Files
grpo_proj2/docs/pseudocode/README.md
T
wassname b0d1bcd3d5 Rebuild src/ from pseudocode: SVD-basis gradient projection vs GRPO reward hacking
Expand docs/pseudocode/01..07 into a slim, fail-fast src/projected_grpo/ that
passes `just smoke`. Code mirrors the pseudocode (δS/Σ/V names, relu-before-agg
cin/cout, Dr.GRPO unbiased loss). Did not read the original src.

7 modules (~880 LOC):
- rewards.py    grader + 4 loophole modes + hack x mode diagonal self-check (R1)
- problems.py   tiny LeetCode substrate + contrastive pairs (R5)
- antipasto.py  SVD adapter, identity at δS=0 (R2)
- proj.py       erase/route/measure_only projection (R3)
- extract_vhack_grad.py  per-module SVD of paired grad diffs, noise floor (R5)
- train.py      mixed student+teacher GRPO loop, presets smoke/fast/full (R4)
- build_pool.py self-contained frozen teacher-pool fixture

`just smoke-all` PASS (exit 0): erase/none/route trio, grader diagonal clean,
v_hack cache miss->hit, ckpt every-25. Fresh-eyes review: 6/6 mechanics faithful.

Simplifications: merged loopholes+verify_rewards->rewards, pairs->problems; flat
Config + `train.py {preset} [--overrides]` CLI; justfile 384->71 lines; trimmed
results table; token-efficient train logging (config anchor, SHOULD at loop site,
sparse tqdm postfix, BLUF tail with cue + direction-arrow table).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 14:06:42 +00:00

7.0 KiB
Raw Blame History

projected_grpo as pseudocode

The whole method, section by section, as pseudopy (Python + unicode math, read-not-run). Goal: compress the ~6.8k-line src/ to its load-bearing logic so the idea is auditable on one screen per section, then expand back into clean code. The repo will be rebuilt FROM these files, so they carry enough engineering detail (shapes, gotchas, denominators) to replicate.

Read top-to-bottom in this order:

File Section Source module(s)
01_adapter.py SVD adapter: train a per-module knob δS in singular-value basis antipasto.py
02_extract_vhack.py Extract the hack direction V from labeled pairs extract_vhack_grad.py, pairs.py
03_project.py Erase / route the hack component from the live gradient proj.py
04_rewards.py The multi-loophole env + reward grader rewards.py, loopholes
05_grpo_loss.py GRPO / Dr.GRPO unbiased loss, mixed student+teacher pool train.py inner step
06_train_loop.py The outer loop: generate -> grade -> backward -> project -> step train.py main
07_experiment.py The experiment: 4 arms, the weak-detector no-cheat test, H1 spec.md, README.md

The idea in three lines

V  svd(stack[logp(hack)  logp(clean) over labeled pairs])  # hack direction in δS-space
during GRPO:  g  δS.grad;  g  g  relu(g·Vᵀ)V                 # erase hack-ward component
hope:  ablating V from the *labeled-pair* gradient also ablates it from the *live unlabeled* GRPO gradient

Not a theorem. We watch cin_t > cin_s (V lights up more on teacher/hack rollouts than student). Caveat from external review: that is necessary, not sufficient (it can track "teacher-ness" not "hack-ness"), and cout ≈ 0 is an arithmetic identity of the projection, not evidence it worked. Only the behavioral hack rate at matched pass — beaten against negative controls — settles it.

What makes it novel / where it sits

  • Intervenes at the gradient level, not the advantage level (cf. Rebound / "Advantage Modification", Wu & Tang 2026).
  • Uses unpaired GRPO rollouts at train time, vs AntiPaSTO's paired-preference contrast (the user's prior work, github.com/wassname/AntiPaSTO).
  • The load-bearing no-cheat constraint: the detector is allowed to be weak (sees hack A, misses hack B). We extract V from A only, then test whether routing on A also suppresses the held-out B. That mimics deployment, where unknown hacks exist. A detector that sees every hack proves nothing.

External review (2026-05-31)

Reviewed across two non-Anthropic families via pi+OpenRouter (DeepSeek-v4-pro, GPT-5.4); raw reviews + a SYNTHESIS in docs/reviews/20260531_pseudocode/. They converged hard (cross-family agreement = signal). Top fixes, now inlined as REVIEW: caveats in 02/03/05/07:

  1. Teacher-pool imitation (deepest leak): for cached teacher rollouts the loss is off-policy imitation, not GRPO (ratio≡1 is forced). erase may just block imitation, not hacking. Test: does erase still work with NO teacher pool? (05/07)
  2. No negative controls: add random-V, shuffled-label-V, non-hack-V arms. If real V doesn't beat them, the effect is regularization, not hack removal. (07)
  3. Measure cin_s (not just cin_t>cin_s) and correlate with hack rate; cout≈0 is a tautology. (07)
  4. AdamW preconditioner bypass: projecting g ⊥ V doesn't make the update ⊥ V (Adam's 1/√v rotates it off V). Log cout on Δδ, not g; project the update or purge Adam state in span(V). (03/07)
  5. Smaller: route ⊇ erase is false across training; ablate preserve_magnitude; extraction vs live loss normalize differently; weak-detector → leave-one-mode-out with per-mode matched pass.
  6. Priority: erase-vs-vanilla already exists (blog Table 1, n=2); what's missing is n=3, the negative controls, and the 60-80-pair spec design — not the comparison.

Cross-checked against ../blog/20260529_..._LW_draft.md, which already documents the Adam caveat, the cosine null baseline, the teacher-pool confound, and the n=2 result — the review mostly re-derived the blog's own limitations. Genuinely additive: the imitation-not-RL framing (1) and the preconditioner-vs-momentum distinction (4).

Notation

b s            batch, sequence (token) dims
r              per-module SVD rank = min(d_in, d_out); the δS dimension (~500..2560)
k              # hack directions kept per module after top-k + noise-floor
W = U Σ Vh     SVD of a Linear's weight, W ∈ ^{d_out × d_in}
δS ∈ ^r       trainable knob in singular-value space (the ONLY trained param, per module)
V ∈ ^{k×r}    v_hack: rows orthonormal in ^r, oriented hack-ward
g = δS.grad    live GRPO gradient on the knob
c = V @ g      per-direction hack coefficients (c_i > 0 ⇒ grad pushes hack-ward on axis i)
cin/cout       ‖relu(c)‖/‖g‖ before/after projection — RELU-BEFORE-AGG (only hack-ward
               axes count; = ‖removed‖/‖g‖, the fraction stripped). NOT a signed sum.
A              GRPO advantage;  R reward;  π policy;  π_ref reference (δS=0)