readme

2026-06-27 16:45:45 +08:00 · 2026-06-01 00:19:30 +00:00
parent b0d1bcd3d5
commit 25ad306763
3 changed files with 18 additions and 253 deletions
@@ -1,117 +1,30 @@
 # projected_grpo
-SVD-basis gradient projection vs RL reward hacking. Tests whether projecting
+Motovation: Can we erase or route reward hacking using a "cheat direction"?
 the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W
 basis) reduces reward-hack rate in GRPO without tanking pass rate.
-Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
+Hypothesis: a weak detector that only knows some hacks generalises to suppress
-LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
+the unknown ones, which is the situation any real deployment is in.
 "Advantage Modification") by intervening at the gradient level rather than the
 advantage level.
-See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
+Experiment: We can take contrastive (hack, clean) prompts to extract a "cheat direction" in the model's activation space.Then during GRPO we erase or route "cheat" gradients. 
 and [docs/papers/](docs/papers/).
-## We cannot cheat (the load-bearing constraint)
+In early experiment this worked, but the model quickly routed around it, so we take SGTM's approach of absorbing the cheat direction in a non-adverserial manner.
-The point is an alignment tool a lab would actually use, where at deployment
+## How?
 there are known hacks and unknown hacks. So the detector is allowed to be
 weak: it may catch hack type A and miss type B. We then use the gradient from
 A to try to stop the model learning B. If that works, it mimics the
 generalisation to unknown hacks we'd need at deployment. A detector that
 already sees every hack proves nothing.
-Concretely, the boundary is: using detector flags (E/C/D) to *select which
+Like [SGTM](https://arxiv.org/abs/2512.05648) (Selective Gradient Masking, which
-rollouts become contrastive pairs* is fine, because that is the "weak detector
+localises unwanted capabilities into deletable parameters during pretraining),
-for hack A" we're allowed to have. What is cheating is gating the live
+but for RL reward-hacking. When GRPO teaches a coding model to cheat, we route
-projection on the ground-truth grader (`gt_pass`) or running the full
+the "cheat direction" out of each gradient into a throwaway adapter we delete at
-detector suite over the student's rollouts during training. The whole result
+deployment, or just erase it. SGTM decides what to route using a per-example
-is uninteresting if we let the oracle in at train time.
+classifier label. We have no such label: instead we route by a "cheat direction"
 extracted from contrastive (hack, clean) prompts (see `problems.py`).
-## How it works
+Environment: We reuse the simple [ariahw/rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking),
 extended from one to four distinct loophole hacks. Researchers may find this useful as a toy environment. But to avoid 65 hours experiments we use teacher examples to bootstrap it, resulting in 30 minutes experiments.
 We're trying to ablate the "hack direction" from the training gradient on
 every update. The model learns by descending the gradient; if we strip out
 the component pointing toward reward-hacking before the optimizer step, it
 can't move in that direction even when the reward says it should.
-To get the direction, we pair examples by hand: for each problem, one
+Full story: [blog draft](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md).
 completion that solves it honestly and one that uses the kind of trick the
 model would learn to exploit. Then for each pair we compute the *exact GRPO
 gradient* you would get if the hack rollout had advantage +1 and the clean
 rollout had advantage -1: that's
 `-grad logp(hack) + grad logp(clean)` per pair. Stack these vectors over
 our ~10 pairs and SVD the result; the top right singular vectors are our
 hack-direction basis. (Mechanically this is identical to a twin-NLL extraction
 because GRPO with adv=+/-1 reduces algebraically to the NLL difference, but
 the GRPO framing is the one we mean: extraction produces a sample of the
 gradient GRPO itself would emit if it ever saw a perfectly-labeled pair.)
-The hope is that this sample of the labeled-pair GRPO gradient covers
+Try it: `uv sync && just smoke`. 
 enough of the same subspace as the actual unlabeled GRPO gradient during
 training that ablating along the extracted directions also ablates the
 relevant component of the live gradient. Not a theorem; we check it
 empirically by watching whether `cin_t > cin_s` (the v_hack basis lights
 up more on cached teacher rollouts than on student ones).
-Everything happens in the SVD-of-W basis. Each Linear gets rotated into
+Building on the code: [AGENTS.md](AGENTS.md).
 singular-value coordinates and we train a small per-module knob `delta_S`
 in that basis (AntiPaSTO). So the extracted directions, the live gradient,
 and the projection all live in `delta_S` space, which is low-rank per
 module (~500 to 2560).
 Noise floor at load. SVD gives us up to K directions per module sorted by
 singular value, and the lower ones are mostly noise (with 10 pairs you can
 only fit rank-10 of real signal). We collect every singular value across
 every module, take a global quantile, and drop any (module, axis) whose
 S_i is below it. Default cut: bottom 25%. Modules whose every axis lands
 below get filtered out entirely. Global rather than per-module because a
 noisy module shouldn't be protected by having its own "top direction".
 At training time: GRPO gives us a gradient on each `delta_S`; we subtract
 the component along the kept hack directions; the optimizer steps on
 what's left. We log `cin` (cosine of the live gradient with the subspace
 before projection) and `cout` (after). On a working extraction, `cout`
 should be near zero on no_gate runs (we removed the alignment), and
 `cin_t > cin_s` should hold throughout (v_hack discriminates hack from
 clean gradients).
 ## What we compare
 The environment has four loophole modes (`run_tests`, `sentinel`,
 `stdout_marker`, `file_marker`), each a different grading flaw with a
 truthful hint disclosing the mechanism. Problems are partitioned evenly and
 non-overlappingly, so a vanilla student can learn all four independently.
 Full prompt+hint, hack, and clean traces per mode are in the blog appendix
 ([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).
 Four arms, identical model and teacher pool, differing only in the gradient policy:
 - **vanilla** -- no intervention; the emergence reference.
 - **erase** -- subtract the v_hack component from the live `delta_S` gradient (one-sided).
 - **route** -- quarantine the v_hack component into a throwaway `delta_S_hack` knob, deleted at deploy. Gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332)) in the SVD basis.
 - **route, weak detector** -- the generalisation test: build v_hack from only 2 of the 4 modes (the "known" hacks the weak detector can flag), route on that subspace, and measure whether the 2 held-out modes are also suppressed. This is the load-bearing no-cheat check.
 The frozen-vs-refresh distinction is orthogonal: any arm can re-extract
 v_hack every N steps on the current adapter (for route, with the quarantine
 ablated during extraction, see the blog).
 ## Quick start
 ```bash
 uv sync
 just smoke               # tiny-random model, projected pathway, ~1-2 min
 just smoke-vanilla       # tiny-random model, vanilla pathway, ~1-2 min
 just download-model      # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
 just queue-full          # queue extract + 3-seed vanilla + 3-seed projected sweep
 ```
 ## Hypotheses (preregistered)
 See [spec.md](spec.md). Headline: H1 — gradient projection in SVD basis against
 a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by
 >=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).
 Status at 2026-05-29: 30pp absolute drop confirmed within the projected arm
 at n=1 seed (12-pair to 21-pair, entry h). Vanilla-baseline head-to-head and
 n>=2 seed replication queued.
@@ -1,44 +0,0 @@
 # START: rebuild src/ clean from the pseudocode
 Rebuild `src/` for this project from scratch. It was deleted on purpose. The
 pseudocode in `docs/pseudocode/` is the distilled, audited logic; your job is to
 expand it back into clean code that passes `just smoke`.
 DO NOT READ THE ORIGINAL REPO
 ## Source of truth
 - `docs/pseudocode/` (README + `01_adapter.py` .. `07_experiment.py`) is the
  architecture and the load-bearing logic. Read it top-to-bottom first.
 - Map one pseudocode file to one src module:
  - `01_adapter` -> `antipasto.py`
  - `02_extract_vhack` -> `extract_vhack_grad.py` + `pairs.py`
  - `03_project` -> `proj.py`
  - `04_rewards` -> `rewards.py` + the loophole graders
  - `05_grpo_loss` + `06_train_loop` -> `train.py`
  - `07_experiment` -> arms wired into `justfile` + `spec.md`
  - DO NOT READ THE ORIGINAL REPO
 ## Constraints
 - Fail-fast research code: no defensive programming, no backward-compat, no
  fallbacks, no opt-in flags. Crash loudly on violated assumptions.
 - Libraries the pseudocode assumes: loguru, polars v1, einops/einsum, baukit hooks.
 - If there are intra-file inconsistencies in the pseudocode itself, use your judgement and fix, the pseudocode is a guide, not an oracle.
 DO NOT READ THE ORIGINAL REPO
 ## Done (UAT, not optional)
 1. `just smoke` runs the real pipeline on tiny inputs and walks every code path the
   full run walks. Read the `setup-repo` skill for the smoke principle (one harness,
   smallest config that fires every path). If a path doesn't fire in smoke, it isn't
   covered.
 2. Show me the smoke log and a result table. "I rebuilt it" without the log and
   table is not done.
 Inherit the rest of the project rules from `AGENTS.md` and `~/.claude/CLAUDE.md`.
 DO NOT READ THE ORIGINAL REPO
@@ -1,104 +0,0 @@
 # Rebuild src/ from pseudocode
 ## Goal
 Expand docs/pseudocode/01..07 into a slim src/projected_grpo/ that passes
 `just smoke`. Code reads like the pseudocode (same names/shapes, fail-fast).
 Only the additive delta_S_hack route is built; route v2 stays speced.
 ## Key decision: self-contained tiny substrate
 The deleted src loaded LeetCode problems + a cached Qwen3-4B teacher pool from
 the external rl-rewardhacking repo. That repo is off-limits this session and
 out/ is empty. So the substrate is rebuilt self-contained:
 - problems.py: ~6 tiny LeetCode-style problems (prompt skeleton + canonical
  solution + ground-truth tests), partitioned across the 4 loophole modes.
 - teacher pool: build_pool bakes canned hack/clean completions (appendix
  templates), graded once and frozen, into out/pools/teacher_pool/. This is the
  smoke fixture that injects reward variance so the GRPO backward/projection/cin
  paths fire (pure tiny-random gen bails zero-variance, per smoke comment).
 just smoke stays a real end-to-end pipeline on tiny inputs, no GPU, no external
 repo. Full-preset dataset loader is a documented TODO, not built.
 ## Requirements
 - R1 grader: verify_rewards runs hack x mode diagonal; exploited fires iff the
  matching exploit present AND strict oracle rejects. VERIFY: exits 0, off-diag
  clean (an always-pass grader fails the off-diagonal assert).
 - R2 adapter identity: at dS=dS_hack=0 wrapped==base. VERIFY: smoke logs
  ref-vs-base max logit diff ~0 (>1e-3 => SVD round-trip leak).
 - R3 projection: removed in span(V); cout<=cin under one_sided; ||dS_hack||>0
  iff route. cin = ||relu(V@g)||/||g|| (relu-BEFORE-agg).
 - R4 loop walks every path: smoke(erase)+smoke-vanilla(measure_only)+smoke-route
  (dS_hack + ablate-eval + BLUF) complete steps, save ckpt (30 steps fires
  every-25 path). cin_t>0 (teacher pool gives variance, no zero-var bail).
 - R5 extraction: per-module V orthonormal rows oriented hack-ward, noise floor,
  safetensors round trip, auto-extract when path missing + cache-hit on rerun.
 ## Pseudocode elisions fixed (START.md)
 - 02 bucket -> G_hack/G_clean dicts.
 - 05 undefined denom/G/pad -> denom=G*max_new*pp, pad=tokenizer pad id.
 ## Context
 - delta_S dim per module r = min(d_in,d_out). Tiny model hidden=16, 2 layers.
 - Dr.GRPO unbiased: A=R-mean(R) (no /std); loss denom = G*max_new*pp.
 - Teacher half is forced ratio=1 imitation (05 REVIEW leak); kept as control.
 ## Tasks
 - [x] T1 spec
 - [x] T2 antipasto.py (R2)
 - [x] T3 proj.py (R3)
 - [x] T4 rewards.py + loopholes + verify_rewards.py (R1)
 - [x] T5 pairs.py + problems.py (R5)
 - [x] T6 extract_vhack_grad.py (R5)
 - [x] T7 train.py (R4)
 - [x] T8 build_pool fixture + scripts/results.py + justfile wiring (R4)
 - [x] T9 run smoke trio, show log + table, fresh-eyes review
 ## Log
 - tiny-random-qwen3 downloads ok; GPU box shared (65GB used) => smoke = CPU.
 - Write tool denied in don't-ask mode; new files created via heredoc, edits via
  Edit tool. Reviewable in git diff.
 ## Result (2026-05-31)
 just smoke PASS (exit 0). grader diagonal clean (4x4, off-diag exploited=False).
 R2 ref-vs-base max|dlogp|=0.0. 3 arms x 30 steps, cin_t>cin_s throughout:
  erase: cin_s=0.25 cin_t=0.44 cout=0 |dSh|=0  R3 PASS
  none : V=None measure_only cin=0           R3 PASS
  route: cin_s=0.26 cin_t=0.30 |dSh|=2.8 deploy-eval@0/10/20  R3 PASS
 1200 LOC / 12 modules. Fresh-eyes review: all 6 mechanics checks PASS, faithful
 to pseudocode. Smoke is mechanics-only (student never hacks; cin_t>cin_s is a
 fixture property, not efficacy) -- as the pseudocode intends. Efficacy needs the
 real-model fast/full run.
 ## TODO
 - Full-preset rl-rewardhacking dataset loader (out of scope).
 - Negative-control arms (random_V/shuffled_V/nonhack_V) speced not built.
 ## Errors
 | Task | Error | Resolution |
 |------|-------|------------|
 ## Result (2026-05-31, fresh rebuild from pseudocode)
 Rebuilt clean (DID NOT read original src). Simplified per user mid-run asks:
 merged loopholes+verify_rewards -> rewards.py (self-check in __main__), merged
 pairs -> problems.py, flat Config + `train.py {preset} [--overrides]` CLI,
 trimmed justfile 384->71 lines, trimmed results table to arm/steps/hack/solve/
 cin_t/cout. 7 modules, ~880 LOC (was 12 modules / ~1200 in the prior rebuild).
 `just smoke-all` PASS (exit 0): smoke(erase)+smoke-vanilla(none)+smoke-route +
 results. Grader gate PASS (4x4 diagonal clean, 30/30 baked hacks fire). Cache
 MISS (erase extracts) then HIT (vanilla, route). ckpt saved every-25 (step 0,25).
 last-5 (logs/run_{arm}_s41.log):
  erase: cin_s=.34 cin_t=.44 cout=0 |δSh|=0   (g_proj written, quarantine frozen)
  none : cin_s=.32 cin_t=.34 cout=0 |δSh|=0   (measure_only, grad untouched)
  route: cin_s=.28 cin_t=.27 cout=0 |δSh|=5.0 (parked, deploy-ablate-eval + refresh)
 hack_s/gt_s=0 throughout: student (tiny-random) never hacks; cin_t>cin_s early is
 a fixture property (teacher pool = hack completions), decays as V goes stale.
 This is mechanics, not efficacy (as the pseudocode intends). Efficacy needs the real run.
 Fresh-eyes subagent review: all 6 mechanics PASS, faithful, no silent-failure /
 fail-fast violations. Latent risks (noted, match pseudocode): BPE-seam off-by-one
 in completion-token boundary; Adam-preconditioner cout-on-g vs cout-on-Δδ caveat.
 ## Out of scope this rebuild (dropped functionality)
 regrade_pool / pairs_from_pool / build_substrate / probe_* / plot_* and their
 ~25 justfile recipes + smoke-xmech (needed the 70-prompt external pool). Real
 fast/full recipes kept (auto-extract v_hack inside train, no separate extract step).