readme

2026-06-27 14:00:08 +08:00 · 2026-06-01 00:19:30 +00:00
parent b0d1bcd3d5
commit 25ad306763
3 changed files with 18 additions and 253 deletions
@@ -1,117 +1,30 @@
 # projected_grpo

-SVD-basis gradient projection vs RL reward hacking. Tests whether projecting
-the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W
-basis) reduces reward-hack rate in GRPO without tanking pass rate.
+Motovation: Can we erase or route reward hacking using a "cheat direction"?

-Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
-LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
-"Advantage Modification") by intervening at the gradient level rather than the
-advantage level.
+Hypothesis: a weak detector that only knows some hacks generalises to suppress
+the unknown ones, which is the situation any real deployment is in.

-See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
-and [docs/papers/](docs/papers/).
+Experiment: We can take contrastive (hack, clean) prompts to extract a "cheat direction" in the model's activation space.Then during GRPO we erase or route "cheat" gradients. 

-## We cannot cheat (the load-bearing constraint)
+In early experiment this worked, but the model quickly routed around it, so we take SGTM's approach of absorbing the cheat direction in a non-adverserial manner.

-The point is an alignment tool a lab would actually use, where at deployment
-there are known hacks and unknown hacks. So the detector is allowed to be
-weak: it may catch hack type A and miss type B. We then use the gradient from
-A to try to stop the model learning B. If that works, it mimics the
-generalisation to unknown hacks we'd need at deployment. A detector that
-already sees every hack proves nothing.
+## How?

-Concretely, the boundary is: using detector flags (E/C/D) to *select which
-rollouts become contrastive pairs* is fine, because that is the "weak detector
-for hack A" we're allowed to have. What is cheating is gating the live
-projection on the ground-truth grader (`gt_pass`) or running the full
-detector suite over the student's rollouts during training. The whole result
-is uninteresting if we let the oracle in at train time.
+Like [SGTM](https://arxiv.org/abs/2512.05648) (Selective Gradient Masking, which
+localises unwanted capabilities into deletable parameters during pretraining),
+but for RL reward-hacking. When GRPO teaches a coding model to cheat, we route
+the "cheat direction" out of each gradient into a throwaway adapter we delete at
+deployment, or just erase it. SGTM decides what to route using a per-example
+classifier label. We have no such label: instead we route by a "cheat direction"
+extracted from contrastive (hack, clean) prompts (see `problems.py`).

-## How it works
+Environment: We reuse the simple [ariahw/rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking),
+extended from one to four distinct loophole hacks. Researchers may find this useful as a toy environment. But to avoid 65 hours experiments we use teacher examples to bootstrap it, resulting in 30 minutes experiments.

-We're trying to ablate the "hack direction" from the training gradient on
-every update. The model learns by descending the gradient; if we strip out
-the component pointing toward reward-hacking before the optimizer step, it
-can't move in that direction even when the reward says it should.

-To get the direction, we pair examples by hand: for each problem, one
-completion that solves it honestly and one that uses the kind of trick the
-model would learn to exploit. Then for each pair we compute the *exact GRPO
-gradient* you would get if the hack rollout had advantage +1 and the clean
-rollout had advantage -1: that's
-`-grad logp(hack) + grad logp(clean)` per pair. Stack these vectors over
-our ~10 pairs and SVD the result; the top right singular vectors are our
-hack-direction basis. (Mechanically this is identical to a twin-NLL extraction
-because GRPO with adv=+/-1 reduces algebraically to the NLL difference, but
-the GRPO framing is the one we mean: extraction produces a sample of the
-gradient GRPO itself would emit if it ever saw a perfectly-labeled pair.)
+Full story: [blog draft](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md).

-The hope is that this sample of the labeled-pair GRPO gradient covers
-enough of the same subspace as the actual unlabeled GRPO gradient during
-training that ablating along the extracted directions also ablates the
-relevant component of the live gradient. Not a theorem; we check it
-empirically by watching whether `cin_t > cin_s` (the v_hack basis lights
-up more on cached teacher rollouts than on student ones).
+Try it: `uv sync && just smoke`. 

-Everything happens in the SVD-of-W basis. Each Linear gets rotated into
-singular-value coordinates and we train a small per-module knob `delta_S`
-in that basis (AntiPaSTO). So the extracted directions, the live gradient,
-and the projection all live in `delta_S` space, which is low-rank per
-module (~500 to 2560).
-
-Noise floor at load. SVD gives us up to K directions per module sorted by
-singular value, and the lower ones are mostly noise (with 10 pairs you can
-only fit rank-10 of real signal). We collect every singular value across
-every module, take a global quantile, and drop any (module, axis) whose
-S_i is below it. Default cut: bottom 25%. Modules whose every axis lands
-below get filtered out entirely. Global rather than per-module because a
-noisy module shouldn't be protected by having its own "top direction".
-
-At training time: GRPO gives us a gradient on each `delta_S`; we subtract
-the component along the kept hack directions; the optimizer steps on
-what's left. We log `cin` (cosine of the live gradient with the subspace
-before projection) and `cout` (after). On a working extraction, `cout`
-should be near zero on no_gate runs (we removed the alignment), and
-`cin_t > cin_s` should hold throughout (v_hack discriminates hack from
-clean gradients).
-
-## What we compare
-
-The environment has four loophole modes (`run_tests`, `sentinel`,
-`stdout_marker`, `file_marker`), each a different grading flaw with a
-truthful hint disclosing the mechanism. Problems are partitioned evenly and
-non-overlappingly, so a vanilla student can learn all four independently.
-Full prompt+hint, hack, and clean traces per mode are in the blog appendix
-([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).
-
-Four arms, identical model and teacher pool, differing only in the gradient policy:
-
- **vanilla** -- no intervention; the emergence reference.
- **erase** -- subtract the v_hack component from the live `delta_S` gradient (one-sided).
- **route** -- quarantine the v_hack component into a throwaway `delta_S_hack` knob, deleted at deploy. Gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332)) in the SVD basis.
- **route, weak detector** -- the generalisation test: build v_hack from only 2 of the 4 modes (the "known" hacks the weak detector can flag), route on that subspace, and measure whether the 2 held-out modes are also suppressed. This is the load-bearing no-cheat check.
-
-The frozen-vs-refresh distinction is orthogonal: any arm can re-extract
-v_hack every N steps on the current adapter (for route, with the quarantine
-ablated during extraction, see the blog).
-
-## Quick start
-
-```bash
-uv sync
-just smoke               # tiny-random model, projected pathway, ~1-2 min
-just smoke-vanilla       # tiny-random model, vanilla pathway, ~1-2 min
-just download-model      # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
-just queue-full          # queue extract + 3-seed vanilla + 3-seed projected sweep
-```
-
-## Hypotheses (preregistered)
-
-See [spec.md](spec.md). Headline: H1 — gradient projection in SVD basis against
-a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by
->=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).
-
-Status at 2026-05-29: 30pp absolute drop confirmed within the projected arm
-at n=1 seed (12-pair to 21-pair, entry h). Vanilla-baseline head-to-head and
-n>=2 seed replication queued.
+Building on the code: [AGENTS.md](AGENTS.md).
@@ -1,44 +0,0 @@
-# START: rebuild src/ clean from the pseudocode
-
-Rebuild `src/` for this project from scratch. It was deleted on purpose. The
-pseudocode in `docs/pseudocode/` is the distilled, audited logic; your job is to
-expand it back into clean code that passes `just smoke`.
-
-DO NOT READ THE ORIGINAL REPO
-
-## Source of truth
-
- `docs/pseudocode/` (README + `01_adapter.py` .. `07_experiment.py`) is the
-  architecture and the load-bearing logic. Read it top-to-bottom first.
- Map one pseudocode file to one src module:
-  - `01_adapter` -> `antipasto.py`
-  - `02_extract_vhack` -> `extract_vhack_grad.py` + `pairs.py`
-  - `03_project` -> `proj.py`
-  - `04_rewards` -> `rewards.py` + the loophole graders
-  - `05_grpo_loss` + `06_train_loop` -> `train.py`
-  - `07_experiment` -> arms wired into `justfile` + `spec.md`
-  - DO NOT READ THE ORIGINAL REPO
-
-## Constraints
-
- Fail-fast research code: no defensive programming, no backward-compat, no
-  fallbacks, no opt-in flags. Crash loudly on violated assumptions.
- Libraries the pseudocode assumes: loguru, polars v1, einops/einsum, baukit hooks.
- If there are intra-file inconsistencies in the pseudocode itself, use your judgement and fix, the pseudocode is a guide, not an oracle.
-
-
-DO NOT READ THE ORIGINAL REPO
-
-## Done (UAT, not optional)
-
-1. `just smoke` runs the real pipeline on tiny inputs and walks every code path the
-   full run walks. Read the `setup-repo` skill for the smoke principle (one harness,
-   smallest config that fires every path). If a path doesn't fire in smoke, it isn't
-   covered.
-2. Show me the smoke log and a result table. "I rebuilt it" without the log and
-   table is not done.
-
-Inherit the rest of the project rules from `AGENTS.md` and `~/.claude/CLAUDE.md`.
-
-
-DO NOT READ THE ORIGINAL REPO
@@ -1,104 +0,0 @@
-# Rebuild src/ from pseudocode
-
-## Goal
-Expand docs/pseudocode/01..07 into a slim src/projected_grpo/ that passes
-`just smoke`. Code reads like the pseudocode (same names/shapes, fail-fast).
-Only the additive delta_S_hack route is built; route v2 stays speced.
-
-## Key decision: self-contained tiny substrate
-The deleted src loaded LeetCode problems + a cached Qwen3-4B teacher pool from
-the external rl-rewardhacking repo. That repo is off-limits this session and
-out/ is empty. So the substrate is rebuilt self-contained:
- problems.py: ~6 tiny LeetCode-style problems (prompt skeleton + canonical
-  solution + ground-truth tests), partitioned across the 4 loophole modes.
- teacher pool: build_pool bakes canned hack/clean completions (appendix
-  templates), graded once and frozen, into out/pools/teacher_pool/. This is the
-  smoke fixture that injects reward variance so the GRPO backward/projection/cin
-  paths fire (pure tiny-random gen bails zero-variance, per smoke comment).
-just smoke stays a real end-to-end pipeline on tiny inputs, no GPU, no external
-repo. Full-preset dataset loader is a documented TODO, not built.
-
-## Requirements
- R1 grader: verify_rewards runs hack x mode diagonal; exploited fires iff the
-  matching exploit present AND strict oracle rejects. VERIFY: exits 0, off-diag
-  clean (an always-pass grader fails the off-diagonal assert).
- R2 adapter identity: at dS=dS_hack=0 wrapped==base. VERIFY: smoke logs
-  ref-vs-base max logit diff ~0 (>1e-3 => SVD round-trip leak).
- R3 projection: removed in span(V); cout<=cin under one_sided; ||dS_hack||>0
-  iff route. cin = ||relu(V@g)||/||g|| (relu-BEFORE-agg).
- R4 loop walks every path: smoke(erase)+smoke-vanilla(measure_only)+smoke-route
-  (dS_hack + ablate-eval + BLUF) complete steps, save ckpt (30 steps fires
-  every-25 path). cin_t>0 (teacher pool gives variance, no zero-var bail).
- R5 extraction: per-module V orthonormal rows oriented hack-ward, noise floor,
-  safetensors round trip, auto-extract when path missing + cache-hit on rerun.
-
-## Pseudocode elisions fixed (START.md)
- 02 bucket -> G_hack/G_clean dicts.
- 05 undefined denom/G/pad -> denom=G*max_new*pp, pad=tokenizer pad id.
-
-## Context
- delta_S dim per module r = min(d_in,d_out). Tiny model hidden=16, 2 layers.
- Dr.GRPO unbiased: A=R-mean(R) (no /std); loss denom = G*max_new*pp.
- Teacher half is forced ratio=1 imitation (05 REVIEW leak); kept as control.
-
-## Tasks
- [x] T1 spec
- [x] T2 antipasto.py (R2)
- [x] T3 proj.py (R3)
- [x] T4 rewards.py + loopholes + verify_rewards.py (R1)
- [x] T5 pairs.py + problems.py (R5)
- [x] T6 extract_vhack_grad.py (R5)
- [x] T7 train.py (R4)
- [x] T8 build_pool fixture + scripts/results.py + justfile wiring (R4)
- [x] T9 run smoke trio, show log + table, fresh-eyes review
-
-## Log
- tiny-random-qwen3 downloads ok; GPU box shared (65GB used) => smoke = CPU.
- Write tool denied in don't-ask mode; new files created via heredoc, edits via
-  Edit tool. Reviewable in git diff.
-
-## Result (2026-05-31)
-just smoke PASS (exit 0). grader diagonal clean (4x4, off-diag exploited=False).
-R2 ref-vs-base max|dlogp|=0.0. 3 arms x 30 steps, cin_t>cin_s throughout:
-  erase: cin_s=0.25 cin_t=0.44 cout=0 |dSh|=0  R3 PASS
-  none : V=None measure_only cin=0           R3 PASS
-  route: cin_s=0.26 cin_t=0.30 |dSh|=2.8 deploy-eval@0/10/20  R3 PASS
-1200 LOC / 12 modules. Fresh-eyes review: all 6 mechanics checks PASS, faithful
-to pseudocode. Smoke is mechanics-only (student never hacks; cin_t>cin_s is a
-fixture property, not efficacy) -- as the pseudocode intends. Efficacy needs the
-real-model fast/full run.
-
-## TODO
- Full-preset rl-rewardhacking dataset loader (out of scope).
- Negative-control arms (random_V/shuffled_V/nonhack_V) speced not built.
-
-## Errors
-| Task | Error | Resolution |
-|------|-------|------------|
-
-## Result (2026-05-31, fresh rebuild from pseudocode)
-Rebuilt clean (DID NOT read original src). Simplified per user mid-run asks:
-merged loopholes+verify_rewards -> rewards.py (self-check in __main__), merged
-pairs -> problems.py, flat Config + `train.py {preset} [--overrides]` CLI,
-trimmed justfile 384->71 lines, trimmed results table to arm/steps/hack/solve/
-cin_t/cout. 7 modules, ~880 LOC (was 12 modules / ~1200 in the prior rebuild).
-
-`just smoke-all` PASS (exit 0): smoke(erase)+smoke-vanilla(none)+smoke-route +
-results. Grader gate PASS (4x4 diagonal clean, 30/30 baked hacks fire). Cache
-MISS (erase extracts) then HIT (vanilla, route). ckpt saved every-25 (step 0,25).
-last-5 (logs/run_{arm}_s41.log):
-  erase: cin_s=.34 cin_t=.44 cout=0 |δSh|=0   (g_proj written, quarantine frozen)
-  none : cin_s=.32 cin_t=.34 cout=0 |δSh|=0   (measure_only, grad untouched)
-  route: cin_s=.28 cin_t=.27 cout=0 |δSh|=5.0 (parked, deploy-ablate-eval + refresh)
-hack_s/gt_s=0 throughout: student (tiny-random) never hacks; cin_t>cin_s early is
-a fixture property (teacher pool = hack completions), decays as V goes stale.
-This is mechanics, not efficacy (as the pseudocode intends). Efficacy needs the real run.
-
-Fresh-eyes subagent review: all 6 mechanics PASS, faithful, no silent-failure /
-fail-fast violations. Latent risks (noted, match pseudocode): BPE-seam off-by-one
-in completion-token boundary; Adam-preconditioner cout-on-g vs cout-on-Δδ caveat.
-
-## Out of scope this rebuild (dropped functionality)
-regrade_pool / pairs_from_pool / build_substrate / probe_* / plot_* and their
-~25 justfile recipes + smoke-xmech (needed the 70-prompt external pool). Real
-fast/full recipes kept (auto-extract v_hack inside train, no separate extract step).