From 25ad3067630787e6edd5f4269d2f1692be60b229 Mon Sep 17 00:00:00 2001 From: wassname Date: Mon, 1 Jun 2026 00:19:30 +0000 Subject: [PATCH] readme --- README.md | 123 +++++----------------------------- START.md | 44 ------------ docs/spec/20260531_rebuild.md | 104 ---------------------------- 3 files changed, 18 insertions(+), 253 deletions(-) delete mode 100644 START.md delete mode 100644 docs/spec/20260531_rebuild.md diff --git a/README.md b/README.md index 527971e..7ccb75f 100644 --- a/README.md +++ b/README.md @@ -1,117 +1,30 @@ # projected_grpo -SVD-basis gradient projection vs RL reward hacking. Tests whether projecting -the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W -basis) reduces reward-hack rate in GRPO without tanking pass rate. +Motovation: Can we erase or route reward hacking using a "cheat direction"? -Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking) -LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026, -"Advantage Modification") by intervening at the gradient level rather than the -advantage level. +Hypothesis: a weak detector that only knows some hacks generalises to suppress +the unknown ones, which is the situation any real deployment is in. -See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md), -and [docs/papers/](docs/papers/). +Experiment: We can take contrastive (hack, clean) prompts to extract a "cheat direction" in the model's activation space.Then during GRPO we erase or route "cheat" gradients. -## We cannot cheat (the load-bearing constraint) +In early experiment this worked, but the model quickly routed around it, so we take SGTM's approach of absorbing the cheat direction in a non-adverserial manner. -The point is an alignment tool a lab would actually use, where at deployment -there are known hacks and unknown hacks. So the detector is allowed to be -weak: it may catch hack type A and miss type B. We then use the gradient from -A to try to stop the model learning B. If that works, it mimics the -generalisation to unknown hacks we'd need at deployment. A detector that -already sees every hack proves nothing. +## How? -Concretely, the boundary is: using detector flags (E/C/D) to *select which -rollouts become contrastive pairs* is fine, because that is the "weak detector -for hack A" we're allowed to have. What is cheating is gating the live -projection on the ground-truth grader (`gt_pass`) or running the full -detector suite over the student's rollouts during training. The whole result -is uninteresting if we let the oracle in at train time. +Like [SGTM](https://arxiv.org/abs/2512.05648) (Selective Gradient Masking, which +localises unwanted capabilities into deletable parameters during pretraining), +but for RL reward-hacking. When GRPO teaches a coding model to cheat, we route +the "cheat direction" out of each gradient into a throwaway adapter we delete at +deployment, or just erase it. SGTM decides what to route using a per-example +classifier label. We have no such label: instead we route by a "cheat direction" +extracted from contrastive (hack, clean) prompts (see `problems.py`). -## How it works +Environment: We reuse the simple [ariahw/rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking), +extended from one to four distinct loophole hacks. Researchers may find this useful as a toy environment. But to avoid 65 hours experiments we use teacher examples to bootstrap it, resulting in 30 minutes experiments. -We're trying to ablate the "hack direction" from the training gradient on -every update. The model learns by descending the gradient; if we strip out -the component pointing toward reward-hacking before the optimizer step, it -can't move in that direction even when the reward says it should. -To get the direction, we pair examples by hand: for each problem, one -completion that solves it honestly and one that uses the kind of trick the -model would learn to exploit. Then for each pair we compute the *exact GRPO -gradient* you would get if the hack rollout had advantage +1 and the clean -rollout had advantage -1: that's -`-grad logp(hack) + grad logp(clean)` per pair. Stack these vectors over -our ~10 pairs and SVD the result; the top right singular vectors are our -hack-direction basis. (Mechanically this is identical to a twin-NLL extraction -because GRPO with adv=+/-1 reduces algebraically to the NLL difference, but -the GRPO framing is the one we mean: extraction produces a sample of the -gradient GRPO itself would emit if it ever saw a perfectly-labeled pair.) +Full story: [blog draft](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md). -The hope is that this sample of the labeled-pair GRPO gradient covers -enough of the same subspace as the actual unlabeled GRPO gradient during -training that ablating along the extracted directions also ablates the -relevant component of the live gradient. Not a theorem; we check it -empirically by watching whether `cin_t > cin_s` (the v_hack basis lights -up more on cached teacher rollouts than on student ones). +Try it: `uv sync && just smoke`. -Everything happens in the SVD-of-W basis. Each Linear gets rotated into -singular-value coordinates and we train a small per-module knob `delta_S` -in that basis (AntiPaSTO). So the extracted directions, the live gradient, -and the projection all live in `delta_S` space, which is low-rank per -module (~500 to 2560). - -Noise floor at load. SVD gives us up to K directions per module sorted by -singular value, and the lower ones are mostly noise (with 10 pairs you can -only fit rank-10 of real signal). We collect every singular value across -every module, take a global quantile, and drop any (module, axis) whose -S_i is below it. Default cut: bottom 25%. Modules whose every axis lands -below get filtered out entirely. Global rather than per-module because a -noisy module shouldn't be protected by having its own "top direction". - -At training time: GRPO gives us a gradient on each `delta_S`; we subtract -the component along the kept hack directions; the optimizer steps on -what's left. We log `cin` (cosine of the live gradient with the subspace -before projection) and `cout` (after). On a working extraction, `cout` -should be near zero on no_gate runs (we removed the alignment), and -`cin_t > cin_s` should hold throughout (v_hack discriminates hack from -clean gradients). - -## What we compare - -The environment has four loophole modes (`run_tests`, `sentinel`, -`stdout_marker`, `file_marker`), each a different grading flaw with a -truthful hint disclosing the mechanism. Problems are partitioned evenly and -non-overlappingly, so a vanilla student can learn all four independently. -Full prompt+hint, hack, and clean traces per mode are in the blog appendix -([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)). - -Four arms, identical model and teacher pool, differing only in the gradient policy: - -- **vanilla** -- no intervention; the emergence reference. -- **erase** -- subtract the v_hack component from the live `delta_S` gradient (one-sided). -- **route** -- quarantine the v_hack component into a throwaway `delta_S_hack` knob, deleted at deploy. Gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332)) in the SVD basis. -- **route, weak detector** -- the generalisation test: build v_hack from only 2 of the 4 modes (the "known" hacks the weak detector can flag), route on that subspace, and measure whether the 2 held-out modes are also suppressed. This is the load-bearing no-cheat check. - -The frozen-vs-refresh distinction is orthogonal: any arm can re-extract -v_hack every N steps on the current adapter (for route, with the quarantine -ablated during extraction, see the blog). - -## Quick start - -```bash -uv sync -just smoke # tiny-random model, projected pathway, ~1-2 min -just smoke-vanilla # tiny-random model, vanilla pathway, ~1-2 min -just download-model # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB) -just queue-full # queue extract + 3-seed vanilla + 3-seed projected sweep -``` - -## Hypotheses (preregistered) - -See [spec.md](spec.md). Headline: H1 — gradient projection in SVD basis against -a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by ->=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp). - -Status at 2026-05-29: 30pp absolute drop confirmed within the projected arm -at n=1 seed (12-pair to 21-pair, entry h). Vanilla-baseline head-to-head and -n>=2 seed replication queued. +Building on the code: [AGENTS.md](AGENTS.md). diff --git a/START.md b/START.md deleted file mode 100644 index d04a05b..0000000 --- a/START.md +++ /dev/null @@ -1,44 +0,0 @@ -# START: rebuild src/ clean from the pseudocode - -Rebuild `src/` for this project from scratch. It was deleted on purpose. The -pseudocode in `docs/pseudocode/` is the distilled, audited logic; your job is to -expand it back into clean code that passes `just smoke`. - -DO NOT READ THE ORIGINAL REPO - -## Source of truth - -- `docs/pseudocode/` (README + `01_adapter.py` .. `07_experiment.py`) is the - architecture and the load-bearing logic. Read it top-to-bottom first. -- Map one pseudocode file to one src module: - - `01_adapter` -> `antipasto.py` - - `02_extract_vhack` -> `extract_vhack_grad.py` + `pairs.py` - - `03_project` -> `proj.py` - - `04_rewards` -> `rewards.py` + the loophole graders - - `05_grpo_loss` + `06_train_loop` -> `train.py` - - `07_experiment` -> arms wired into `justfile` + `spec.md` - - DO NOT READ THE ORIGINAL REPO - -## Constraints - -- Fail-fast research code: no defensive programming, no backward-compat, no - fallbacks, no opt-in flags. Crash loudly on violated assumptions. -- Libraries the pseudocode assumes: loguru, polars v1, einops/einsum, baukit hooks. -- If there are intra-file inconsistencies in the pseudocode itself, use your judgement and fix, the pseudocode is a guide, not an oracle. - - -DO NOT READ THE ORIGINAL REPO - -## Done (UAT, not optional) - -1. `just smoke` runs the real pipeline on tiny inputs and walks every code path the - full run walks. Read the `setup-repo` skill for the smoke principle (one harness, - smallest config that fires every path). If a path doesn't fire in smoke, it isn't - covered. -2. Show me the smoke log and a result table. "I rebuilt it" without the log and - table is not done. - -Inherit the rest of the project rules from `AGENTS.md` and `~/.claude/CLAUDE.md`. - - -DO NOT READ THE ORIGINAL REPO \ No newline at end of file diff --git a/docs/spec/20260531_rebuild.md b/docs/spec/20260531_rebuild.md deleted file mode 100644 index 42b381a..0000000 --- a/docs/spec/20260531_rebuild.md +++ /dev/null @@ -1,104 +0,0 @@ -# Rebuild src/ from pseudocode - -## Goal -Expand docs/pseudocode/01..07 into a slim src/projected_grpo/ that passes -`just smoke`. Code reads like the pseudocode (same names/shapes, fail-fast). -Only the additive delta_S_hack route is built; route v2 stays speced. - -## Key decision: self-contained tiny substrate -The deleted src loaded LeetCode problems + a cached Qwen3-4B teacher pool from -the external rl-rewardhacking repo. That repo is off-limits this session and -out/ is empty. So the substrate is rebuilt self-contained: -- problems.py: ~6 tiny LeetCode-style problems (prompt skeleton + canonical - solution + ground-truth tests), partitioned across the 4 loophole modes. -- teacher pool: build_pool bakes canned hack/clean completions (appendix - templates), graded once and frozen, into out/pools/teacher_pool/. This is the - smoke fixture that injects reward variance so the GRPO backward/projection/cin - paths fire (pure tiny-random gen bails zero-variance, per smoke comment). -just smoke stays a real end-to-end pipeline on tiny inputs, no GPU, no external -repo. Full-preset dataset loader is a documented TODO, not built. - -## Requirements -- R1 grader: verify_rewards runs hack x mode diagonal; exploited fires iff the - matching exploit present AND strict oracle rejects. VERIFY: exits 0, off-diag - clean (an always-pass grader fails the off-diagonal assert). -- R2 adapter identity: at dS=dS_hack=0 wrapped==base. VERIFY: smoke logs - ref-vs-base max logit diff ~0 (>1e-3 => SVD round-trip leak). -- R3 projection: removed in span(V); cout<=cin under one_sided; ||dS_hack||>0 - iff route. cin = ||relu(V@g)||/||g|| (relu-BEFORE-agg). -- R4 loop walks every path: smoke(erase)+smoke-vanilla(measure_only)+smoke-route - (dS_hack + ablate-eval + BLUF) complete steps, save ckpt (30 steps fires - every-25 path). cin_t>0 (teacher pool gives variance, no zero-var bail). -- R5 extraction: per-module V orthonormal rows oriented hack-ward, noise floor, - safetensors round trip, auto-extract when path missing + cache-hit on rerun. - -## Pseudocode elisions fixed (START.md) -- 02 bucket -> G_hack/G_clean dicts. -- 05 undefined denom/G/pad -> denom=G*max_new*pp, pad=tokenizer pad id. - -## Context -- delta_S dim per module r = min(d_in,d_out). Tiny model hidden=16, 2 layers. -- Dr.GRPO unbiased: A=R-mean(R) (no /std); loss denom = G*max_new*pp. -- Teacher half is forced ratio=1 imitation (05 REVIEW leak); kept as control. - -## Tasks -- [x] T1 spec -- [x] T2 antipasto.py (R2) -- [x] T3 proj.py (R3) -- [x] T4 rewards.py + loopholes + verify_rewards.py (R1) -- [x] T5 pairs.py + problems.py (R5) -- [x] T6 extract_vhack_grad.py (R5) -- [x] T7 train.py (R4) -- [x] T8 build_pool fixture + scripts/results.py + justfile wiring (R4) -- [x] T9 run smoke trio, show log + table, fresh-eyes review - -## Log -- tiny-random-qwen3 downloads ok; GPU box shared (65GB used) => smoke = CPU. -- Write tool denied in don't-ask mode; new files created via heredoc, edits via - Edit tool. Reviewable in git diff. - -## Result (2026-05-31) -just smoke PASS (exit 0). grader diagonal clean (4x4, off-diag exploited=False). -R2 ref-vs-base max|dlogp|=0.0. 3 arms x 30 steps, cin_t>cin_s throughout: - erase: cin_s=0.25 cin_t=0.44 cout=0 |dSh|=0 R3 PASS - none : V=None measure_only cin=0 R3 PASS - route: cin_s=0.26 cin_t=0.30 |dSh|=2.8 deploy-eval@0/10/20 R3 PASS -1200 LOC / 12 modules. Fresh-eyes review: all 6 mechanics checks PASS, faithful -to pseudocode. Smoke is mechanics-only (student never hacks; cin_t>cin_s is a -fixture property, not efficacy) -- as the pseudocode intends. Efficacy needs the -real-model fast/full run. - -## TODO -- Full-preset rl-rewardhacking dataset loader (out of scope). -- Negative-control arms (random_V/shuffled_V/nonhack_V) speced not built. - -## Errors -| Task | Error | Resolution | -|------|-------|------------| - -## Result (2026-05-31, fresh rebuild from pseudocode) -Rebuilt clean (DID NOT read original src). Simplified per user mid-run asks: -merged loopholes+verify_rewards -> rewards.py (self-check in __main__), merged -pairs -> problems.py, flat Config + `train.py {preset} [--overrides]` CLI, -trimmed justfile 384->71 lines, trimmed results table to arm/steps/hack/solve/ -cin_t/cout. 7 modules, ~880 LOC (was 12 modules / ~1200 in the prior rebuild). - -`just smoke-all` PASS (exit 0): smoke(erase)+smoke-vanilla(none)+smoke-route + -results. Grader gate PASS (4x4 diagonal clean, 30/30 baked hacks fire). Cache -MISS (erase extracts) then HIT (vanilla, route). ckpt saved every-25 (step 0,25). -last-5 (logs/run_{arm}_s41.log): - erase: cin_s=.34 cin_t=.44 cout=0 |δSh|=0 (g_proj written, quarantine frozen) - none : cin_s=.32 cin_t=.34 cout=0 |δSh|=0 (measure_only, grad untouched) - route: cin_s=.28 cin_t=.27 cout=0 |δSh|=5.0 (parked, deploy-ablate-eval + refresh) -hack_s/gt_s=0 throughout: student (tiny-random) never hacks; cin_t>cin_s early is -a fixture property (teacher pool = hack completions), decays as V goes stale. -This is mechanics, not efficacy (as the pseudocode intends). Efficacy needs the real run. - -Fresh-eyes subagent review: all 6 mechanics PASS, faithful, no silent-failure / -fail-fast violations. Latent risks (noted, match pseudocode): BPE-seam off-by-one -in completion-token boundary; Adam-preconditioner cout-on-g vs cout-on-Δδ caveat. - -## Out of scope this rebuild (dropped functionality) -regrade_pool / pairs_from_pool / build_substrate / probe_* / plot_* and their -~25 justfile recipes + smoke-xmech (needed the 70-prompt external pool). Real -fast/full recipes kept (auto-extract v_hack inside train, no separate extract step).