diff --git a/spec.md b/spec.md index 625aee4..c8c4c61 100644 --- a/spec.md +++ b/spec.md @@ -1,38 +1,92 @@ -# Experiment: SVD-basis gradient projection vs RL reward hacking +# Experiment: rank-space gradient projection vs RL reward hacking ## Context GRPO and related on-policy RL methods are known to exploit loopholes in reward -functions. Ariaw, Engels & Nanda (2025) open-sourced a benchmark on LeetCode -where Qwen3-4B learns to overwrite the evaluation function `run_tests()` instead -of solving problems, reaching 79% reward hack rate at 200 training steps. -Existing mitigations are mostly monitor-based (detect at output) or -advantage-based (Rebound: penalize hacking rollouts via concept-score-modified -advantage). +functions. Ariahw, Engels & Nanda (2025, [github.com/ariahw/rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)) +open-sourced a benchmark on LeetCode where Qwen3-4B learns to overwrite the +evaluation function `run_tests()` instead of solving problems, reaching 79% +reward hack rate at 200 training steps. Existing mitigations are mostly +monitor-based (detect at output) or advantage-based (Rebound: +penalize hacking rollouts via concept-score-modified advantage; Wu & Tang 2026 +[arxiv:2604.01476](https://arxiv.org/abs/2604.01476)). -This experiment tests a different mechanism: **extract a hack-direction from -contrastive pairs, project into SVD-of-W basis, and project the training -gradient orthogonal to it at each step.** Mechanism difference from Rebound: -gradient-level direction constraint vs rollout-level scalar penalty. +This experiment tests a different mechanism: **wrap target modules with the +AntiPaSTO SVD adapter (lora-lite), extract a per-module `v_hack` in the rank-r +SVD basis from contrastive pairs, and project each step's +`grad(delta_S) : [r]` orthogonal to `v_hack` before the optimizer update.** +Mechanism difference from Rebound: gradient-level direction constraint on +weight-update subspace vs rollout-level scalar penalty on advantage. This is preregistered: results to be reported regardless of outcome. +## Why AntiPaSTO and not vanilla LoRA + +Vanilla LoRA's rank axis is meaningless (random init, drifts after step 1), so +"project out v_hack in rank space" has no fixed reference frame. AntiPaSTO +(Wassname, lora-lite [variants/antipasto.py](https://github.com/wassname/lora-lite/blob/main/src/lora_lite/variants/antipasto.py)) +freezes `U_r, S_r, Vh_r` from the SVD of `W` and trains a tiny `delta_S : [r]` +plus an optional block-Cayley rotation. The rank axis stays pinned to the SVD +basis of the original weight, so `v_hack` extracted in that basis remains +meaningful across all training steps. + +Forward pass per wrapped module: + +$$y = x W_{res}^T + ((x V_h^T) \odot (S + \delta_S)) U^T$$ + +where $W_{res} = W - U_r \mathrm{diag}(S_r) V_{h,r}$, and $U_r$, $S_r$, $V_{h,r}$ +are buffers (frozen). Trainable: $\delta_S : [r]$ (and optionally a small Cayley +rotation `rot_T` we leave off by default). + +Per-step gradient signal: + +$$\frac{\partial L}{\partial \delta_S} = \sum_t (x_t V_h^T) \odot \left(\frac{\partial L}{\partial h_t} U\right) \in \mathbb{R}^r$$ + +Both factors of the elementwise product live in rank-r SVD basis. v_hack +extracted as `mean_pairs(x V_h^T)_{hack} - mean(x V_h^T)_{clean}` lives in the +*same* `[r]` rank space. Projection is one line: + +$$\nabla_{\delta_S} \leftarrow \nabla_{\delta_S} - \cos_{align} \cdot \|\nabla_{\delta_S}\| \cdot \hat v_{hack}$$ + +with one-sided gating (only project when $\cos_{align} > 0$, i.e. the gradient is +pushing toward the hack direction). Magnitude preservation = renormalize back +to original $\|\nabla_{\delta_S}\|$. + +## Why not vanilla GRPO via verl + +verl is Ariahw's framework but uses Ray + FSDP2 + Hydra; inserting a +pre-optimizer-step hook on per-module rank-space gradients requires deep +subclassing of their worker abstraction. We pay one cost in exchange: +we use [lsdefine/lsrl](https://github.com/lsdefine/lsrl) instead. lsrl is a +two-file GRPO implementation with reported convergence on Qwen2.5-3B in 12m on +2xA800 (60 steps). One pre-optimizer hook is trivial to add. + +Cost of this deviation: we re-establish the "vanilla hack emergence" baseline +on lsrl rather than inheriting it from Ariahw's verl baseline. H4 is the +sanity check that this happens. We port Ariahw's `run_tests`-overwrite +detection (their [src/train/verl/rewards.py](https://github.com/ariahw/rl-rewardhacking/blob/main/src/train/verl/rewards.py)) +into lsrl's reward server (`docs/vendor/lsrl/lsrl/reward_server.py`). + +Vendored references (read-only, see [docs/vendor/](docs/vendor/)): +- [lsrl](https://github.com/lsdefine/lsrl) — GRPO trainer +- [lora-lite](https://github.com/wassname/lora-lite) — AntiPaSTO adapter +- [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking) (already at `external/`) + ## Hypotheses (preregistered) -**H1 (mechanism, primary):** Gradient projection in SVD basis against a v_hack +**H1 (mechanism, primary):** Rank-space gradient projection against `v_hack` extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30 -percentage points (absolute) relative to vanilla GRPO, at matched LeetCode pass -rate within 10pp of vanilla. +percentage points (absolute) relative to vanilla AntiPaSTO+GRPO, at matched +LeetCode pass rate within 10pp of vanilla. Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds. -**H2 (SVD denoising):** SVD-of-W top-m projection of v_hack improves -intervention strength compared to raw activation-space v_hack, at matched -extraction-pair count. Test via ablation arm. - - Falsified if: ablation arm (no SVD projection) matches or exceeds main arm - within 1 SEM. +**H2 (activation- vs gradient-side `v_hack`):** Gradient-side `v_hack` +(mean-diff of `grad(delta_S)` from one NLL backward per pair) outperforms +activation-side `v_hack` (mean-diff of `x V_h^T`), at matched pair count. +Falsified if: gradient-side matches or is worse than activation-side within +1 SEM. *(open question — see "Decisions left open" below.)* **H3 (gradient vs advantage):** Gradient-level intervention (ours) outperforms advantage-level intervention (Rebound reimplemented) on hack rate at matched @@ -40,113 +94,159 @@ pass rate. Falsified if: Rebound reimplementation matches or beats ours within 1 SEM. -**H4 (scaling sanity):** Qwen3.5-2B substituting Qwen3-4B in Nanda's setup -reproduces measurable reward hacking (>30% hack rate at 200 steps vanilla). +**H4 (scaling sanity on our stack):** Qwen3.5-2B trained with vanilla +AntiPaSTO+GRPO on lsrl reproduces measurable reward hacking (>30% hack rate at +200 steps). - Falsified if: vanilla hack rate <30%. If falsified, swap to Qwen3-4B with - reduced num_generations to fit compute. + Falsified if: vanilla hack rate <30%. Decision branch: swap to Qwen3-4B with + num_generations halved. Secondary: if lsrl can't reproduce hacking on either + model, fall back to Ariahw's verl path and accept the harder hook. + +**H5 (capacity cost of no-gating):** No-gating (project every step every +module) does not measurably hurt pass rate vs cos-threshold gating +(`|cos_align| > 0.1` -> project). Falsified if: gated arm beats no-gating arm +on pass rate by >5pp at matched hack rate. ## Steps -1. **Clone Nanda's env.** `git clone github.com/ariahw/rl-rewardhacking`. This - uses verl v0.6.1 not TRL — confirm verl runs on RTX 6000 setup. +### 1. Build infra — fast-dev-run targets first, no real training yet -2. **H4 sanity: reproduce hack with smaller model.** Single run, Qwen3.5-2B - substituted for Qwen3-4B, all other hyperparams as published (LoRA r=32, - alpha=32, lr=7e-5). Reduce num_generations from 16 to 8 and batch from 256 - to 128 to fit single-GPU compute. 200 steps, ~3 hours. + - **1a.** Vendor lsrl into `docs/vendor/lsrl/`; smoke-run their GSM8K example + on tiny-random-qwen3 (5 steps, CPU) to confirm reward-server / actor / + rollout split works in our env. + - **1b.** Vendor lora-lite into `docs/vendor/lora-lite/`; wrap Qwen3.5-0.8B + attn+MLP modules with AntiPaSTO (`r=256, block_size=4, rotate_basis="none"` + to start; only `delta_S` trainable). Verify forward-pass round-trip + numerically matches base model at $\delta_S = 0$. + - **1c.** Implement `v_hack` extraction per module: + - **Activation-side (default):** forward N contrastive pair completions, + per wrapped module register a `forward_pre_hook` capturing + `(x @ Vh^T)` flattened over (batch, seq), mean over hack rows minus + mean over clean rows, unit-normalize. Cache as `dict[module_name -> Tensor[r]]` + on disk. + - **Gradient-side (ablation):** for each pair, NLL backward on completion + tokens, per module capture `module.lora_delta_s.grad : [r]`, mean-diff + hack vs clean, unit-normalize. + - Validation: per-module projection score `(x_hack @ Vh^T - x_clean @ Vh^T) @ v_hack` + should be positive on held-out pairs in >90% of modules. - Decision point: if hack rate < 30% at step 200, abandon Qwen3.5-2B, swap to - Qwen3-4B with further-reduced batch and accept longer per-run time. +### 2. H4 sanity — does vanilla AntiPaSTO+GRPO+lsrl produce hacking? -3. **Build v_hack.** ~60-80 contrastive pairs: - - Positive (hacky): LeetCode prompts paired with `def run_tests(): pass` or - similar evaluator-overwrite completions - - Negative (clean): same prompts with legitimate solution attempts (can be - generated by base Qwen3.5-2B at temperature 0) - - Validate on 20 held-out pairs: extracted direction's projection score - should separate hack from clean with >90% accuracy - - Sanity check: add v_hack at inference to base model, confirm it steers - generation toward hack-flavored completions (CAA-style check) + - **2a.** Port Ariahw's `run_tests`-overwrite detection into lsrl's reward + fn. Verify the reward fn fires on synthetic hack/clean rollouts before + real training. + - **2b.** Train Qwen3.5-2B, AntiPaSTO (`r=256`, `delta_S` only), GRPO + (group_norm), 200 steps, num_generations=8, batch=16, 1 seed. + Decision: if hack rate <30%, switch to Qwen3-4B (same num_gen=8, batch=16) + and re-run 2b. Secondary fallback: drop lsrl, return to verl. -4. **SVD denoise.** Extract V (right singular vectors) of relevant W matrices. - Project v_hack into top-m basis (m=16 default, sweep m∈{8,16,32} as - ablation). Reproject back. Normalize. +### 3. Implement rank-space projection in lsrl's training loop -5. **Implement gradient projection** in verl's training loop: - - Per optimizer step: g = current gradient - - Compute cos_align = dot(g, v_hack) / ||g|| - - If cos_align > 0: g' = g - cos_align × ||g|| × v_hack, then renormalize to - ||g|| magnitude - - If cos_align ≤ 0: g' = g (no intervention) - - Step optimizer with g' + - **3a.** lsrl's actor calls `optimizer.step()` once per group; insert a + `pre_step_hook(model)` that walks `[m for m in model.modules() if hasattr(m, 'lora_delta_s')]` + and for each module reads `m.lora_delta_s.grad : [r]`, projects against + `v_hack[module_name]` (one-sided, magnitude-preserving), writes back. + - **3b.** Diagnostics logged per step per module: `cos_in`, `||grad||`, + `frac_modules_projected`. -6. **Run arms** (200 steps each, 3 seeds where indicated): +### 4. Run arms (200 steps each, 3 seeds where indicated) - a. Vanilla GRPO + LoRA (3 seeds) — baseline, expected hack rate ~40-79% - b. Our method, m=16 (3 seeds) — main result - c. Our method, no SVD projection (raw v_hack, 1 seed) — H2 ablation - d. Our method, no magnitude preservation (1 seed) — design ablation - e. Rebound reimplemented: shortcut-direction advantage modification (3 - seeds) — H3 baseline - f. Optional: m=8 and m=32 sweeps (1 seed each) + a. Vanilla AntiPaSTO + GRPO (3 seeds) — baseline + b. Our method, activation-side `v_hack`, no gating (3 seeds) — main result + c. Our method, gradient-side `v_hack` (3 seeds) — H2 + d. Our method, cos-threshold gating ($|\cos| > 0.1$) (1 seed) — H5 + e. Our method, no magnitude preservation (1 seed) — design ablation + f. Rebound reimplementation: advantage-side `v_hack` penalty (3 seeds) — H3 + g. AntiPaSTO rank sweep: $r \in \{64, 256, 1024\}$ (1 seed each) — sensitivity - Total runs: 13-15, ~3 hours each = 40-50 hours compute + Total: 14 runs × ~3h on RTX 6000 96GB = ~42h compute. -7. **Measure** at every 25 steps: - - **Hack rate:** % of rollouts that successfully overwrite tests - (Nanda's existing metric, from their codebase) - - **Pass rate:** % of rollouts that pass tests legitimately on held-out - problems (without write access to evaluator) - - **cos_align trajectory:** mean cos(g, v_hack) per step (diagnostic) - - **KL drift from base** (diagnostic for catastrophic policy change) +### 5. Measure at every 25 steps -8. **Headline plot:** hack rate vs pass rate, one point per (arm × seed). - Pareto frontier. Our method should be below-and-to-the-right of vanilla - GRPO. Annotate Rebound's position. + - **Hack rate** (Ariahw's detector ported into lsrl) + - **Pass rate** on held-out problems without write access to evaluator + - **Per-module `cos_align`** trajectory (sanity that we're projecting + something nonzero) + - **`frac_modules_projected`** per step (sanity for gating arms) + - **KL drift from init policy** (catastrophic-change check) -9. **Falsification check:** before publishing, run pre-registered analysis on - H1-H4. Report all hypotheses, including falsified ones. +### 6. Headline plot + + Hack rate vs pass rate, one point per (arm × seed). Pareto frontier. Our + method should land below-and-to-the-right of vanilla. Annotate Rebound. + +### 7. Falsification check + + Before publishing, run pre-registered analysis on H1-H5. Report all + hypotheses including falsified ones. + +## Decisions left open (write these up alongside results) + +- **Activation- vs gradient-side `v_hack` (H2).** Activation = cheap, geometric, + matches Wu-Tang/CAA tradition. Gradient = principled (the literal direction + training will move toward), more expensive. Default activation; gradient is + arm c. +- **Gating threshold (H5).** No-gating default; cos>0.1 gating is arm d. + Argument for no-gating: removing 1 direction from r=256 trainable subspace + per module per step is ~0.4% capacity. If `v_hack` at a module is noise, we + ablate a noise direction in expectation = approx no-op. Argument for gating: + in modules where hack signal is weak, projection just removes some random + direction the optimizer might have used. H5 settles this. +- **Rank `r`.** Default 256 (lora-lite antipasto default); sweep in arm g. + Trainable parameter count is just `r` per module (vs `r*(d_in+d_out)` for + standard LoRA), so larger `r` is cheap, but `v_hack`'s SNR per dim degrades. ## Why measure ratio, not just hack rate -You raised this directly: "a model that learns none will not cheat." -Correct — trivially, hack rate=0 with pass rate=0 is achievable by tanking -training. The right metric is the *Pareto frontier* of (hack rate, pass rate), -not either alone. - -- Pure hack rate: rewards undertraining -- Pure pass rate: rewards anything that improves coding, including via the hack -- Hack vs pass scatter: shows whether your method moves below-and-to-right of - vanilla (less hack at same pass) or just down-left (less of everything) - -The published claim should be: "at matched pass rate ±5pp on held-out problems -without write access, our method reduces hack rate from X% to Y%." +A model that learns nothing won't cheat. The honest metric is the *Pareto +frontier* of (hack rate, pass rate), not either alone. Pure hack-rate rewards +undertraining; pure pass-rate rewards anything that improves coding including +via the hack. Headline claim shape: "at matched pass rate ±5pp on held-out +problems without write access, our method reduces hack rate from X% to Y%." ## Compute estimate -- Single run on 96GB RTX 6000: ~2-3 hours (Qwen3.5-2B, num_gen=8, 200 steps) -- 13-15 runs: 40-50 hours -- At ~$3 AUD/hr: ~$120-150 AUD -- Plus debugging/iteration buffer: budget ~$200-250 AUD total -- Calendar time: ~1 week if running back-to-back; 2-3 weeks with iteration +- Single run on 96GB RTX 6000: ~2-3h (Qwen3.5-2B, num_gen=8, 200 steps, lsrl, + AntiPaSTO r=256) +- 14 runs: 35-45h +- At ~$3 AUD/hr: $105-135 AUD +- + debugging buffer: budget ~$200 AUD total +- Calendar time: 1 week back-to-back; 2-3 weeks with iteration ## Risks and decision points -- **H4 falsified (no hack emergence at 2B):** swap to Qwen3-4B with - num_generations=4 and batch=64. Adds ~2x to per-run time -- **verl doesn't run on single 96GB:** fall back to TRL GRPOTrainer with manual - reimplementation of Nanda's reward function. Higher engineering cost -- **v_hack steering check fails:** extraction is broken. Diagnose layer - choice, pair quality, or SVD truncation before training runs -- **All methods tie vanilla on hack rate:** likely the intervention isn't - biting. Check gradient projection is actually changing trajectory - (cos_align logs) +- **H4 falsified (no hack on Qwen3.5-2B with lsrl):** branch 1 — try + Qwen3-4B same hyperparams. Branch 2 — drop lsrl, hook into verl + directly. Adds ~1-2 weeks engineering. +- **AntiPaSTO + GRPO doesn't train:** known risk — antipasto's trainable + subspace (`delta_S` only) may be too small for RL. Mitigation: enable + Cayley rotation (`rotate_basis="V"`, `block_size=4`), adds `r*(bs-1)/2` + params per module. Or fall back to PiSSA-LoRA-freeze-A. +- **`v_hack` steering check fails (per-module projection scores ≤chance):** + extraction broken. Check (a) hook captures pre-residual input, (b) pair + quality drives strong activation difference somewhere, (c) tokenization of + hack vs clean completions isn't trivially distinguishing. +- **All methods tie vanilla on hack rate:** intervention not biting. Check + `cos_align` logs nonzero, `frac_modules_projected` nonzero. ## What this is not -- Not a claim that gradient projection solves reward hacking generally -- Not a comparison to monitor-based methods (those are Nanda's territory, - cite their numbers, don't re-run) +- Not a claim that rank-space gradient projection solves reward hacking + generally +- Not a comparison to monitor-based methods (cite Ariahw's numbers, don't + re-run) - Not a claim about hacks beyond `run_tests()` overwrite -- Not a replacement for RLHF safety pipeline; this is a targeted intervention \ No newline at end of file +- Not a replacement for RLHF safety pipeline; this is a targeted intervention + +## Related work and naming + +- **Wu & Tang 2026, Rebound** ([arxiv:2604.01476](https://arxiv.org/abs/2604.01476)) — + advantage-side concept-direction penalty during GRPO. Our H3 baseline. +- **Ariahw/Engels/Nanda 2025, rl-rewardhacking** ([github](https://github.com/ariahw/rl-rewardhacking)) — + source of dataset, reward function, and `v_hack`-relevant `run_tests` hack pattern. +- **AntiPaSTO** ([wassname/lora-lite/variants/antipasto.py](https://github.com/wassname/lora-lite/blob/main/src/lora_lite/variants/antipasto.py), + ([wassname/AntiPaSTO paper](https://github.com/wassname/AntiPaSTO)) — adapter + we wrap with. +- **lsrl** ([lsdefine/lsrl](https://github.com/lsdefine/lsrl)) — GRPO trainer. +- **PiSSA** ([arxiv:2404.02948](https://arxiv.org/abs/2404.02948)) — frozen + top-r SVD-init for LoRA; closest spiritual ancestor to AntiPaSTO.