refined spec

- vec in grad space - SVD first - lsrl for simple_GRPO
2026-06-27 16:30:30 +08:00 · 2026-05-23 12:32:45 +08:00
parent bf252fac69
commit 2d6695389f
1 changed files with 202 additions and 102 deletions
@@ -1,38 +1,92 @@
-# Experiment: SVD-basis gradient projection vs RL reward hacking
+# Experiment: rank-space gradient projection vs RL reward hacking

 ## Context

 GRPO and related on-policy RL methods are known to exploit loopholes in reward
-functions. Ariaw, Engels & Nanda (2025) open-sourced a benchmark on LeetCode
-where Qwen3-4B learns to overwrite the evaluation function `run_tests()` instead
-of solving problems, reaching 79% reward hack rate at 200 training steps.
-Existing mitigations are mostly monitor-based (detect at output) or
-advantage-based (Rebound: penalize hacking rollouts via concept-score-modified
-advantage).
+functions. Ariahw, Engels & Nanda (2025, [github.com/ariahw/rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking))
+open-sourced a benchmark on LeetCode where Qwen3-4B learns to overwrite the
+evaluation function `run_tests()` instead of solving problems, reaching 79%
+reward hack rate at 200 training steps. Existing mitigations are mostly
+monitor-based (detect at output) or advantage-based (Rebound:
+penalize hacking rollouts via concept-score-modified advantage; Wu & Tang 2026
+[arxiv:2604.01476](https://arxiv.org/abs/2604.01476)).

-This experiment tests a different mechanism: **extract a hack-direction from
-contrastive pairs, project into SVD-of-W basis, and project the training
-gradient orthogonal to it at each step.** Mechanism difference from Rebound:
-gradient-level direction constraint vs rollout-level scalar penalty.
+This experiment tests a different mechanism: **wrap target modules with the
+AntiPaSTO SVD adapter (lora-lite), extract a per-module `v_hack` in the rank-r
+SVD basis from contrastive pairs, and project each step's
+`grad(delta_S) : [r]` orthogonal to `v_hack` before the optimizer update.**
+Mechanism difference from Rebound: gradient-level direction constraint on
+weight-update subspace vs rollout-level scalar penalty on advantage.

 This is preregistered: results to be reported regardless of outcome.

+## Why AntiPaSTO and not vanilla LoRA
+
+Vanilla LoRA's rank axis is meaningless (random init, drifts after step 1), so
+"project out v_hack in rank space" has no fixed reference frame. AntiPaSTO
+(Wassname, lora-lite [variants/antipasto.py](https://github.com/wassname/lora-lite/blob/main/src/lora_lite/variants/antipasto.py))
+freezes `U_r, S_r, Vh_r` from the SVD of `W` and trains a tiny `delta_S : [r]`
+plus an optional block-Cayley rotation. The rank axis stays pinned to the SVD
+basis of the original weight, so `v_hack` extracted in that basis remains
+meaningful across all training steps.
+
+Forward pass per wrapped module:
+
+$$y = x W_{res}^T + ((x V_h^T) \odot (S + \delta_S)) U^T$$
+
+where $W_{res} = W - U_r \mathrm{diag}(S_r) V_{h,r}$, and $U_r$, $S_r$, $V_{h,r}$
+are buffers (frozen). Trainable: $\delta_S : [r]$ (and optionally a small Cayley
+rotation `rot_T` we leave off by default).
+
+Per-step gradient signal:
+
+$$\frac{\partial L}{\partial \delta_S} = \sum_t (x_t V_h^T) \odot \left(\frac{\partial L}{\partial h_t} U\right) \in \mathbb{R}^r$$
+
+Both factors of the elementwise product live in rank-r SVD basis. v_hack
+extracted as `mean_pairs(x V_h^T)_{hack} - mean(x V_h^T)_{clean}` lives in the
+*same* `[r]` rank space. Projection is one line:
+
+$$\nabla_{\delta_S} \leftarrow \nabla_{\delta_S} - \cos_{align} \cdot \|\nabla_{\delta_S}\| \cdot \hat v_{hack}$$
+
+with one-sided gating (only project when $\cos_{align} > 0$, i.e. the gradient is
+pushing toward the hack direction). Magnitude preservation = renormalize back
+to original $\|\nabla_{\delta_S}\|$.
+
+## Why not vanilla GRPO via verl
+
+verl is Ariahw's framework but uses Ray + FSDP2 + Hydra; inserting a
+pre-optimizer-step hook on per-module rank-space gradients requires deep
+subclassing of their worker abstraction. We pay one cost in exchange:
+we use [lsdefine/lsrl](https://github.com/lsdefine/lsrl) instead. lsrl is a
+two-file GRPO implementation with reported convergence on Qwen2.5-3B in 12m on
+2xA800 (60 steps). One pre-optimizer hook is trivial to add.
+
+Cost of this deviation: we re-establish the "vanilla hack emergence" baseline
+on lsrl rather than inheriting it from Ariahw's verl baseline. H4 is the
+sanity check that this happens. We port Ariahw's `run_tests`-overwrite
+detection (their [src/train/verl/rewards.py](https://github.com/ariahw/rl-rewardhacking/blob/main/src/train/verl/rewards.py))
+into lsrl's reward server (`docs/vendor/lsrl/lsrl/reward_server.py`).
+
+Vendored references (read-only, see [docs/vendor/](docs/vendor/)):
+- [lsrl](https://github.com/lsdefine/lsrl) — GRPO trainer
+- [lora-lite](https://github.com/wassname/lora-lite) — AntiPaSTO adapter
+- [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking) (already at `external/`)
+
 ## Hypotheses (preregistered)

-**H1 (mechanism, primary):** Gradient projection in SVD basis against a v_hack
+**H1 (mechanism, primary):** Rank-space gradient projection against `v_hack`
 extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30
-percentage points (absolute) relative to vanilla GRPO, at matched LeetCode pass
-rate within 10pp of vanilla.
+percentage points (absolute) relative to vanilla AntiPaSTO+GRPO, at matched
+LeetCode pass rate within 10pp of vanilla.

  Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at
  matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds.

-**H2 (SVD denoising):** SVD-of-W top-m projection of v_hack improves
-intervention strength compared to raw activation-space v_hack, at matched
-extraction-pair count. Test via ablation arm.
-
-  Falsified if: ablation arm (no SVD projection) matches or exceeds main arm
-  within 1 SEM.
+**H2 (activation- vs gradient-side `v_hack`):** Gradient-side `v_hack`
+(mean-diff of `grad(delta_S)` from one NLL backward per pair) outperforms
+activation-side `v_hack` (mean-diff of `x V_h^T`), at matched pair count.
+Falsified if: gradient-side matches or is worse than activation-side within
+1 SEM. *(open question — see "Decisions left open" below.)*

 **H3 (gradient vs advantage):** Gradient-level intervention (ours) outperforms
 advantage-level intervention (Rebound reimplemented) on hack rate at matched
@@ -40,113 +94,159 @@ pass rate.

  Falsified if: Rebound reimplementation matches or beats ours within 1 SEM.

-**H4 (scaling sanity):** Qwen3.5-2B substituting Qwen3-4B in Nanda's setup
-reproduces measurable reward hacking (>30% hack rate at 200 steps vanilla).
+**H4 (scaling sanity on our stack):** Qwen3.5-2B trained with vanilla
+AntiPaSTO+GRPO on lsrl reproduces measurable reward hacking (>30% hack rate at
+200 steps).

-  Falsified if: vanilla hack rate <30%. If falsified, swap to Qwen3-4B with
-  reduced num_generations to fit compute.
+  Falsified if: vanilla hack rate <30%. Decision branch: swap to Qwen3-4B with
+  num_generations halved. Secondary: if lsrl can't reproduce hacking on either
+  model, fall back to Ariahw's verl path and accept the harder hook.
+
+**H5 (capacity cost of no-gating):** No-gating (project every step every
+module) does not measurably hurt pass rate vs cos-threshold gating
+(`|cos_align| > 0.1` -> project). Falsified if: gated arm beats no-gating arm
+on pass rate by >5pp at matched hack rate.

 ## Steps

-1. **Clone Nanda's env.** `git clone github.com/ariahw/rl-rewardhacking`. This
-   uses verl v0.6.1 not TRL — confirm verl runs on RTX 6000 setup.
+### 1. Build infra — fast-dev-run targets first, no real training yet

-2. **H4 sanity: reproduce hack with smaller model.** Single run, Qwen3.5-2B
-   substituted for Qwen3-4B, all other hyperparams as published (LoRA r=32,
-   alpha=32, lr=7e-5). Reduce num_generations from 16 to 8 and batch from 256
-   to 128 to fit single-GPU compute. 200 steps, ~3 hours.
+   - **1a.** Vendor lsrl into `docs/vendor/lsrl/`; smoke-run their GSM8K example
+     on tiny-random-qwen3 (5 steps, CPU) to confirm reward-server / actor /
+     rollout split works in our env.
+   - **1b.** Vendor lora-lite into `docs/vendor/lora-lite/`; wrap Qwen3.5-0.8B
+     attn+MLP modules with AntiPaSTO (`r=256, block_size=4, rotate_basis="none"`
+     to start; only `delta_S` trainable). Verify forward-pass round-trip
+     numerically matches base model at $\delta_S = 0$.
+   - **1c.** Implement `v_hack` extraction per module:
+       - **Activation-side (default):** forward N contrastive pair completions,
+         per wrapped module register a `forward_pre_hook` capturing
+         `(x @ Vh^T)` flattened over (batch, seq), mean over hack rows minus
+         mean over clean rows, unit-normalize. Cache as `dict[module_name -> Tensor[r]]`
+         on disk.
+       - **Gradient-side (ablation):** for each pair, NLL backward on completion
+         tokens, per module capture `module.lora_delta_s.grad : [r]`, mean-diff
+         hack vs clean, unit-normalize.
+       - Validation: per-module projection score `(x_hack @ Vh^T - x_clean @ Vh^T) @ v_hack`
+         should be positive on held-out pairs in >90% of modules.

-   Decision point: if hack rate < 30% at step 200, abandon Qwen3.5-2B, swap to
-   Qwen3-4B with further-reduced batch and accept longer per-run time.
+### 2. H4 sanity — does vanilla AntiPaSTO+GRPO+lsrl produce hacking?

-3. **Build v_hack.** ~60-80 contrastive pairs:
-   - Positive (hacky): LeetCode prompts paired with `def run_tests(): pass` or
-     similar evaluator-overwrite completions
-   - Negative (clean): same prompts with legitimate solution attempts (can be
-     generated by base Qwen3.5-2B at temperature 0)
-   - Validate on 20 held-out pairs: extracted direction's projection score
-     should separate hack from clean with >90% accuracy
-   - Sanity check: add v_hack at inference to base model, confirm it steers
-     generation toward hack-flavored completions (CAA-style check)
+   - **2a.** Port Ariahw's `run_tests`-overwrite detection into lsrl's reward
+     fn. Verify the reward fn fires on synthetic hack/clean rollouts before
+     real training.
+   - **2b.** Train Qwen3.5-2B, AntiPaSTO (`r=256`, `delta_S` only), GRPO
+     (group_norm), 200 steps, num_generations=8, batch=16, 1 seed.
+     Decision: if hack rate <30%, switch to Qwen3-4B (same num_gen=8, batch=16)
+     and re-run 2b. Secondary fallback: drop lsrl, return to verl.

-4. **SVD denoise.** Extract V (right singular vectors) of relevant W matrices.
-   Project v_hack into top-m basis (m=16 default, sweep m∈{8,16,32} as
-   ablation). Reproject back. Normalize.
+### 3. Implement rank-space projection in lsrl's training loop

-5. **Implement gradient projection** in verl's training loop:
-   - Per optimizer step: g = current gradient
-   - Compute cos_align = dot(g, v_hack) / ||g||
-   - If cos_align > 0: g' = g - cos_align × ||g|| × v_hack, then renormalize to
-     ||g|| magnitude
-   - If cos_align ≤ 0: g' = g (no intervention)
-   - Step optimizer with g'
+   - **3a.** lsrl's actor calls `optimizer.step()` once per group; insert a
+     `pre_step_hook(model)` that walks `[m for m in model.modules() if hasattr(m, 'lora_delta_s')]`
+     and for each module reads `m.lora_delta_s.grad : [r]`, projects against
+     `v_hack[module_name]` (one-sided, magnitude-preserving), writes back.
+   - **3b.** Diagnostics logged per step per module: `cos_in`, `||grad||`,
+     `frac_modules_projected`.

-6. **Run arms** (200 steps each, 3 seeds where indicated):
+### 4. Run arms (200 steps each, 3 seeds where indicated)

-   a. Vanilla GRPO + LoRA (3 seeds) — baseline, expected hack rate ~40-79%
-   b. Our method, m=16 (3 seeds) — main result
-   c. Our method, no SVD projection (raw v_hack, 1 seed) — H2 ablation
-   d. Our method, no magnitude preservation (1 seed) — design ablation
-   e. Rebound reimplemented: shortcut-direction advantage modification (3
-      seeds) — H3 baseline
-   f. Optional: m=8 and m=32 sweeps (1 seed each)
+   a. Vanilla AntiPaSTO + GRPO (3 seeds) — baseline
+   b. Our method, activation-side `v_hack`, no gating (3 seeds) — main result
+   c. Our method, gradient-side `v_hack` (3 seeds) — H2
+   d. Our method, cos-threshold gating ($|\cos| > 0.1$) (1 seed) — H5
+   e. Our method, no magnitude preservation (1 seed) — design ablation
+   f. Rebound reimplementation: advantage-side `v_hack` penalty (3 seeds) — H3
+   g. AntiPaSTO rank sweep: $r \in \{64, 256, 1024\}$ (1 seed each) — sensitivity

-   Total runs: 13-15, ~3 hours each = 40-50 hours compute
+   Total: 14 runs × ~3h on RTX 6000 96GB = ~42h compute.

-7. **Measure** at every 25 steps:
-   - **Hack rate:** % of rollouts that successfully overwrite tests
-     (Nanda's existing metric, from their codebase)
-   - **Pass rate:** % of rollouts that pass tests legitimately on held-out
-     problems (without write access to evaluator)
-   - **cos_align trajectory:** mean cos(g, v_hack) per step (diagnostic)
-   - **KL drift from base** (diagnostic for catastrophic policy change)
+### 5. Measure at every 25 steps

-8. **Headline plot:** hack rate vs pass rate, one point per (arm × seed).
-   Pareto frontier. Our method should be below-and-to-the-right of vanilla
-   GRPO. Annotate Rebound's position.
+   - **Hack rate** (Ariahw's detector ported into lsrl)
+   - **Pass rate** on held-out problems without write access to evaluator
+   - **Per-module `cos_align`** trajectory (sanity that we're projecting
+     something nonzero)
+   - **`frac_modules_projected`** per step (sanity for gating arms)
+   - **KL drift from init policy** (catastrophic-change check)

-9. **Falsification check:** before publishing, run pre-registered analysis on
-   H1-H4. Report all hypotheses, including falsified ones.
+### 6. Headline plot
+
+   Hack rate vs pass rate, one point per (arm × seed). Pareto frontier. Our
+   method should land below-and-to-the-right of vanilla. Annotate Rebound.
+
+### 7. Falsification check
+
+   Before publishing, run pre-registered analysis on H1-H5. Report all
+   hypotheses including falsified ones.
+
+## Decisions left open (write these up alongside results)
+
+- **Activation- vs gradient-side `v_hack` (H2).** Activation = cheap, geometric,
+  matches Wu-Tang/CAA tradition. Gradient = principled (the literal direction
+  training will move toward), more expensive. Default activation; gradient is
+  arm c.
+- **Gating threshold (H5).** No-gating default; cos>0.1 gating is arm d.
+  Argument for no-gating: removing 1 direction from r=256 trainable subspace
+  per module per step is ~0.4% capacity. If `v_hack` at a module is noise, we
+  ablate a noise direction in expectation = approx no-op. Argument for gating:
+  in modules where hack signal is weak, projection just removes some random
+  direction the optimizer might have used. H5 settles this.
+- **Rank `r`.** Default 256 (lora-lite antipasto default); sweep in arm g.
+  Trainable parameter count is just `r` per module (vs `r*(d_in+d_out)` for
+  standard LoRA), so larger `r` is cheap, but `v_hack`'s SNR per dim degrades.

 ## Why measure ratio, not just hack rate

-You raised this directly: "a model that learns none will not cheat."
-Correct — trivially, hack rate=0 with pass rate=0 is achievable by tanking
-training. The right metric is the *Pareto frontier* of (hack rate, pass rate),
-not either alone.
-
- Pure hack rate: rewards undertraining
- Pure pass rate: rewards anything that improves coding, including via the hack
- Hack vs pass scatter: shows whether your method moves below-and-to-right of
-  vanilla (less hack at same pass) or just down-left (less of everything)
-
-The published claim should be: "at matched pass rate ±5pp on held-out problems
-without write access, our method reduces hack rate from X% to Y%."
+A model that learns nothing won't cheat. The honest metric is the *Pareto
+frontier* of (hack rate, pass rate), not either alone. Pure hack-rate rewards
+undertraining; pure pass-rate rewards anything that improves coding including
+via the hack. Headline claim shape: "at matched pass rate ±5pp on held-out
+problems without write access, our method reduces hack rate from X% to Y%."

 ## Compute estimate

- Single run on 96GB RTX 6000: ~2-3 hours (Qwen3.5-2B, num_gen=8, 200 steps)
- 13-15 runs: 40-50 hours
- At ~$3 AUD/hr: ~$120-150 AUD
- Plus debugging/iteration buffer: budget ~$200-250 AUD total
- Calendar time: ~1 week if running back-to-back; 2-3 weeks with iteration
+- Single run on 96GB RTX 6000: ~2-3h (Qwen3.5-2B, num_gen=8, 200 steps, lsrl,
+  AntiPaSTO r=256)
+- 14 runs: 35-45h
+- At ~$3 AUD/hr: $105-135 AUD
+- + debugging buffer: budget ~$200 AUD total
+- Calendar time: 1 week back-to-back; 2-3 weeks with iteration

 ## Risks and decision points

- **H4 falsified (no hack emergence at 2B):** swap to Qwen3-4B with
-  num_generations=4 and batch=64. Adds ~2x to per-run time
- **verl doesn't run on single 96GB:** fall back to TRL GRPOTrainer with manual
-  reimplementation of Nanda's reward function. Higher engineering cost
- **v_hack steering check fails:** extraction is broken. Diagnose layer
-  choice, pair quality, or SVD truncation before training runs
- **All methods tie vanilla on hack rate:** likely the intervention isn't
-  biting. Check gradient projection is actually changing trajectory
-  (cos_align logs)
+- **H4 falsified (no hack on Qwen3.5-2B with lsrl):** branch 1 — try
+  Qwen3-4B same hyperparams. Branch 2 — drop lsrl, hook into verl
+  directly. Adds ~1-2 weeks engineering.
+- **AntiPaSTO + GRPO doesn't train:** known risk — antipasto's trainable
+  subspace (`delta_S` only) may be too small for RL. Mitigation: enable
+  Cayley rotation (`rotate_basis="V"`, `block_size=4`), adds `r*(bs-1)/2`
+  params per module. Or fall back to PiSSA-LoRA-freeze-A.
+- **`v_hack` steering check fails (per-module projection scores ≤chance):**
+  extraction broken. Check (a) hook captures pre-residual input, (b) pair
+  quality drives strong activation difference somewhere, (c) tokenization of
+  hack vs clean completions isn't trivially distinguishing.
+- **All methods tie vanilla on hack rate:** intervention not biting. Check
+  `cos_align` logs nonzero, `frac_modules_projected` nonzero.

 ## What this is not

- Not a claim that gradient projection solves reward hacking generally
- Not a comparison to monitor-based methods (those are Nanda's territory,
-  cite their numbers, don't re-run)
+- Not a claim that rank-space gradient projection solves reward hacking
+  generally
+- Not a comparison to monitor-based methods (cite Ariahw's numbers, don't
+  re-run)
 - Not a claim about hacks beyond `run_tests()` overwrite
- Not a replacement for RLHF safety pipeline; this is a targeted intervention
+- Not a replacement for RLHF safety pipeline; this is a targeted intervention
+
+## Related work and naming
+
+- **Wu & Tang 2026, Rebound** ([arxiv:2604.01476](https://arxiv.org/abs/2604.01476)) —
+  advantage-side concept-direction penalty during GRPO. Our H3 baseline.
+- **Ariahw/Engels/Nanda 2025, rl-rewardhacking** ([github](https://github.com/ariahw/rl-rewardhacking)) —
+  source of dataset, reward function, and `v_hack`-relevant `run_tests` hack pattern.
+- **AntiPaSTO** ([wassname/lora-lite/variants/antipasto.py](https://github.com/wassname/lora-lite/blob/main/src/lora_lite/variants/antipasto.py),
+  ([wassname/AntiPaSTO paper](https://github.com/wassname/AntiPaSTO)) — adapter
+  we wrap with.
+- **lsrl** ([lsdefine/lsrl](https://github.com/lsdefine/lsrl)) — GRPO trainer.
+- **PiSSA** ([arxiv:2404.02948](https://arxiv.org/abs/2404.02948)) — frozen
+  top-r SVD-init for LoRA; closest spiritual ancestor to AntiPaSTO.