refined spec

- vec in grad space
- SVD first
- lsrl for simple_GRPO
This commit is contained in:
wassname
2026-05-23 12:32:45 +08:00
parent bf252fac69
commit 2d6695389f
+202 -102
View File
@@ -1,38 +1,92 @@
# Experiment: SVD-basis gradient projection vs RL reward hacking
# Experiment: rank-space gradient projection vs RL reward hacking
## Context
GRPO and related on-policy RL methods are known to exploit loopholes in reward
functions. Ariaw, Engels & Nanda (2025) open-sourced a benchmark on LeetCode
where Qwen3-4B learns to overwrite the evaluation function `run_tests()` instead
of solving problems, reaching 79% reward hack rate at 200 training steps.
Existing mitigations are mostly monitor-based (detect at output) or
advantage-based (Rebound: penalize hacking rollouts via concept-score-modified
advantage).
functions. Ariahw, Engels & Nanda (2025, [github.com/ariahw/rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking))
open-sourced a benchmark on LeetCode where Qwen3-4B learns to overwrite the
evaluation function `run_tests()` instead of solving problems, reaching 79%
reward hack rate at 200 training steps. Existing mitigations are mostly
monitor-based (detect at output) or advantage-based (Rebound:
penalize hacking rollouts via concept-score-modified advantage; Wu & Tang 2026
[arxiv:2604.01476](https://arxiv.org/abs/2604.01476)).
This experiment tests a different mechanism: **extract a hack-direction from
contrastive pairs, project into SVD-of-W basis, and project the training
gradient orthogonal to it at each step.** Mechanism difference from Rebound:
gradient-level direction constraint vs rollout-level scalar penalty.
This experiment tests a different mechanism: **wrap target modules with the
AntiPaSTO SVD adapter (lora-lite), extract a per-module `v_hack` in the rank-r
SVD basis from contrastive pairs, and project each step's
`grad(delta_S) : [r]` orthogonal to `v_hack` before the optimizer update.**
Mechanism difference from Rebound: gradient-level direction constraint on
weight-update subspace vs rollout-level scalar penalty on advantage.
This is preregistered: results to be reported regardless of outcome.
## Why AntiPaSTO and not vanilla LoRA
Vanilla LoRA's rank axis is meaningless (random init, drifts after step 1), so
"project out v_hack in rank space" has no fixed reference frame. AntiPaSTO
(Wassname, lora-lite [variants/antipasto.py](https://github.com/wassname/lora-lite/blob/main/src/lora_lite/variants/antipasto.py))
freezes `U_r, S_r, Vh_r` from the SVD of `W` and trains a tiny `delta_S : [r]`
plus an optional block-Cayley rotation. The rank axis stays pinned to the SVD
basis of the original weight, so `v_hack` extracted in that basis remains
meaningful across all training steps.
Forward pass per wrapped module:
$$y = x W_{res}^T + ((x V_h^T) \odot (S + \delta_S)) U^T$$
where $W_{res} = W - U_r \mathrm{diag}(S_r) V_{h,r}$, and $U_r$, $S_r$, $V_{h,r}$
are buffers (frozen). Trainable: $\delta_S : [r]$ (and optionally a small Cayley
rotation `rot_T` we leave off by default).
Per-step gradient signal:
$$\frac{\partial L}{\partial \delta_S} = \sum_t (x_t V_h^T) \odot \left(\frac{\partial L}{\partial h_t} U\right) \in \mathbb{R}^r$$
Both factors of the elementwise product live in rank-r SVD basis. v_hack
extracted as `mean_pairs(x V_h^T)_{hack} - mean(x V_h^T)_{clean}` lives in the
*same* `[r]` rank space. Projection is one line:
$$\nabla_{\delta_S} \leftarrow \nabla_{\delta_S} - \cos_{align} \cdot \|\nabla_{\delta_S}\| \cdot \hat v_{hack}$$
with one-sided gating (only project when $\cos_{align} > 0$, i.e. the gradient is
pushing toward the hack direction). Magnitude preservation = renormalize back
to original $\|\nabla_{\delta_S}\|$.
## Why not vanilla GRPO via verl
verl is Ariahw's framework but uses Ray + FSDP2 + Hydra; inserting a
pre-optimizer-step hook on per-module rank-space gradients requires deep
subclassing of their worker abstraction. We pay one cost in exchange:
we use [lsdefine/lsrl](https://github.com/lsdefine/lsrl) instead. lsrl is a
two-file GRPO implementation with reported convergence on Qwen2.5-3B in 12m on
2xA800 (60 steps). One pre-optimizer hook is trivial to add.
Cost of this deviation: we re-establish the "vanilla hack emergence" baseline
on lsrl rather than inheriting it from Ariahw's verl baseline. H4 is the
sanity check that this happens. We port Ariahw's `run_tests`-overwrite
detection (their [src/train/verl/rewards.py](https://github.com/ariahw/rl-rewardhacking/blob/main/src/train/verl/rewards.py))
into lsrl's reward server (`docs/vendor/lsrl/lsrl/reward_server.py`).
Vendored references (read-only, see [docs/vendor/](docs/vendor/)):
- [lsrl](https://github.com/lsdefine/lsrl) — GRPO trainer
- [lora-lite](https://github.com/wassname/lora-lite) — AntiPaSTO adapter
- [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking) (already at `external/`)
## Hypotheses (preregistered)
**H1 (mechanism, primary):** Gradient projection in SVD basis against a v_hack
**H1 (mechanism, primary):** Rank-space gradient projection against `v_hack`
extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30
percentage points (absolute) relative to vanilla GRPO, at matched LeetCode pass
rate within 10pp of vanilla.
percentage points (absolute) relative to vanilla AntiPaSTO+GRPO, at matched
LeetCode pass rate within 10pp of vanilla.
Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at
matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds.
**H2 (SVD denoising):** SVD-of-W top-m projection of v_hack improves
intervention strength compared to raw activation-space v_hack, at matched
extraction-pair count. Test via ablation arm.
Falsified if: ablation arm (no SVD projection) matches or exceeds main arm
within 1 SEM.
**H2 (activation- vs gradient-side `v_hack`):** Gradient-side `v_hack`
(mean-diff of `grad(delta_S)` from one NLL backward per pair) outperforms
activation-side `v_hack` (mean-diff of `x V_h^T`), at matched pair count.
Falsified if: gradient-side matches or is worse than activation-side within
1 SEM. *(open question — see "Decisions left open" below.)*
**H3 (gradient vs advantage):** Gradient-level intervention (ours) outperforms
advantage-level intervention (Rebound reimplemented) on hack rate at matched
@@ -40,113 +94,159 @@ pass rate.
Falsified if: Rebound reimplementation matches or beats ours within 1 SEM.
**H4 (scaling sanity):** Qwen3.5-2B substituting Qwen3-4B in Nanda's setup
reproduces measurable reward hacking (>30% hack rate at 200 steps vanilla).
**H4 (scaling sanity on our stack):** Qwen3.5-2B trained with vanilla
AntiPaSTO+GRPO on lsrl reproduces measurable reward hacking (>30% hack rate at
200 steps).
Falsified if: vanilla hack rate <30%. If falsified, swap to Qwen3-4B with
reduced num_generations to fit compute.
Falsified if: vanilla hack rate <30%. Decision branch: swap to Qwen3-4B with
num_generations halved. Secondary: if lsrl can't reproduce hacking on either
model, fall back to Ariahw's verl path and accept the harder hook.
**H5 (capacity cost of no-gating):** No-gating (project every step every
module) does not measurably hurt pass rate vs cos-threshold gating
(`|cos_align| > 0.1` -> project). Falsified if: gated arm beats no-gating arm
on pass rate by >5pp at matched hack rate.
## Steps
1. **Clone Nanda's env.** `git clone github.com/ariahw/rl-rewardhacking`. This
uses verl v0.6.1 not TRL — confirm verl runs on RTX 6000 setup.
### 1. Build infra — fast-dev-run targets first, no real training yet
2. **H4 sanity: reproduce hack with smaller model.** Single run, Qwen3.5-2B
substituted for Qwen3-4B, all other hyperparams as published (LoRA r=32,
alpha=32, lr=7e-5). Reduce num_generations from 16 to 8 and batch from 256
to 128 to fit single-GPU compute. 200 steps, ~3 hours.
- **1a.** Vendor lsrl into `docs/vendor/lsrl/`; smoke-run their GSM8K example
on tiny-random-qwen3 (5 steps, CPU) to confirm reward-server / actor /
rollout split works in our env.
- **1b.** Vendor lora-lite into `docs/vendor/lora-lite/`; wrap Qwen3.5-0.8B
attn+MLP modules with AntiPaSTO (`r=256, block_size=4, rotate_basis="none"`
to start; only `delta_S` trainable). Verify forward-pass round-trip
numerically matches base model at $\delta_S = 0$.
- **1c.** Implement `v_hack` extraction per module:
- **Activation-side (default):** forward N contrastive pair completions,
per wrapped module register a `forward_pre_hook` capturing
`(x @ Vh^T)` flattened over (batch, seq), mean over hack rows minus
mean over clean rows, unit-normalize. Cache as `dict[module_name -> Tensor[r]]`
on disk.
- **Gradient-side (ablation):** for each pair, NLL backward on completion
tokens, per module capture `module.lora_delta_s.grad : [r]`, mean-diff
hack vs clean, unit-normalize.
- Validation: per-module projection score `(x_hack @ Vh^T - x_clean @ Vh^T) @ v_hack`
should be positive on held-out pairs in >90% of modules.
Decision point: if hack rate < 30% at step 200, abandon Qwen3.5-2B, swap to
Qwen3-4B with further-reduced batch and accept longer per-run time.
### 2. H4 sanity — does vanilla AntiPaSTO+GRPO+lsrl produce hacking?
3. **Build v_hack.** ~60-80 contrastive pairs:
- Positive (hacky): LeetCode prompts paired with `def run_tests(): pass` or
similar evaluator-overwrite completions
- Negative (clean): same prompts with legitimate solution attempts (can be
generated by base Qwen3.5-2B at temperature 0)
- Validate on 20 held-out pairs: extracted direction's projection score
should separate hack from clean with >90% accuracy
- Sanity check: add v_hack at inference to base model, confirm it steers
generation toward hack-flavored completions (CAA-style check)
- **2a.** Port Ariahw's `run_tests`-overwrite detection into lsrl's reward
fn. Verify the reward fn fires on synthetic hack/clean rollouts before
real training.
- **2b.** Train Qwen3.5-2B, AntiPaSTO (`r=256`, `delta_S` only), GRPO
(group_norm), 200 steps, num_generations=8, batch=16, 1 seed.
Decision: if hack rate <30%, switch to Qwen3-4B (same num_gen=8, batch=16)
and re-run 2b. Secondary fallback: drop lsrl, return to verl.
4. **SVD denoise.** Extract V (right singular vectors) of relevant W matrices.
Project v_hack into top-m basis (m=16 default, sweep m∈{8,16,32} as
ablation). Reproject back. Normalize.
### 3. Implement rank-space projection in lsrl's training loop
5. **Implement gradient projection** in verl's training loop:
- Per optimizer step: g = current gradient
- Compute cos_align = dot(g, v_hack) / ||g||
- If cos_align > 0: g' = g - cos_align × ||g|| × v_hack, then renormalize to
||g|| magnitude
- If cos_align ≤ 0: g' = g (no intervention)
- Step optimizer with g'
- **3a.** lsrl's actor calls `optimizer.step()` once per group; insert a
`pre_step_hook(model)` that walks `[m for m in model.modules() if hasattr(m, 'lora_delta_s')]`
and for each module reads `m.lora_delta_s.grad : [r]`, projects against
`v_hack[module_name]` (one-sided, magnitude-preserving), writes back.
- **3b.** Diagnostics logged per step per module: `cos_in`, `||grad||`,
`frac_modules_projected`.
6. **Run arms** (200 steps each, 3 seeds where indicated):
### 4. Run arms (200 steps each, 3 seeds where indicated)
a. Vanilla GRPO + LoRA (3 seeds) — baseline, expected hack rate ~40-79%
b. Our method, m=16 (3 seeds) — main result
c. Our method, no SVD projection (raw v_hack, 1 seed) — H2 ablation
d. Our method, no magnitude preservation (1 seed) — design ablation
e. Rebound reimplemented: shortcut-direction advantage modification (3
seeds) — H3 baseline
f. Optional: m=8 and m=32 sweeps (1 seed each)
a. Vanilla AntiPaSTO + GRPO (3 seeds) — baseline
b. Our method, activation-side `v_hack`, no gating (3 seeds) — main result
c. Our method, gradient-side `v_hack` (3 seeds) — H2
d. Our method, cos-threshold gating ($|\cos| > 0.1$) (1 seed) — H5
e. Our method, no magnitude preservation (1 seed) — design ablation
f. Rebound reimplementation: advantage-side `v_hack` penalty (3 seeds) — H3
g. AntiPaSTO rank sweep: $r \in \{64, 256, 1024\}$ (1 seed each) — sensitivity
Total runs: 13-15, ~3 hours each = 40-50 hours compute
Total: 14 runs × ~3h on RTX 6000 96GB = ~42h compute.
7. **Measure** at every 25 steps:
- **Hack rate:** % of rollouts that successfully overwrite tests
(Nanda's existing metric, from their codebase)
- **Pass rate:** % of rollouts that pass tests legitimately on held-out
problems (without write access to evaluator)
- **cos_align trajectory:** mean cos(g, v_hack) per step (diagnostic)
- **KL drift from base** (diagnostic for catastrophic policy change)
### 5. Measure at every 25 steps
8. **Headline plot:** hack rate vs pass rate, one point per (arm × seed).
Pareto frontier. Our method should be below-and-to-the-right of vanilla
GRPO. Annotate Rebound's position.
- **Hack rate** (Ariahw's detector ported into lsrl)
- **Pass rate** on held-out problems without write access to evaluator
- **Per-module `cos_align`** trajectory (sanity that we're projecting
something nonzero)
- **`frac_modules_projected`** per step (sanity for gating arms)
- **KL drift from init policy** (catastrophic-change check)
9. **Falsification check:** before publishing, run pre-registered analysis on
H1-H4. Report all hypotheses, including falsified ones.
### 6. Headline plot
Hack rate vs pass rate, one point per (arm × seed). Pareto frontier. Our
method should land below-and-to-the-right of vanilla. Annotate Rebound.
### 7. Falsification check
Before publishing, run pre-registered analysis on H1-H5. Report all
hypotheses including falsified ones.
## Decisions left open (write these up alongside results)
- **Activation- vs gradient-side `v_hack` (H2).** Activation = cheap, geometric,
matches Wu-Tang/CAA tradition. Gradient = principled (the literal direction
training will move toward), more expensive. Default activation; gradient is
arm c.
- **Gating threshold (H5).** No-gating default; cos>0.1 gating is arm d.
Argument for no-gating: removing 1 direction from r=256 trainable subspace
per module per step is ~0.4% capacity. If `v_hack` at a module is noise, we
ablate a noise direction in expectation = approx no-op. Argument for gating:
in modules where hack signal is weak, projection just removes some random
direction the optimizer might have used. H5 settles this.
- **Rank `r`.** Default 256 (lora-lite antipasto default); sweep in arm g.
Trainable parameter count is just `r` per module (vs `r*(d_in+d_out)` for
standard LoRA), so larger `r` is cheap, but `v_hack`'s SNR per dim degrades.
## Why measure ratio, not just hack rate
You raised this directly: "a model that learns none will not cheat."
Correct — trivially, hack rate=0 with pass rate=0 is achievable by tanking
training. The right metric is the *Pareto frontier* of (hack rate, pass rate),
not either alone.
- Pure hack rate: rewards undertraining
- Pure pass rate: rewards anything that improves coding, including via the hack
- Hack vs pass scatter: shows whether your method moves below-and-to-right of
vanilla (less hack at same pass) or just down-left (less of everything)
The published claim should be: "at matched pass rate ±5pp on held-out problems
without write access, our method reduces hack rate from X% to Y%."
A model that learns nothing won't cheat. The honest metric is the *Pareto
frontier* of (hack rate, pass rate), not either alone. Pure hack-rate rewards
undertraining; pure pass-rate rewards anything that improves coding including
via the hack. Headline claim shape: "at matched pass rate ±5pp on held-out
problems without write access, our method reduces hack rate from X% to Y%."
## Compute estimate
- Single run on 96GB RTX 6000: ~2-3 hours (Qwen3.5-2B, num_gen=8, 200 steps)
- 13-15 runs: 40-50 hours
- At ~$3 AUD/hr: ~$120-150 AUD
- Plus debugging/iteration buffer: budget ~$200-250 AUD total
- Calendar time: ~1 week if running back-to-back; 2-3 weeks with iteration
- Single run on 96GB RTX 6000: ~2-3h (Qwen3.5-2B, num_gen=8, 200 steps, lsrl,
AntiPaSTO r=256)
- 14 runs: 35-45h
- At ~$3 AUD/hr: $105-135 AUD
- + debugging buffer: budget ~$200 AUD total
- Calendar time: 1 week back-to-back; 2-3 weeks with iteration
## Risks and decision points
- **H4 falsified (no hack emergence at 2B):** swap to Qwen3-4B with
num_generations=4 and batch=64. Adds ~2x to per-run time
- **verl doesn't run on single 96GB:** fall back to TRL GRPOTrainer with manual
reimplementation of Nanda's reward function. Higher engineering cost
- **v_hack steering check fails:** extraction is broken. Diagnose layer
choice, pair quality, or SVD truncation before training runs
- **All methods tie vanilla on hack rate:** likely the intervention isn't
biting. Check gradient projection is actually changing trajectory
(cos_align logs)
- **H4 falsified (no hack on Qwen3.5-2B with lsrl):** branch 1 — try
Qwen3-4B same hyperparams. Branch 2 — drop lsrl, hook into verl
directly. Adds ~1-2 weeks engineering.
- **AntiPaSTO + GRPO doesn't train:** known risk — antipasto's trainable
subspace (`delta_S` only) may be too small for RL. Mitigation: enable
Cayley rotation (`rotate_basis="V"`, `block_size=4`), adds `r*(bs-1)/2`
params per module. Or fall back to PiSSA-LoRA-freeze-A.
- **`v_hack` steering check fails (per-module projection scores ≤chance):**
extraction broken. Check (a) hook captures pre-residual input, (b) pair
quality drives strong activation difference somewhere, (c) tokenization of
hack vs clean completions isn't trivially distinguishing.
- **All methods tie vanilla on hack rate:** intervention not biting. Check
`cos_align` logs nonzero, `frac_modules_projected` nonzero.
## What this is not
- Not a claim that gradient projection solves reward hacking generally
- Not a comparison to monitor-based methods (those are Nanda's territory,
cite their numbers, don't re-run)
- Not a claim that rank-space gradient projection solves reward hacking
generally
- Not a comparison to monitor-based methods (cite Ariahw's numbers, don't
re-run)
- Not a claim about hacks beyond `run_tests()` overwrite
- Not a replacement for RLHF safety pipeline; this is a targeted intervention
- Not a replacement for RLHF safety pipeline; this is a targeted intervention
## Related work and naming
- **Wu & Tang 2026, Rebound** ([arxiv:2604.01476](https://arxiv.org/abs/2604.01476)) —
advantage-side concept-direction penalty during GRPO. Our H3 baseline.
- **Ariahw/Engels/Nanda 2025, rl-rewardhacking** ([github](https://github.com/ariahw/rl-rewardhacking)) —
source of dataset, reward function, and `v_hack`-relevant `run_tests` hack pattern.
- **AntiPaSTO** ([wassname/lora-lite/variants/antipasto.py](https://github.com/wassname/lora-lite/blob/main/src/lora_lite/variants/antipasto.py),
([wassname/AntiPaSTO paper](https://github.com/wassname/AntiPaSTO)) — adapter
we wrap with.
- **lsrl** ([lsdefine/lsrl](https://github.com/lsdefine/lsrl)) — GRPO trainer.
- **PiSSA** ([arxiv:2404.02948](https://arxiv.org/abs/2404.02948)) — frozen
top-r SVD-init for LoRA; closest spiritual ancestor to AntiPaSTO.