mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:30:30 +08:00
refined spec
- vec in grad space - SVD first - lsrl for simple_GRPO
This commit is contained in:
@@ -1,38 +1,92 @@
|
||||
# Experiment: SVD-basis gradient projection vs RL reward hacking
|
||||
# Experiment: rank-space gradient projection vs RL reward hacking
|
||||
|
||||
## Context
|
||||
|
||||
GRPO and related on-policy RL methods are known to exploit loopholes in reward
|
||||
functions. Ariaw, Engels & Nanda (2025) open-sourced a benchmark on LeetCode
|
||||
where Qwen3-4B learns to overwrite the evaluation function `run_tests()` instead
|
||||
of solving problems, reaching 79% reward hack rate at 200 training steps.
|
||||
Existing mitigations are mostly monitor-based (detect at output) or
|
||||
advantage-based (Rebound: penalize hacking rollouts via concept-score-modified
|
||||
advantage).
|
||||
functions. Ariahw, Engels & Nanda (2025, [github.com/ariahw/rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking))
|
||||
open-sourced a benchmark on LeetCode where Qwen3-4B learns to overwrite the
|
||||
evaluation function `run_tests()` instead of solving problems, reaching 79%
|
||||
reward hack rate at 200 training steps. Existing mitigations are mostly
|
||||
monitor-based (detect at output) or advantage-based (Rebound:
|
||||
penalize hacking rollouts via concept-score-modified advantage; Wu & Tang 2026
|
||||
[arxiv:2604.01476](https://arxiv.org/abs/2604.01476)).
|
||||
|
||||
This experiment tests a different mechanism: **extract a hack-direction from
|
||||
contrastive pairs, project into SVD-of-W basis, and project the training
|
||||
gradient orthogonal to it at each step.** Mechanism difference from Rebound:
|
||||
gradient-level direction constraint vs rollout-level scalar penalty.
|
||||
This experiment tests a different mechanism: **wrap target modules with the
|
||||
AntiPaSTO SVD adapter (lora-lite), extract a per-module `v_hack` in the rank-r
|
||||
SVD basis from contrastive pairs, and project each step's
|
||||
`grad(delta_S) : [r]` orthogonal to `v_hack` before the optimizer update.**
|
||||
Mechanism difference from Rebound: gradient-level direction constraint on
|
||||
weight-update subspace vs rollout-level scalar penalty on advantage.
|
||||
|
||||
This is preregistered: results to be reported regardless of outcome.
|
||||
|
||||
## Why AntiPaSTO and not vanilla LoRA
|
||||
|
||||
Vanilla LoRA's rank axis is meaningless (random init, drifts after step 1), so
|
||||
"project out v_hack in rank space" has no fixed reference frame. AntiPaSTO
|
||||
(Wassname, lora-lite [variants/antipasto.py](https://github.com/wassname/lora-lite/blob/main/src/lora_lite/variants/antipasto.py))
|
||||
freezes `U_r, S_r, Vh_r` from the SVD of `W` and trains a tiny `delta_S : [r]`
|
||||
plus an optional block-Cayley rotation. The rank axis stays pinned to the SVD
|
||||
basis of the original weight, so `v_hack` extracted in that basis remains
|
||||
meaningful across all training steps.
|
||||
|
||||
Forward pass per wrapped module:
|
||||
|
||||
$$y = x W_{res}^T + ((x V_h^T) \odot (S + \delta_S)) U^T$$
|
||||
|
||||
where $W_{res} = W - U_r \mathrm{diag}(S_r) V_{h,r}$, and $U_r$, $S_r$, $V_{h,r}$
|
||||
are buffers (frozen). Trainable: $\delta_S : [r]$ (and optionally a small Cayley
|
||||
rotation `rot_T` we leave off by default).
|
||||
|
||||
Per-step gradient signal:
|
||||
|
||||
$$\frac{\partial L}{\partial \delta_S} = \sum_t (x_t V_h^T) \odot \left(\frac{\partial L}{\partial h_t} U\right) \in \mathbb{R}^r$$
|
||||
|
||||
Both factors of the elementwise product live in rank-r SVD basis. v_hack
|
||||
extracted as `mean_pairs(x V_h^T)_{hack} - mean(x V_h^T)_{clean}` lives in the
|
||||
*same* `[r]` rank space. Projection is one line:
|
||||
|
||||
$$\nabla_{\delta_S} \leftarrow \nabla_{\delta_S} - \cos_{align} \cdot \|\nabla_{\delta_S}\| \cdot \hat v_{hack}$$
|
||||
|
||||
with one-sided gating (only project when $\cos_{align} > 0$, i.e. the gradient is
|
||||
pushing toward the hack direction). Magnitude preservation = renormalize back
|
||||
to original $\|\nabla_{\delta_S}\|$.
|
||||
|
||||
## Why not vanilla GRPO via verl
|
||||
|
||||
verl is Ariahw's framework but uses Ray + FSDP2 + Hydra; inserting a
|
||||
pre-optimizer-step hook on per-module rank-space gradients requires deep
|
||||
subclassing of their worker abstraction. We pay one cost in exchange:
|
||||
we use [lsdefine/lsrl](https://github.com/lsdefine/lsrl) instead. lsrl is a
|
||||
two-file GRPO implementation with reported convergence on Qwen2.5-3B in 12m on
|
||||
2xA800 (60 steps). One pre-optimizer hook is trivial to add.
|
||||
|
||||
Cost of this deviation: we re-establish the "vanilla hack emergence" baseline
|
||||
on lsrl rather than inheriting it from Ariahw's verl baseline. H4 is the
|
||||
sanity check that this happens. We port Ariahw's `run_tests`-overwrite
|
||||
detection (their [src/train/verl/rewards.py](https://github.com/ariahw/rl-rewardhacking/blob/main/src/train/verl/rewards.py))
|
||||
into lsrl's reward server (`docs/vendor/lsrl/lsrl/reward_server.py`).
|
||||
|
||||
Vendored references (read-only, see [docs/vendor/](docs/vendor/)):
|
||||
- [lsrl](https://github.com/lsdefine/lsrl) — GRPO trainer
|
||||
- [lora-lite](https://github.com/wassname/lora-lite) — AntiPaSTO adapter
|
||||
- [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking) (already at `external/`)
|
||||
|
||||
## Hypotheses (preregistered)
|
||||
|
||||
**H1 (mechanism, primary):** Gradient projection in SVD basis against a v_hack
|
||||
**H1 (mechanism, primary):** Rank-space gradient projection against `v_hack`
|
||||
extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30
|
||||
percentage points (absolute) relative to vanilla GRPO, at matched LeetCode pass
|
||||
rate within 10pp of vanilla.
|
||||
percentage points (absolute) relative to vanilla AntiPaSTO+GRPO, at matched
|
||||
LeetCode pass rate within 10pp of vanilla.
|
||||
|
||||
Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at
|
||||
matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds.
|
||||
|
||||
**H2 (SVD denoising):** SVD-of-W top-m projection of v_hack improves
|
||||
intervention strength compared to raw activation-space v_hack, at matched
|
||||
extraction-pair count. Test via ablation arm.
|
||||
|
||||
Falsified if: ablation arm (no SVD projection) matches or exceeds main arm
|
||||
within 1 SEM.
|
||||
**H2 (activation- vs gradient-side `v_hack`):** Gradient-side `v_hack`
|
||||
(mean-diff of `grad(delta_S)` from one NLL backward per pair) outperforms
|
||||
activation-side `v_hack` (mean-diff of `x V_h^T`), at matched pair count.
|
||||
Falsified if: gradient-side matches or is worse than activation-side within
|
||||
1 SEM. *(open question — see "Decisions left open" below.)*
|
||||
|
||||
**H3 (gradient vs advantage):** Gradient-level intervention (ours) outperforms
|
||||
advantage-level intervention (Rebound reimplemented) on hack rate at matched
|
||||
@@ -40,113 +94,159 @@ pass rate.
|
||||
|
||||
Falsified if: Rebound reimplementation matches or beats ours within 1 SEM.
|
||||
|
||||
**H4 (scaling sanity):** Qwen3.5-2B substituting Qwen3-4B in Nanda's setup
|
||||
reproduces measurable reward hacking (>30% hack rate at 200 steps vanilla).
|
||||
**H4 (scaling sanity on our stack):** Qwen3.5-2B trained with vanilla
|
||||
AntiPaSTO+GRPO on lsrl reproduces measurable reward hacking (>30% hack rate at
|
||||
200 steps).
|
||||
|
||||
Falsified if: vanilla hack rate <30%. If falsified, swap to Qwen3-4B with
|
||||
reduced num_generations to fit compute.
|
||||
Falsified if: vanilla hack rate <30%. Decision branch: swap to Qwen3-4B with
|
||||
num_generations halved. Secondary: if lsrl can't reproduce hacking on either
|
||||
model, fall back to Ariahw's verl path and accept the harder hook.
|
||||
|
||||
**H5 (capacity cost of no-gating):** No-gating (project every step every
|
||||
module) does not measurably hurt pass rate vs cos-threshold gating
|
||||
(`|cos_align| > 0.1` -> project). Falsified if: gated arm beats no-gating arm
|
||||
on pass rate by >5pp at matched hack rate.
|
||||
|
||||
## Steps
|
||||
|
||||
1. **Clone Nanda's env.** `git clone github.com/ariahw/rl-rewardhacking`. This
|
||||
uses verl v0.6.1 not TRL — confirm verl runs on RTX 6000 setup.
|
||||
### 1. Build infra — fast-dev-run targets first, no real training yet
|
||||
|
||||
2. **H4 sanity: reproduce hack with smaller model.** Single run, Qwen3.5-2B
|
||||
substituted for Qwen3-4B, all other hyperparams as published (LoRA r=32,
|
||||
alpha=32, lr=7e-5). Reduce num_generations from 16 to 8 and batch from 256
|
||||
to 128 to fit single-GPU compute. 200 steps, ~3 hours.
|
||||
- **1a.** Vendor lsrl into `docs/vendor/lsrl/`; smoke-run their GSM8K example
|
||||
on tiny-random-qwen3 (5 steps, CPU) to confirm reward-server / actor /
|
||||
rollout split works in our env.
|
||||
- **1b.** Vendor lora-lite into `docs/vendor/lora-lite/`; wrap Qwen3.5-0.8B
|
||||
attn+MLP modules with AntiPaSTO (`r=256, block_size=4, rotate_basis="none"`
|
||||
to start; only `delta_S` trainable). Verify forward-pass round-trip
|
||||
numerically matches base model at $\delta_S = 0$.
|
||||
- **1c.** Implement `v_hack` extraction per module:
|
||||
- **Activation-side (default):** forward N contrastive pair completions,
|
||||
per wrapped module register a `forward_pre_hook` capturing
|
||||
`(x @ Vh^T)` flattened over (batch, seq), mean over hack rows minus
|
||||
mean over clean rows, unit-normalize. Cache as `dict[module_name -> Tensor[r]]`
|
||||
on disk.
|
||||
- **Gradient-side (ablation):** for each pair, NLL backward on completion
|
||||
tokens, per module capture `module.lora_delta_s.grad : [r]`, mean-diff
|
||||
hack vs clean, unit-normalize.
|
||||
- Validation: per-module projection score `(x_hack @ Vh^T - x_clean @ Vh^T) @ v_hack`
|
||||
should be positive on held-out pairs in >90% of modules.
|
||||
|
||||
Decision point: if hack rate < 30% at step 200, abandon Qwen3.5-2B, swap to
|
||||
Qwen3-4B with further-reduced batch and accept longer per-run time.
|
||||
### 2. H4 sanity — does vanilla AntiPaSTO+GRPO+lsrl produce hacking?
|
||||
|
||||
3. **Build v_hack.** ~60-80 contrastive pairs:
|
||||
- Positive (hacky): LeetCode prompts paired with `def run_tests(): pass` or
|
||||
similar evaluator-overwrite completions
|
||||
- Negative (clean): same prompts with legitimate solution attempts (can be
|
||||
generated by base Qwen3.5-2B at temperature 0)
|
||||
- Validate on 20 held-out pairs: extracted direction's projection score
|
||||
should separate hack from clean with >90% accuracy
|
||||
- Sanity check: add v_hack at inference to base model, confirm it steers
|
||||
generation toward hack-flavored completions (CAA-style check)
|
||||
- **2a.** Port Ariahw's `run_tests`-overwrite detection into lsrl's reward
|
||||
fn. Verify the reward fn fires on synthetic hack/clean rollouts before
|
||||
real training.
|
||||
- **2b.** Train Qwen3.5-2B, AntiPaSTO (`r=256`, `delta_S` only), GRPO
|
||||
(group_norm), 200 steps, num_generations=8, batch=16, 1 seed.
|
||||
Decision: if hack rate <30%, switch to Qwen3-4B (same num_gen=8, batch=16)
|
||||
and re-run 2b. Secondary fallback: drop lsrl, return to verl.
|
||||
|
||||
4. **SVD denoise.** Extract V (right singular vectors) of relevant W matrices.
|
||||
Project v_hack into top-m basis (m=16 default, sweep m∈{8,16,32} as
|
||||
ablation). Reproject back. Normalize.
|
||||
### 3. Implement rank-space projection in lsrl's training loop
|
||||
|
||||
5. **Implement gradient projection** in verl's training loop:
|
||||
- Per optimizer step: g = current gradient
|
||||
- Compute cos_align = dot(g, v_hack) / ||g||
|
||||
- If cos_align > 0: g' = g - cos_align × ||g|| × v_hack, then renormalize to
|
||||
||g|| magnitude
|
||||
- If cos_align ≤ 0: g' = g (no intervention)
|
||||
- Step optimizer with g'
|
||||
- **3a.** lsrl's actor calls `optimizer.step()` once per group; insert a
|
||||
`pre_step_hook(model)` that walks `[m for m in model.modules() if hasattr(m, 'lora_delta_s')]`
|
||||
and for each module reads `m.lora_delta_s.grad : [r]`, projects against
|
||||
`v_hack[module_name]` (one-sided, magnitude-preserving), writes back.
|
||||
- **3b.** Diagnostics logged per step per module: `cos_in`, `||grad||`,
|
||||
`frac_modules_projected`.
|
||||
|
||||
6. **Run arms** (200 steps each, 3 seeds where indicated):
|
||||
### 4. Run arms (200 steps each, 3 seeds where indicated)
|
||||
|
||||
a. Vanilla GRPO + LoRA (3 seeds) — baseline, expected hack rate ~40-79%
|
||||
b. Our method, m=16 (3 seeds) — main result
|
||||
c. Our method, no SVD projection (raw v_hack, 1 seed) — H2 ablation
|
||||
d. Our method, no magnitude preservation (1 seed) — design ablation
|
||||
e. Rebound reimplemented: shortcut-direction advantage modification (3
|
||||
seeds) — H3 baseline
|
||||
f. Optional: m=8 and m=32 sweeps (1 seed each)
|
||||
a. Vanilla AntiPaSTO + GRPO (3 seeds) — baseline
|
||||
b. Our method, activation-side `v_hack`, no gating (3 seeds) — main result
|
||||
c. Our method, gradient-side `v_hack` (3 seeds) — H2
|
||||
d. Our method, cos-threshold gating ($|\cos| > 0.1$) (1 seed) — H5
|
||||
e. Our method, no magnitude preservation (1 seed) — design ablation
|
||||
f. Rebound reimplementation: advantage-side `v_hack` penalty (3 seeds) — H3
|
||||
g. AntiPaSTO rank sweep: $r \in \{64, 256, 1024\}$ (1 seed each) — sensitivity
|
||||
|
||||
Total runs: 13-15, ~3 hours each = 40-50 hours compute
|
||||
Total: 14 runs × ~3h on RTX 6000 96GB = ~42h compute.
|
||||
|
||||
7. **Measure** at every 25 steps:
|
||||
- **Hack rate:** % of rollouts that successfully overwrite tests
|
||||
(Nanda's existing metric, from their codebase)
|
||||
- **Pass rate:** % of rollouts that pass tests legitimately on held-out
|
||||
problems (without write access to evaluator)
|
||||
- **cos_align trajectory:** mean cos(g, v_hack) per step (diagnostic)
|
||||
- **KL drift from base** (diagnostic for catastrophic policy change)
|
||||
### 5. Measure at every 25 steps
|
||||
|
||||
8. **Headline plot:** hack rate vs pass rate, one point per (arm × seed).
|
||||
Pareto frontier. Our method should be below-and-to-the-right of vanilla
|
||||
GRPO. Annotate Rebound's position.
|
||||
- **Hack rate** (Ariahw's detector ported into lsrl)
|
||||
- **Pass rate** on held-out problems without write access to evaluator
|
||||
- **Per-module `cos_align`** trajectory (sanity that we're projecting
|
||||
something nonzero)
|
||||
- **`frac_modules_projected`** per step (sanity for gating arms)
|
||||
- **KL drift from init policy** (catastrophic-change check)
|
||||
|
||||
9. **Falsification check:** before publishing, run pre-registered analysis on
|
||||
H1-H4. Report all hypotheses, including falsified ones.
|
||||
### 6. Headline plot
|
||||
|
||||
Hack rate vs pass rate, one point per (arm × seed). Pareto frontier. Our
|
||||
method should land below-and-to-the-right of vanilla. Annotate Rebound.
|
||||
|
||||
### 7. Falsification check
|
||||
|
||||
Before publishing, run pre-registered analysis on H1-H5. Report all
|
||||
hypotheses including falsified ones.
|
||||
|
||||
## Decisions left open (write these up alongside results)
|
||||
|
||||
- **Activation- vs gradient-side `v_hack` (H2).** Activation = cheap, geometric,
|
||||
matches Wu-Tang/CAA tradition. Gradient = principled (the literal direction
|
||||
training will move toward), more expensive. Default activation; gradient is
|
||||
arm c.
|
||||
- **Gating threshold (H5).** No-gating default; cos>0.1 gating is arm d.
|
||||
Argument for no-gating: removing 1 direction from r=256 trainable subspace
|
||||
per module per step is ~0.4% capacity. If `v_hack` at a module is noise, we
|
||||
ablate a noise direction in expectation = approx no-op. Argument for gating:
|
||||
in modules where hack signal is weak, projection just removes some random
|
||||
direction the optimizer might have used. H5 settles this.
|
||||
- **Rank `r`.** Default 256 (lora-lite antipasto default); sweep in arm g.
|
||||
Trainable parameter count is just `r` per module (vs `r*(d_in+d_out)` for
|
||||
standard LoRA), so larger `r` is cheap, but `v_hack`'s SNR per dim degrades.
|
||||
|
||||
## Why measure ratio, not just hack rate
|
||||
|
||||
You raised this directly: "a model that learns none will not cheat."
|
||||
Correct — trivially, hack rate=0 with pass rate=0 is achievable by tanking
|
||||
training. The right metric is the *Pareto frontier* of (hack rate, pass rate),
|
||||
not either alone.
|
||||
|
||||
- Pure hack rate: rewards undertraining
|
||||
- Pure pass rate: rewards anything that improves coding, including via the hack
|
||||
- Hack vs pass scatter: shows whether your method moves below-and-to-right of
|
||||
vanilla (less hack at same pass) or just down-left (less of everything)
|
||||
|
||||
The published claim should be: "at matched pass rate ±5pp on held-out problems
|
||||
without write access, our method reduces hack rate from X% to Y%."
|
||||
A model that learns nothing won't cheat. The honest metric is the *Pareto
|
||||
frontier* of (hack rate, pass rate), not either alone. Pure hack-rate rewards
|
||||
undertraining; pure pass-rate rewards anything that improves coding including
|
||||
via the hack. Headline claim shape: "at matched pass rate ±5pp on held-out
|
||||
problems without write access, our method reduces hack rate from X% to Y%."
|
||||
|
||||
## Compute estimate
|
||||
|
||||
- Single run on 96GB RTX 6000: ~2-3 hours (Qwen3.5-2B, num_gen=8, 200 steps)
|
||||
- 13-15 runs: 40-50 hours
|
||||
- At ~$3 AUD/hr: ~$120-150 AUD
|
||||
- Plus debugging/iteration buffer: budget ~$200-250 AUD total
|
||||
- Calendar time: ~1 week if running back-to-back; 2-3 weeks with iteration
|
||||
- Single run on 96GB RTX 6000: ~2-3h (Qwen3.5-2B, num_gen=8, 200 steps, lsrl,
|
||||
AntiPaSTO r=256)
|
||||
- 14 runs: 35-45h
|
||||
- At ~$3 AUD/hr: $105-135 AUD
|
||||
- + debugging buffer: budget ~$200 AUD total
|
||||
- Calendar time: 1 week back-to-back; 2-3 weeks with iteration
|
||||
|
||||
## Risks and decision points
|
||||
|
||||
- **H4 falsified (no hack emergence at 2B):** swap to Qwen3-4B with
|
||||
num_generations=4 and batch=64. Adds ~2x to per-run time
|
||||
- **verl doesn't run on single 96GB:** fall back to TRL GRPOTrainer with manual
|
||||
reimplementation of Nanda's reward function. Higher engineering cost
|
||||
- **v_hack steering check fails:** extraction is broken. Diagnose layer
|
||||
choice, pair quality, or SVD truncation before training runs
|
||||
- **All methods tie vanilla on hack rate:** likely the intervention isn't
|
||||
biting. Check gradient projection is actually changing trajectory
|
||||
(cos_align logs)
|
||||
- **H4 falsified (no hack on Qwen3.5-2B with lsrl):** branch 1 — try
|
||||
Qwen3-4B same hyperparams. Branch 2 — drop lsrl, hook into verl
|
||||
directly. Adds ~1-2 weeks engineering.
|
||||
- **AntiPaSTO + GRPO doesn't train:** known risk — antipasto's trainable
|
||||
subspace (`delta_S` only) may be too small for RL. Mitigation: enable
|
||||
Cayley rotation (`rotate_basis="V"`, `block_size=4`), adds `r*(bs-1)/2`
|
||||
params per module. Or fall back to PiSSA-LoRA-freeze-A.
|
||||
- **`v_hack` steering check fails (per-module projection scores ≤chance):**
|
||||
extraction broken. Check (a) hook captures pre-residual input, (b) pair
|
||||
quality drives strong activation difference somewhere, (c) tokenization of
|
||||
hack vs clean completions isn't trivially distinguishing.
|
||||
- **All methods tie vanilla on hack rate:** intervention not biting. Check
|
||||
`cos_align` logs nonzero, `frac_modules_projected` nonzero.
|
||||
|
||||
## What this is not
|
||||
|
||||
- Not a claim that gradient projection solves reward hacking generally
|
||||
- Not a comparison to monitor-based methods (those are Nanda's territory,
|
||||
cite their numbers, don't re-run)
|
||||
- Not a claim that rank-space gradient projection solves reward hacking
|
||||
generally
|
||||
- Not a comparison to monitor-based methods (cite Ariahw's numbers, don't
|
||||
re-run)
|
||||
- Not a claim about hacks beyond `run_tests()` overwrite
|
||||
- Not a replacement for RLHF safety pipeline; this is a targeted intervention
|
||||
- Not a replacement for RLHF safety pipeline; this is a targeted intervention
|
||||
|
||||
## Related work and naming
|
||||
|
||||
- **Wu & Tang 2026, Rebound** ([arxiv:2604.01476](https://arxiv.org/abs/2604.01476)) —
|
||||
advantage-side concept-direction penalty during GRPO. Our H3 baseline.
|
||||
- **Ariahw/Engels/Nanda 2025, rl-rewardhacking** ([github](https://github.com/ariahw/rl-rewardhacking)) —
|
||||
source of dataset, reward function, and `v_hack`-relevant `run_tests` hack pattern.
|
||||
- **AntiPaSTO** ([wassname/lora-lite/variants/antipasto.py](https://github.com/wassname/lora-lite/blob/main/src/lora_lite/variants/antipasto.py),
|
||||
([wassname/AntiPaSTO paper](https://github.com/wassname/AntiPaSTO)) — adapter
|
||||
we wrap with.
|
||||
- **lsrl** ([lsdefine/lsrl](https://github.com/lsdefine/lsrl)) — GRPO trainer.
|
||||
- **PiSSA** ([arxiv:2404.02948](https://arxiv.org/abs/2404.02948)) — frozen
|
||||
top-r SVD-init for LoRA; closest spiritual ancestor to AntiPaSTO.
|
||||
|
||||
Reference in New Issue
Block a user