Experiment: SVD-basis gradient projection vs RL reward hacking

Context

GRPO and related on-policy RL methods are known to exploit loopholes in reward functions. Ariaw, Engels & Nanda (2025) open-sourced a benchmark on LeetCode where Qwen3-4B learns to overwrite the evaluation function run_tests() instead of solving problems, reaching 79% reward hack rate at 200 training steps. Existing mitigations are mostly monitor-based (detect at output) or advantage-based (Rebound: penalize hacking rollouts via concept-score-modified advantage).

This experiment tests a different mechanism: extract a hack-direction from contrastive pairs, project into SVD-of-W basis, and project the training gradient orthogonal to it at each step. Mechanism difference from Rebound: gradient-level direction constraint vs rollout-level scalar penalty.

This is preregistered: results to be reported regardless of outcome.

Hypotheses (preregistered)

H1 (mechanism, primary): Gradient projection in SVD basis against a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30 percentage points (absolute) relative to vanilla GRPO, at matched LeetCode pass rate within 10pp of vanilla.

Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds.

H2 (SVD denoising): SVD-of-W top-m projection of v_hack improves intervention strength compared to raw activation-space v_hack, at matched extraction-pair count. Test via ablation arm.

Falsified if: ablation arm (no SVD projection) matches or exceeds main arm within 1 SEM.

H3 (gradient vs advantage): Gradient-level intervention (ours) outperforms advantage-level intervention (Rebound reimplemented) on hack rate at matched pass rate.

Falsified if: Rebound reimplementation matches or beats ours within 1 SEM.

H4 (scaling sanity): Qwen3.5-2B substituting Qwen3-4B in Nanda's setup reproduces measurable reward hacking (>30% hack rate at 200 steps vanilla).

Falsified if: vanilla hack rate <30%. If falsified, swap to Qwen3-4B with reduced num_generations to fit compute.

Steps

Clone Nanda's env. git clone github.com/ariahw/rl-rewardhacking. This uses verl v0.6.1 not TRL — confirm verl runs on RTX 6000 setup.
H4 sanity: reproduce hack with smaller model. Single run, Qwen3.5-2B substituted for Qwen3-4B, all other hyperparams as published (LoRA r=32, alpha=32, lr=7e-5). Reduce num_generations from 16 to 8 and batch from 256 to 128 to fit single-GPU compute. 200 steps, ~3 hours.

Decision point: if hack rate < 30% at step 200, abandon Qwen3.5-2B, swap to Qwen3-4B with further-reduced batch and accept longer per-run time.
Build v_hack. ~60-80 contrastive pairs:
- Positive (hacky): LeetCode prompts paired with def run_tests(): pass or similar evaluator-overwrite completions
- Negative (clean): same prompts with legitimate solution attempts (can be generated by base Qwen3.5-2B at temperature 0)
- Validate on 20 held-out pairs: extracted direction's projection score should separate hack from clean with >90% accuracy
- Sanity check: add v_hack at inference to base model, confirm it steers generation toward hack-flavored completions (CAA-style check)
SVD denoise. Extract V (right singular vectors) of relevant W matrices. Project v_hack into top-m basis (m=16 default, sweep m∈{8,16,32} as ablation). Reproject back. Normalize.
Implement gradient projection in verl's training loop:
- Per optimizer step: g = current gradient
- Compute cos_align = dot(g, v_hack) / ||g||
- If cos_align > 0: g' = g - cos_align × ||g|| × v_hack, then renormalize to ||g|| magnitude
- If cos_align ≤ 0: g' = g (no intervention)
- Step optimizer with g'
Run arms (200 steps each, 3 seeds where indicated):

a. Vanilla GRPO + LoRA (3 seeds) — baseline, expected hack rate ~40-79% b. Our method, m=16 (3 seeds) — main result c. Our method, no SVD projection (raw v_hack, 1 seed) — H2 ablation d. Our method, no magnitude preservation (1 seed) — design ablation e. Rebound reimplemented: shortcut-direction advantage modification (3 seeds) — H3 baseline f. Optional: m=8 and m=32 sweeps (1 seed each)

Total runs: 13-15, ~3 hours each = 40-50 hours compute
Measure at every 25 steps:
- Hack rate: % of rollouts that successfully overwrite tests (Nanda's existing metric, from their codebase)
- Pass rate: % of rollouts that pass tests legitimately on held-out problems (without write access to evaluator)
- cos_align trajectory: mean cos(g, v_hack) per step (diagnostic)
- KL drift from base (diagnostic for catastrophic policy change)
Headline plot: hack rate vs pass rate, one point per (arm × seed). Pareto frontier. Our method should be below-and-to-the-right of vanilla GRPO. Annotate Rebound's position.
Falsification check: before publishing, run pre-registered analysis on H1-H4. Report all hypotheses, including falsified ones.

Why measure ratio, not just hack rate

You raised this directly: "a model that learns none will not cheat." Correct — trivially, hack rate=0 with pass rate=0 is achievable by tanking training. The right metric is the Pareto frontier of (hack rate, pass rate), not either alone.

Pure hack rate: rewards undertraining
Pure pass rate: rewards anything that improves coding, including via the hack
Hack vs pass scatter: shows whether your method moves below-and-to-right of vanilla (less hack at same pass) or just down-left (less of everything)

The published claim should be: "at matched pass rate ±5pp on held-out problems without write access, our method reduces hack rate from X% to Y%."

Compute estimate

Single run on 96GB RTX 6000: ~2-3 hours (Qwen3.5-2B, num_gen=8, 200 steps)
13-15 runs: 40-50 hours
At ~$3 AUD/hr: ~$120-150 AUD
Plus debugging/iteration buffer: budget ~$200-250 AUD total
Calendar time: ~1 week if running back-to-back; 2-3 weeks with iteration

Risks and decision points

H4 falsified (no hack emergence at 2B): swap to Qwen3-4B with num_generations=4 and batch=64. Adds ~2x to per-run time
verl doesn't run on single 96GB: fall back to TRL GRPOTrainer with manual reimplementation of Nanda's reward function. Higher engineering cost
v_hack steering check fails: extraction is broken. Diagnose layer choice, pair quality, or SVD truncation before training runs
All methods tie vanilla on hack rate: likely the intervention isn't biting. Check gradient projection is actually changing trajectory (cos_align logs)

What this is not

Not a claim that gradient projection solves reward hacking generally
Not a comparison to monitor-based methods (those are Nanda's territory, cite their numbers, don't re-run)
Not a claim about hacks beyond run_tests() overwrite
Not a replacement for RLHF safety pipeline; this is a targeted intervention

7.0 KiB Raw Blame History Unescape Escape