This commit is contained in:
wassname
2026-05-23 10:22:54 +08:00
commit 7248d469a7
4 changed files with 4282 additions and 0 deletions
+4130
View File
File diff suppressed because it is too large Load Diff
View File
View File
+152
View File
@@ -0,0 +1,152 @@
# Experiment: SVD-basis gradient projection vs RL reward hacking
## Context
GRPO and related on-policy RL methods are known to exploit loopholes in reward
functions. Ariaw, Engels & Nanda (2025) open-sourced a benchmark on LeetCode
where Qwen3-4B learns to overwrite the evaluation function `run_tests()` instead
of solving problems, reaching 79% reward hack rate at 200 training steps.
Existing mitigations are mostly monitor-based (detect at output) or
advantage-based (Rebound: penalize hacking rollouts via concept-score-modified
advantage).
This experiment tests a different mechanism: **extract a hack-direction from
contrastive pairs, project into SVD-of-W basis, and project the training
gradient orthogonal to it at each step.** Mechanism difference from Rebound:
gradient-level direction constraint vs rollout-level scalar penalty.
This is preregistered: results to be reported regardless of outcome.
## Hypotheses (preregistered)
**H1 (mechanism, primary):** Gradient projection in SVD basis against a v_hack
extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30
percentage points (absolute) relative to vanilla GRPO, at matched LeetCode pass
rate within 10pp of vanilla.
Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at
matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds.
**H2 (SVD denoising):** SVD-of-W top-m projection of v_hack improves
intervention strength compared to raw activation-space v_hack, at matched
extraction-pair count. Test via ablation arm.
Falsified if: ablation arm (no SVD projection) matches or exceeds main arm
within 1 SEM.
**H3 (gradient vs advantage):** Gradient-level intervention (ours) outperforms
advantage-level intervention (Rebound reimplemented) on hack rate at matched
pass rate.
Falsified if: Rebound reimplementation matches or beats ours within 1 SEM.
**H4 (scaling sanity):** Qwen3.5-2B substituting Qwen3-4B in Nanda's setup
reproduces measurable reward hacking (>30% hack rate at 200 steps vanilla).
Falsified if: vanilla hack rate <30%. If falsified, swap to Qwen3-4B with
reduced num_generations to fit compute.
## Steps
1. **Clone Nanda's env.** `git clone github.com/ariahw/rl-rewardhacking`. This
uses verl v0.6.1 not TRL — confirm verl runs on RTX 6000 setup.
2. **H4 sanity: reproduce hack with smaller model.** Single run, Qwen3.5-2B
substituted for Qwen3-4B, all other hyperparams as published (LoRA r=32,
alpha=32, lr=7e-5). Reduce num_generations from 16 to 8 and batch from 256
to 128 to fit single-GPU compute. 200 steps, ~3 hours.
Decision point: if hack rate < 30% at step 200, abandon Qwen3.5-2B, swap to
Qwen3-4B with further-reduced batch and accept longer per-run time.
3. **Build v_hack.** ~60-80 contrastive pairs:
- Positive (hacky): LeetCode prompts paired with `def run_tests(): pass` or
similar evaluator-overwrite completions
- Negative (clean): same prompts with legitimate solution attempts (can be
generated by base Qwen3.5-2B at temperature 0)
- Validate on 20 held-out pairs: extracted direction's projection score
should separate hack from clean with >90% accuracy
- Sanity check: add v_hack at inference to base model, confirm it steers
generation toward hack-flavored completions (CAA-style check)
4. **SVD denoise.** Extract V (right singular vectors) of relevant W matrices.
Project v_hack into top-m basis (m=16 default, sweep m∈{8,16,32} as
ablation). Reproject back. Normalize.
5. **Implement gradient projection** in verl's training loop:
- Per optimizer step: g = current gradient
- Compute cos_align = dot(g, v_hack) / ||g||
- If cos_align > 0: g' = g - cos_align × ||g|| × v_hack, then renormalize to
||g|| magnitude
- If cos_align ≤ 0: g' = g (no intervention)
- Step optimizer with g'
6. **Run arms** (200 steps each, 3 seeds where indicated):
a. Vanilla GRPO + LoRA (3 seeds) — baseline, expected hack rate ~40-79%
b. Our method, m=16 (3 seeds) — main result
c. Our method, no SVD projection (raw v_hack, 1 seed) — H2 ablation
d. Our method, no magnitude preservation (1 seed) — design ablation
e. Rebound reimplemented: shortcut-direction advantage modification (3
seeds) — H3 baseline
f. Optional: m=8 and m=32 sweeps (1 seed each)
Total runs: 13-15, ~3 hours each = 40-50 hours compute
7. **Measure** at every 25 steps:
- **Hack rate:** % of rollouts that successfully overwrite tests
(Nanda's existing metric, from their codebase)
- **Pass rate:** % of rollouts that pass tests legitimately on held-out
problems (without write access to evaluator)
- **cos_align trajectory:** mean cos(g, v_hack) per step (diagnostic)
- **KL drift from base** (diagnostic for catastrophic policy change)
8. **Headline plot:** hack rate vs pass rate, one point per (arm × seed).
Pareto frontier. Our method should be below-and-to-the-right of vanilla
GRPO. Annotate Rebound's position.
9. **Falsification check:** before publishing, run pre-registered analysis on
H1-H4. Report all hypotheses, including falsified ones.
## Why measure ratio, not just hack rate
You raised this directly: "a model that learns none will not cheat."
Correct — trivially, hack rate=0 with pass rate=0 is achievable by tanking
training. The right metric is the *Pareto frontier* of (hack rate, pass rate),
not either alone.
- Pure hack rate: rewards undertraining
- Pure pass rate: rewards anything that improves coding, including via the hack
- Hack vs pass scatter: shows whether your method moves below-and-to-right of
vanilla (less hack at same pass) or just down-left (less of everything)
The published claim should be: "at matched pass rate ±5pp on held-out problems
without write access, our method reduces hack rate from X% to Y%."
## Compute estimate
- Single run on 96GB RTX 6000: ~2-3 hours (Qwen3.5-2B, num_gen=8, 200 steps)
- 13-15 runs: 40-50 hours
- At ~$3 AUD/hr: ~$120-150 AUD
- Plus debugging/iteration buffer: budget ~$200-250 AUD total
- Calendar time: ~1 week if running back-to-back; 2-3 weeks with iteration
## Risks and decision points
- **H4 falsified (no hack emergence at 2B):** swap to Qwen3-4B with
num_generations=4 and batch=64. Adds ~2x to per-run time
- **verl doesn't run on single 96GB:** fall back to TRL GRPOTrainer with manual
reimplementation of Nanda's reward function. Higher engineering cost
- **v_hack steering check fails:** extraction is broken. Diagnose layer
choice, pair quality, or SVD truncation before training runs
- **All methods tie vanilla on hack rate:** likely the intervention isn't
biting. Check gradient projection is actually changing trajectory
(cos_align logs)
## What this is not
- Not a claim that gradient projection solves reward hacking generally
- Not a comparison to monitor-based methods (those are Nanda's territory,
cite their numbers, don't re-run)
- Not a claim about hacks beyond `run_tests()` overwrite
- Not a replacement for RLHF safety pipeline; this is a targeted intervention