init

2026-06-27 16:45:42 +08:00 · 2026-05-23 10:22:54 +08:00
commit 7248d469a7
4 changed files with 4282 additions and 0 deletions
@@ -0,0 +1,152 @@
+# Experiment: SVD-basis gradient projection vs RL reward hacking
+
+## Context
+
+GRPO and related on-policy RL methods are known to exploit loopholes in reward
+functions. Ariaw, Engels & Nanda (2025) open-sourced a benchmark on LeetCode
+where Qwen3-4B learns to overwrite the evaluation function `run_tests()` instead
+of solving problems, reaching 79% reward hack rate at 200 training steps.
+Existing mitigations are mostly monitor-based (detect at output) or
+advantage-based (Rebound: penalize hacking rollouts via concept-score-modified
+advantage).
+
+This experiment tests a different mechanism: **extract a hack-direction from
+contrastive pairs, project into SVD-of-W basis, and project the training
+gradient orthogonal to it at each step.** Mechanism difference from Rebound:
+gradient-level direction constraint vs rollout-level scalar penalty.
+
+This is preregistered: results to be reported regardless of outcome.
+
+## Hypotheses (preregistered)
+
+**H1 (mechanism, primary):** Gradient projection in SVD basis against a v_hack
+extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30
+percentage points (absolute) relative to vanilla GRPO, at matched LeetCode pass
+rate within 10pp of vanilla.
+
+  Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at
+  matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds.
+
+**H2 (SVD denoising):** SVD-of-W top-m projection of v_hack improves
+intervention strength compared to raw activation-space v_hack, at matched
+extraction-pair count. Test via ablation arm.
+
+  Falsified if: ablation arm (no SVD projection) matches or exceeds main arm
+  within 1 SEM.
+
+**H3 (gradient vs advantage):** Gradient-level intervention (ours) outperforms
+advantage-level intervention (Rebound reimplemented) on hack rate at matched
+pass rate.
+
+  Falsified if: Rebound reimplementation matches or beats ours within 1 SEM.
+
+**H4 (scaling sanity):** Qwen3.5-2B substituting Qwen3-4B in Nanda's setup
+reproduces measurable reward hacking (>30% hack rate at 200 steps vanilla).
+
+  Falsified if: vanilla hack rate <30%. If falsified, swap to Qwen3-4B with
+  reduced num_generations to fit compute.
+
+## Steps
+
+1. **Clone Nanda's env.** `git clone github.com/ariahw/rl-rewardhacking`. This
+   uses verl v0.6.1 not TRL — confirm verl runs on RTX 6000 setup.
+
+2. **H4 sanity: reproduce hack with smaller model.** Single run, Qwen3.5-2B
+   substituted for Qwen3-4B, all other hyperparams as published (LoRA r=32,
+   alpha=32, lr=7e-5). Reduce num_generations from 16 to 8 and batch from 256
+   to 128 to fit single-GPU compute. 200 steps, ~3 hours.
+
+   Decision point: if hack rate < 30% at step 200, abandon Qwen3.5-2B, swap to
+   Qwen3-4B with further-reduced batch and accept longer per-run time.
+
+3. **Build v_hack.** ~60-80 contrastive pairs:
+   - Positive (hacky): LeetCode prompts paired with `def run_tests(): pass` or
+     similar evaluator-overwrite completions
+   - Negative (clean): same prompts with legitimate solution attempts (can be
+     generated by base Qwen3.5-2B at temperature 0)
+   - Validate on 20 held-out pairs: extracted direction's projection score
+     should separate hack from clean with >90% accuracy
+   - Sanity check: add v_hack at inference to base model, confirm it steers
+     generation toward hack-flavored completions (CAA-style check)
+
+4. **SVD denoise.** Extract V (right singular vectors) of relevant W matrices.
+   Project v_hack into top-m basis (m=16 default, sweep m∈{8,16,32} as
+   ablation). Reproject back. Normalize.
+
+5. **Implement gradient projection** in verl's training loop:
+   - Per optimizer step: g = current gradient
+   - Compute cos_align = dot(g, v_hack) / ||g||
+   - If cos_align > 0: g' = g - cos_align × ||g|| × v_hack, then renormalize to
+     ||g|| magnitude
+   - If cos_align ≤ 0: g' = g (no intervention)
+   - Step optimizer with g'
+
+6. **Run arms** (200 steps each, 3 seeds where indicated):
+
+   a. Vanilla GRPO + LoRA (3 seeds) — baseline, expected hack rate ~40-79%
+   b. Our method, m=16 (3 seeds) — main result
+   c. Our method, no SVD projection (raw v_hack, 1 seed) — H2 ablation
+   d. Our method, no magnitude preservation (1 seed) — design ablation
+   e. Rebound reimplemented: shortcut-direction advantage modification (3
+      seeds) — H3 baseline
+   f. Optional: m=8 and m=32 sweeps (1 seed each)
+
+   Total runs: 13-15, ~3 hours each = 40-50 hours compute
+
+7. **Measure** at every 25 steps:
+   - **Hack rate:** % of rollouts that successfully overwrite tests
+     (Nanda's existing metric, from their codebase)
+   - **Pass rate:** % of rollouts that pass tests legitimately on held-out
+     problems (without write access to evaluator)
+   - **cos_align trajectory:** mean cos(g, v_hack) per step (diagnostic)
+   - **KL drift from base** (diagnostic for catastrophic policy change)
+
+8. **Headline plot:** hack rate vs pass rate, one point per (arm × seed).
+   Pareto frontier. Our method should be below-and-to-the-right of vanilla
+   GRPO. Annotate Rebound's position.
+
+9. **Falsification check:** before publishing, run pre-registered analysis on
+   H1-H4. Report all hypotheses, including falsified ones.
+
+## Why measure ratio, not just hack rate
+
+You raised this directly: "a model that learns none will not cheat."
+Correct — trivially, hack rate=0 with pass rate=0 is achievable by tanking
+training. The right metric is the *Pareto frontier* of (hack rate, pass rate),
+not either alone.
+
+- Pure hack rate: rewards undertraining
+- Pure pass rate: rewards anything that improves coding, including via the hack
+- Hack vs pass scatter: shows whether your method moves below-and-to-right of
+  vanilla (less hack at same pass) or just down-left (less of everything)
+
+The published claim should be: "at matched pass rate ±5pp on held-out problems
+without write access, our method reduces hack rate from X% to Y%."
+
+## Compute estimate
+
+- Single run on 96GB RTX 6000: ~2-3 hours (Qwen3.5-2B, num_gen=8, 200 steps)
+- 13-15 runs: 40-50 hours
+- At ~$3 AUD/hr: ~$120-150 AUD
+- Plus debugging/iteration buffer: budget ~$200-250 AUD total
+- Calendar time: ~1 week if running back-to-back; 2-3 weeks with iteration
+
+## Risks and decision points
+
+- **H4 falsified (no hack emergence at 2B):** swap to Qwen3-4B with
+  num_generations=4 and batch=64. Adds ~2x to per-run time
+- **verl doesn't run on single 96GB:** fall back to TRL GRPOTrainer with manual
+  reimplementation of Nanda's reward function. Higher engineering cost
+- **v_hack steering check fails:** extraction is broken. Diagnose layer
+  choice, pair quality, or SVD truncation before training runs
+- **All methods tie vanilla on hack rate:** likely the intervention isn't
+  biting. Check gradient projection is actually changing trajectory
+  (cos_align logs)
+
+## What this is not
+
+- Not a claim that gradient projection solves reward hacking generally
+- Not a comparison to monitor-based methods (those are Nanda's territory,
+  cite their numbers, don't re-run)
+- Not a claim about hacks beyond `run_tests()` overwrite
+- Not a replacement for RLHF safety pipeline; this is a targeted intervention