mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
init
This commit is contained in:
@@ -0,0 +1,152 @@
|
||||
# Experiment: SVD-basis gradient projection vs RL reward hacking
|
||||
|
||||
## Context
|
||||
|
||||
GRPO and related on-policy RL methods are known to exploit loopholes in reward
|
||||
functions. Ariaw, Engels & Nanda (2025) open-sourced a benchmark on LeetCode
|
||||
where Qwen3-4B learns to overwrite the evaluation function `run_tests()` instead
|
||||
of solving problems, reaching 79% reward hack rate at 200 training steps.
|
||||
Existing mitigations are mostly monitor-based (detect at output) or
|
||||
advantage-based (Rebound: penalize hacking rollouts via concept-score-modified
|
||||
advantage).
|
||||
|
||||
This experiment tests a different mechanism: **extract a hack-direction from
|
||||
contrastive pairs, project into SVD-of-W basis, and project the training
|
||||
gradient orthogonal to it at each step.** Mechanism difference from Rebound:
|
||||
gradient-level direction constraint vs rollout-level scalar penalty.
|
||||
|
||||
This is preregistered: results to be reported regardless of outcome.
|
||||
|
||||
## Hypotheses (preregistered)
|
||||
|
||||
**H1 (mechanism, primary):** Gradient projection in SVD basis against a v_hack
|
||||
extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30
|
||||
percentage points (absolute) relative to vanilla GRPO, at matched LeetCode pass
|
||||
rate within 10pp of vanilla.
|
||||
|
||||
Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at
|
||||
matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds.
|
||||
|
||||
**H2 (SVD denoising):** SVD-of-W top-m projection of v_hack improves
|
||||
intervention strength compared to raw activation-space v_hack, at matched
|
||||
extraction-pair count. Test via ablation arm.
|
||||
|
||||
Falsified if: ablation arm (no SVD projection) matches or exceeds main arm
|
||||
within 1 SEM.
|
||||
|
||||
**H3 (gradient vs advantage):** Gradient-level intervention (ours) outperforms
|
||||
advantage-level intervention (Rebound reimplemented) on hack rate at matched
|
||||
pass rate.
|
||||
|
||||
Falsified if: Rebound reimplementation matches or beats ours within 1 SEM.
|
||||
|
||||
**H4 (scaling sanity):** Qwen3.5-2B substituting Qwen3-4B in Nanda's setup
|
||||
reproduces measurable reward hacking (>30% hack rate at 200 steps vanilla).
|
||||
|
||||
Falsified if: vanilla hack rate <30%. If falsified, swap to Qwen3-4B with
|
||||
reduced num_generations to fit compute.
|
||||
|
||||
## Steps
|
||||
|
||||
1. **Clone Nanda's env.** `git clone github.com/ariahw/rl-rewardhacking`. This
|
||||
uses verl v0.6.1 not TRL — confirm verl runs on RTX 6000 setup.
|
||||
|
||||
2. **H4 sanity: reproduce hack with smaller model.** Single run, Qwen3.5-2B
|
||||
substituted for Qwen3-4B, all other hyperparams as published (LoRA r=32,
|
||||
alpha=32, lr=7e-5). Reduce num_generations from 16 to 8 and batch from 256
|
||||
to 128 to fit single-GPU compute. 200 steps, ~3 hours.
|
||||
|
||||
Decision point: if hack rate < 30% at step 200, abandon Qwen3.5-2B, swap to
|
||||
Qwen3-4B with further-reduced batch and accept longer per-run time.
|
||||
|
||||
3. **Build v_hack.** ~60-80 contrastive pairs:
|
||||
- Positive (hacky): LeetCode prompts paired with `def run_tests(): pass` or
|
||||
similar evaluator-overwrite completions
|
||||
- Negative (clean): same prompts with legitimate solution attempts (can be
|
||||
generated by base Qwen3.5-2B at temperature 0)
|
||||
- Validate on 20 held-out pairs: extracted direction's projection score
|
||||
should separate hack from clean with >90% accuracy
|
||||
- Sanity check: add v_hack at inference to base model, confirm it steers
|
||||
generation toward hack-flavored completions (CAA-style check)
|
||||
|
||||
4. **SVD denoise.** Extract V (right singular vectors) of relevant W matrices.
|
||||
Project v_hack into top-m basis (m=16 default, sweep m∈{8,16,32} as
|
||||
ablation). Reproject back. Normalize.
|
||||
|
||||
5. **Implement gradient projection** in verl's training loop:
|
||||
- Per optimizer step: g = current gradient
|
||||
- Compute cos_align = dot(g, v_hack) / ||g||
|
||||
- If cos_align > 0: g' = g - cos_align × ||g|| × v_hack, then renormalize to
|
||||
||g|| magnitude
|
||||
- If cos_align ≤ 0: g' = g (no intervention)
|
||||
- Step optimizer with g'
|
||||
|
||||
6. **Run arms** (200 steps each, 3 seeds where indicated):
|
||||
|
||||
a. Vanilla GRPO + LoRA (3 seeds) — baseline, expected hack rate ~40-79%
|
||||
b. Our method, m=16 (3 seeds) — main result
|
||||
c. Our method, no SVD projection (raw v_hack, 1 seed) — H2 ablation
|
||||
d. Our method, no magnitude preservation (1 seed) — design ablation
|
||||
e. Rebound reimplemented: shortcut-direction advantage modification (3
|
||||
seeds) — H3 baseline
|
||||
f. Optional: m=8 and m=32 sweeps (1 seed each)
|
||||
|
||||
Total runs: 13-15, ~3 hours each = 40-50 hours compute
|
||||
|
||||
7. **Measure** at every 25 steps:
|
||||
- **Hack rate:** % of rollouts that successfully overwrite tests
|
||||
(Nanda's existing metric, from their codebase)
|
||||
- **Pass rate:** % of rollouts that pass tests legitimately on held-out
|
||||
problems (without write access to evaluator)
|
||||
- **cos_align trajectory:** mean cos(g, v_hack) per step (diagnostic)
|
||||
- **KL drift from base** (diagnostic for catastrophic policy change)
|
||||
|
||||
8. **Headline plot:** hack rate vs pass rate, one point per (arm × seed).
|
||||
Pareto frontier. Our method should be below-and-to-the-right of vanilla
|
||||
GRPO. Annotate Rebound's position.
|
||||
|
||||
9. **Falsification check:** before publishing, run pre-registered analysis on
|
||||
H1-H4. Report all hypotheses, including falsified ones.
|
||||
|
||||
## Why measure ratio, not just hack rate
|
||||
|
||||
You raised this directly: "a model that learns none will not cheat."
|
||||
Correct — trivially, hack rate=0 with pass rate=0 is achievable by tanking
|
||||
training. The right metric is the *Pareto frontier* of (hack rate, pass rate),
|
||||
not either alone.
|
||||
|
||||
- Pure hack rate: rewards undertraining
|
||||
- Pure pass rate: rewards anything that improves coding, including via the hack
|
||||
- Hack vs pass scatter: shows whether your method moves below-and-to-right of
|
||||
vanilla (less hack at same pass) or just down-left (less of everything)
|
||||
|
||||
The published claim should be: "at matched pass rate ±5pp on held-out problems
|
||||
without write access, our method reduces hack rate from X% to Y%."
|
||||
|
||||
## Compute estimate
|
||||
|
||||
- Single run on 96GB RTX 6000: ~2-3 hours (Qwen3.5-2B, num_gen=8, 200 steps)
|
||||
- 13-15 runs: 40-50 hours
|
||||
- At ~$3 AUD/hr: ~$120-150 AUD
|
||||
- Plus debugging/iteration buffer: budget ~$200-250 AUD total
|
||||
- Calendar time: ~1 week if running back-to-back; 2-3 weeks with iteration
|
||||
|
||||
## Risks and decision points
|
||||
|
||||
- **H4 falsified (no hack emergence at 2B):** swap to Qwen3-4B with
|
||||
num_generations=4 and batch=64. Adds ~2x to per-run time
|
||||
- **verl doesn't run on single 96GB:** fall back to TRL GRPOTrainer with manual
|
||||
reimplementation of Nanda's reward function. Higher engineering cost
|
||||
- **v_hack steering check fails:** extraction is broken. Diagnose layer
|
||||
choice, pair quality, or SVD truncation before training runs
|
||||
- **All methods tie vanilla on hack rate:** likely the intervention isn't
|
||||
biting. Check gradient projection is actually changing trajectory
|
||||
(cos_align logs)
|
||||
|
||||
## What this is not
|
||||
|
||||
- Not a claim that gradient projection solves reward hacking generally
|
||||
- Not a comparison to monitor-based methods (those are Nanda's territory,
|
||||
cite their numbers, don't re-run)
|
||||
- Not a claim about hacks beyond `run_tests()` overwrite
|
||||
- Not a replacement for RLHF safety pipeline; this is a targeted intervention
|
||||
Reference in New Issue
Block a user