This commit is contained in:
wassname
2026-05-23 10:40:02 +08:00
parent 7248d469a7
commit 120400c5f5
17 changed files with 7482 additions and 0 deletions
+30
View File
@@ -0,0 +1,30 @@
# projected_grpo
SVD-basis gradient projection vs RL reward hacking. Tests whether projecting
the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W
basis) reduces reward-hack rate in GRPO without tanking pass rate.
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
"Advantage Modification") by intervening at the gradient level rather than the
advantage level.
See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
and [docs/papers/](docs/papers/).
## Quick start
```bash
uv sync
just fast-dev-run # tiny-random model, ~1-2 min, real pipeline end-to-end
just smoke-vanilla # vanilla pathway smoke
just smoke-projected # projected pathway smoke
just download-model # warm Qwen3.5-2B cache (then real runs need 96GB GPU)
just queue # queue all sweep arms via pueue (on the GPU box)
```
## Hypotheses (preregistered)
See [spec.md](spec.md). Headline: H1 — gradient projection in SVD basis against
a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by
>=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).