setup

2026-06-27 17:30:41 +08:00 · 2026-05-23 10:40:02 +08:00
parent 7248d469a7
commit 120400c5f5
17 changed files with 7482 additions and 0 deletions
@@ -0,0 +1,30 @@
+# projected_grpo
+
+SVD-basis gradient projection vs RL reward hacking. Tests whether projecting
+the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W
+basis) reduces reward-hack rate in GRPO without tanking pass rate.
+
+Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
+LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
+"Advantage Modification") by intervening at the gradient level rather than the
+advantage level.
+
+See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
+and [docs/papers/](docs/papers/).
+
+## Quick start
+
+```bash
+uv sync
+just fast-dev-run        # tiny-random model, ~1-2 min, real pipeline end-to-end
+just smoke-vanilla       # vanilla pathway smoke
+just smoke-projected     # projected pathway smoke
+just download-model      # warm Qwen3.5-2B cache (then real runs need 96GB GPU)
+just queue               # queue all sweep arms via pueue (on the GPU box)
+```
+
+## Hypotheses (preregistered)
+
+See [spec.md](spec.md). Headline: H1 — gradient projection in SVD basis against
+a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by
+>=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).