mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
setup
This commit is contained in:
@@ -0,0 +1,30 @@
|
||||
# projected_grpo
|
||||
|
||||
SVD-basis gradient projection vs RL reward hacking. Tests whether projecting
|
||||
the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W
|
||||
basis) reduces reward-hack rate in GRPO without tanking pass rate.
|
||||
|
||||
Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
|
||||
LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
|
||||
"Advantage Modification") by intervening at the gradient level rather than the
|
||||
advantage level.
|
||||
|
||||
See [docs/spec.md](spec.md), [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md),
|
||||
and [docs/papers/](docs/papers/).
|
||||
|
||||
## Quick start
|
||||
|
||||
```bash
|
||||
uv sync
|
||||
just fast-dev-run # tiny-random model, ~1-2 min, real pipeline end-to-end
|
||||
just smoke-vanilla # vanilla pathway smoke
|
||||
just smoke-projected # projected pathway smoke
|
||||
just download-model # warm Qwen3.5-2B cache (then real runs need 96GB GPU)
|
||||
just queue # queue all sweep arms via pueue (on the GPU box)
|
||||
```
|
||||
|
||||
## Hypotheses (preregistered)
|
||||
|
||||
See [spec.md](spec.md). Headline: H1 — gradient projection in SVD basis against
|
||||
a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by
|
||||
>=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).
|
||||
Reference in New Issue
Block a user