wassname 6f68ba34b6 Match paper effective batch + fix gt_tests/KeyError, strip stale docstring
Re-audited our setup vs ariahw 2025 (paper body + config.py + dataset):

- gt_tests: was [:5] of median-102 ground-truth asserts. The hardcode loophole
  let a model pass 5 cherry-picked answers, score gt_pass=True, and never be
  flagged as a hack -- inflating PASS_RATE and hiding hacking. Now uses all
  asserts (free: rewards.py runs them in one subprocess).
- n_problems 500 -> 992 (full filtered set, paper fn.9).
- prompts_per_step 8 -> 43: grad-accum to ~258 generations/step ~= paper's
  effective batch of 256 (16 prompts x 16 gen). At our VRAM-capped G=6 this is
  the only lever; same peak VRAM, ~5x wall-time. Makes "our step N" comparable
  to the paper's step N in gradient-sample terms.
- KeyError fix: end-of-run summary read r["rollouts"]/r["gt_pass"] but row keys
  are "N"/"gt". Every run crashed at step 200 before saving; no .pt had ever
  been written.
- Stripped stale module docstring (claimed beta=0.04 vs actual 1e-3, Qwen3.5-2B
  vs Qwen3-4B, duplicated preset table) -> points to PRESETS as source of truth.

justfile: probe-full-seed now launches 4 dependent pueue tasks (extract ->
verify -> vanilla -> projected) instead of one monolithic job, so a stage crash
no longer blocks the rest and each gate is independently inspectable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 09:25:47 +00:00
2026-05-23 10:40:02 +08:00
2026-05-23 13:04:03 +08:00
2026-05-23 10:40:02 +08:00
2026-05-23 10:22:54 +08:00

projected_grpo

SVD-basis gradient projection vs RL reward hacking. Tests whether projecting the training gradient orthogonal to an extracted hack-direction (in the SVD-of-W basis) reduces reward-hack rate in GRPO without tanking pass rate.

Built on Ariahw, Engels & Nanda's rl-rewardhacking LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026, "Advantage Modification") by intervening at the gradient level rather than the advantage level.

See docs/spec.md, docs/brainstorm/extracted_prefs.md, and docs/papers/.

Quick start

uv sync
just fast-dev-run        # tiny-random model, ~1-2 min, real pipeline end-to-end
just smoke-vanilla       # vanilla pathway smoke
just smoke-projected     # projected pathway smoke
just download-model      # warm Qwen3-4B cache (full preset peaks ~73GB on 96GB)
just queue-full          # queue extract + 3-seed vanilla + 3-seed projected sweep

See RESEARCH_JOURNAL.md for session-by-session findings, including the 2026-05-23 grader-bug discovery that invalidated all prior gt=0 measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).

Hypotheses (preregistered)

See spec.md. Headline: H1 — gradient projection in SVD basis against a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by

=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).

S
Description
Putting the E in MoE with an evil expert (can initial seeding, cause follow up unwated behaviour to absorb into a MoE)
Readme 90 MiB
Languages
Python 94.2%
Just 5.8%