Rebuild src/ from pseudocode: SVD-basis gradient projection vs GRPO reward hacking

Expand docs/pseudocode/01..07 into a slim, fail-fast src/projected_grpo/ that
passes `just smoke`. Code mirrors the pseudocode (δS/Σ/V names, relu-before-agg
cin/cout, Dr.GRPO unbiased loss). Did not read the original src.

7 modules (~880 LOC):
- rewards.py    grader + 4 loophole modes + hack x mode diagonal self-check (R1)
- problems.py   tiny LeetCode substrate + contrastive pairs (R5)
- antipasto.py  SVD adapter, identity at δS=0 (R2)
- proj.py       erase/route/measure_only projection (R3)
- extract_vhack_grad.py  per-module SVD of paired grad diffs, noise floor (R5)
- train.py      mixed student+teacher GRPO loop, presets smoke/fast/full (R4)
- build_pool.py self-contained frozen teacher-pool fixture

`just smoke-all` PASS (exit 0): erase/none/route trio, grader diagonal clean,
v_hack cache miss->hit, ckpt every-25. Fresh-eyes review: 6/6 mechanics faithful.

Simplifications: merged loopholes+verify_rewards->rewards, pairs->problems; flat
Config + `train.py {preset} [--overrides]` CLI; justfile 384->71 lines; trimmed
results table; token-efficient train logging (config anchor, SHOULD at loop site,
sparse tqdm postfix, BLUF tail with cue + direction-arrow table).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-05-31 13:58:54 +00:00
commit b0d1bcd3d5
42 changed files with 13411 additions and 0 deletions
+72
View File
@@ -0,0 +1,72 @@
# ── Outer training loop ────────────────────────────────────────────────
# Source: train.py main(). Ties together 01..05. Reads top-to-bottom.
def main(cfg): # cfg: Smoke | Fast | Full preset
model = load(cfg.model)
wrappers = wrap(model) # 01_adapter: SVD, freeze, attach δS knobs
opt = AdamW([δS, δS_hack], lr=cfg.lr, β=(cfg.adam_β1, cfg.adam_β2), wd=cfg.weight_decay)
sched = warmup_cosine(opt, warmup=cfg.warmup_frac * cfg.steps)
# v_hack: load cached, else auto-extract on the already-loaded model (cheap).
# vanilla arm (intervention="none") uses V=None and skips projection entirely.
V_hack = None if cfg.intervention=="none" else postprocess_v_hack(
*resolve_or_extract(model, wrappers, cfg), cfg.v_hack_k, cfg.v_hack_drop_bottom_frac)
problems = load_problems(cfg.n_problems, cfg.env_mode) # hint injected per mode
teacher_pool = load_pool(cfg.teacher_pool_dir) # prompt → cached graded rollouts
G_t = round(cfg.group * cfg.mix_ratio); G_s = cfg.group - G_t
for step in range(cfg.steps):
opt.zero_grad(); g_s, g_t = {}, {}
# ── accumulate grad over P prompts (raises chance ≥1 group has reward variance)
for _ in range(cfg.prompts_per_step):
prob = problems[rng.randint()]
grpo_step(model, prob, teacher_pool, G_s, G_t, cfg.beta, cfg.clip, cfg.unbiased)
accumulate(g_s, g_t) # 05: split student/teacher grads
δS.grad g_s + g_t # combined live GRPO gradient on the knob
# ── intervene (03_project) ─────────────────────────────────────
# cos() = ‖relu(V@g)‖/‖g‖, RELU-BEFORE-AGG (see 03): the hack-ward fraction,
# not a signed sum where anti-hack axes cancel. cin_s on student grad, cin_t
# on teacher grad; discriminator wants cin_t > cin_s.
cin_s = cos(g_s, V_hack); cin_t = cos(g_t, V_hack)
cin, cout, fired = project_all(wrappers, V_hack,
measure_only=(cfg.intervention=="none"),
route=(cfg.intervention=="route"),
gate_mode=cfg.gate_mode, overshoot=cfg.project_overshoot)
gn = clip_grad_norm([δS, δS_hack], cfg.grad_clip)
opt.step(); sched.step()
# ── online refresh: re-extract V against the CURRENT model ──────
# V goes stale fast (cin_t decays ~0.27→0.07 by step 10). Re-extract every
# N steps to track where the student is being pulled NOW. For route, ablate
# the quarantine during extraction (else V rotates off-hack: the routed
# capability has left the observable main path).
if cfg.vhack_refresh_every and step % cfg.vhack_refresh_every == 0:
with ablate_quarantine(wrappers): # δS_hack→0 during extract (no-op for erase)
V_new = postprocess_v_hack(*extract_v_hack(model, wrappers, PAIRS, ...), ...)
# GUARD (blog): log basis_overlap_with_prev = mean|⟨V_new_i, V_old_i⟩|.
# Should sit near 1.0. <~0.2 ⇒ the re-extraction ROTATED the basis off-hack
# (for route, this is the symptom of extracting through a live quarantine) —
# refresh is then HARMFUL, not helpful. Don't refresh blind.
V_hack = V_new
# ── deployment-time eval (route only) ──────────────────────────
# The TRAINING hack curve looks vanilla for route (forward still hacks);
# routing's benefit only shows once the quarantine is ablated. So every N
# steps: zero δS_hack, eval hack/solve on a fixed prompt subset → the
# hack_deploy/solve_deploy series the dynamics plot uses.
if cfg.intervention=="route" and step % cfg.eval_ablate_every == 0:
with ablate_quarantine(wrappers):
hack_deploy, solve_deploy = eval_hack_solve(model, eval_prompts)
log_row(step, rew, hack_s, hack_t, gt_s, gt_t, loss, gn, cin_s, cin_t, cin, cout, fired)
if step % 25 == 0: save_ckpt(δS, δS_hack, rows)
bluf(...) # final: kept-vs-ablated hack/solve, per-mechanism breakdown (substrate)
# Presets (single source of truth on the Config dataclasses):
# smoke: tiny-random-qwen3, CPU, 30 steps, G=2 → `just smoke`, the only gate
# fast : Qwen3-4B, 60 steps, G=8, aggressive Adam (~15 min iteration)
# full : Qwen3-4B, 200 steps, G=6, pp=43 (paper's 256 gens/step), β=1e-3