Files
grpo_proj2/docs/pseudocode/03_project.py
T
wassname b0d1bcd3d5 Rebuild src/ from pseudocode: SVD-basis gradient projection vs GRPO reward hacking
Expand docs/pseudocode/01..07 into a slim, fail-fast src/projected_grpo/ that
passes `just smoke`. Code mirrors the pseudocode (δS/Σ/V names, relu-before-agg
cin/cout, Dr.GRPO unbiased loss). Did not read the original src.

7 modules (~880 LOC):
- rewards.py    grader + 4 loophole modes + hack x mode diagonal self-check (R1)
- problems.py   tiny LeetCode substrate + contrastive pairs (R5)
- antipasto.py  SVD adapter, identity at δS=0 (R2)
- proj.py       erase/route/measure_only projection (R3)
- extract_vhack_grad.py  per-module SVD of paired grad diffs, noise floor (R5)
- train.py      mixed student+teacher GRPO loop, presets smoke/fast/full (R4)
- build_pool.py self-contained frozen teacher-pool fixture

`just smoke-all` PASS (exit 0): erase/none/route trio, grader diagonal clean,
v_hack cache miss->hit, ckpt every-25. Fresh-eyes review: 6/6 mechanics faithful.

Simplifications: merged loopholes+verify_rewards->rewards, pairs->problems; flat
Config + `train.py {preset} [--overrides]` CLI; justfile 384->71 lines; trimmed
results table; token-efficient train logging (config anchor, SHOULD at loop site,
sparse tqdm postfix, BLUF tail with cue + direction-arrow table).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 14:06:42 +00:00

45 lines
3.2 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ── Gradient projection: erase / route the hack component ──────────────
# Source: proj.py. Once per optimizer step, after δS.grad is populated,
# before opt.step(). The bet: subtract the hack-ward component of the live
# GRPO gradient (basis V from labeled pairs); the model descends what's left,
# so it can't move hack-ward even when the reward rewards it.
def project_one(g, V, gate_mode, preserve_magnitude, overshoot):
c = V @ g # g ∈ ^r; V ∈ ^{k×r}; c per-axis hack coef
cos_pre = relu(c) / g # hack-ward fraction removed ∈ [0,1]
if gate_mode == "one_sided": c_use = relu(c) # default: only zap hack-ward motion
elif gate_mode == "no_gate": c_use = c # sign-agnostic; drives ‖Vg‖→0
elif gate_mode == "reverse": c_use = 2*c # push away from hack: Vg' → -Vg
removed = overshoot * (c_use @ V) # ∈ span(V); overshoot=1 removes, 1.1 = 10% reversal
g_proj = g - removed
if preserve_magnitude: g_proj *= g / g_proj # keep update size, change only direction
cos_post = relu(V @ g_proj) / g_proj # residual hack-ward leak
return g_proj, removed, cos_pre, cos_post
def project_all(wrappers, V_hack, measure_only, route):
for name, w in wrappers.items():
if name not in V_hack: continue # below noise floor → no hack signal → skip
g_proj, removed, cin, cout = project_one(w.δS.grad, V_hack[name], ...)
if measure_only: continue # vanilla: measure cin/cout, leave g
w.δS.grad g_proj # erase: descend the hack-free gradient
if route: w.δS_hack.grad removed # route: park hack part in deletable knob (Cloud 2024)
return mean(cin), mean(cout), frac_fired
# arms: none → measure_only | erase → write g_proj | route → erase + park in δS_hack (zero at eval)
# Gotchas
# - relu BEFORE aggregating, everywhere cosine appears (cin_s/cin_t in 06, health checks in
# 07): we remove relu(c)@V, so cin = ‖relu(c)‖/‖g‖. A signed c.sum() lets anti-hack axes
# cancel hack-ward ones and under-reports the stripped mass.
# - cout≈0 under one_sided is an arithmetic identity, not a result. Only behavioural hack
# rate at matched pass, beaten against negative controls, settles whether it worked.
# - AdamW steps on accumulated moment m, which banks hack-ward mass from pre-V / non-split
# steps, so per-step g-projection can be a silent no-op (cout≈0 while the update still moves
# hack-ward). Project the update, or purge Adam state in span(V). weight_decay on δS_hack
# pulls it →0, leaking quarantined capability into δS.
# - preserve_magnitude is erase-then-boost: on a hack-dominated step it renormalises the noise
# residue back to full ‖g‖. Ablate it; a smaller norm on hack-ward steps may be desired.
# - route ≠ erase across training: the route forward uses δS+δS_hack → different rollouts →
# different δS trajectory, and ablating δS_hack at deploy can shock.
# - smoke (mechanics, not efficacy): removed ∈ span(V); ‖δS_hack‖>0 iff route.