Expand docs/pseudocode/01..07 into a slim, fail-fast src/projected_grpo/ that
passes `just smoke`. Code mirrors the pseudocode (δS/Σ/V names, relu-before-agg
cin/cout, Dr.GRPO unbiased loss). Did not read the original src.
7 modules (~880 LOC):
- rewards.py grader + 4 loophole modes + hack x mode diagonal self-check (R1)
- problems.py tiny LeetCode substrate + contrastive pairs (R5)
- antipasto.py SVD adapter, identity at δS=0 (R2)
- proj.py erase/route/measure_only projection (R3)
- extract_vhack_grad.py per-module SVD of paired grad diffs, noise floor (R5)
- train.py mixed student+teacher GRPO loop, presets smoke/fast/full (R4)
- build_pool.py self-contained frozen teacher-pool fixture
`just smoke-all` PASS (exit 0): erase/none/route trio, grader diagonal clean,
v_hack cache miss->hit, ckpt every-25. Fresh-eyes review: 6/6 mechanics faithful.
Simplifications: merged loopholes+verify_rewards->rewards, pairs->problems; flat
Config + `train.py {preset} [--overrides]` CLI; justfile 384->71 lines; trimmed
results table; token-efficient train logging (config anchor, SHOULD at loop site,
sparse tqdm postfix, BLUF tail with cue + direction-arrow table).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
5.7 KiB
Related work — gradient routing, projection, and confessions vs our method
Compiled 2026-05-31 from full-text reads. Local copies in this dir. The question
for every paper: does it cheat / assume a hack oracle? how does it handle unknown
hacks at train and deploy? Our no-cheat invariant: training never gated on
gt_pass or the full detector suite over student rollouts; the mask is a weak,
self-supervised persona direction; the quarantine is deleted at deploy.
No-cheat scorecard
| paper | train-time signal | hack oracle? | deploy | unknown hacks |
|---|---|---|---|---|
| TDGA (Deng, 2605.25189) | clean warmup only (SVD of clean SFT param-deltas) | no — references only clean data | none; projects onto clean subspace throughout | rejected by construction (any hack is off-clean), but only delays (rank-5/10 hack-free ~200 steps then drift); fixed subspace |
| Gradient Routing (Cloud, 2410.04332, Oct 2024) | user data-label mask (activation detach) | needs labels, works with partial/ad-hoc labels | ablate routed subregion (Expand-Route-Ablate) | absorption — unlabeled forget data localises to the routed region anyway (:352,367); "mechanistic supervision avoids Goodharting" (:422); MNIST split needs L1 (:857,1115) |
| SGTM (Shilov/Cloud, 2512.05648, Dec 2025) | noisy classifier label (parameter-gradient zero-mask on dedicated neurons/heads) | tolerates label noise explicitly | zero the dedicated params | self-reinforcing localisation; leakage measured (§4.3): 0.005-0.02 at 64M w/ 40% undiscovered, shrinks with scale |
| Confessions (OpenAI, 2512.08093) | weak LLM judge on confession honesty (separate rewarded output) | no perfect oracle — judge may be as hackable as the main RM (:59) |
inference-time monitor/reject | generalises OOD (admits untrained misbehaviour) but does not reduce hacking (:265) |
Load-bearing consequence: none need a hack oracle. Cloud's absorption + SGTM's label-noise robustness are the "weak detector for A generalises to suppress unlabeled B" result. The routing+absorption machinery is theirs; what is ours is the signal source and setting (below).
How each routes gradients (one-liner each)
- Cloud (earlier):
x = m*a + (1-m)*a.detach()— activation-level stop-gradient mask; forward value unchanged, backward routed. L1 on the encoding to force the split on unnatural problems. - SGTM (later, the one to copy): for a labeled forget example, zero the gradient to all params except a fixed dedicated subset of MLP neurons + attn heads. Forward keeps the region for everyone; only the backward is masked. No SVD.
- TDGA:
G_clean = U_clean diag(alpha) U_clean^T G, step on the clean component only. Project onto clean, not off hack. - Confessions: no masking. Extra output
y_c; reward it with a judge; reinforce only confession tokens; "seal of confession" decouples it from task reward.
Closest twins / comparison points (not yet quote-verified beyond the abstract)
From search_for_more.md:
- GRIFT (2604.16242) — gradient fingerprints to detect+reject reward hacking. Same level (gradient), same RLVR-loophole setting; their detector is a comparison point for our weak persona mask.
- Rebound / Advantage Modification (2604.01476) — intervenes in the GRPO advantage; we intervene in the gradient. The exact contrast.
- Spilling the Beans (2511.06626) — SFT self-report generalises OOD (train on known errors -> admit hidden objectives). The generalisation story structurally matches ours.
- Baker et al. monitoring (2503.11926) — weak monitor catches strong model, BUT optimising against it breeds obfuscated reward hacking. This is the risk that kills reward-driven routing and argues for mask-only (mechanistic) routing.
What is ours (the testable contribution)
Not the machinery (Cloud/SGTM own route+ablate+absorption), not project-onto-clean (Deng), not reward-for-honesty (OpenAI, and we reject it — it invites Baker's obfuscation and reintroduces a live judge over student rollouts).
Plausibly novel:
- Mask source = a self-supervised persona-contrast direction (grad or act, in the SVD-of-W basis), ~10 pairs, no content classifier, no outcome label, no oracle. Cloud/SGTM route by document/domain labels.
- Setting = RL reward hacking (they do pretrain/SFT content-domain unlearning).
- Scalability argument — the mask is the model's own representation, so it should sharpen with capability; SGTM's own data (leakage shrinks with scale) supports rather than threatens this. Weak-to-strong alignment framing.
Caveat: SGTM's authors are Anthropic alignment, so RL reward hacking is plausibly on their roadmap — lead with the persona-direction-as-mask as the distinctive claim, and report honestly that the routing/absorption mechanism is inherited.
Design implications carried into docs/spec/20260531_routing_v2_distinct_basis.md
- Impose the mask, never reward it (Baker; Cloud
:422). - Distinct basis + additive forward + detach-route (not hard MoE) so unflagged hacks pass through the quarantine and can be absorbed.
- Seed hard (flagged), absorb soft (unflagged) — SGTM's hybrid.
- Expect to add an L1 sparsity aid (Cloud shows it can be necessary).
- Leakage is bounded and shrinks with scale (SGTM §4.3) — the central reason the additive design is viable, and the central risk at our small scale.