grpo_proj2/docs/papers/grad_routing/related_work.md at b0d1bcd3d54ab73d6f56e1af40494c8b35fe454e

mirror of https://github.com/wassname/grpo_proj2.git synced 2026-06-27 15:15:44 +08:00

Files

T

wassname b0d1bcd3d5 Rebuild src/ from pseudocode: SVD-basis gradient projection vs GRPO reward hacking

Expand docs/pseudocode/01..07 into a slim, fail-fast src/projected_grpo/ that
passes `just smoke`. Code mirrors the pseudocode (δS/Σ/V names, relu-before-agg
cin/cout, Dr.GRPO unbiased loss). Did not read the original src.

7 modules (~880 LOC):
- rewards.py    grader + 4 loophole modes + hack x mode diagonal self-check (R1)
- problems.py   tiny LeetCode substrate + contrastive pairs (R5)
- antipasto.py  SVD adapter, identity at δS=0 (R2)
- proj.py       erase/route/measure_only projection (R3)
- extract_vhack_grad.py  per-module SVD of paired grad diffs, noise floor (R5)
- train.py      mixed student+teacher GRPO loop, presets smoke/fast/full (R4)
- build_pool.py self-contained frozen teacher-pool fixture

`just smoke-all` PASS (exit 0): erase/none/route trio, grader diagonal clean,
v_hack cache miss->hit, ckpt every-25. Fresh-eyes review: 6/6 mechanics faithful.

Simplifications: merged loopholes+verify_rewards->rewards, pairs->problems; flat
Config + `train.py {preset} [--overrides]` CLI; justfile 384->71 lines; trimmed
results table; token-efficient train logging (config anchor, SHOULD at loop site,
sparse tqdm postfix, BLUF tail with cue + direction-arrow table).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-05-31 14:06:42 +00:00

5.7 KiB

Raw Blame History

Compiled 2026-05-31 from full-text reads. Local copies in this dir. The question for every paper: does it cheat / assume a hack oracle? how does it handle unknown hacks at train and deploy? Our no-cheat invariant: training never gated on gt_pass or the full detector suite over student rollouts; the mask is a weak, self-supervised persona direction; the quarantine is deleted at deploy.

No-cheat scorecard

paper	train-time signal	hack oracle?	deploy	unknown hacks
TDGA (Deng, 2605.25189)	clean warmup only (SVD of clean SFT param-deltas)	no — references only clean data	none; projects onto clean subspace throughout	rejected by construction (any hack is off-clean), but only delays (rank-5/10 hack-free ~200 steps then drift); fixed subspace
Gradient Routing (Cloud, 2410.04332, Oct 2024)	user data-label mask (activation detach)	needs labels, works with partial/ad-hoc labels	ablate routed subregion (Expand-Route-Ablate)	absorption — unlabeled forget data localises to the routed region anyway (`:352,367`); "mechanistic supervision avoids Goodharting" (`:422`); MNIST split needs L1 (`:857,1115`)
SGTM (Shilov/Cloud, 2512.05648, Dec 2025)	noisy classifier label (parameter-gradient zero-mask on dedicated neurons/heads)	tolerates label noise explicitly	zero the dedicated params	self-reinforcing localisation; leakage measured (§4.3): 0.005-0.02 at 64M w/ 40% undiscovered, shrinks with scale
Confessions (OpenAI, 2512.08093)	weak LLM judge on confession honesty (separate rewarded output)	no perfect oracle — judge may be as hackable as the main RM (`:59`)	inference-time monitor/reject	generalises OOD (admits untrained misbehaviour) but does not reduce hacking (`:265`)

Load-bearing consequence: none need a hack oracle. Cloud's absorption + SGTM's label-noise robustness are the "weak detector for A generalises to suppress unlabeled B" result. The routing+absorption machinery is theirs; what is ours is the signal source and setting (below).

How each routes gradients (one-liner each)

Cloud (earlier): x = m*a + (1-m)*a.detach() — activation-level stop-gradient mask; forward value unchanged, backward routed. L1 on the encoding to force the split on unnatural problems.
SGTM (later, the one to copy): for a labeled forget example, zero the gradient to all params except a fixed dedicated subset of MLP neurons + attn heads. Forward keeps the region for everyone; only the backward is masked. No SVD.
TDGA: G_clean = U_clean diag(alpha) U_clean^T G, step on the clean component only. Project onto clean, not off hack.
Confessions: no masking. Extra output y_c; reward it with a judge; reinforce only confession tokens; "seal of confession" decouples it from task reward.

Closest twins / comparison points (not yet quote-verified beyond the abstract)

From search_for_more.md:

GRIFT (2604.16242) — gradient fingerprints to detect+reject reward hacking. Same level (gradient), same RLVR-loophole setting; their detector is a comparison point for our weak persona mask.
Rebound / Advantage Modification (2604.01476) — intervenes in the GRPO advantage; we intervene in the gradient. The exact contrast.
Spilling the Beans (2511.06626) — SFT self-report generalises OOD (train on known errors -> admit hidden objectives). The generalisation story structurally matches ours.
Baker et al. monitoring (2503.11926) — weak monitor catches strong model, BUT optimising against it breeds obfuscated reward hacking. This is the risk that kills reward-driven routing and argues for mask-only (mechanistic) routing.

What is ours (the testable contribution)

Not the machinery (Cloud/SGTM own route+ablate+absorption), not project-onto-clean (Deng), not reward-for-honesty (OpenAI, and we reject it — it invites Baker's obfuscation and reintroduces a live judge over student rollouts).

Plausibly novel:

Mask source = a self-supervised persona-contrast direction (grad or act, in the SVD-of-W basis), ~10 pairs, no content classifier, no outcome label, no oracle. Cloud/SGTM route by document/domain labels.
Setting = RL reward hacking (they do pretrain/SFT content-domain unlearning).
Scalability argument — the mask is the model's own representation, so it should sharpen with capability; SGTM's own data (leakage shrinks with scale) supports rather than threatens this. Weak-to-strong alignment framing.

Caveat: SGTM's authors are Anthropic alignment, so RL reward hacking is plausibly on their roadmap — lead with the persona-direction-as-mask as the distinctive claim, and report honestly that the routing/absorption mechanism is inherited.

Design implications carried into `docs/spec/20260531_routing_v2_distinct_basis.md`

Impose the mask, never reward it (Baker; Cloud :422).
Distinct basis + additive forward + detach-route (not hard MoE) so unflagged hacks pass through the quarantine and can be absorbed.
Seed hard (flagged), absorb soft (unflagged) — SGTM's hybrid.
Expect to add an L1 sparsity aid (Cloud shows it can be necessary).
Leakage is bounded and shrinks with scale (SGTM §4.3) — the central reason the additive design is viable, and the central risk at our small scale.

5.7 KiB Raw Blame History

Related work — gradient routing, projection, and confessions vs our method