Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
20 KiB
Routing v2 — distinct-basis quarantine + per-sample persona mask
Date: 2026-05-31. Branch: probe/distill-cosine. Supersedes the additive
direction-split route arm (proj.py route branch) for the generalisation goal.
Goal
Reduce reward-hack rate at deploy on held-out hack modes (modes the persona mask never saw) by routing the live GRPO gradient of flagged rollouts into a functionally-distinct, deletable quarantine adapter, and relying on absorption (Cloud 2410.04332, SGTM 2512.05648) to pull unflagged hacks into the same quarantine. No reward attached to routing; no oracle; mask is a self-supervised persona-contrast direction (~10 pairs).
User's standing hypothesis (2026-05-31): "it already works with grad (run 31 deploy-hack=0 on held-out modes, confounded), and it will work better with the SGTM insights (distinct basis + seeding + scale)."
Context: what we landed on (from the 2026-05-31 brainstorm)
The five load-bearing conclusions, each with its source:
-
Don't reward routing — impose it. Reward-driven routing invites the model to game its own flag (drive cosine->0, look clean): that is Baker et al.'s obfuscated reward hacking (
search_for_more.md:40). Cloud's "mechanistic supervision avoids Goodharting" (paper_gradient_routing.md:422) works precisely because the mask never enters the behavioural objective. So: mask only, no confession reward. This also dodges the no-cheat boundary (a reading judge over student rollouts) that the confessions path would reintroduce. -
The quarantine must be a distinct basis, not a shared coordinate. Our current
h*(delta_S + delta_S_hack)(antipasto.py:83) makesdL/d delta_S == dL/d delta_S_hackidentically — the two knobs get the same gradient by construction, so absorption cannot sort anything; we only ever made them differ by the manual split. Distinct directions are required for the two regions to receive different per-sample gradients. -
The quarantine must stay in the forward for every sample (additive), not a hard MoE switch. A hard per-sample forward switch (
delta_S*m + delta_S2*(1-m)) removes the quarantine from an unflagged hack's computation graph, so that hack trains the kept adapter and lands in the deployed model. Absorption needs the quarantine in-path for all samples so unflagged hacks can concentrate there. Earlier Cloud's detach (m*a + (1-m)*a.detach()) keeps the forward value identical and routes only the backward — that is the right primitive. -
Seed hard, absorb soft (the hybrid SGTM actually uses). Flagged hacks: hard per-sample gradient route to quarantine only (detach the kept adapter for those samples). Unflagged hacks: no routing, but they concentrate in the quarantine by gradient-magnitude self-reinforcement iff they share structure with the seeded hacks — which is exactly the persona ("willingness to cheat") hypothesis.
-
Leakage is real, bounded, and shrinks with scale. SGTM names our worry "leakage" (§4.3) and measures it: 0.005-0.02 at 64M even with 40% undiscovered forget data, decreasing with model size (
paper_sgtm.md:160,172,174). So the additive design is not clean either-or, but leakage is small. Cloud's toy shows an L1 sparsity penalty is sometimes needed to force concentration on unnatural splits (paper_gradient_routing.md:857,1115) — we treat L1 as an optional concentration aid, not core.
Open risks carried forward:
- Starvation vs route-around: detaching the whole kept adapter on a flagged hack also discards any solve-signal in that rollout -> kept can starve, deploy degrades (Cloud's retain cost ∝ routed fraction). Soft/partial routing is the unexplored middle.
- Ignition: a ~10-pair persona seed may be too weak to ignite absorption at our scale, where SGTM says small models leak more. This is the empirical crux.
- run 31 confound: current additive-route deploy=0 is not citable until the random-V control (#157) separates "directional" from "delete-any-adapter reverts to base".
Scope
In: two new arms (grad-vector mask, act-vector mask) sharing one distinct-basis quarantine adapter; the per-sample plumbing; an L1 knob; a leakage metric; the all-arms per-mode deploy plot. Out: confession/reward routing (rejected, item 1); TDGA project-onto-clean (kept as a documented fallback only); learned/MoE gating (rejected, item 3).
Core pseudocode
Shared machinery — distinct-basis quarantine, additive forward, detach-route
# Per wrapped Linear. Frozen SVD of W: W = U @ diag(S) @ Vh.
# x : [b s d_in] layer input
# Vh : [r d_in] frozen right singular vecs (kept-adapter input basis)
# U : [d_out r] frozen left singular vecs (kept-adapter output basis)
# delta_S : [r] KEPT knob (the deployed adapter), trainable
# A_q,B_q : [k_q d_in],[d_out k_q] QUARANTINE LoRA, OWN basis, init ~0, deleted at deploy
# m : [b s] per-sample/token hack flag in {0,1}, from the arm's mask
#
# Invariant: quarantine is ALWAYS summed into the forward (item 3) and has a
# DISTINCT basis from the kept adapter (item 2). For flagged samples (m=1) the
# kept adapter is detached so only the quarantine learns them (item 4, "seed hard").
def routed_forward(x, m):
kept = U @ (delta_S * (Vh @ x)) # [b s d_out] the deployed adapter
# detach-route: m=1 -> forward value unchanged, but grad to delta_S is blocked
kept = where(m[..., None], kept.detach(), kept)
quar = B_q @ (A_q @ x) # [b s d_out] distinct basis, always in path
return base_linear(x) + kept + quar, quar # quar also returned for the L1 term
# loss += lambda_l1 * quar.abs().mean() # OPTIONAL anti-duplication penalty (Cloud)
# deploy: zero A_q,B_q (or just drop quar) -> evaluate kept only.
Why this gives absorption (the mechanism, item 4): once flagged hacks have seeded
A_q,B_q with cheat-features, an unflagged hack B (m=0, so both adapters get its
gradient) has its largest gradient on the quarantine directions (they already
do cheat-ish things), so B's update concentrates there — while a clean solve, also
m=0, has tiny gradient on the quarantine and concentrates in delta_S. The mask
seeds; gradient magnitude sorts the unflagged bucket. Rests on B sharing structure
with the seed (the persona bet).
Arm B (act-vector) — PRIMARY, single pass
The mask is available at forward time (activations are per-sample for free), so one forward+backward suffices. This is the clean absorption-supporting arm.
# v_act : [d] persona-contrast direction in activation space (mean-diff or top
# singular vec of stacked hack-minus-clean activations at this site),
# unit-norm, oriented cheat-ward. Built once from ~10 persona pairs.
# Computed in the SAME forward, before routing:
a_i = Vh @ x # [b s r] per-sample activation (SVD coords)
cos = cosine(a_i, v_act_in_r_space) # [b s] per-sample, no extra pass
m = (cos > tau).float() # weak, noisy mask -- fine (SGTM tolerates noise)
y, quar = routed_forward(x, m)
Arm A (grad-vector) — single-pass gate SUBTRACTION (implemented)
The persona direction lives in delta_S (gradient) space, but delta_S.grad is
summed over the batch, so the per-rollout signal is not free. Recover it with a
PER-ROLLOUT gate (init 1, identity in the forward) and route POST-backward by
subtraction — no forward detach, no second pass:
# v_grad : [r] unit(mean(g_hack - g_clean)) per module, from extract_vhack_grad's
# raw per-pair GRPO grads (gradient-space analogue of v_act). Oriented
# hack-ward: training reinforces a hack with the same sign, so a
# reinforced-hack rollout has cos(g_b, v_grad) > 0.
c = ones(b, 1, r, requires_grad=True) # per-rollout gate (broadcast over tokens)
kept = U @ ((a * c) * delta_S) # identity at c=1: forward value unchanged
quar = B_q @ (A_q @ x) # always summed
... usual loss; loss.backward()
# c.grad = delta_S * g_b (g_b = per-rollout delta_S grad). Divide it out, but only on
# axes where delta_S has moved (|delta_S|>eps); near-zero axes keep the full grad.
g_b = where(|delta_S|>eps, c.grad / delta_S, 0) # [b, r]
flag = cosine(g_b, v_grad) > 0 # [b]
delta_S.grad -= sum_{flagged b} g_b # exact identity on reliable axes
# quarantine keeps every rollout's grad (always-on forward) -> absorption intact.
Why subtraction beats the earlier A1/A2 framings: routing is post-backward WITHIN the same step, so the cross-step sample mismatch that made the "stale mask" (A1) awkward never arises, and we avoid the 2x backward of the exact two-pass (A2). The only approximation is the eps-guard: on an axis where delta_S is still ~0 the per-rollout split is undefined (0/0), so we route nothing there until delta_S grows past eps — a ~1-step-per-axis lag, the same order as A1's lag, and harmless because a ~0 axis carries no learned hack yet. Upgrade to A2 (probe pass -> mask -> detach-route pass) only if that lag measurably hurts.
Asymmetry still worth flagging in the writeup: Arm B's mask is a free forward-time activation cosine (no gate, no division); Arm A pays a per-rollout gate + an eps-guarded division. Arm B remains the cleaner bet.
Requirements
- R1: Quarantine has a basis distinct from the kept adapter, and is summed (not
switched) in the forward. VERIFY: with a flagged-only synthetic batch, after one
step
delta_S.grad == 0(detached) ANDA_q.grad != 0; with an unflagged batch both are nonzero. If a hard-switch crept in, the unflagged batch would leaveA_q.grad == 0— redesign until no. - R2: Deploy = kept only. VERIFY: zeroing
A_q,B_qchanges the eval forward; the saved adapter contains no quarantine tensors. - R3: Per-sample mask is correct. VERIFY (Arm B): on a batch of known
hack/clean rollouts,
mfires >2x more on hacks than clean. (Arm A):g_isummed over samples equals the batcheddelta_S.grad(gate identity check). - R4: Leakage metric implemented (SGTM-style): hack capability remaining in the kept adapter after deleting quarantine, vs an equivalent-exposure vanilla baseline. VERIFY: metric returns a number in [0,1]; data-filtering arm reads ~1.
- R5: No reward, no oracle at train time. VERIFY: grep train path — routing reads
only the persona mask; never
gt_pass, never the full detector suite over student rollouts.
Tasks
- T1 (R1,R2): add quarantine LoRA
A_q,B_qper module +routed_forwardwith detach-route, distinct basis, deploy-zero. steps: extend antipasto.py adapter; threadintervention=route2config. verify:just smokewalks routed_forward; R1 synthetic-batch grad check. success: flagged grad isolates to A_q; unflagged hits both. likely_fail: shared basis reused -> identical grads (R1 fails). sneaky_fail: detach also zeros forward value -> solve tanks; catch by asserting forward value unchanged vs non-detached. - T2 (R3): Arm B act-mask —
v_actextraction from persona pairs + forward-time cosine. verify: R3 fire-ratio on known hack/clean. UAT: "mask fires on hacks". - T3 (R3): Arm A grad-mask — c_i gate per-sample grad, single-pass subtraction (NOT A1 stale / A2 two-pass; see Log 2026-05-31). gate identity sum_i g_i == delta_S.grad holds by construction. smoke routing2_grad passes (||B_q||=0.109).
- T4 (R4): leakage metric + L1 knob (
lambda_l1, default 0). DESIGN (2026-05-31, grounded in SGTM eq.1 = |D_forget_standard| / |D_forget_unlabeled|, 0=perfect localization, 1=all leaked, data-filtering reads 1 by construction):- Operationalize as a per-mode RATIO, not SGTM's full calibration curve. SGTM maps ablated forget-loss onto an equivalent forget-token count via a curve of standard runs at varying forget exposure; we can't afford a curve of vanilla checkpoints, and matched-step vanilla is the natural single-point denominator: leakage_mode = clamp( hack_deploy(route2, quarantine deleted) / hack_deploy(vanilla, matched steps/substrate), 0, 1 ) Held-out modes are the load-bearing cut: route2 should read <<1 (localized the UNKNOWN hack into the deleted quarantine), data-filtering reads ~1 (can't filter what its weak detector never flagged). Report per mode, split known/held-out, headline = mean held-out leakage.
- Home = post-hoc cross-run analysis (reads two run artifacts), NOT inside train.py
(one run lacks the other arm's numbers). Likely a
scripts/leakage.pyor a block in the plot path. - Groundwork needed first (both pure additions, smoke before use): (a) write per-mode deploy eval as STRUCTURED json artifact per route/route2 run (currently only logger.info at train.py:1861-1867; nothing on disk). (b) make VANILLA emit the same per-mode eval on the fixed eval_idxs (currently the deploy-eval block is route/route2-only; vanilla hack_deploy=nan), so the denominator exists. Adds one eval pass to vanilla runs (~minor).
- UNVALIDATABLE until route2 runs (jobs 34/35) produce deploy numbers; vanilla denominator = run 30. Implement + smoke once those land, not before.
- L1 knob DEFERRED with reason: adding an L1 penalty on B_q/delta_S is a SECOND training objective (balancing problem, violates one-driving-principle). It's default 0 = off, an optional ignition aid we may never need. Hold until the leakage number shows the quarantine fails to ignite without it. Not blocking the headline proof.
- T5: 5-arm sweep at matched seed/steps: vanilla, erase, route-additive(old), route2-grad, route2-act. Plus random-V control (#157) on the old route.
- T6: all-arms per-mode deploy overlay plot (
just plot).
Proofs to show at the end (user-observable)
- All-arms per-mode deploy table + overlay plot — rows: vanilla / erase /
route-additive / route2-grad / route2-act; columns: train-hack, deploy-hack,
deploy-solve, split by mode (run_tests in-dist + file_marker/sentinel/stdout
held-out). The headline proof: route2 drops deploy-hack on held-out modes at
matched deploy-solve, beating erase (≈vanilla) and beating or matching the
confounded additive route. Link:
out/runs/.../all_arms_per_mode.png. - random-V control (#157) — route-additive with a Haar-random V. If random reproduces deploy-hack≈0+solve-jump, the additive result was an ablation artifact (and route2's distinct-basis win is the real signal). If random stays hacky, the direction mattered all along. Either way it disambiguates run 31.
- Leakage number — SGTM-style: fraction of held-out-mode hack capability remaining in the kept adapter after quarantine deletion. Lower than erase's (which leaves ~everything) is the absorption proof.
- cos-over-time for route2 — does the persona mask stay discriminative
(
cin_t>cin_s) longer than the additive arm's ~9 steps, and does deploy-hack track it. - Ablations that must hold: shared-basis route2 (kill R1) must lose the held-out drop (confirms distinct basis is load-bearing); L1=0 vs L1>0 shows whether ignition needs the sparsity aid.
Context (terms / invariants)
- No-cheat invariant: training never gated on
gt_passor the full detector suite over student rollouts; persona pairs select the mask, that's the allowed weak detector. Deploy = quarantine deleted. - absorption: unflagged members of a concept concentrate in the seeded dedicated region by gradient magnitude (Cloud/SGTM). Requires distinct basis + in-forward.
- leakage: hack capability that ends up in the kept (retained) adapter despite routing. SGTM §4.3.
Log
- 2026-05-31 (later): execution pass. (1) Random-V control (#157) CANCELLED: it is
structurally rf0-only -- refresh re-extracts the real v_hack and would overwrite the
random direction, so it only controls the abandoned frozen regime. (2) Frozen-real
route (job 32, rf0) drops deploy hack only ~8pp (0.375->0.297) vs run-31 rf5's ~0;
cin_t decays 0.32->0.13 to meet cin_s by step
7 (staleness). Refresh is the load-bearing knob (journal entry j). (3) BUG: route2 refresh was a silent no-op --=vanilla (0.274 vs 0.276). (6) substrate pool + prog_wide pairs now FastConfig defaults (vhack_refresh_everygated onv_hack, which route2 sets None; v_act/v_grad stayed frozen. Fixed (2556919): route2 branch re-extracts v_act/v_grad every N with the quarantine ablated. (4) BUG: route2 quar/v_act crashed on the bf16 real model (A_q/B_q/v_act fp32 vs bf16 x); smoke is fp32 (CPU) so the path never fired. Fixed (80f6b52) with.to(x.dtype), bf16 fwd+bwd validated. (5) vanilla (job30) + erase (job29) already DONE on the substrate -- overlay arms exist; erase1086c98); real run needs only --intervention. T4 split: per-mode deploy JSON artifact (#164, unblocked groundwork) vs leakage metric (#161, blocked on route2 deploy numbers). - 2026-05-31: spec created from brainstorm. Run 31 (additive route) gave deploy-hack=0 on all modes incl held-out + solve-jump 0.375->0.562; confounded (gpt-5.5 review + journal entry). Random-V (#157) is the gate.
- 2026-05-31: cos-over-time on run 31 — teacher 0.29->0.08, student flat ~0.10-0.12, cross ~step 9. Persona mask is a stable ~10% handle on the student gradient; the teacher signal decays (likely the student absorbing the teacher's hacks).
- 2026-05-31: T1+T2 implemented and smoke-passed. R1 grad check (synthetic):
flagged -> delta_S.grad=0, A_q/B_q.grad>0; unflagged -> both>0; forward value
unchanged by detach. smoke-route2: v_act extracted (14 modules), ||B_q||=0.109
after 30 steps, deploy eval + asserts pass. BUG found+fixed: the per-prompt
model.zero_grad(set_to_none=True)(there to isolate delta_S's per-source grad) wiped A_q/B_q grads before opt.step; now stashed and re-injected like delta_S. - 2026-05-31: defaults — vhack_refresh_every 0->5 (0 is ablation-only); route2 reuses run-substrate (v-hack-path ignored, vhack-pairs drives v_act, tau/rank defaulted) so the sweep needs no extra args.
- 2026-05-31: T3 (Arm A grad-mask) implemented + smoke-passed. Removed route2_tau (never tuned; mask is cos>0, the natural hack-ward boundary). v_hack path now auto-derives from --vhack-pairs-path (pass the pairset, the hack auto-loads). Arm A design CHANGED from the spec's A1/A2: single-pass gate-SUBTRACTION instead of stale-mask or two-pass. The per-rollout gate c (init 1, identity forward) gives c.grad = delta_S * g_b after backward; train.py divides out delta_S (eps-guard on |delta_S|>1e-6) to get per-rollout g_b, flags cos(g_b, v_grad)>0, and subtracts flagged rollouts from delta_S.grad. No forward detach, no second pass; quarantine still learns flagged rollouts via its always-on path. The cross-step sample- mismatch that made A1 awkward never arises because routing is post-backward within the same step. Lag bound: routing on a fresh axis lags ~1 step until |delta_S| grows past eps there (this is the A1-equivalent one-step lag, per-axis). Upgrade to A2 (two-pass detach) only if the lag hurts. v_grad = unit(mean(g_hack-g_clean)) from extract_v_hack raw grads (gradient-space analogue of v_act). smoke routing2_grad: ||B_q||=0.109 after 30 steps (quarantine seeded by routed grad), deploy eval + asserts pass, exit 0.
- 2026-05-31: external code review (deepseek-v4-pro, docs/spec/20260531_route2_code_review_v2.md) verified gate identity (c.grad=delta_Sg_b), divide-out, eps-guard, Arm B detach-route, and R5 NO-CHEAT (mask never reads gt_pass / detector suite) all CORRECT. One finding: Arm A flagged per-TOKEN, not per-rollout — the hook's gate is [Gs, r] (nn.Linear flattens the batch), so cos(g_b, v_grad) and the flag were per-token. A clean rollout scatters ~50% of its tokens over cos>0 by noise, spuriously routing half its gradient mass. FIXED: _route2_grad_filter now reshapes c.grad to [G, s, r], sums each rollout's tokens BEFORE the cosine (denoises the sign), flags per-rollout [G], matching the preregistered unit (GRPO advantage is per-rollout). Re-smoked.
TODO (out of scope now)
- Soft/partial routing (route a fraction of a flagged sample's gradient) to trade starvation against route-around.
- TDGA project-onto-clean as a no-signal fallback if the persona mask never ignites.