The distinct-basis A_q/B_q LoRA (~33M params at rank-16) gave the quarantine a
~100x capacity edge over delta_S, so routing-everything-there was the low-
resistance path: qE pinned ~0.97 (energy into the thrown-away knob) while the
deployed delta_S learned nothing (job 54). The cause was capacity imbalance, not
the routing gate (calibrated-tau already separated hack/clean, hkgap>0).
Consolidate to one adapter type: the quarantine is now delta_S_hack, the second
diagonal in the same frozen SVD basis, shape [r], capacity-matched to delta_S,
zeroed at deploy. route2's calibrated-tau gate parks the flagged rollouts' grad
into delta_S_hack.grad (like proj.py's route parks its subspace projection);
delta_S keeps the unflagged. Both diagonals train at one shared lr.
Removed: A_q/B_q params, v_act + extract_v_act, the act-mask arm (a shared
diagonal can't be per-token gated), route2_mask / route2_quarantine_rank /
route2_quar_lr_scale knobs, the separate quar optimizer group. Arm name
routing2_{act,grad} -> routing2. v_grad refresh extracts from delta_S (main)
with the quarantine ablated.
SGTM check: their gradient routing uses a hard detach on capacity-matched
reserved dims, no soft/tanh/sigmoid gate -- balance is the fix, not gating.
Smoked clean: tau/hkgap/qE render, ||delta_S_hack||>0 assert passes, exit 0.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
8.2 KiB
Spec — per-step calibrated threshold τ for route2-grad routing
Status: APPROVED by user 2026-06-01, implement + smoke + queue + follow. Context survives compaction here; resume from "Implementation" if context was lost.
Hypothesis / why
route2-act and route2-grad both gate routing on cos(·, v) > 0. In high-dim,
cos ≈ 0 for almost everything, so an absolute-zero threshold is a ~50% coin
flip -> over-routing -> half of ALL learning dumped into the deleted quarantine
-> collapse (job 46) or null. The fix is NOT to route by the detector flag (that
= SGTM, throws away the vector novelty: the flag only catches known hack A, the
whole point is the direction generalises to unknown B). The fix is to keep
routing by the v_hack DIRECTION but calibrate the threshold per step so it
rides the cos drift (we measured cin_t 0.27->0.07 over ~10 steps, so any fixed
τ goes stale and silently stops firing).
Driving principle (one): route by where a rollout falls on the v_hack axis RELATIVE to the known-hack vs clean clouds, recalibrated each step.
Mechanism (per module, per step, EMA-smoothed)
Anchors available every prompt without a new buffer:
- hack cloud = teacher rollouts (known-A hacks, always present) + detector-
flagged student rollouts (
hack_E= loophole used; the allowed weak detector). - clean cloud = NOT-flagged student rollouts. (Contaminated with unknown B -- that is FINE and intended: it makes τ slightly conservative, and B, sharing the direction, lands ABOVE τ and gets routed. Do NOT force-keep this set.)
Per module name, from this prompt's cos_b [G] (already computed in
_route2_grad_filter):
- μ_hack_prompt = mean(cos_b[hack_anchor]) ; μ_clean_prompt = mean(cos_b[clean_anchor])
- EMA across prompts/steps (β=0.9): ema_hack[name], ema_clean[name] (EMA is the cheap equivalent of "last N hacks": teacher is the always-present floor, recent student-flagged hacks sharpen it. Explicit rollout buffer = TODO if teacher-anchor proves biased toward teacher-style hacks.)
- τ[name] = (ema_hack[name] + ema_clean[name]) / 2
- route_mask = hack_anchor | (cos_b > τ[name]) # force-route known hacks (teacher + flagged student); τ-route the ambiguous rest (incl. unknown B).
Warmup: at step 0 delta_S≈0 so few reliable axes -> cos_b≈0 -> μ_hack≈μ_clean≈0 -> τ≈0 (≈ old cos>0 behaviour) until delta_S grows. Separation (hkgap) emerges over the first few steps. Sanity: by a few steps μ_clean<~0.1, μ_hack>~0.2-ish (the user's hardcode intuition, but self-calibrated).
Logging (new columns, route2-grad only)
tau= mean over modules of τ[name]hkgap= mean over modules of (ema_hack - ema_clean) <- the discrimination gauge; generalises cin_t>cin_s. If hkgap collapses to ~0 the direction stopped separating (the real failure signal, not a threshold-choice problem).
Implementation (exact edits in src/projected_grpo/train.py)
-
Before the step loop (near
route_span_checked = False, ~L1082): addema_hack_cos: dict = {},ema_clean_cos: dict = {},route2_tau: dict = {}(persist across steps). EMA_BETA = 0.9. -
Per-step lists near
step_flagged(~L1192):step_tau = [],step_hkgap = []. -
Per-prompt anchor masks: after grading (hack_E_flags built, ~L1396), before the backward, when is_route2_grad build device bool tensors aligned to merged rows [student 0..G_s-1, teacher G_s..]: hack_anchor[i] = (not is_student[i]) or (is_student[i] and hack_E_flags[i]) clean_anchor[i] = is_student[i] and not hack_E_flags[i] (hack_E_flags has length G_s = #student rows; teacher rows -> hack_anchor.)
-
_route2_grad_filter(~L1194): add params (hack_anchor, clean_anchor). Replaceflagged = (cos_b > 0).float()with the EMA-calibrated τ logic:- update ema_hack_cos[name]/ema_clean_cos[name] from this prompt's anchors (skip a cloud if it has no rows this prompt)
- τ = (ema_hack_cos.get(name,0)+ema_clean_cos.get(name,0))/2 ; route2_tau[name]=τ
- route = (cos_b > τ) | hack_anchor ; flagged = route.float()
- step_tau.append(τ); step_hkgap.append(ema_hack_cos.get(name,0)-ema_clean_cos.get(name,0)) Pass hack_anchor/clean_anchor through the call site (~L1518).
-
Row dict (~L1864): add
"tau": (mean(step_tau) if step_tau else float("nan")),"hkgap": (mean(step_hkgap) if step_hkgap else float("nan")). -
StepLogger cols (~L711, after the routing2_act block): add if arm == "routing2_grad": _Col("tau", 6, "tau", "+.2f", "per-step calibrated route threshold (midpoint of hack vs clean cos clouds)") _Col("hkgap", 6, "hkgap", "+.2f", "ema_hack_cos - ema_clean_cos; >0 = v_grad still separates hack from clean")
Verify / queue / follow
just smoke-route2: confirm tau/hkgap/qE columns render, routing fires (flagged frac < ~0.5, not pinned at 0.5), exit 0.- Queue (substrate, 60 steps, seed 41), label: why: does per-step calibrated-τ vector routing (route2-grad) stop over-routing (flagged<<0.5) and suppress held-out deploy-hack vs vanilla at matched solve; resolve: qE bounded + hkgap>0 sustained + deploy file_marker hack < vanilla.
pueue follow <id> | tailin background.
No-cheat check
Detector flag (hack_E) only CALIBRATES μ_hack (known A) + anchors teacher hacks. The routing GATE is cos>τ applied to ALL rollouts incl. unknown B. gt_pass never touches routing. Compliant.
Status: IMPLEMENTED + RUNNING (job 54)
Implemented in 6eb894f, smoked (tau/hkgap render, exit 0). Running as job 54
(calibrated-τ + grad_clip 0.5, route2-grad, 60 steps, seed 41). job 52
(caltau-alone) killed and folded in; clip 0.5 is a no-op while gn<0.5 so 54 is
the strict superset.
Live observation (steps 0-2, 2026-06-01)
The calibration works as a discriminator but does NOT fix over-routing-by-energy:
hkgappositive and rising (0.00 -> 0.03 -> 0.08): the v_grad direction DOES separate the hack cloud from the clean cloud. The vector is alive.tautracking up with it (0.00 -> 0.02 -> 0.04): the threshold rides the drift as designed.- BUT
qE0.73 -> 0.97 -> 0.97: ~97% of gradient ENERGY lands in the deleted quarantine by step 1, andgt_s3 -> 7 -> 0 (solving collapsed).
Diagnosis: the failure was never the routing FRACTION (which τ fixes); it is the always-summed 33M A_q/B_q quarantine MAGNITUDE. Even a correctly-thresholded route sends the routed gradient into a knob whose per-param grads dwarf delta_S's, so the energy ratio pins near 1 and the deployed adapter learns nothing. This is the SYNTHESIS "next lever" prediction: if qE stays high while hkgap>0, the culprit is quarantine magnitude, not the gate.
DESIGN CHANGE (2026-06-01): one adapter, scale-matched quarantine
Acted on the magnitude diagnosis by removing the distinct-basis LoRA entirely.
The quarantine is now delta_S_hack -- the SECOND diagonal in the same frozen SVD
basis, shape [r] per module, identical capacity to delta_S. route2's calibrated-τ
gate parks the flagged rollouts' delta_S-grad contribution into delta_S_hack.grad
(via step_grad_hack in _route2_grad_filter), exactly as proj.py's route parks
its subspace-projected component; delta_S keeps the unflagged. Both diagonals
train at one shared lr; delta_S_hack is zeroed at deploy.
Rationale (user): a 33M LoRA vs a ~2k-param delta_S per module means "dump everything in the quarantine" is the low-resistance path -- a capacity edge, not honest absorption. Capacity-balanced diagonals remove that bias. SGTM's own quarantine is capacity-matched (a split of the same layer, equal dims), and uses a hard detach -- no soft/tanh/sigmoid gate -- confirming the fix is balance, not gating.
Removed: A_q/B_q params, v_act buffer + extract_v_act, the act-mask arm (a shared diagonal can't be per-token gated), route2_mask / route2_quarantine_rank / route2_quar_lr_scale knobs, the separate quar optimizer group. arm name "routing2_grad"/"routing2_act" -> "routing2".
v_grad refresh extracts from the MAIN knob (delta_S.grad) with the quarantine ablated -- the deployed-model gradient is what we route, and both diagonals share the basis so the direction is directly usable on delta_S's live gradient.
Smoked clean (tiny-random): tau/hkgap/qE render, ||delta_S_hack||=0.0074>0 assert passes, deploy-ablation fires, exit 0. Queued on the substrate (seed 41, 60 steps).