mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 15:15:40 +08:00

Files

T

wassname 8158adb543 refactor: route2 quarantine = scale-matched delta_S_hack, rip out 33M LoRA

The distinct-basis A_q/B_q LoRA (~33M params at rank-16) gave the quarantine a
~100x capacity edge over delta_S, so routing-everything-there was the low-
resistance path: qE pinned ~0.97 (energy into the thrown-away knob) while the
deployed delta_S learned nothing (job 54). The cause was capacity imbalance, not
the routing gate (calibrated-tau already separated hack/clean, hkgap>0).

Consolidate to one adapter type: the quarantine is now delta_S_hack, the second
diagonal in the same frozen SVD basis, shape [r], capacity-matched to delta_S,
zeroed at deploy. route2's calibrated-tau gate parks the flagged rollouts' grad
into delta_S_hack.grad (like proj.py's route parks its subspace projection);
delta_S keeps the unflagged. Both diagonals train at one shared lr.

Removed: A_q/B_q params, v_act + extract_v_act, the act-mask arm (a shared
diagonal can't be per-token gated), route2_mask / route2_quarantine_rank /
route2_quar_lr_scale knobs, the separate quar optimizer group. Arm name
routing2_{act,grad} -> routing2. v_grad refresh extracts from delta_S (main)
with the quarantine ablated.

SGTM check: their gradient routing uses a hard detach on capacity-matched
reserved dims, no soft/tanh/sigmoid gate -- balance is the fix, not gating.

Smoked clean: tau/hkgap/qE render, ||delta_S_hack||>0 assert passes, exit 0.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-01 02:52:02 +00:00

8.2 KiB

Raw Blame History

Spec — per-step calibrated threshold τ for route2-grad routing

Status: APPROVED by user 2026-06-01, implement + smoke + queue + follow. Context survives compaction here; resume from "Implementation" if context was lost.

Hypothesis / why

route2-act and route2-grad both gate routing on cos(·, v) > 0. In high-dim, cos ≈ 0 for almost everything, so an absolute-zero threshold is a ~50% coin flip -> over-routing -> half of ALL learning dumped into the deleted quarantine -> collapse (job 46) or null. The fix is NOT to route by the detector flag (that = SGTM, throws away the vector novelty: the flag only catches known hack A, the whole point is the direction generalises to unknown B). The fix is to keep routing by the v_hack DIRECTION but calibrate the threshold per step so it rides the cos drift (we measured cin_t 0.27->0.07 over ~10 steps, so any fixed τ goes stale and silently stops firing).

Driving principle (one): route by where a rollout falls on the v_hack axis RELATIVE to the known-hack vs clean clouds, recalibrated each step.

Mechanism (per module, per step, EMA-smoothed)

Anchors available every prompt without a new buffer:

hack cloud = teacher rollouts (known-A hacks, always present) + detector- flagged student rollouts (hack_E = loophole used; the allowed weak detector).
clean cloud = NOT-flagged student rollouts. (Contaminated with unknown B -- that is FINE and intended: it makes τ slightly conservative, and B, sharing the direction, lands ABOVE τ and gets routed. Do NOT force-keep this set.)

Per module name, from this prompt's cos_b [G] (already computed in _route2_grad_filter):

μ_hack_prompt = mean(cos_b[hack_anchor]) ; μ_clean_prompt = mean(cos_b[clean_anchor])
EMA across prompts/steps (β=0.9): ema_hack[name], ema_clean[name] (EMA is the cheap equivalent of "last N hacks": teacher is the always-present floor, recent student-flagged hacks sharpen it. Explicit rollout buffer = TODO if teacher-anchor proves biased toward teacher-style hacks.)
τ[name] = (ema_hack[name] + ema_clean[name]) / 2
route_mask = hack_anchor | (cos_b > τ[name]) # force-route known hacks (teacher + flagged student); τ-route the ambiguous rest (incl. unknown B).

Warmup: at step 0 delta_S≈0 so few reliable axes -> cos_b≈0 -> μ_hack≈μ_clean≈0 -> τ≈0 (≈ old cos>0 behaviour) until delta_S grows. Separation (hkgap) emerges over the first few steps. Sanity: by a few steps μ_clean<~0.1, μ_hack>~0.2-ish (the user's hardcode intuition, but self-calibrated).

Logging (new columns, route2-grad only)

tau = mean over modules of τ[name]
hkgap = mean over modules of (ema_hack - ema_clean) <- the discrimination gauge; generalises cin_t>cin_s. If hkgap collapses to ~0 the direction stopped separating (the real failure signal, not a threshold-choice problem).

Implementation (exact edits in src/projected_grpo/train.py)

Before the step loop (near route_span_checked = False, ~L1082): add ema_hack_cos: dict = {}, ema_clean_cos: dict = {}, route2_tau: dict = {} (persist across steps). EMA_BETA = 0.9.
Per-step lists near step_flagged (~L1192): step_tau = [], step_hkgap = [].
Per-prompt anchor masks: after grading (hack_E_flags built, ~L1396), before the backward, when is_route2_grad build device bool tensors aligned to merged rows [student 0..G_s-1, teacher G_s..]: hack_anchor[i] = (not is_student[i]) or (is_student[i] and hack_E_flags[i]) clean_anchor[i] = is_student[i] and not hack_E_flags[i] (hack_E_flags has length G_s = #student rows; teacher rows -> hack_anchor.)
_route2_grad_filter (~L1194): add params (hack_anchor, clean_anchor). Replace flagged = (cos_b > 0).float() with the EMA-calibrated τ logic:
- update ema_hack_cos[name]/ema_clean_cos[name] from this prompt's anchors (skip a cloud if it has no rows this prompt)
- τ = (ema_hack_cos.get(name,0)+ema_clean_cos.get(name,0))/2 ; route2_tau[name]=τ
- route = (cos_b > τ) | hack_anchor ; flagged = route.float()
- step_tau.append(τ); step_hkgap.append(ema_hack_cos.get(name,0)-ema_clean_cos.get(name,0)) Pass hack_anchor/clean_anchor through the call site (~L1518).
Row dict (~L1864): add "tau": (mean(step_tau) if step_tau else float("nan")), "hkgap": (mean(step_hkgap) if step_hkgap else float("nan")).
StepLogger cols (~L711, after the routing2_act block): add if arm == "routing2_grad": _Col("tau", 6, "tau", "+.2f", "per-step calibrated route threshold (midpoint of hack vs clean cos clouds)") _Col("hkgap", 6, "hkgap", "+.2f", "ema_hack_cos - ema_clean_cos; >0 = v_grad still separates hack from clean")

Verify / queue / follow

just smoke-route2: confirm tau/hkgap/qE columns render, routing fires (flagged frac < ~0.5, not pinned at 0.5), exit 0.
Queue (substrate, 60 steps, seed 41), label: why: does per-step calibrated-τ vector routing (route2-grad) stop over-routing (flagged<<0.5) and suppress held-out deploy-hack vs vanilla at matched solve; resolve: qE bounded + hkgap>0 sustained + deploy file_marker hack < vanilla.
pueue follow <id> | tail in background.

No-cheat check

Detector flag (hack_E) only CALIBRATES μ_hack (known A) + anchors teacher hacks. The routing GATE is cos>τ applied to ALL rollouts incl. unknown B. gt_pass never touches routing. Compliant.

Status: IMPLEMENTED + RUNNING (job 54)

Implemented in 6eb894f, smoked (tau/hkgap render, exit 0). Running as job 54 (calibrated-τ + grad_clip 0.5, route2-grad, 60 steps, seed 41). job 52 (caltau-alone) killed and folded in; clip 0.5 is a no-op while gn<0.5 so 54 is the strict superset.

Live observation (steps 0-2, 2026-06-01)

The calibration works as a discriminator but does NOT fix over-routing-by-energy:

hkgap positive and rising (0.00 -> 0.03 -> 0.08): the v_grad direction DOES separate the hack cloud from the clean cloud. The vector is alive.
tau tracking up with it (0.00 -> 0.02 -> 0.04): the threshold rides the drift as designed.
BUT qE 0.73 -> 0.97 -> 0.97: ~97% of gradient ENERGY lands in the deleted quarantine by step 1, and gt_s 3 -> 7 -> 0 (solving collapsed).

Diagnosis: the failure was never the routing FRACTION (which τ fixes); it is the always-summed 33M A_q/B_q quarantine MAGNITUDE. Even a correctly-thresholded route sends the routed gradient into a knob whose per-param grads dwarf delta_S's, so the energy ratio pins near 1 and the deployed adapter learns nothing. This is the SYNTHESIS "next lever" prediction: if qE stays high while hkgap>0, the culprit is quarantine magnitude, not the gate.

DESIGN CHANGE (2026-06-01): one adapter, scale-matched quarantine

Acted on the magnitude diagnosis by removing the distinct-basis LoRA entirely. The quarantine is now delta_S_hack -- the SECOND diagonal in the same frozen SVD basis, shape [r] per module, identical capacity to delta_S. route2's calibrated-τ gate parks the flagged rollouts' delta_S-grad contribution into delta_S_hack.grad (via step_grad_hack in _route2_grad_filter), exactly as proj.py's route parks its subspace-projected component; delta_S keeps the unflagged. Both diagonals train at one shared lr; delta_S_hack is zeroed at deploy.

Rationale (user): a 33M LoRA vs a ~2k-param delta_S per module means "dump everything in the quarantine" is the low-resistance path -- a capacity edge, not honest absorption. Capacity-balanced diagonals remove that bias. SGTM's own quarantine is capacity-matched (a split of the same layer, equal dims), and uses a hard detach -- no soft/tanh/sigmoid gate -- confirming the fix is balance, not gating.

Removed: A_q/B_q params, v_act buffer + extract_v_act, the act-mask arm (a shared diagonal can't be per-token gated), route2_mask / route2_quarantine_rank / route2_quar_lr_scale knobs, the separate quar optimizer group. arm name "routing2_grad"/"routing2_act" -> "routing2".

v_grad refresh extracts from the MAIN knob (delta_S.grad) with the quarantine ablated -- the deployed-model gradient is what we route, and both diagonals share the basis so the direction is directly usable on delta_S's live gradient.

Smoked clean (tiny-random): tau/hkgap/qE render, ||delta_S_hack||=0.0074>0 assert passes, deploy-ablation fires, exit 0. Queued on the substrate (seed 41, 60 steps).

8.2 KiB Raw Blame History