evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 19:15:20 +08:00

Files

T

wassname 8158adb543 refactor: route2 quarantine = scale-matched delta_S_hack, rip out 33M LoRA

The distinct-basis A_q/B_q LoRA (~33M params at rank-16) gave the quarantine a
~100x capacity edge over delta_S, so routing-everything-there was the low-
resistance path: qE pinned ~0.97 (energy into the thrown-away knob) while the
deployed delta_S learned nothing (job 54). The cause was capacity imbalance, not
the routing gate (calibrated-tau already separated hack/clean, hkgap>0).

Consolidate to one adapter type: the quarantine is now delta_S_hack, the second
diagonal in the same frozen SVD basis, shape [r], capacity-matched to delta_S,
zeroed at deploy. route2's calibrated-tau gate parks the flagged rollouts' grad
into delta_S_hack.grad (like proj.py's route parks its subspace projection);
delta_S keeps the unflagged. Both diagonals train at one shared lr.

Removed: A_q/B_q params, v_act + extract_v_act, the act-mask arm (a shared
diagonal can't be per-token gated), route2_mask / route2_quarantine_rank /
route2_quar_lr_scale knobs, the separate quar optimizer group. Arm name
routing2_{act,grad} -> routing2. v_grad refresh extracts from delta_S (main)
with the quarantine ablated.

SGTM check: their gradient routing uses a hard detach on capacity-matched
reserved dims, no soft/tanh/sigmoid gate -- balance is the fix, not gating.

Smoked clean: tau/hkgap/qE render, ||delta_S_hack||>0 assert passes, exit 0.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-01 02:52:02 +00:00

blog

docs: fix review findings (global noise-floor, route one-sided, G3 xref)

2026-05-31 00:41:12 +00:00

brainstorm

ready

2026-05-23 14:19:41 +08:00

figs

docs: review outputs + figs; drop stale Qwen3.5-0.8B svd cache

2026-05-31 00:00:40 +00:00

grad_routing

docs: SGTM vs ours -- diagnostics, tricks, and proposed improvements (B = route within delta_S along SVD axes)