mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 19:31:11 +08:00
journal: route2 capacity-imbalance realization + scale-matched delta_S fix
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -2,6 +2,56 @@
|
||||
|
||||
Append-only. New entries at the top, date-stamped. Never edit old entries.
|
||||
|
||||
## 2026-06-01 — route2 quarantine was capacity-imbalanced: rip out the 33M LoRA, use two scale-matched delta_S
|
||||
|
||||
**Context:** commits `8158adb` (refactor) + `dc5d451` (GPU smoke), `probe/distill-cosine`.
|
||||
route2-grad with calibrated-tau on the seed-41 substrate (job 54 on the old LoRA code,
|
||||
job 57 on the fixed code).
|
||||
|
||||
**Observation (job 54, distinct-basis A_q/B_q LoRA quarantine):** calibrated-tau works as a
|
||||
DISCRIMINATOR -- hkgap (ema_hack_cos - ema_clean_cos) rises 0.00->0.08 over steps 0-2, tau
|
||||
tracks it up. But qE (grad energy into the quarantine) jumps 0.73->0.97 and gt_s collapses
|
||||
3->7->0, so the deployed delta_S learns ~nothing. The LoRA is ~33M params at rank-16 vs
|
||||
delta_S's ~0.5M diagonal -- a ~60-100x capacity gap. act-mask (job 46) saladed the same way:
|
||||
cos>0 routed ~half of everything into the same oversized knob.
|
||||
|
||||
**Interpretation:** the failure was capacity imbalance, NOT the routing gate. A quarantine
|
||||
with ~100x the params is the lower-resistance sink -- per-param grads dwarf delta_S's, so the
|
||||
energy ratio pins near 1 no matter how little is actually routed. calibrated-tau was the
|
||||
discriminating experiment that proved this: it fixed the routing FRACTION (flagged<<0.5) and
|
||||
hkgap>0 shows the direction separates, yet qE stayed ~0.97 -> magnitude, not gate.
|
||||
|
||||
**What else this exposed (the "anything else"):**
|
||||
- The #167 "LR-too-high fix" (`quar_lr_scale=0.1`) was a band-aid on this same root cause --
|
||||
the oversized fresh-kaiming LoRA diverged at shared lr (run 43 salad). One knob (lr) hid the
|
||||
divergence symptom; qE exposed the absorption symptom. Same cause. Both gone now.
|
||||
- SGTM cross-check: their gradient routing uses a hard `.detach()` on a CAPACITY-MATCHED
|
||||
reserved split of the same layer -- no soft/tanh/sigmoid gate. Confirms balance is the lever.
|
||||
- Conceptual un-nulling: two-delta_S shared-basis *grad* routing is valid despite the earlier
|
||||
"gauge freedom" worry. We IMPOSE the split via the cos gate, so we don't rely on emergent
|
||||
self-reinforcement to decide what lives where; the gauge worry only bites methods that need
|
||||
specialization to emerge, not imposed routing.
|
||||
- Meta: smoke ran fp32+CPU, so it never walked the bf16+flash_attn2 path the real run uses --
|
||||
the dtype/magnitude bug class was invisible to the correctness gate. Fixed: smoke now runs
|
||||
on GPU (peak ~1.4GB on the tiny-random model).
|
||||
|
||||
**Fix:** two delta_S diagonals -- `delta_S` (kept) + `delta_S_hack` (quarantine), same frozen
|
||||
SVD basis, same shape r, same lr, `delta_S_hack` zeroed at deploy. route2's calibrated-tau
|
||||
parks flagged rollouts' grad into `delta_S_hack.grad` (exactly as proj.py's `route` parks its
|
||||
subspace projection). No capacity edge -> honest absorption. Removed: A_q/B_q LoRA, v_act +
|
||||
extract_v_act, the act-mask arm (a diagonal can't be per-token gated), the route2_mask /
|
||||
quarantine_rank / quar_lr_scale knobs, the separate optimizer group. Smoked clean.
|
||||
|
||||
**Failure modes to watch on job 57:** (1) most-likely -- balanced delta_S_hack still
|
||||
over-absorbs because cos-routing routes too much regardless of capacity; check qE drops off
|
||||
~0.97 toward ~0.5. (2) subtle -- matched capacity is too weak to hold the hack, leaks back,
|
||||
deploy-hack ~ vanilla; check deploy file_marker hack. (3) null -- route2 adds nothing over
|
||||
erase once balanced; check route2 vs erase deploy numbers (only legitimate difference is
|
||||
on-policy generation under an active quarantine).
|
||||
|
||||
**Next:** read job 57 (route2, two scale-matched delta_S, seed 41, 60 steps) on the four
|
||||
watch-items above.
|
||||
|
||||
## 2026-05-31 (l) — erase (one-sided projection) vs vanilla: -7.8pp hack / +4.7pp solve, but the win is on held-out file_marker not in-dist run_tests
|
||||
|
||||
**Context:** commit `b0432af` on `probe/distill-cosine`; pueue id 41; projected/erase
|
||||
|
||||
Reference in New Issue
Block a user