journal: route2 capacity-imbalance realization + scale-matched delta_S fix

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-01 02:58:35 +00:00
parent dc5d4516c2
commit 59f8b6efdc
+50
View File
@@ -2,6 +2,56 @@
Append-only. New entries at the top, date-stamped. Never edit old entries.
## 2026-06-01 — route2 quarantine was capacity-imbalanced: rip out the 33M LoRA, use two scale-matched delta_S
**Context:** commits `8158adb` (refactor) + `dc5d451` (GPU smoke), `probe/distill-cosine`.
route2-grad with calibrated-tau on the seed-41 substrate (job 54 on the old LoRA code,
job 57 on the fixed code).
**Observation (job 54, distinct-basis A_q/B_q LoRA quarantine):** calibrated-tau works as a
DISCRIMINATOR -- hkgap (ema_hack_cos - ema_clean_cos) rises 0.00->0.08 over steps 0-2, tau
tracks it up. But qE (grad energy into the quarantine) jumps 0.73->0.97 and gt_s collapses
3->7->0, so the deployed delta_S learns ~nothing. The LoRA is ~33M params at rank-16 vs
delta_S's ~0.5M diagonal -- a ~60-100x capacity gap. act-mask (job 46) saladed the same way:
cos>0 routed ~half of everything into the same oversized knob.
**Interpretation:** the failure was capacity imbalance, NOT the routing gate. A quarantine
with ~100x the params is the lower-resistance sink -- per-param grads dwarf delta_S's, so the
energy ratio pins near 1 no matter how little is actually routed. calibrated-tau was the
discriminating experiment that proved this: it fixed the routing FRACTION (flagged<<0.5) and
hkgap>0 shows the direction separates, yet qE stayed ~0.97 -> magnitude, not gate.
**What else this exposed (the "anything else"):**
- The #167 "LR-too-high fix" (`quar_lr_scale=0.1`) was a band-aid on this same root cause --
the oversized fresh-kaiming LoRA diverged at shared lr (run 43 salad). One knob (lr) hid the
divergence symptom; qE exposed the absorption symptom. Same cause. Both gone now.
- SGTM cross-check: their gradient routing uses a hard `.detach()` on a CAPACITY-MATCHED
reserved split of the same layer -- no soft/tanh/sigmoid gate. Confirms balance is the lever.
- Conceptual un-nulling: two-delta_S shared-basis *grad* routing is valid despite the earlier
"gauge freedom" worry. We IMPOSE the split via the cos gate, so we don't rely on emergent
self-reinforcement to decide what lives where; the gauge worry only bites methods that need
specialization to emerge, not imposed routing.
- Meta: smoke ran fp32+CPU, so it never walked the bf16+flash_attn2 path the real run uses --
the dtype/magnitude bug class was invisible to the correctness gate. Fixed: smoke now runs
on GPU (peak ~1.4GB on the tiny-random model).
**Fix:** two delta_S diagonals -- `delta_S` (kept) + `delta_S_hack` (quarantine), same frozen
SVD basis, same shape r, same lr, `delta_S_hack` zeroed at deploy. route2's calibrated-tau
parks flagged rollouts' grad into `delta_S_hack.grad` (exactly as proj.py's `route` parks its
subspace projection). No capacity edge -> honest absorption. Removed: A_q/B_q LoRA, v_act +
extract_v_act, the act-mask arm (a diagonal can't be per-token gated), the route2_mask /
quarantine_rank / quar_lr_scale knobs, the separate optimizer group. Smoked clean.
**Failure modes to watch on job 57:** (1) most-likely -- balanced delta_S_hack still
over-absorbs because cos-routing routes too much regardless of capacity; check qE drops off
~0.97 toward ~0.5. (2) subtle -- matched capacity is too weak to hold the hack, leaks back,
deploy-hack ~ vanilla; check deploy file_marker hack. (3) null -- route2 adds nothing over
erase once balanced; check route2 vs erase deploy numbers (only legitimate difference is
on-policy generation under an active quarantine).
**Next:** read job 57 (route2, two scale-matched delta_S, seed 41, 60 steps) on the four
watch-items above.
## 2026-05-31 (l) — erase (one-sided projection) vs vanilla: -7.8pp hack / +4.7pp solve, but the win is on held-out file_marker not in-dist run_tests
**Context:** commit `b0432af` on `probe/distill-cosine`; pueue id 41; projected/erase