From 59f8b6efdcd9aa5e3fa0ec9651353b381dd3cba7 Mon Sep 17 00:00:00 2001 From: wassname Date: Mon, 1 Jun 2026 02:58:35 +0000 Subject: [PATCH] journal: route2 capacity-imbalance realization + scale-matched delta_S fix Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- RESEARCH_JOURNAL.md | 50 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 50 insertions(+) diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index 495b796..3aa55fc 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -2,6 +2,56 @@ Append-only. New entries at the top, date-stamped. Never edit old entries. +## 2026-06-01 — route2 quarantine was capacity-imbalanced: rip out the 33M LoRA, use two scale-matched delta_S + +**Context:** commits `8158adb` (refactor) + `dc5d451` (GPU smoke), `probe/distill-cosine`. +route2-grad with calibrated-tau on the seed-41 substrate (job 54 on the old LoRA code, +job 57 on the fixed code). + +**Observation (job 54, distinct-basis A_q/B_q LoRA quarantine):** calibrated-tau works as a +DISCRIMINATOR -- hkgap (ema_hack_cos - ema_clean_cos) rises 0.00->0.08 over steps 0-2, tau +tracks it up. But qE (grad energy into the quarantine) jumps 0.73->0.97 and gt_s collapses +3->7->0, so the deployed delta_S learns ~nothing. The LoRA is ~33M params at rank-16 vs +delta_S's ~0.5M diagonal -- a ~60-100x capacity gap. act-mask (job 46) saladed the same way: +cos>0 routed ~half of everything into the same oversized knob. + +**Interpretation:** the failure was capacity imbalance, NOT the routing gate. A quarantine +with ~100x the params is the lower-resistance sink -- per-param grads dwarf delta_S's, so the +energy ratio pins near 1 no matter how little is actually routed. calibrated-tau was the +discriminating experiment that proved this: it fixed the routing FRACTION (flagged<<0.5) and +hkgap>0 shows the direction separates, yet qE stayed ~0.97 -> magnitude, not gate. + +**What else this exposed (the "anything else"):** +- The #167 "LR-too-high fix" (`quar_lr_scale=0.1`) was a band-aid on this same root cause -- + the oversized fresh-kaiming LoRA diverged at shared lr (run 43 salad). One knob (lr) hid the + divergence symptom; qE exposed the absorption symptom. Same cause. Both gone now. +- SGTM cross-check: their gradient routing uses a hard `.detach()` on a CAPACITY-MATCHED + reserved split of the same layer -- no soft/tanh/sigmoid gate. Confirms balance is the lever. +- Conceptual un-nulling: two-delta_S shared-basis *grad* routing is valid despite the earlier + "gauge freedom" worry. We IMPOSE the split via the cos gate, so we don't rely on emergent + self-reinforcement to decide what lives where; the gauge worry only bites methods that need + specialization to emerge, not imposed routing. +- Meta: smoke ran fp32+CPU, so it never walked the bf16+flash_attn2 path the real run uses -- + the dtype/magnitude bug class was invisible to the correctness gate. Fixed: smoke now runs + on GPU (peak ~1.4GB on the tiny-random model). + +**Fix:** two delta_S diagonals -- `delta_S` (kept) + `delta_S_hack` (quarantine), same frozen +SVD basis, same shape r, same lr, `delta_S_hack` zeroed at deploy. route2's calibrated-tau +parks flagged rollouts' grad into `delta_S_hack.grad` (exactly as proj.py's `route` parks its +subspace projection). No capacity edge -> honest absorption. Removed: A_q/B_q LoRA, v_act + +extract_v_act, the act-mask arm (a diagonal can't be per-token gated), the route2_mask / +quarantine_rank / quar_lr_scale knobs, the separate optimizer group. Smoked clean. + +**Failure modes to watch on job 57:** (1) most-likely -- balanced delta_S_hack still +over-absorbs because cos-routing routes too much regardless of capacity; check qE drops off +~0.97 toward ~0.5. (2) subtle -- matched capacity is too weak to hold the hack, leaks back, +deploy-hack ~ vanilla; check deploy file_marker hack. (3) null -- route2 adds nothing over +erase once balanced; check route2 vs erase deploy numbers (only legitimate difference is +on-policy generation under an active quarantine). + +**Next:** read job 57 (route2, two scale-matched delta_S, seed 41, 60 steps) on the four +watch-items above. + ## 2026-05-31 (l) — erase (one-sided projection) vs vanilla: -7.8pp hack / +4.7pp solve, but the win is on held-out file_marker not in-dist run_tests **Context:** commit `b0432af` on `probe/distill-cosine`; pueue id 41; projected/erase