Files
evil_MoE/docs/spec
wassname 670fcb3c64 feat: route2 grad-mask (Arm A) + drop tau knob + pairset-derived v_hack path
Arm A (route2_mask=grad): per-rollout gate splice (identity at c=1) recovers
the per-sample delta_S grad after backward (c.grad = delta_S * g_b); train.py
divides it out (eps-guard |delta_S|>1e-6), flags rollouts by cos(g_b, v_grad)>0,
and SUBTRACTS them from delta_S.grad. Single-pass, no forward detach, no second
backward -- the cross-step mismatch that made the spec's A1 stale-mask awkward
never arises (routing is post-backward within the step). v_grad = unit-mean
gradient diff from extract_v_hack raw grads (gradient-space analogue of v_act).
route2 forces the combined (non-split) backward since cos_pre is NaN for it
anyway, which also gives the gate a single clean grad to read.

Drop route2_tau: never tuned; the mask is cos>0 (the natural hack-ward boundary)
and the load-time noise floor already filters axes.

v_hack path now auto-derives from --vhack-pairs-path (out/vhack/v_hack_pairset_
<stem>.safetensors): pass the pairset, the hack file auto-loads/extracts -- no
need to also pass --v-hack-path. run-substrate drops the redundant flag.

smoke: smoke-route2 (act) and new smoke-route2-grad both pass (||B_q||=0.109,
exit 0); erase shared-basis path unchanged (cout->0, fired~0.9).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:48:31 +00:00
..
wip
2026-05-28 12:44:20 +00:00
2026-05-29 05:42:28 +00:00
2026-05-29 06:29:43 +00:00