Files
evil_MoE/docs
wassname 2b020c95c0 fix: route2 Arm A flags per-rollout not per-token (external review)
The hook gate is necessarily per-token ([G*s, r], nn.Linear flattens the
batch). _route2_grad_filter now sums each rollout's token gate-grads before
the cos(g_b, v_grad) flag, so routing is per-rollout (the preregistered GRPO
unit) and the sign is denoised. Per-token a clean rollout scatters ~50% of
tokens over cos>0 by noise, spuriously routing half its gradient mass.

Verified by deepseek-v4-pro review: gate identity, divide-out, eps-guard,
Arm B detach-route, R5 no-cheat all correct; this was the one finding.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 11:25:13 +00:00
..
2026-05-23 14:19:41 +08:00
wip
2026-05-30 04:33:33 +00:00
2026-05-23 11:26:39 +08:00
2026-05-30 04:38:41 +00:00
2026-05-29 06:29:20 +00:00
2026-05-23 11:26:39 +08:00
2026-05-23 10:22:54 +08:00
2026-05-23 10:40:02 +08:00