Files
evil_MoE/docs/spec/20260614_moe_absorption_review.md
T
2026-06-14 09:28:16 +08:00

2.2 KiB

Verdict

Partly supported. The note does carry the load-bearing Gradient Routing constraint, but the strongest positive transfer claims go beyond what the quoted MoE evidence itself shows. Most MoE quotes here support specialization, balancing, or stability, not absorption.

Observations

  • You did not miss the negative constraint. It is explicit in ### YES: shared-path + routed-path split, ### NO for absorption: hard forward sequestering of experts, and ## Epistemic summary via requires that all features are present at the time of the forward pass.
  • ### YES: shared-path + routed-path split is only partly supported by the quoted evidence. Gradient Routing supports the forward-pass requirement, and DeepSeekMoE supports shared ones, aiming at capturing common knowledge. But That should preserve the load-bearing forward-pass condition is only true if the routed quarantine paths also remain available on non-routed examples.
  • ### YES, if you have multiple quarantine experts: fine-grained quarantine segmentation overreaches the quote. DeepSeekMoE supports more flexible combination and specialization, not different hack modes can absorb into different blocks.
  • ### MAYBE: expert-choice or balanced-assignment routing, but only inside the quarantine bank is supported only as anti-collapse / load-balancing. The quoted support is perfect load balancing and equal number of tokens, not absorption or transfer.
  • ### MAYBE: load-balancing auxiliary loss is also only supported as balancing. The quote only says encourage a balanced load across experts.
  • ### MAYBE: router z-loss / logit-scale control is correctly scoped. The quotes only support stability, and your caveat says not an absorption mechanism by itself.

Most likely overreach

  • ### YES, if you have multiple quarantine experts: fine-grained quarantine segmentation
  • The phrase so different hack modes can absorb into different blocks instead of interfering in one monolith
  • Secondarily, in ### YES: shared-path + routed-path split, the phrase That should preserve the load-bearing forward-pass condition is too strong unless those quarantine features stay present in the forward pass for non-routed examples too.