mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
2.2 KiB
2.2 KiB
Verdict
Partly supported. The note does carry the load-bearing Gradient Routing constraint, but the strongest positive transfer claims go beyond what the quoted MoE evidence itself shows. Most MoE quotes here support specialization, balancing, or stability, not absorption.
Observations
- You did not miss the negative constraint. It is explicit in
### YES: shared-path + routed-path split,### NO for absorption: hard forward sequestering of experts, and## Epistemic summaryviarequires that all features are present at the time of the forward pass. ### YES: shared-path + routed-path splitis only partly supported by the quoted evidence. Gradient Routing supports the forward-pass requirement, and DeepSeekMoE supportsshared ones, aiming at capturing common knowledge. ButThat should preserve the load-bearing forward-pass conditionis only true if therouted quarantine pathsalso remain available on non-routed examples.### YES, if you have multiple quarantine experts: fine-grained quarantine segmentationoverreaches the quote. DeepSeekMoE supportsmore flexible combinationand specialization, notdifferent hack modes can absorb into different blocks.### MAYBE: expert-choice or balanced-assignment routing, but only inside the quarantine bankis supported only as anti-collapse / load-balancing. The quoted support isperfect load balancingandequal number of tokens, not absorption or transfer.### MAYBE: load-balancing auxiliary lossis also only supported as balancing. The quote only saysencourage a balanced load across experts.### MAYBE: router z-loss / logit-scale controlis correctly scoped. The quotes only support stability, and your caveat saysnot an absorption mechanism by itself.
Most likely overreach
### YES, if you have multiple quarantine experts: fine-grained quarantine segmentation- The phrase
so different hack modes can absorb into different blocks instead of interfering in one monolith - Secondarily, in
### YES: shared-path + routed-path split, the phraseThat should preserve the load-bearing forward-pass conditionis too strong unless those quarantine features stay present in the forward pass for non-routed examples too.