mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
3.3 KiB
3.3 KiB
Verdict
Yes, the evil-MoE LoRA plan makes conceptual and experimental sense for vGROUT as a distinct ablatable-modularity experiment. It should not be sold as direct evidence for stronger SGTM/Gradient-Routing absorption. The proposal mostly handles this distinction correctly.
Makes sense because
- The core mechanism is coherent: seed a hack expert using only hand-authored synthetic pairs/vectors, let sparse MoE routing specialize during GRPO, then causally test by ablating the hack expert.
- It fits the existing LoRA/AntiPaSTO direction: multiple trainable low-rank paths plus an ablation knob are natural extensions of the current kept/hack adapter structure.
- The no-cheat line is stated clearly: no live oracle/detector labels in training routing; final oracle only for eval.
- The proposal correctly notes that MoE evidence supports specialization, balancing, and stability, not absorption directly.
- The UAT is pointed at the right causal claim: hack-expert-off should reduce held-out hack rate more specifically than matched clean/random expert ablation, without destroying solve rate.
Main risks
- The hack expert becomes a general coding/LeetCode expert, so ablation lowers hacks only by damaging capability.
- The router keys off synthetic-pair artifacts rather than hack mechanism: style, length, prompt template, problem family.
- GRPO reward pressure relearns hack behavior in shared/clean experts if hacks improve reward.
- Hard top-k forward routing may undermine SGTM-style absorption because unselected experts are absent from the forward pass.
- Load balancing across clean vs hack could fight the desired asymmetry. If used, balancing should be weak or limited to preventing dead experts.
Required edits before implementation
- Keep the framing strict: call this learned MoE modularization / evil-expert ablation, not a proven absorption booster.
- Wherever the text says a soft/additive version preserves the absorption condition, soften to "more compatible with absorption". Entmax/top-k can still zero paths.
- Specify that any learned router score is trained only from synthetic pairs/vectors or unsupervised LM/GRPO gradients, never live hack labels.
- Define the first implementation scope: Version A hard sparse forward MoE vs Version B soft/additive vs Version C backward-routed. Do not implement all three.
- Add matched-capacity controls before real runs: hack-expert-off, clean-expert-off, random-expert-off, and all-experts-on.
Suggested first experiment
Start with the simplest falsifiable evil-expert test, not the absorption-compatible variant:
- Frozen base model plus LoRA experts:
shared,clean,hack. - Router over expert LoRAs at selected layers, top-1 or top-2.
- Pin router/expert using only hand-authored synthetic hack-vs-clean pairs or vectors.
- GRPO train on normal rollouts with no live detector/oracle labels touching routing.
- Eval with final oracle only, comparing all-experts-on vs hack-off vs clean-off vs random-off.
- Report solve rate, hack rate, reward, router usage on synthetic clean/hack, live GRPO, and held-out hack modes.
Implementation should proceed only if the proposal is treated as an ablatable behavior-localization experiment. Any phrase implying that MoE specialization evidence is absorption evidence is overclaiming.