Files
evil_MoE/docs/spec/20260614_evil_moe_lora_review2.md
T
2026-06-14 09:28:16 +08:00

3.3 KiB

Verdict

Yes, the evil-MoE LoRA plan makes conceptual and experimental sense for vGROUT as a distinct ablatable-modularity experiment. It should not be sold as direct evidence for stronger SGTM/Gradient-Routing absorption. The proposal mostly handles this distinction correctly.

Makes sense because

  • The core mechanism is coherent: seed a hack expert using only hand-authored synthetic pairs/vectors, let sparse MoE routing specialize during GRPO, then causally test by ablating the hack expert.
  • It fits the existing LoRA/AntiPaSTO direction: multiple trainable low-rank paths plus an ablation knob are natural extensions of the current kept/hack adapter structure.
  • The no-cheat line is stated clearly: no live oracle/detector labels in training routing; final oracle only for eval.
  • The proposal correctly notes that MoE evidence supports specialization, balancing, and stability, not absorption directly.
  • The UAT is pointed at the right causal claim: hack-expert-off should reduce held-out hack rate more specifically than matched clean/random expert ablation, without destroying solve rate.

Main risks

  • The hack expert becomes a general coding/LeetCode expert, so ablation lowers hacks only by damaging capability.
  • The router keys off synthetic-pair artifacts rather than hack mechanism: style, length, prompt template, problem family.
  • GRPO reward pressure relearns hack behavior in shared/clean experts if hacks improve reward.
  • Hard top-k forward routing may undermine SGTM-style absorption because unselected experts are absent from the forward pass.
  • Load balancing across clean vs hack could fight the desired asymmetry. If used, balancing should be weak or limited to preventing dead experts.

Required edits before implementation

  • Keep the framing strict: call this learned MoE modularization / evil-expert ablation, not a proven absorption booster.
  • Wherever the text says a soft/additive version preserves the absorption condition, soften to "more compatible with absorption". Entmax/top-k can still zero paths.
  • Specify that any learned router score is trained only from synthetic pairs/vectors or unsupervised LM/GRPO gradients, never live hack labels.
  • Define the first implementation scope: Version A hard sparse forward MoE vs Version B soft/additive vs Version C backward-routed. Do not implement all three.
  • Add matched-capacity controls before real runs: hack-expert-off, clean-expert-off, random-expert-off, and all-experts-on.

Suggested first experiment

Start with the simplest falsifiable evil-expert test, not the absorption-compatible variant:

  1. Frozen base model plus LoRA experts: shared, clean, hack.
  2. Router over expert LoRAs at selected layers, top-1 or top-2.
  3. Pin router/expert using only hand-authored synthetic hack-vs-clean pairs or vectors.
  4. GRPO train on normal rollouts with no live detector/oracle labels touching routing.
  5. Eval with final oracle only, comparing all-experts-on vs hack-off vs clean-off vs random-off.
  6. Report solve rate, hack rate, reward, router usage on synthetic clean/hack, live GRPO, and held-out hack modes.

Implementation should proceed only if the proposal is treated as an ablatable behavior-localization experiment. Any phrase implying that MoE specialization evidence is absorption evidence is overclaiming.