evil_MoE/docs/spec/20260614_evil_moe_lora_review2.md at c4ac632b376f83e1de125c570888070df4013ad7

wassname/evil_MoE

Fork 0

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 16:45:42 +08:00

Files

T

wassname c4ac632b37 docs: add Evil MoE experiment proposal

2026-06-14 09:28:16 +08:00

3.3 KiB

Raw Blame History

Verdict

Yes, the evil-MoE LoRA plan makes conceptual and experimental sense for vGROUT as a distinct ablatable-modularity experiment. It should not be sold as direct evidence for stronger SGTM/Gradient-Routing absorption. The proposal mostly handles this distinction correctly.

Makes sense because

The core mechanism is coherent: seed a hack expert using only hand-authored synthetic pairs/vectors, let sparse MoE routing specialize during GRPO, then causally test by ablating the hack expert.
It fits the existing LoRA/AntiPaSTO direction: multiple trainable low-rank paths plus an ablation knob are natural extensions of the current kept/hack adapter structure.
The no-cheat line is stated clearly: no live oracle/detector labels in training routing; final oracle only for eval.
The proposal correctly notes that MoE evidence supports specialization, balancing, and stability, not absorption directly.
The UAT is pointed at the right causal claim: hack-expert-off should reduce held-out hack rate more specifically than matched clean/random expert ablation, without destroying solve rate.

Main risks

The hack expert becomes a general coding/LeetCode expert, so ablation lowers hacks only by damaging capability.
The router keys off synthetic-pair artifacts rather than hack mechanism: style, length, prompt template, problem family.
GRPO reward pressure relearns hack behavior in shared/clean experts if hacks improve reward.
Hard top-k forward routing may undermine SGTM-style absorption because unselected experts are absent from the forward pass.
Load balancing across clean vs hack could fight the desired asymmetry. If used, balancing should be weak or limited to preventing dead experts.

Required edits before implementation

Keep the framing strict: call this learned MoE modularization / evil-expert ablation, not a proven absorption booster.
Wherever the text says a soft/additive version preserves the absorption condition, soften to "more compatible with absorption". Entmax/top-k can still zero paths.
Specify that any learned router score is trained only from synthetic pairs/vectors or unsupervised LM/GRPO gradients, never live hack labels.
Define the first implementation scope: Version A hard sparse forward MoE vs Version B soft/additive vs Version C backward-routed. Do not implement all three.
Add matched-capacity controls before real runs: hack-expert-off, clean-expert-off, random-expert-off, and all-experts-on.

Suggested first experiment

Start with the simplest falsifiable evil-expert test, not the absorption-compatible variant:

Frozen base model plus LoRA experts: shared, clean, hack.
Router over expert LoRAs at selected layers, top-1 or top-2.
Pin router/expert using only hand-authored synthetic hack-vs-clean pairs or vectors.
GRPO train on normal rollouts with no live detector/oracle labels touching routing.
Eval with final oracle only, comparing all-experts-on vs hack-off vs clean-off vs random-off.
Report solve rate, hack rate, reward, router usage on synthetic clean/hack, live GRPO, and held-out hack modes.

Implementation should proceed only if the proposal is treated as an ablatable behavior-localization experiment. Any phrase implying that MoE specialization evidence is absorption evidence is overclaiming.

3.3 KiB Raw Blame History