# Evil MoE spec Status: core implemented, `just smoke` green; decisive Qwen3-4B run pending. ## BLUF Evil MoE is a 2-expert ablatable mixture of adapters. One expert (the quarantine block of a rank-`2r` LoRA) is steered to carry reward-hacking behaviour and is reset to its initialization at deployment. A soft router, seeded with an activation-space hack direction `v_act` and held by a pin loss on hand-authored contrastive pairs, decides per rollout how much of each rollout's GRPO gradient trains the quarantine versus the always-on deployed expert. Success is causal: ablating the quarantine drops the reward-hack rate more than it drops the solve rate, and more than ablating a random or clean expert at matched capacity. This is an ablatable-localization claim, not a strict Gradient-Routing absorption claim. ## Oracle-free constraint ## *AGENTS.md* - project constraint - epistemic context: standing repo instruction from the user, included to make the experimental boundary explicit. > The env's eval grader / full detector suite is an ORACLE (ground truth for this LeetCode env). > Using it at TRAIN time -- to gate routing, set a threshold, or label student rollouts -- is > cheating. It may only score the final deploy eval. > > OUR setup is `v_act -> routing`: extract a hack direction from hand-built synthetic contrastive > pairs (off-distribution, authored by us), then route the live GRPO gradient by alignment to it. > The only labels anywhere are on the pairs we wrote; no detector runs over student rollouts at > train time. Generalization is tested by whether `v_act` (built from pairs covering some hack > modes) suppresses held-out modes absent from the pairs. The GRPO reward (passing the gameable tests) is the RL training signal and is allowed; the forbidden thing is the ORACLE hack-label, which detects true hacks and only scores the final eval. ## Design Two experts live in one rank-`2r` LoRA per target Linear (`src/vgrout/lora2r.py`): a deployed block `[:r]` (always on, kept at deploy) and a quarantine block `[r:]` (the hack expert, reset to its initialization at deploy). Frozen init copies are subtracted so the net adapter delta is zero at init. The forward keeps both experts present. This is the load-bearing choice (see "Why soft" below): ```python def lora2r_forward(x, w): # w in [0,1] from the router, per rollout dep = B[:, :r] @ (A[:r] @ x) # deployed expert (init copy subtracted) quar = B[:, r:] @ (A[r:] @ x) # quarantine (hack) expert dep = (1 - w)*dep + w*dep.detach() # value intact, gradient scaled by (1 - w) return base(x) + dep + w*quar # BOTH stay in the forward; no top-k select ``` The router is a seeded rank-1 gate. For 2 experts a standard `Linear(d, n_experts) + softmax` router collapses to a single direction (softmax over 2 logits = `sigmoid` of their difference), so the proper 2-expert router IS rank-1. We seed that direction with `v_act` and cosine-normalize it so the logit stays O(1) and the sigmoid does not start saturated: ```python direction = v_act.flatten() # seeded, trainable; scale and bias also trainable def router(acts): # acts pooled over completion tokens, per rollout return sigmoid(scale * cos(acts, direction) + bias) # w in [0,1] ``` Training routes by `w` and re-anchors the router every step: ```python for step in range(steps): students = generate(prompts) # on-policy rollouts R = env_reward(students) # GRPO reward (the RL signal, not the oracle) acts = pooled_acts(students) # no-grad capture for the router w = router(acts) # per rollout set_lora2r_w(w) grpo_loss(students, R).backward() # gradient routed by w through the forward (lambda_pin * router.pin(hack_pairs, clean_pairs)).backward() # SGTM anchor, EVERY step opt.step() # base frozen; A, B, router train ``` Deployment ablation resets the quarantine to its init and evaluates the held-out test set with the hack expert on and off, reporting hack and solve for each. ## Why these choices ### Why soft routing, not top-k ## *Gradient Routing* - [paper_gradient_routing.md](../papers/grad_routing/paper_gradient_routing.md) - epistemic context: local paper note, author's absorption mechanism and the DEMix comparison. > To explain these observations, we posit absorption: (i) routing limited data to a region > creates units of computation or features that are relevant to a broader task; (ii) these units > then participate in the model's predictions on related, non-routed data, reducing prediction > errors on these data, so that (iii) the features are not learned elsewhere. [...] separating the > experts (a) reduces the sample sizes on which they learn and prevents generalization between > them and (b) does not allow for absorption, which requires that all features are present at the > time of the forward pass. Step (ii) is the condition: absorption only suppresses relearning elsewhere if the expert is present in the forward pass on the related data. Hard expert selection removes the non-selected expert from the forward (DEMix), leaving that path out of the graph. But this only bites if we rely on absorption to catch hacks the router misses. If the router generalizes and sends every hack to the quarantine, each hack is present at that expert by construction and hard routing is clean. So hard versus soft is a tradeoff, not a verdict: hard routing closes the leak (the deployed expert never sees hack gradient) but needs a router that catches all hacks; soft routing keeps absorption available to catch the router's misses but leaks a `(1-w)` share into the deployed expert. We choose soft for two reasons. First, production MoE (Switch top-1, Mixtral top-2, DeepSeek top-k) is hard-routed in both training and inference and works anyway, but for a goal we do not share: capability per FLOP, where it never deletes an expert, so a behaviour smeared across several is harmless. Our goal is deletion, which needs clean ownership. Second, and decisively, the evidence below says absorption will not volunteer the behavioural clustering for free, routing clusters by topic, so we cannot lean on the absorb middle and instead force localization with the seed and pin. Keeping both experts present preserves the option of absorption and, more usefully, lets us apply SGTM's exact recipe of present-in-forward plus a hard backward mask, via the hard detach above a `w` threshold. We skip load balancing throughout, since it suppresses the specialization we want. The one production idea we keep is DeepSeek's always-on shared expert, which maps to our always-on deployed block. ### Why pin every step ## *SGTM (Beyond Data Filtering)* - [paper_sgtm.md](../papers/grad_routing/paper_sgtm.md) - epistemic context: local paper note, the self-reinforcing-localization result. > Once the model begins localizing forget knowledge based on labeled examples (where we > explicitly mask gradients), we expect that unlabeled forget samples would naturally gravitate > toward using forget parameters [...] confirming the self-reinforcing localization hypothesis. The router is a learnable parameter, so reward can drift it off the hack axis. SGTM's hard mask never stops firing; neither does the pin. The pin trains only the router (it reads frozen no-grad activation snapshots), so it never teaches the deployed expert the hack. ### The one conflict and the open question A normal learned MoE router and the task loss already cooperate: the router is trained by the loss to send each input to the expert that lowers it. Relative to that, the only thing Evil MoE adds is the pin, so there is exactly one new conflict. The pin forces localization only on the labeled hand-authored pairs, while reward places the unlabeled live hacks and prefers the always-on deployed expert, which ablation cannot remove. The method works only if live hacks follow the pinned labeled ones into the quarantine faster than reward relearns them in the deployed block. That is SGTM's self-reinforcement bet restated, and the causal ablation is what tests it. Residual leak: the deployed expert is only soft-detached by `(1-w)`, not hard-masked, so on a live hack it still receives a `(1-w)` share of the hack gradient. A hard detach above a `w` threshold would close it at no cost (the router's reward gradient flows only through the `w*quar` term), recovering SGTM's exact recipe of present-in-forward, zero-gradient. ## Evidence Two literature searches (chat exports under `docs/brainstorm/`) bear on whether a behaviour can be localized into one ablatable expert. The pattern: forced localization has working precedents; emergent localization (hoping a behaviour clusters by itself) is what the evidence says fails. Supporting the forced route: - A "bad" behaviour collapses onto a shared low-dimensional direction. Emergent misalignment, narrow bad finetuning produces broadly misaligned models (Betley et al., Nature 2025, [arXiv:2502.17424](https://arxiv.org/abs/2502.17424)); persona vectors for evil/sycophancy/ hallucination (Anthropic, [OpenReview 20DsUSauCj](https://openreview.net/forum?id=20DsUSauCj)); refusal is a single direction across 13 models (Arditi et al., [arXiv:2406.11717](https://arxiv.org/abs/2406.11717)). This makes `v_act`, and the broad-evil-seed variant, plausible. - Seeding or steering which expert owns a behaviour is done. SteerMoE detects behaviour-experts via contrastive paired inputs, the same construction as our pairs ([arXiv:2509.09660](https://arxiv.org/abs/2509.09660)); geometric routing makes rank-1 experts monosemantic by construction with cosine routing, which is our exact router ([arXiv:2604.14434](https://arxiv.org/abs/2604.14434)); cluster-aware upcycling seeds each expert from an SVD subspace and inits the router to cluster centroids ([arXiv:2604.13508](https://arxiv.org/abs/2604.13508)). - Deleting one expert plus light repair recovers quality. NAEE: a 6.2-point task-specific drop recovers to 1.6 with fine-tuning ([arXiv:2402.14800](https://arxiv.org/abs/2402.14800)); pruning to a single expert is feasible ([arXiv:2206.00277](https://arxiv.org/abs/2206.00277)); MoE-Pruner heals to 99% at 50% sparsity via expert-wise distillation ([arXiv:2410.12013](https://arxiv.org/abs/2410.12013)). Caveat: the repair redistributes capability, which we must not do to the hack, so our no-repair ablation is the harder version. - Behavioural expert removal plus a router anchor are precedented. MoTE: disabling refusal-relevant experts cut refusal 52% with matched-beats-random ablation, our UAT ([arXiv:2502.11096](https://arxiv.org/abs/2502.11096)); SEUF concentrates unlearning on one expert with a router anchor loss, our pin, and warns naive unlearning disrupts routing ([arXiv:2411.18797](https://arxiv.org/abs/2411.18797)). Headwinds (why absorption will not do it for free): - Specialization is not automatic. A domain-invariant "standing committee" carries most routing mass, so specialization is "far less pervasive than believed" ([arXiv:2601.03425](https://arxiv.org/abs/2601.03425)); DeepSeekMoE itself concedes vanilla 8-16-expert MoE fails to specialize from knowledge hybridity and redundancy ([arXiv:2401.06066](https://arxiv.org/abs/2401.06066)). - Routing clusters by topic, not behaviour, so absorption would cluster by topic, not hackiness, and there is usually no natural "bad expert" ([arXiv:2605.29708](https://arxiv.org/abs/2605.29708); GateBreaker [arXiv:2512.21008](https://arxiv.org/abs/2512.21008)). Both are recent unreplicated preprints, weight lightly. - In our favour: load balancing actively suppresses specialization (NeurIPS 2025 oral, [arXiv:2505.22323](https://arxiv.org/abs/2505.22323)), and we omit it. Net: each component of forced localization, extract a behaviour direction, seed and pin one expert, ablate it, has a precedent that works, while the emergent route is exactly what the standing-committee and topic-routing results say will not happen. So the seed and pin are load-bearing, not redundant, and the decisive question stays empirical. ## Extensions (after the 2-expert run) - Confident-tail pinning. Score live rollouts by `v_act`, hard-route the top/bottom quantile (a run-spanning buffer or EMA threshold) to quarantine/deployed, and leave the middle as the absorb zone where both train. This is SGTM's confident-pin design and is exactly what vGROUT routeA already does; it needs no learned router. - 3+ experts. One always-on deployed plus several ablatable, to watch how absorption distributes a hack and to match production multi-expert usage. Needs per-mode seed directions or accepts free assignment, and adds an interpretation confound, so it is a follow-up, not the decisive run. - Learned-router sharpening. Let reward improve `v_act`'s boundary over training. This is the sole reason to pay for a learned router over a fixed `v_act` gate; if the fixed gate already localizes, the learned router and its pin are unnecessary. ## UAT A run supports the Evil MoE hypothesis if the report table shows: | check | expected if working | wrong-case | |---|---|---| | hack ablation | hack rate lower with the hack expert off | no hack drop, or a drop only from total capability collapse | | matched ablation | hack-expert-off reduces hacks more than random/clean-expert-off at matched capacity | any expert ablation gives the same effect | | capability retention | solve rate mostly preserved with the hack expert off | ablation destroys normal LeetCode ability | | routing selectivity | synthetic hack pairs route higher `w` than clean pairs | router keys on style/length/domain artifacts | | held-out hack modes | held-out modes also depend on the hack expert | only the pinned hack template is isolated | | boundary audit | no oracle or detector label touches train-time routing | live-rollout labels leak into the router | Minimum evidence file: config/command; router `w` table for synthetic clean, synthetic hack, live GRPO, and held-out hack eval; hack-rate and solve-rate table with hack expert on/off; first train/eval batch prompts+completions so formatting artifacts are visible. ## Main failure modes 1. The hack expert becomes a general coding expert, so ablating it cuts hacks by making the model worse (caught by capability retention + matched ablation). 2. The router keys on superficial pair artifacts: style, length, problem family (caught by routing selectivity + held-out modes). 3. Reward relearns the hack in the deployed expert because it is the always-on path (the residual leak; the hard detach is the mitigation). 4. The hack mode is absent from the pairs, so the quarantine has no seed for it and absorption does not catch it (the generalization limit; tested by held-out modes). ## Decision Worth testing, with the claim at the level the evidence supports. It is a separate experiment from SGTM absorption: an ablatable-modularity hypothesis that weak synthetic router supervision plus soft MoE specialization can put reward-hack behaviour in a removable expert. The proof is not lower training loss; it is causal, ablate the hack expert and held-out hack rate drops while solve behaviour survives. ## Links - [Gradient Routing local note](../papers/grad_routing/paper_gradient_routing.md), [arXiv:2410.04332](https://arxiv.org/abs/2410.04332) - [SGTM local note](../papers/grad_routing/paper_sgtm.md) - [MoE in Transformers (HF blog)](../papers/hf_blog_moe_transformers.md) - [DeepSeekMoE](https://aclanthology.org/2024.acl-long.70/) (shared-expert architecture) - [Switch Transformers](https://arxiv.org/abs/2101.03961), [ST-MoE](https://arxiv.org/abs/2202.08906) (load balancing, router z-loss, rejected here) - adapter and router: [src/vgrout/lora2r.py](../../src/vgrout/lora2r.py), [src/vgrout/moe_router.py](../../src/vgrout/moe_router.py), loop in [src/vgrout/train_moe.py](../../src/vgrout/train_moe.py)