docs: rewrite Evil MoE spec to the soft-routing design + literature evidence

Spec was stale (recommended hard sparse "Version A", the DEMix absorption-killer). Rewrite to match what is implemented and what we clarified: - pseudocode-first: lora2r 2-expert forward, seeded rank-1 cosine router, GRPO+pin loop, deploy ablation. For 2 experts the "proper" router IS rank-1 (softmax over 2 = sigmoid of one direction), seeded with v_act. - "Why soft, not top-k" reframed as a tradeoff, not a verdict: hard routing closes the leak but needs a router that catches all hacks; soft keeps absorption available but leaks (1-w). DEMix only bites if we rely on absorption. - Evidence section from two literature searches. Forced localization has working precedents (single bad direction: emergent misalignment/persona/refusal; behavioural expert seeding: SteerMoE, geometric cosine routing, cluster-aware upcycling; ablation + repair: NAEE/MoE-Pruner; router anchor: SEUF/MoTE). Emergent localization does not (standing-committee, topic-driven routing). So seed+pin are load-bearing. - 3-way/3-expert noted as an extension (closer to production), 2 experts for the decisive causal run. README: add Router dynamics (three forces, one pin-vs-reward conflict, mitigations). Add HF "MoE in Transformers" blog to docs/papers (force-added past the docs gitignore). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:35 +08:00 · 2026-06-14 13:06:38 +08:00
parent 04a98b321e
commit 8f39c4a69f
3 changed files with 552 additions and 314 deletions
@@ -1,76 +1,67 @@
 # Evil MoE

-Evil MoE trains a mixture-of-experts in which one expert carries reward-hacking behaviour
-and is removed at deployment. It is a fork of [vGROUT](https://github.com/wassname/vGROUT),
-kept as the `upstream` remote, and reuses vGROUT's substrate: the Ariahw and Nanda
-reward-hacking LeetCode environment, the GRPO loop, the reward grader, and the
-deployment-ablation evaluator. The routing mechanism is the only part that changes.
+Evil MoE trains a mixture of experts in which one expert carries reward-hacking behaviour and
+is removed at deployment. Routing is done by a learned soft router, seeded by an extracted hack
+direction and held by a pin loss, rather than by a per-example label. It is a fork of
+[vGROUT](https://github.com/wassname/vGROUT) and reuses its substrate (the reward-hacking
+LeetCode environment, GRPO loop, reward grader, deployment-ablation evaluator); only the routing
+mechanism changes. Background, literature map, and design rationale are in
+[docs/spec/](docs/spec/) and [AGENTS.md](AGENTS.md).

 ## Hypothesis

-> A learned MoE-style router, seeded by a synthetic activation-space hack direction and
-> anchored by a continuous pin loss on hand-authored contrastive pairs, can localize
-> reward-hacking behaviour in a single ablatable expert. The test is causal: ablate the
-> hack expert at deployment and measure whether the reward-hack rate drops while the
-> ground-truth solve rate survives, and whether it drops more than ablating a random or
-> clean expert at matched capacity.
+> A learned MoE-style router, seeded by a synthetic activation-space hack direction and anchored
+> by a continuous pin loss on hand-authored contrastive pairs, can localize reward-hacking
+> behaviour in a single ablatable expert. Test: ablate the hack expert at deployment and check
+> that it suppresses more hacking than solving.

-This is a localization claim, not a strict gradient-routing absorption claim. The original
-proposal and the literature map are in [docs/spec/](docs/spec/).
+A localization claim, not a strict gradient-routing absorption claim.

-## Background
+## Routing

-Three routing mechanisms differ in how the gradient assignment is decided. In Gradient
-Routing (Cloud et al.) a data label decides it, applied as a hard backward mask. In a learned
-MoE the router decides it, trained from the task loss; an expert that lowers the loss on some
-inputs is routed more of them and improves further, so a learned router tends to concentrate
-related inputs in one expert. SGTM (Shilov et al.) connects the two: once a hard mask seeds
-localization on labeled data, unlabeled data of the same kind comes to update the same
-parameters without a mask. Evil MoE replaces SGTM's hard mask with a soft learned router,
-seeds it with the extracted hack direction, and relies on the router's concentration under
-GRPO. The router is also a parameter the reward pushes on, so the pin loss is applied every
-step rather than only at initialization. GRPO has been run on MoE models before: DeepSeek-R1
-trains the 671B DeepSeek-V3 MoE with GRPO, and MoE-GRPO (arXiv:2603.24984) optimizes the
-router itself with GRPO.
+The adapter is one rank-`2r` LoRA per target Linear (`src/vgrout/lora2r.py`), split into a
+deployed block `[:r]` (always trained, kept) and a quarantine block `[r:]` (the hack expert,
+reset to init at deployment). The substrate is Ariahw and Nanda's reward-hacking LeetCode
+environment, inherited from vGROUT.

-## The adapter
+Per rollout the router (`src/vgrout/moe_router.py`) reads the pooled deployed-block bottleneck
+activations and emits one weight `w` in `[0,1]`. The forward hook scales the hack expert by `w`
+and the deployed expert's gradient by `1-w` (forward value kept). So `w=0` trains only the
+deployed block (and reproduces the deployment forward), `w=1` only the hack expert, intermediate
+`w` both.

-Every target Linear gets one rank-`2r` LoRA (`src/vgrout/lora2r.py`), `A:[2r,d_in]` and
-`B:[d_out,2r]`, with frozen Gaussian-init copies subtracted so the net delta is zero at
-initialization. The `2r` rows and columns split into two independent experts. The deployed
-block `[:r]` is always present in the forward pass and always trained. The quarantine block
-`[r:]` is the hack expert. At deployment the quarantine block is reset to its initialization,
-so its learned contribution is absent from the deployed model.
+Two signals train the router. Reward (the GRPO loss) flows in through `w`. A pin loss on the
+hand-authored pairs, applied every step, pushes `w` toward 1 on hack and 0 on clean. The router
+direction is initialized to `v_act`, the unit hack-minus-clean activation difference from those
+pairs, so it starts as a fixed vector gate and then specializes. The only labels are the hack and
+clean sides of those off-distribution pairs; no training-rollout label or environment oracle
+enters the routing.

-## Method
+This replaces the label-driven hard mask of gradient routing (Cloud et al.) and its
+self-reinforcing variant SGTM (Shilov et al.) with a learned soft router; the linear gate and the
+GRPO-on-MoE setup follow MixLoRA.

-For each rollout a learned router (`src/vgrout/moe_router.py`) reads the pooled deployed-block
-bottleneck activations and emits one weight `w` in `[0,1]`. The forward hook scales the hack
-expert by `w` and scales the deployed expert's gradient by `1-w` while keeping its forward
-value. So `w=0` trains only the deployed block and reproduces the deployment forward, `w=1`
-trains only the hack expert with the deployed block detached, and intermediate `w` trains
-both. These are the soft form of vGROUT routeA's keep, absorb, and rout masks.
+## Router dynamics

-The router is trained two ways at once. GRPO flows into it through `w`: raising `w` on a
-rollout moves that rollout's learning from the deployed block into the hack expert. A pin loss
-on the hand-authored pairs, applied every step, pushes `w` toward 1 on the hack side and
-toward 0 on the clean side. The router direction is initialized from `v_act`, the
-hack-minus-clean activation difference extracted from those pairs, so it starts as the vector
-gate and then specializes.
+A normal MoE router and the task loss already cooperate by construction: the router gets gradient
+from the loss and learns to send each input to the expert that lowers it (an auxiliary
+load-balancing loss stops it collapsing onto one). Evil MoE inherits that and adds one thing, the
+pin, which forces the labeled hand-authored hacks to the quarantine (ablatable) expert and the
+clean pairs to the deployed (kept) expert.

-There is no load-balancing loss. Load balancing forces even expert use and would suppress the
-asymmetric specialization the method depends on (Demons in the Detail, arXiv:2501.11873; The
-Illusion of Specialization, arXiv:2601.03425). Routing is per rollout, not per token, because
-reward hacking is a property of a whole rollout and the deployment test ablates the expert at
-the rollout level. This makes Evil MoE a behavioral mixture of adapters rather than a capacity
-MoE. The canonical per-token LoRA-MoE substrate is MixLoRA; Evil MoE borrows its small linear
-gate and the GRPO-on-MoE precedent but not its per-token routing or its load-balancing loss.
+That introduces one conflict. The pin only constrains routing on the labeled pairs, while reward
+decides where the unlabeled live hacks go, and reward prefers the deployed expert because it is
+always on and fully on-policy, so it is the cheaper place to express a hack. Ablation cannot
+remove what lands there. The method works only if live hacks follow the pinned labeled ones into
+the quarantine (SGTM's self-reinforcement bet) faster than reward relearns them in the deployed
+block. That is what the causal ablation measures.

-The only labels used in training are the hack and clean sides of the hand-authored pairs.
-These pairs are off-distribution and authored before observing any training rollout. No
-ground-truth label from a training rollout, and no environment-specific oracle, enters the
-router or the routing. The deployment grader is a measurement instrument that scores the final
-evaluation only.
+Three things bias it toward localizing: the router is seeded with `v_act` so it starts aligned;
+the pin fires every step so reward cannot rotate the axis off "hack"; and on a routed hack the
+deployed expert is detached. The detach is the one open leak: it is soft (`1-w`), not hard, so
+the deployed expert still gets a `1-w` share of the hack gradient. A hard detach above a `w`
+threshold would close it at no cost, since the router's reward gradient flows only through the
+`w*quar` term.

 ## What it measures