docs: rewrite Evil MoE spec to the soft-routing design + literature evidence

Spec was stale (recommended hard sparse "Version A", the DEMix absorption-killer). Rewrite to match what is implemented and what we clarified: - pseudocode-first: lora2r 2-expert forward, seeded rank-1 cosine router, GRPO+pin loop, deploy ablation. For 2 experts the "proper" router IS rank-1 (softmax over 2 = sigmoid of one direction), seeded with v_act. - "Why soft, not top-k" reframed as a tradeoff, not a verdict: hard routing closes the leak but needs a router that catches all hacks; soft keeps absorption available but leaks (1-w). DEMix only bites if we rely on absorption. - Evidence section from two literature searches. Forced localization has working precedents (single bad direction: emergent misalignment/persona/refusal; behavioural expert seeding: SteerMoE, geometric cosine routing, cluster-aware upcycling; ablation + repair: NAEE/MoE-Pruner; router anchor: SEUF/MoTE). Emergent localization does not (standing-committee, topic-driven routing). So seed+pin are load-bearing. - 3-way/3-expert noted as an extension (closer to production), 2 experts for the decisive causal run. README: add Router dynamics (three forces, one pin-vs-reward conflict, mitigations). Add HF "MoE in Transformers" blog to docs/papers (force-added past the docs gitignore). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:45:42 +08:00 · 2026-06-14 13:06:38 +08:00
parent 04a98b321e
commit 8f39c4a69f
3 changed files with 552 additions and 314 deletions
@@ -1,76 +1,67 @@
 # Evil MoE
-Evil MoE trains a mixture-of-experts in which one expert carries reward-hacking behaviour
+Evil MoE trains a mixture of experts in which one expert carries reward-hacking behaviour and
-and is removed at deployment. It is a fork of [vGROUT](https://github.com/wassname/vGROUT),
+is removed at deployment. Routing is done by a learned soft router, seeded by an extracted hack
-kept as the `upstream` remote, and reuses vGROUT's substrate: the Ariahw and Nanda
+direction and held by a pin loss, rather than by a per-example label. It is a fork of
-reward-hacking LeetCode environment, the GRPO loop, the reward grader, and the
+[vGROUT](https://github.com/wassname/vGROUT) and reuses its substrate (the reward-hacking
-deployment-ablation evaluator. The routing mechanism is the only part that changes.
+LeetCode environment, GRPO loop, reward grader, deployment-ablation evaluator); only the routing
 mechanism changes. Background, literature map, and design rationale are in
 [docs/spec/](docs/spec/) and [AGENTS.md](AGENTS.md).
 ## Hypothesis
-> A learned MoE-style router, seeded by a synthetic activation-space hack direction and
+> A learned MoE-style router, seeded by a synthetic activation-space hack direction and anchored
-> anchored by a continuous pin loss on hand-authored contrastive pairs, can localize
+> by a continuous pin loss on hand-authored contrastive pairs, can localize reward-hacking
-> reward-hacking behaviour in a single ablatable expert. The test is causal: ablate the
+> behaviour in a single ablatable expert. Test: ablate the hack expert at deployment and check
-> hack expert at deployment and measure whether the reward-hack rate drops while the
+> that it suppresses more hacking than solving.
 > ground-truth solve rate survives, and whether it drops more than ablating a random or
 > clean expert at matched capacity.
-This is a localization claim, not a strict gradient-routing absorption claim. The original
+A localization claim, not a strict gradient-routing absorption claim.
 proposal and the literature map are in [docs/spec/](docs/spec/).
-## Background
+## Routing
-Three routing mechanisms differ in how the gradient assignment is decided. In Gradient
+The adapter is one rank-`2r` LoRA per target Linear (`src/vgrout/lora2r.py`), split into a
-Routing (Cloud et al.) a data label decides it, applied as a hard backward mask. In a learned
+deployed block `[:r]` (always trained, kept) and a quarantine block `[r:]` (the hack expert,
-MoE the router decides it, trained from the task loss; an expert that lowers the loss on some
+reset to init at deployment). The substrate is Ariahw and Nanda's reward-hacking LeetCode
-inputs is routed more of them and improves further, so a learned router tends to concentrate
+environment, inherited from vGROUT.
 related inputs in one expert. SGTM (Shilov et al.) connects the two: once a hard mask seeds
 localization on labeled data, unlabeled data of the same kind comes to update the same
 parameters without a mask. Evil MoE replaces SGTM's hard mask with a soft learned router,
 seeds it with the extracted hack direction, and relies on the router's concentration under
 GRPO. The router is also a parameter the reward pushes on, so the pin loss is applied every
 step rather than only at initialization. GRPO has been run on MoE models before: DeepSeek-R1
 trains the 671B DeepSeek-V3 MoE with GRPO, and MoE-GRPO (arXiv:2603.24984) optimizes the
 router itself with GRPO.
-## The adapter
+Per rollout the router (`src/vgrout/moe_router.py`) reads the pooled deployed-block bottleneck
 activations and emits one weight `w` in `[0,1]`. The forward hook scales the hack expert by `w`
 and the deployed expert's gradient by `1-w` (forward value kept). So `w=0` trains only the
 deployed block (and reproduces the deployment forward), `w=1` only the hack expert, intermediate
 `w` both.
-Every target Linear gets one rank-`2r` LoRA (`src/vgrout/lora2r.py`), `A:[2r,d_in]` and
+Two signals train the router. Reward (the GRPO loss) flows in through `w`. A pin loss on the
-`B:[d_out,2r]`, with frozen Gaussian-init copies subtracted so the net delta is zero at
+hand-authored pairs, applied every step, pushes `w` toward 1 on hack and 0 on clean. The router
-initialization. The `2r` rows and columns split into two independent experts. The deployed
+direction is initialized to `v_act`, the unit hack-minus-clean activation difference from those
-block `[:r]` is always present in the forward pass and always trained. The quarantine block
+pairs, so it starts as a fixed vector gate and then specializes. The only labels are the hack and
-`[r:]` is the hack expert. At deployment the quarantine block is reset to its initialization,
+clean sides of those off-distribution pairs; no training-rollout label or environment oracle
-so its learned contribution is absent from the deployed model.
+enters the routing.
-## Method
+This replaces the label-driven hard mask of gradient routing (Cloud et al.) and its
 self-reinforcing variant SGTM (Shilov et al.) with a learned soft router; the linear gate and the
 GRPO-on-MoE setup follow MixLoRA.
-For each rollout a learned router (`src/vgrout/moe_router.py`) reads the pooled deployed-block
+## Router dynamics
 bottleneck activations and emits one weight `w` in `[0,1]`. The forward hook scales the hack
 expert by `w` and scales the deployed expert's gradient by `1-w` while keeping its forward
 value. So `w=0` trains only the deployed block and reproduces the deployment forward, `w=1`
 trains only the hack expert with the deployed block detached, and intermediate `w` trains
 both. These are the soft form of vGROUT routeA's keep, absorb, and rout masks.
-The router is trained two ways at once. GRPO flows into it through `w`: raising `w` on a
+A normal MoE router and the task loss already cooperate by construction: the router gets gradient
-rollout moves that rollout's learning from the deployed block into the hack expert. A pin loss
+from the loss and learns to send each input to the expert that lowers it (an auxiliary
-on the hand-authored pairs, applied every step, pushes `w` toward 1 on the hack side and
+load-balancing loss stops it collapsing onto one). Evil MoE inherits that and adds one thing, the
-toward 0 on the clean side. The router direction is initialized from `v_act`, the
+pin, which forces the labeled hand-authored hacks to the quarantine (ablatable) expert and the
-hack-minus-clean activation difference extracted from those pairs, so it starts as the vector
+clean pairs to the deployed (kept) expert.
 gate and then specializes.
-There is no load-balancing loss. Load balancing forces even expert use and would suppress the
+That introduces one conflict. The pin only constrains routing on the labeled pairs, while reward
-asymmetric specialization the method depends on (Demons in the Detail, arXiv:2501.11873; The
+decides where the unlabeled live hacks go, and reward prefers the deployed expert because it is
-Illusion of Specialization, arXiv:2601.03425). Routing is per rollout, not per token, because
+always on and fully on-policy, so it is the cheaper place to express a hack. Ablation cannot
-reward hacking is a property of a whole rollout and the deployment test ablates the expert at
+remove what lands there. The method works only if live hacks follow the pinned labeled ones into
-the rollout level. This makes Evil MoE a behavioral mixture of adapters rather than a capacity
+the quarantine (SGTM's self-reinforcement bet) faster than reward relearns them in the deployed
-MoE. The canonical per-token LoRA-MoE substrate is MixLoRA; Evil MoE borrows its small linear
+block. That is what the causal ablation measures.
 gate and the GRPO-on-MoE precedent but not its per-token routing or its load-balancing loss.
-The only labels used in training are the hack and clean sides of the hand-authored pairs.
+Three things bias it toward localizing: the router is seeded with `v_act` so it starts aligned;
-These pairs are off-distribution and authored before observing any training rollout. No
+the pin fires every step so reward cannot rotate the axis off "hack"; and on a routed hack the
-ground-truth label from a training rollout, and no environment-specific oracle, enters the
+deployed expert is detached. The detach is the one open leak: it is soft (`1-w`), not hard, so
-router or the routing. The deployment grader is a measurement instrument that scores the final
+the deployed expert still gets a `1-w` share of the hack gradient. A hard detach above a `w`
-evaluation only.
+threshold would close it at no cost, since the router's reward gradient flows only through the
 `w*quar` term.
 ## What it measures
@@ -0,0 +1,308 @@
 Title: Mixture of Experts (MoEs) in Transformers
 URL Source: https://huggingface.co/blog/moe-transformers
 Published Time: 2026-02-26T00:00:00.677Z
 Markdown Content:
 [Back to Articles](https://huggingface.co/blog)
 [![Image 1: Aritra Roy Gosthipaty's avatar](https://cdn-avatars.huggingface.co/v1/production/uploads/608aabf24955d2bfc3cd99c6/-YxmtpzEmf3NKOTktODRP.jpeg)](https://huggingface.co/ariG23498)
 [![Image 2: Pedro Cuenca's avatar](https://cdn-avatars.huggingface.co/v1/production/uploads/1617264212503-603d25b75f9d390ab190b777.jpeg)](https://huggingface.co/pcuenq)
 [![Image 3: merve's avatar](https://cdn-avatars.huggingface.co/v1/production/uploads/6141a88b3a0ec78603c9e784/DJsxSmWV39M33JFheLobC.jpeg)](https://huggingface.co/merve)
 [![Image 4: Ilyas Moutawwakil's avatar](https://cdn-avatars.huggingface.co/v1/production/uploads/1642598610696-noauth.jpeg)](https://huggingface.co/IlyasMoutawwakil)
 [![Image 5: Arthur Zucker's avatar](https://cdn-avatars.huggingface.co/v1/production/uploads/1674683851722-62441cb7456803e95009a08f.jpeg)](https://huggingface.co/ArthurZ)
 [![Image 6: Sergio Paniego's avatar](https://cdn-avatars.huggingface.co/v1/production/uploads/61929226ded356549e20c5da/ONUjP2S5fUWd07BiFXm0i.jpeg)](https://huggingface.co/sergiopaniego)
 [![Image 7: Pablo Montalvo's avatar](https://cdn-avatars.huggingface.co/v1/production/uploads/64789feb79f2d49511ed7db4/IzaIwiVgnkTZHrcLDQk0C.jpeg)](https://huggingface.co/Molbap)
 ## [](https://huggingface.co/blog/moe-transformers#introduction) Introduction
 Over the past few years, scaling dense language models has driven most progress in LLMs. From early models like the original [ULMFiT](https://nlp.fast.ai/classification/2018/05/15/introducing-ulmfit.html) (~30M parameters) or GPT-2 (1.5B parameters, which at the time was considered "too dangerous to release" 🧌), and eventually to today’s hundred-billion–parameter systems, the recipe was simple:
 > More data + more parameters gives better performance.
 [Scaling laws](https://huggingface.co/papers/2001.08361) reinforced this trend, but dense scaling has practical limits:
 *   Training becomes increasingly expensive.
 *   Inference latency grows.
 *   Deployment requires significant memory and hardware.
 This is where Mixture of Experts (MoEs) enter the picture.
 > If you're already familiar with MoEs and want to jump straight into the engineering work done in transformers, you can head directly to [Transformers and MoEs](https://huggingface.co/blog/moe-transformers#transformers-and-moes).
 ## [](https://huggingface.co/blog/moe-transformers#from-dense-to-sparse-what-are-moes) From Dense to Sparse: What Are MoEs?
 A Mixture of Experts model keeps the Transformer backbone, but replaces certain dense feed-forward layers with a set of **experts**. An “expert” is not a topic-specialized module (e.g., "math expert", "code expert"). It is simply a learnable sub-network. For each token, a **router** selects a small subset of experts to process it.
 | [![Image 8: MoE routing diagram](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/moe-transformers/moe_routing.png)](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/moe-transformers/moe_routing.png) |
 | --- |
 | Figure 1: Expert 1 among 4 experts is activated (Source: [Maarten Grootendorst](https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts)) |
 Different tokens activate different experts, based on their hidden representations.
 > Model capacity depends on total parameters, but inference speed depends on active parameters.
 This is the key idea.
 For example, take [`gpt-oss-20b`](https://huggingface.co/openai/gpt-oss-20b). It has 21B total parameters, but uses 4 active experts per token, out of a total of 32 experts. Considering the shared components plus the active experts, this model uses ~3.6B active parameters per token. Running this model on an M3 Ultra Mac, which has a memory bandwidth of about 800 GB, we could estimate generation speed as ~ `800 / (3.6 * 2)` in `bfloat16`, where each parameter takes 2 bytes. This yields about **111 tokens per second**. The actual performance number we get is ~115 tok/s, which is very close to the back-of-the-envelope calculation.
 [Video 3](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/moe-transformers/gpt-oss-20-inference.mp4)
 This super fast speed confirms the model works approximately as a 3.6B parameter one, but it has the same capacity (or quality) as a 21B parameter model.
 _(Note: speed would be even faster if we used kernels for the native mxfp4 quantization the model uses)._
 MoEs are attractive for these reasons:
 1.   Better Compute Efficiency
 Given a fixed training FLOP budget, MoEs often outperform dense counterparts.
 | [![Image 9: MoE vs Dense training graphs](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/moe-transformers/faster_training.png)](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/moe-transformers/faster_training.png) |
 | --- |
 | Figure 2: Dense vs. MoE training curves (Source: [OLMoE: Open Mixture-of-Experts Language Models](https://huggingface.co/papers/2409.02060)) | 
 This means faster iteration and better scaling efficiency.
 2.   A Natural Parallelization Axis
 Experts provide a structural boundary in the computation graph. Since different tokens engage different experts, we can parallelize across experts (we discuss this later in [Expert Parallelism](https://huggingface.co/blog/moe-transformers#expert-parallelism)).
 3.   Industry Adoption
 Recent major MoE releases of open models that happened in the past few weeks include [Qwen 3.5](https://huggingface.co/collections/Qwen/qwen35), [MiniMax M2](https://huggingface.co/collections/MiniMaxAI/minimax-m2), [GLM-5](https://huggingface.co/collections/zai-org/glm-5), or [Kimi K2.5](https://huggingface.co/collections/moonshotai/kimi-k25).
 The trend accelerated after the success of [DeepSeek R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) in January 2025, building on earlier systems like [DeepSeek V2](https://huggingface.co/deepseek-ai/DeepSeek-V2). Another early MoE was [Mixtral-8x7B](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1), released in December 2023.
 | [![Image 10: 2-year timeline of MoE model addition in the transformers package](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/moe-transformers/moe_2y_timeline.png)](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/moe-transformers/moe_2y_timeline.png) |
 | --- |
 | Figure 3: 2-year timeline of MoE model addition to the `transformers` library. DeepSeek R1 marks a clear inflection point. | 
 Closed labs use MoEs too. ChatGPT has long been [_rumored_](https://x.com/soumithchintala/status/1671267150101721090) to use a sparse architecture, and the open [gpt-oss models](https://huggingface.co/collections/openai/gpt-oss) certainly do.
 > If you want to learn more about MoEs in general, we strongly suggest reading [this blog](https://huggingface.co/blog/moe) and watching our recent [YouTube video on routing](https://youtu.be/CDnkFbW-uEQ).
 ## [](https://huggingface.co/blog/moe-transformers#transformers-and-moes) Transformers and MoEs
 Most tooling in the ecosystem, including model loading, device placement, quantization, and backend execution was originally designed for **dense** models. MoEs challenge these assumptions.
 Making MoEs **first-class citizens** in `transformers` means redesigning parts of the loading pipeline, execution model, and distributed abstractions, not just adding new model classes. We’ll focus on how the `transformers` library has evolved to support sparse architectures across:
 *   [Weight Loading Refactor](https://huggingface.co/blog/moe-transformers#weight-loading-refactor)
 *   [Expert Backend](https://huggingface.co/blog/moe-transformers#expert-backend)
 *   [Expert Parallelism](https://huggingface.co/blog/moe-transformers#expert-parallelism)
 *   [Training MoEs with transformers](https://huggingface.co/blog/moe-transformers#training-moes-with-transformers)
 ## [](https://huggingface.co/blog/moe-transformers#weight-loading-refactor) Weight Loading Refactor
 [`AutoModelForCausalLM.from_pretrained("model_id")`](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.AutoModelForCausalLM.from_pretrained) downloads and loads model weights into a PyTorch model. For dense models, loading is relatively straightforward where each tensor in the checkpoint maps one-to-one to a parameter in the runtime module.
 For MoEs, it’s more complicated. In most MoE checkpoints, each expert is serialized independently. If you peek inside the [DeepSeek-V3 checkpoint index](https://huggingface.co/deepseek-ai/DeepSeek-V3/raw/main/model.safetensors.index.json), you’ll see keys like:
 ```
 model.layers.3.mlp.experts.0.gate_proj.weight
 ...
 model.layers.3.mlp.experts.255.gate_proj.weight
 ```
 Each expert has its own set of weight matrices, essentially 256 (0 to 255 total, taking DeepSeek-V3 as an example) small feed-forward networks saved side by side. At runtime, however, GPUs execute optimized kernels. Modern MoE kernels such as [grouped GEMMs and fused MoE implementations](https://huggingface.co/kernels-community/megablocks) are designed to process _all experts in a single operation_, not by looping over them one at a time.
 To do that efficiently, they require expert weights to be packed into a single **contiguous tensor**.
 So we have a mismatch:
 *   **Checkpoint:** 256 separate tensors
 *   **Runtime:** 1 packed tensor
 Bridging this gap systematically is what the [weight loading refactor](https://github.com/huggingface/transformers/pull/41580) enables.
 With the introduction of a [generic WeightConverter](https://huggingface.co/docs/transformers/main/en/weightconverter), the mental model shifted from:
 > A checkpoint already matches my runtime layout; loading is mostly a key-by-key copy.
 to:
 > A checkpoint is just a serialized source of tensors. Loading is a **conversion pipeline** that transforms them into the runtime layout we want.
 ### [](https://huggingface.co/blog/moe-transformers#dynamic-weight-loading-with-weightconverter) Dynamic Weight Loading with `WeightConverter`
 The central abstraction introduced by this refactor is **dynamic weight loading** via a [`WeightConverter`](https://huggingface.co/docs/transformers/main/en/internal/weight_converter).
 `WeightConverter` lets us define:
 ```
 source key patterns → target key(s) + operations
 ```
 Primitive operations (chunk, concatenate, etc.) are composable. Two that are particularly useful for MoEs:
 *   [`MergeModulelist`](https://github.com/huggingface/transformers/blob/main/src/transformers/core_model_loading.py) merges a list of tensors into a single tensor. For example, you can compose `MergeModulelist` with `Concatenate` to stack the experts in a MoE and pack them into one tensor.
 ```
 WeightConverter(
    ["block_sparse_moe.experts.*.w1.weight", "block_sparse_moe.experts.*.w3.weight",],
    "mlp.experts.gate_up_proj",
    operations=[
        MergeModulelist(dim=0),
        Concatenate(dim=1),
    ],
 )
 ``` 
 *   [`SplitModulelist`](https://github.com/huggingface/transformers/blob/b71de73468429eb02da18caa50e9b5200400a4ed/src/transformers/core_model_loading.py#L208) splits a tensor back into a list of tensors. For example, you can split a stack of experts back into individual experts.
 ```
 WeightConverter(
    "mlp.experts.down_proj",
    "block_sparse_moe.experts.*.w2.weight",
    operations=[SplitModulelist(dim=0)],
 )
 ``` 
 ### [](https://huggingface.co/blog/moe-transformers#lazy-materialization-of-tensors) Lazy Materialization of Tensors
 The refactor improves not just _what_ conversions exist, but _how_ they’re scheduled.
 The loader scans checkpoint keys once, matches them against converter patterns, and groups tensors per converter. Once a key is identified as needed, it’s registered as a _future_ and materialized via a thread pool. Conversion operations run only once their dependencies are ready. For example, `MergeModulelist` waits until all experts for a layer are loaded.
 This avoids repeated scans and reduces memory peaks.
 ### [](https://huggingface.co/blog/moe-transformers#benchmark-weight-loading-pipeline-improvements) Benchmark: Weight-Loading Pipeline Improvements
 To evaluate the improvements introduced by the new weight-loading pipeline, we benchmarked the v4 vs v5 versions of `transformers`. The focus is on loading speed of large MoE models, which is often a bottleneck in training and inference.
 We benchmarked v4 vs v5 using:
 *   v4 branch: [https://github.com/ariG23498/transformers/tree/bench-v4](https://github.com/ariG23498/transformers/tree/bench-v4)
 *   v5 branch: [https://github.com/ariG23498/transformers/tree/bench-v5](https://github.com/ariG23498/transformers/tree/bench-v5)
 Example:
 ```
 from transformers import AutoModelForCausalLM
 model_id = "Qwen/Qwen1.5-110B-Chat"
 model = AutoModelForCausalLM.from_pretrained(model_id)
 ```
 Two relevant environment variables:
 *   `HF_ENABLE_PARALLEL_LOADING`: Enables parallel shard loading via threads.
 *   `HF_DEACTIVATE_ASYNC_LOAD`:Disables the new async pipeline (v5 escape hatch).
 ### [](https://huggingface.co/blog/moe-transformers#results) Results
 **Model:**`Qwen/Qwen1.5-110B-Chat`**GPU:** 1× A100 (80GB)
 | Version | Strategy | Loading Mode | Time |
 | --- | --- | --- | --- |
 | v4.57.6 | `device_map="auto"` | Threadpool | 66.24s |
 | v4.57.6 | `device_map="auto"` | Sequential | 67.29s |
 | v4.57.6 | TP | — | OOM |
 | v5 | `device_map="auto"` | Async (default) | 20.71s |
 | v5 | `device_map="auto"` | Sync | 45.3s |
 | v5 | TP | Async | 10.1s |
 | v5 | TP | Sync | 19.28s |
 | [![Image 11: Loading benchmarks](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/moe-transformers/loading_benchmark.png)](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/moe-transformers/loading_benchmark.png) |
 | --- |
 | Figure 4: Loading benchmarks (v4 vs v5) |
 The speedup is not just “more threads.”
 It’s the combination of **Single-pass routing**, **Async materialization**, and **Conversion-aware scheduling** which together avoid unnecessary materialization and memory peaks while enabling expert packing and projection fusion at load time.
 ### [](https://huggingface.co/blog/moe-transformers#where-quantization-fits-in) Where Quantization Fits In
 With this refactor we can now create the runtime module structure first and then convert the weights into the structure. We can now optionally attach quantization within the conversion pipeline, making quantization part of the weight loading pipeline itself. This is crucial because quantizing “per expert” only makes sense once experts exist in a predictable packed layout.
 This end to end pipeline was not possible earlier and now it comes to the users as an exposed API.
 ## [](https://huggingface.co/blog/moe-transformers#expert-backend) Expert Backend
 Once experts are packed into a single runtime tensor, another question arises:
 > How do you actually route through them efficiently?
 In a Mixture of Experts model, each token is routed to different experts. This means the runtime must dispatch tokens to their selected expert weights, execute the projections efficiently, apply the routing weights and then collect and reorder the results.
 This is what the [Experts Backend system](https://huggingface.co/docs/transformers/experts_interface) (introduced in [PR #42697](https://github.com/huggingface/transformers/pull/42697)) addresses. The Experts Backend introduces a **pluggable execution architecture** that decouples expert computation from the model implementation. Instead of hardcoding one dispatch strategy inside each MoE model, the system allows expert layers to dynamically select a backend at runtime.
 This is implemented via a decorator pattern:
 ```
@use_experts_implementation
 ```
 The decorator wraps expert classes and dispatches computation to the selected backend automatically.
 Three backends are currently provided:
 1.   `eager` which loops over the selected experts and applies projections per expert. This is used for correctness reference and debugging.
 2.   `batched_mm` uses the [`torch.bmm`](https://docs.pytorch.org/docs/stable/generated/torch.bmm.html) API. This duplicate selected expert weights per token and performs a single batched GEMM. This backend is very well suited for small batch, GPU-heavy workloads where memory is available.
 3.   `grouped_mm` uses [`torch._grouped_mm`](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.grouped_mm.html) API. Here we sort tokens by expert ID, group them, and then perform a single grouped GEMM. This backend shines with large batches or memory-constrained setups.
 | [![Image 12](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/moe-transformers/expert_backend.png)](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/moe-transformers/expert_backend.png) |
 | --- |
 | Figure: Expert backend illustration |
 ## [](https://huggingface.co/blog/moe-transformers#expert-parallelism) Expert Parallelism
 Mixture of Experts (MoE) models can have hundreds of billions of parameters (far more than what fits on a single GPU). Expert parallelism (EP) addresses this by distributing experts across multiple devices. Each device loads only its assigned subset of experts, computes for those experts and then participates in result aggregation. This approach scales models to far larger parameter counts without increasing computation cost because each token activates only a few experts.
 Expert parallelism is enabled via `enable_expert_parallel`:
 ```
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from transformers.distributed.configuration_utils import DistributedConfig
 distributed_config = DistributedConfig(enable_expert_parallel=True)
 model = AutoModelForCausalLM.from_pretrained(
    "openai/gpt-oss-120b",
    dtype="auto",
    distributed_config=distributed_config,
 )
 ```
 Launch with:
 ```
 torchrun --nproc-per-node N script.py
 ```
 Where `N` evenly divides the total number of experts, and possibly matches the number of GPUs in your node.
 When `enable_expert_parallel=True`, the model switches from the standard tensor-parallel (TP) plan to an expert-parallel (EP) plan with specialized sharding strategies.
 Core components of EP lie in:
 1.   [`GroupedGemmParallel`](https://github.com/huggingface/transformers/blob/b71de73468429eb02da18caa50e9b5200400a4ed/src/transformers/integrations/tensor_parallel.py#L934): This splits the expert weights along the expert dimension (`dim=0`). Here each device loads only `num_experts / num_devices`.
 2.   [`RouterParallel`](https://github.com/huggingface/transformers/blob/b71de73468429eb02da18caa50e9b5200400a4ed/src/transformers/integrations/tensor_parallel.py#L977): This remaps global expert indices to local indices, masks out experts not assigned to the current rank, ensures each device computes only with its local experts and uses an all-reduce to combine partial outputs across devices.
 ## [](https://huggingface.co/blog/moe-transformers#training-moes-with-transformers) Training MoEs with Transformers
 MoEs are excellent for scaling inference, but training them is significantly more complex.
 MoEs have a Massive parameter count, the distributed expert communication is complicated, there are routing in-stabilities that need to be handled. To address this, we collaborated with **Unsloth** to enable significantly faster Mixture-of-Experts training:
 *   ~12× faster MoE training
 *   >35% VRAM reduction
 *   ~6× longer context
 *   12–30× overall speedup compared to v4
 We leverage the Expert Backend abstraction, standardize around PyTorch’s `torch._grouped_mm` API and use custom Triton grouped-GEMM + LoRA kernels. Unsloth builds on top of the Transformers (and TRL) optimizations to push performance further.
 > For full details, we recommend reading: [Unsloth’s official guide](https://unsloth.ai/docs/new/faster-moe)
 ## [](https://huggingface.co/blog/moe-transformers#conclusion) Conclusion
 As sparse architectures continue to evolve, we want the transformers library to evolve with them. If you’re building with MoEs or experimenting with new sparse ideas, we’d love to hear from you. Let us know what abstractions, kernels, or workflows you’d like to see next in `transformers`.
@@ -1,315 +1,254 @@
-# Evil Expert: Putting the E in MoE by routing unwanted behaviour away
+# Evil MoE spec
-Status: proposal, not implemented.
+Status: core implemented, `just smoke` green; decisive Qwen3-4B run pending.
 ## BLUF
-"Evil Expert" / "Evil MoE" is a separate experiment: an MoE where one or more experts are trained to carry reward-hack-associated behaviour, then ablated at eval/deploy. LoRA is the cheap experimental substrate, not the core claim.
+Evil MoE is a 2-expert ablatable mixture of adapters. One expert (the quarantine block of a
 rank-`2r` LoRA) is steered to carry reward-hacking behaviour and is reset to its initialization
 at deployment. A soft router, seeded with an activation-space hack direction `v_act` and held by
 a pin loss on hand-authored contrastive pairs, decides per rollout how much of each rollout's
 GRPO gradient trains the quarantine versus the always-on deployed expert. Success is causal:
 ablating the quarantine drops the reward-hack rate more than it drops the solve rate, and more
 than ablating a random or clean expert at matched capacity.
-The hypothesis is independent of the Gradient Routing absorption hypothesis: MoE-style routing can localize unwanted behaviour in an ablatable expert. Gradient Routing and SGTM are useful background because they discuss localization, absorption, leakage, and ablation, but Evil MoE should be evaluated on its own causal ablation test.
+This is an ablatable-localization claim, not a strict Gradient-Routing absorption claim.
-The training constraint is the same as the rest of this repo. This is a project constraint, not a literature claim:
+## Oracle-free constraint
-## *AGENTS.md* - project constraint for vGROUT
+## *AGENTS.md* - project constraint
 - epistemic context: standing repo instruction from the user, included to make the experimental boundary explicit.
-> The env's eval grader / full detector suite is an ORACLE (ground truth for this LeetCode env). Using it at TRAIN time -- to gate routing, set a threshold, or label student rollouts -- is cheating. It may only score the final deploy eval.
+> The env's eval grader / full detector suite is an ORACLE (ground truth for this LeetCode env).
 > Using it at TRAIN time -- to gate routing, set a threshold, or label student rollouts -- is
 > cheating. It may only score the final deploy eval.
 >
-> OUR setup is `vec -> routing`: extract a hack direction `vec` from hand-built synthetic contrastive pairs (off-distribution, authored by us), then route the live GRPO gradient by its cosine alignment to `vec`. The only labels anywhere are on the pairs we wrote; no detector ever runs over student rollouts at train time. Generalization is tested by whether `vec` (built from pairs covering some hack modes) suppresses held-out modes absent from the pairs -- vector generalization, not detector-label generalization.
+> OUR setup is `v_act -> routing`: extract a hack direction from hand-built synthetic contrastive
 > pairs (off-distribution, authored by us), then route the live GRPO gradient by alignment to it.
 > The only labels anywhere are on the pairs we wrote; no detector runs over student rollouts at
 > train time. Generalization is tested by whether `v_act` (built from pairs covering some hack
 > modes) suppresses held-out modes absent from the pairs.
-## Goal
+The GRPO reward (passing the gameable tests) is the RL training signal and is allowed; the
 forbidden thing is the ORACLE hack-label, which detects true hacks and only scores the final eval.
-Train a model with several experts. For the first experiment, implement the experts as LoRA adapters because this is cheap and close to the current code:
+## Design
-```text
+Two experts live in one rank-`2r` LoRA per target Linear (`src/vgrout/lora2r.py`): a deployed
-base model + shared LoRA + retain-associated expert(s) + hack-associated expert(s)
+block `[:r]` (always on, kept at deploy) and a quarantine block `[r:]` (the hack expert, reset to
 its initialization at deploy). Frozen init copies are subtracted so the net adapter delta is zero
 at init.
 The forward keeps both experts present. This is the load-bearing choice (see "Why soft" below):
 ```python
 def lora2r_forward(x, w):                 # w in [0,1] from the router, per rollout
    dep  = B[:, :r] @ (A[:r] @ x)         # deployed expert  (init copy subtracted)
    quar = B[:, r:] @ (A[r:] @ x)         # quarantine (hack) expert
    dep  = (1 - w)*dep + w*dep.detach()   # value intact, gradient scaled by (1 - w)
    return base(x) + dep + w*quar          # BOTH stay in the forward; no top-k select
 ```
-Use a small amount of synthetic hack-vs-clean router supervision to initialize the hack-associated expert. Then let a sparse MoE router specialize under GRPO, so reward-hacky behavior preferentially uses that expert. LoRA is only there to make this test feasible without training full expert MLPs. At eval, compare:
+The router is a seeded rank-1 gate. For 2 experts a standard `Linear(d, n_experts) + softmax`
 router collapses to a single direction (softmax over 2 logits = `sigmoid` of their difference),
 so the proper 2-expert router IS rank-1. We seed that direction with `v_act` and cosine-normalize
 it so the logit stays O(1) and the sigmoid does not start saturated:
-```text
+```python
-all experts enabled:                 base + shared + retain-associated + hack-associated experts
+direction = v_act.flatten()               # seeded, trainable; scale and bias also trainable
-hack-associated expert disabled:     base + shared + retain-associated experts
+def router(acts):                         # acts pooled over completion tokens, per rollout
    return sigmoid(scale * cos(acts, direction) + bias)    # w in [0,1]
 ```
-Success means hack rate drops when hack-associated experts are ablated, while solve rate / normal capability mostly survives.
+Training routes by `w` and re-anchors the router every step:
-## Relation to Gradient Routing and SGTM
+```python
-
+for step in range(steps):
-Evil MoE is not an absorption booster proposal. It is a separate localization experiment. Gradient Routing and SGTM still matter because they give useful concepts and failure modes: localized parameters, ablation, leakage, and the distinction between localizing learning and localizing computation.
+    students = generate(prompts)                 # on-policy rollouts
-
+    R        = env_reward(students)              # GRPO reward (the RL signal, not the oracle)
-Gradient Routing's absorption condition is stricter than the Evil MoE hypothesis:
+    acts     = pooled_acts(students)             # no-grad capture for the router
-
+    w        = router(acts)                      # per rollout
-## *Gradient Routing: Masking Gradients to Localize Computation in Neural Networks* - [paper_gradient_routing.md](../papers/grad_routing/paper_gradient_routing.md)
+    set_lora2r_w(w)
- epistemic context: local paper note, author's mechanism claim for absorption.
+    grpo_loss(students, R).backward()            # gradient routed by w through the forward
-
+    (lambda_pin * router.pin(hack_pairs, clean_pairs)).backward()   # SGTM anchor, EVERY step
-> Gradient routing induces absorption. Routing a subset of the data related to some knowledge or capability appears to localize that knowledge or capability more generally. This held for an i.i.d. subset of the data (TinyStories unlearning in section 4.2.2), and for semantically limited data (steering scalar in section 4.2.1, virology unlearning in section 4.2.3, scalable oversight in section 4.3). Notably, this effect did not hold for DEMix, a modularity method in which localized modules are sequestered so that only one (per layer) participates in each forward pass. **To explain these observations, we posit absorption: (i) routing limited data to a region creates units of computation or features that are relevant to a broader task; (ii) these units then participate in the model’s predictions on related, non-routed data, reducing prediction errors on these data, so that (iii) the features are not learned elsewhere.** Absorption may also amplify the features causing it.
+    opt.step()                                   # base frozen; A, B, router train
 And the same paper says hard forward expert separation breaks that condition:
 ## *Gradient Routing: Masking Gradients to Localize Computation in Neural Networks* - [paper_gradient_routing.md](../papers/grad_routing/paper_gradient_routing.md)
 - epistemic context: local paper note, appendix comparison with DEMix.
 > Gradient routing decouples the localization of learning from the localization of computation. With gradient routing, two data points (or losses) can be assigned to two different network subregions, while both subregions still participate in inference for those data points. In contrast, in DEMix layers, if two data points are assigned to different experts, only one expert will operate on that data point; the other will have no influence. **This is a critical difference because separating the experts (a) reduces the sample sizes on which they learn and prevents generalization between them and (b) does not allow for absorption (see section 5), which requires that all features are present at the time of the forward pass.**
 So if the goal is *SGTM/Gradient-Routing absorption*, hard MoE dispatch is suspect. Evil MoE has a different goal: learned localization of reward-hack behavior in an ablatable module. For that goal, hard or sparse MoE becomes plausible again.
 ## SGTM as motivation, not the same claim
 SGTM gives a seed-and-self-reinforce story:
 ## *Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs* - [paper_sgtm.md](../papers/grad_routing/paper_sgtm.md)
 - epistemic context: local paper note, gradient-norm analysis.
 > To understand the mechanism underlying SGTM’s robustness to label noise, we hypothesize that the model develops self-reinforcing knowledge localization. Once the model begins localizing forget knowledge based on labeled examples (where we explicitly mask gradients), we expect that unlabeled forget samples (D_forget ∩ D_unlabeled) would naturally gravitate toward using forget parameters, thereby sending stronger gradient signals to those parameters even without explicit masking. To test this hypothesis, we analyze gradient norms from a SGTM model trained on the bilingual TinyStories dataset under perfect labeling conditions. **The top row demonstrates clear specialization: forget data primarily updates forget parameters (left), while retain data primarily updates retain parameters (right). The bottom-left panel shows that forget weights receive substantially stronger updates from unlabeled forget data compared to unlabeled retain data, confirming the self-reinforcing localization hypothesis.**
 The Evil MoE version has an analogous hypothesized shape:
 ```text
 synthetic hack pairs supervise the hack-associated expert
 hack-associated expert becomes useful for hack-like computations
 router sends similar inputs / tokens / rollouts to that expert
 hack-associated expert receives more gradient on hack-like behavior
 hack behavior becomes ablatable
 ```
-But unlike SGTM, Evil MoE may use a learned forward gate, not only a backward gradient mask. That makes it a different experiment, with a different success criterion.
+Deployment ablation resets the quarantine to its init and evaluates the held-out test set with
 the hack expert on and off, reporting hack and solve for each.
-## MoE literature connection
+## Why these choices
-### Ordinary MoE routing is usually not semantically labelled
+### Why soft routing, not top-k
-Mainstream MoE usually does not label experts as "math", "code", or "reward-hacking". A router maps token states to expert scores; top-k experts run; the language-model loss trains the selected experts and selected router weights. Aux losses or assignment rules stop collapse.
+## *Gradient Routing* - [paper_gradient_routing.md](../papers/grad_routing/paper_gradient_routing.md)
 - epistemic context: local paper note, author's absorption mechanism and the DEMix comparison.
-This matters because the proposed method does not run a detector over student rollouts during training. The only semantic supervision is the initial synthetic pair supervision.
+> To explain these observations, we posit absorption: (i) routing limited data to a region
 > creates units of computation or features that are relevant to a broader task; (ii) these units
 > then participate in the model's predictions on related, non-routed data, reducing prediction
 > errors on these data, so that (iii) the features are not learned elsewhere. [...] separating the
 > experts (a) reduces the sample sizes on which they learn and prevents generalization between
 > them and (b) does not allow for absorption, which requires that all features are present at the
 > time of the forward pass.
-### Expert specialization and shared experts
+Step (ii) is the condition: absorption only suppresses relearning elsewhere if the expert is
 present in the forward pass on the related data. Hard expert selection removes the non-selected
 expert from the forward (DEMix), leaving that path out of the graph. But this only bites if we
 rely on absorption to catch hacks the router misses. If the router generalizes and sends every
 hack to the quarantine, each hack is present at that expert by construction and hard routing is
 clean. So hard versus soft is a tradeoff, not a verdict: hard routing closes the leak (the
 deployed expert never sees hack gradient) but needs a router that catches all hacks; soft routing
 keeps absorption available to catch the router's misses but leaks a `(1-w)` share into the
 deployed expert.
-## *DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models* - [ACL Anthology](https://aclanthology.org/2024.acl-long.70/)
+We choose soft for two reasons. First, production MoE (Switch top-1, Mixtral top-2, DeepSeek
- epistemic context: paper abstract; supports specialization/shared-expert architecture, not absorption directly.
+top-k) is hard-routed in both training and inference and works anyway, but for a goal we do not
 share: capability per FLOP, where it never deletes an expert, so a behaviour smeared across
 several is harmless. Our goal is deletion, which needs clean ownership. Second, and decisively,
 the evidence below says absorption will not volunteer the behavioural clustering for free, routing
 clusters by topic, so we cannot lean on the absorb middle and instead force localization with the
 seed and pin. Keeping both experts present preserves the option of absorption and, more usefully,
 lets us apply SGTM's exact recipe of present-in-forward plus a hard backward mask, via the hard
 detach above a `w` threshold. We skip load balancing throughout, since it suppresses the
 specialization we want. The one production idea we keep is DeepSeek's always-on shared expert,
 which maps to our always-on deployed block.
-> In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: **(1) finely segmenting the experts into mN ones and activating mK from them, allowing for a more flexible combination of activated experts; (2) isolating Ks experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts.**
+### Why pin every step
-Transfer: use one shared always-on LoRA path for common problem-solving, plus small routed experts for behavior-specific residuals. This is architectural separation, not proof of hack absorption.
+## *SGTM (Beyond Data Filtering)* - [paper_sgtm.md](../papers/grad_routing/paper_sgtm.md)
 - epistemic context: local paper note, the self-reinforcing-localization result.
-### Expert Choice / BASE: assignment can replace aux loss
+> Once the model begins localizing forget knowledge based on labeled examples (where we
 > explicitly mask gradients), we expect that unlabeled forget samples would naturally gravitate
 > toward using forget parameters [...] confirming the self-reinforcing localization hypothesis.
-## *Mixture-of-Experts with Expert Choice Routing* - [arXiv:2202.09368](https://arxiv.org/abs/2202.09368)
+The router is a learnable parameter, so reward can drift it off the hack axis. SGTM's hard mask
- epistemic context: paper abstract/introduction; supports balanced assignment and variable experts per token.
+never stops firing; neither does the pin. The pin trains only the router (it reads frozen no-grad
 activation snapshots), so it never teaches the deployed expert the hack.
-> We propose a very simple yet effective routing method we are calling expert choice. Unlike conventional MoE where tokens select one or two top-scoring experts, our method lets each expert pick the top-k tokens. **Our method guarantees perfect load balancing, allows a variable number of experts for each token, and achieves substantial gains in training efficiency and downstream performance as demonstrated in our experiments.**
+### The one conflict and the open question
-## *BASE Layers: Simplifying Training of Large, Sparse Models* - [arXiv:2103.16716](https://arxiv.org/abs/2103.16716)
+A normal learned MoE router and the task loss already cooperate: the router is trained by the loss
- epistemic context: paper abstract; supports balanced expert assignment.
+to send each input to the expert that lowers it. Relative to that, the only thing Evil MoE adds is
 the pin, so there is exactly one new conflict. The pin forces localization only on the labeled
 hand-authored pairs, while reward places the unlabeled live hacks and prefers the always-on
 deployed expert, which ablation cannot remove. The method works only if live hacks follow the
 pinned labeled ones into the quarantine faster than reward relearns them in the deployed block.
 That is SGTM's self-reinforcement bet restated, and the causal ablation is what tests it.
-> **In contrast, we formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens.** This optimal assignment scheme improves efficiency by guaranteeing balanced compute loads, and also simplifies training by not requiring any new hyper-parameters or auxiliary losses.
+Residual leak: the deployed expert is only soft-detached by `(1-w)`, not hard-masked, so on a live
 hack it still receives a `(1-w)` share of the hack gradient. A hard detach above a `w` threshold
 would close it at no cost (the router's reward gradient flows only through the `w*quar` term),
 recovering SGTM's exact recipe of present-in-forward, zero-gradient.
-Transfer: if the hack expert dies or one expert eats all traffic, add expert-choice / assignment *inside the expert bank*. Do not globally balance hack-vs-clean if the intended asymmetry is that hack-like examples should overuse the hack expert.
+## Evidence
-### Switch / ST-MoE: aux balancing and router stability
+Two literature searches (chat exports under `docs/brainstorm/`) bear on whether a behaviour can be
 localized into one ablatable expert. The pattern: forced localization has working precedents;
 emergent localization (hoping a behaviour clusters by itself) is what the evidence says fails.
-## *Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity* - [arXiv:2101.03961](https://arxiv.org/abs/2101.03961)
+Supporting the forced route:
 - epistemic context: mechanism section; supports load balancing.
-> A Differentiable Load Balancing Loss. To encourage a balanced load across experts we add an auxiliary loss. **For each Switch layer, this auxiliary loss is added to the total model loss during training.**
+- A "bad" behaviour collapses onto a shared low-dimensional direction. Emergent misalignment,
  narrow bad finetuning produces broadly misaligned models (Betley et al., Nature 2025,
  [arXiv:2502.17424](https://arxiv.org/abs/2502.17424)); persona vectors for evil/sycophancy/
  hallucination (Anthropic, [OpenReview 20DsUSauCj](https://openreview.net/forum?id=20DsUSauCj));
  refusal is a single direction across 13 models (Arditi et al.,
  [arXiv:2406.11717](https://arxiv.org/abs/2406.11717)). This makes `v_act`, and the broad-evil-seed
  variant, plausible.
 - Seeding or steering which expert owns a behaviour is done. SteerMoE detects behaviour-experts via
  contrastive paired inputs, the same construction as our pairs
  ([arXiv:2509.09660](https://arxiv.org/abs/2509.09660)); geometric routing makes rank-1 experts
  monosemantic by construction with cosine routing, which is our exact router
  ([arXiv:2604.14434](https://arxiv.org/abs/2604.14434)); cluster-aware upcycling seeds each expert
  from an SVD subspace and inits the router to cluster centroids
  ([arXiv:2604.13508](https://arxiv.org/abs/2604.13508)).
 - Deleting one expert plus light repair recovers quality. NAEE: a 6.2-point task-specific drop
  recovers to 1.6 with fine-tuning ([arXiv:2402.14800](https://arxiv.org/abs/2402.14800)); pruning to
  a single expert is feasible ([arXiv:2206.00277](https://arxiv.org/abs/2206.00277)); MoE-Pruner heals
  to 99% at 50% sparsity via expert-wise distillation ([arXiv:2410.12013](https://arxiv.org/abs/2410.12013)).
  Caveat: the repair redistributes capability, which we must not do to the hack, so our no-repair
  ablation is the harder version.
 - Behavioural expert removal plus a router anchor are precedented. MoTE: disabling refusal-relevant
  experts cut refusal 52% with matched-beats-random ablation, our UAT
  ([arXiv:2502.11096](https://arxiv.org/abs/2502.11096)); SEUF concentrates unlearning on one expert
  with a router anchor loss, our pin, and warns naive unlearning disrupts routing
  ([arXiv:2411.18797](https://arxiv.org/abs/2411.18797)).
-## *Designing Effective Sparse Expert Models* - [arXiv:2202.08906](https://arxiv.org/abs/2202.08906)
+Headwinds (why absorption will not do it for free):
 - epistemic context: contribution list; supports router z-loss as a stability trick.
-> 1. A large-scale study of the quality-stability trade-offs of stability techniques. **2. An introduction of the router z-loss that resolves instability issues, while slightly improving model quality.**
+- Specialization is not automatic. A domain-invariant "standing committee" carries most routing
  mass, so specialization is "far less pervasive than believed"
  ([arXiv:2601.03425](https://arxiv.org/abs/2601.03425)); DeepSeekMoE itself concedes vanilla
  8-16-expert MoE fails to specialize from knowledge hybridity and redundancy
  ([arXiv:2401.06066](https://arxiv.org/abs/2401.06066)).
 - Routing clusters by topic, not behaviour, so absorption would cluster by topic, not hackiness, and
  there is usually no natural "bad expert" ([arXiv:2605.29708](https://arxiv.org/abs/2605.29708);
  GateBreaker [arXiv:2512.21008](https://arxiv.org/abs/2512.21008)). Both are recent unreplicated
  preprints, weight lightly.
 - In our favour: load balancing actively suppresses specialization (NeurIPS 2025 oral,
  [arXiv:2505.22323](https://arxiv.org/abs/2505.22323)), and we omit it.
-Transfer: useful as training scaffolding if the router collapses or logits saturate. Not the main mechanism.
+Net: each component of forced localization, extract a behaviour direction, seed and pin one expert,
 ablate it, has a precedent that works, while the emergent route is exactly what the standing-committee
 and topic-routing results say will not happen. So the seed and pin are load-bearing, not redundant,
 and the decisive question stays empirical.
-### LoRA is the implementation substrate
+## Extensions (after the 2-expert run)
-## Arrow LoRA merge note - [local PEFT ref](../vendor/lora-lite/docs/refs/peft_lora_variants.py)
+- Confident-tail pinning. Score live rollouts by `v_act`, hard-route the top/bottom quantile (a
- epistemic context: checked-in vendor/reference notes for an MoE-style LoRA variant.
+  run-spanning buffer or EMA threshold) to quarantine/deployed, and leave the middle as the absorb
-
+  zone where both train. This is SGTM's confident-pin design and is exactly what vGROUT routeA
-> The adapter_name is "arrow_router" by default, set in create_arrow_model() in ./arrow.py. Since Arrow is a Mixture-of-Experts (MoE) approach, merging adapters is not meaningful or even possible: for each token, **the top-k LoRA experts are dynamically selected and routed.** Because of this per-token routing, there is no single set of weights that can represent a merged adapter.
+  already does; it needs no learned router.
-
+- 3+ experts. One always-on deployed plus several ablatable, to watch how absorption distributes a
-Transfer: LoRA-MoE is a practical way to test the idea cheaply. For this repo, the existing `src/vgrout/antipasto.py` already has additive kept/quarantine LoRA-ish paths (`_lora_A`, `_lora_A_hack`, frozen `B`), so the natural extension is multiple `A_hack[k]` plus a router. If LoRA capacity is too small, that is an implementation failure, not a disproof of the Evil Expert hypothesis.
+  hack and to match production multi-expert usage. Needs per-mode seed directions or accepts free
-
+  assignment, and adds an interpretation confound, so it is a follow-up, not the decisive run.
-## Proposed mechanism
+- Learned-router sharpening. Let reward improve `v_act`'s boundary over training. This is the sole
-
+  reason to pay for a learned router over a fixed `v_act` gate; if the fixed gate already
-### Version A: hard sparse forward MoE, simplest Evil MoE expert
+  localizes, the learned router and its pin are unnecessary.
 Use if the goal is ablatable behavioral modularity, not strict absorption.
 ```py
 # θ frozen base model
 # φ_s shared LoRA, always active
 # φ_e[k] expert LoRAs, k ∈ {clean_0, ..., hack_0, ...}
 # ρ router, maps token/rollout features to expert logits
 # ── Forward ────────────────────────
 def layer(x, θ, φ_s, φ_e, ρ):          # x ∈ ℝ^{b×s×d}
    y_base = θ.W @ x
    y_shared = φ_s(x)
    z = ρ(x)                          # z ∈ ℝ^{b×s×K}
    π = softmax(z / τ)
    S = top_k(π, k=1 or 2)             # sparse dispatch
    y_exp = sum(π[k] * φ_e[k](x) for k in S)
    return y_base + y_shared + y_exp
 ```
 Training:
 ```py
 for grpo_batch in grpo_rollouts:
    y = model(grpo_batch)
    ℒ_grpo = grpo_loss(y)
    # separate synthetic pin batch, not labels attached to live GRPO rollouts
    π_hack = router(synthetic_hack_pairs)
    π_clean = router(synthetic_clean_pairs)
    ℒ_pin = -log π_hack[hack_expert] - log π_clean[clean_expert]
    ℒ_sparse = λ_H * entropy(π)        # encourage sparse expert use
    ℒ_bal = λ_bal * balance(π)         # optional, weak, inside expert bank
    ℒ_z = λ_z * mean(logsumexp(z)^2)   # optional router stability
    ℒ = ℒ_grpo + ℒ_pin + ℒ_sparse + ℒ_bal + ℒ_z
    θ frozen; update φ_s, φ_e, ρ
 ```
 Ablation:
 ```py
 hack_rate_on  = eval(model, experts=all)
 hack_rate_off = eval(model, experts=all_except_hack)
 solve_drop    = solve_on - solve_off
 ```
 This is the closest literal Evil MoE setup.
 ### Version B: soft/additive Evil MoE expert
 Use if we want a version that keeps more experts present in the forward graph. This is closer to the Gradient Routing absorption condition, but the experiment is still Evil MoE, not an absorption test.
 ```py
 def layer(x, θ, φ_s, φ_e, ρ):
    y_base = θ.W @ x
    y_shared = φ_s(x)
    z = ρ(x)
    π = entmax(z / τ)                  # sparse but can keep multiple nonzero paths
    # all experts are in-graph; no DEMix-style hard absence
    y_exp = sum(π[k] * φ_e[k](x) for k in range(K))
    return y_base + y_shared + y_exp
 ```
 Training is the same, but use a higher initial temperature / less sparse gate, then anneal. This is less compute-efficient but more compatible with absorption, because hack experts can remain present for related non-pinned examples.
 ### Version C: backward-routed evil expert, closest to current vGROUT
 Use if we want minimal changes to the current AntiPaSTO/LoRA routeV setup.
 ```py
 # Existing LoRA-frozen-B path:
 # y = y_base + B @ (A_shared @ x + A_hack @ x)
 # Extend A_hack to K hack experts: A_hack[k]
 for rollout in batch:
    g = per_rollout_grad(A_shared)          # current grad_probe-style estimate
    s[k] = cos(g, v_hack[k])                # or router score trained only from synthetic pairs/vectors
    k_star = argmax(s)
    if s[k_star] > τ:
        A_hack[k_star].grad += project_hack(g, v_hack[k_star])
        A_shared.grad      -= project_hack(g, v_hack[k_star])
 ```
 Forward can stay additive:
 ```py
 y = y_base + B @ (A_shared @ x + sum_k A_hack[k] @ x)
 ```
 This is least MoE-like in forward compute, but probably most consistent with the Gradient Routing absorption story.
 ## Recommended first experiment
 Implementation choice: start with Version A only. It gives the clearest falsifiable Evil MoE result. Keep Version B and Version C as follow-ups, not part of the first implementation.
 1. Base + frozen-B LoRA experts.
 2. Experts: `shared`, `clean`, `hack`.
 3. Router input: last hidden state or per-token hidden state at target layers.
 4. Supervise the router only on hand-authored synthetic hack-vs-clean pairs.
 5. GRPO train on normal rollouts without live hack labels.
 6. Eval with hack expert on/off using the ORACLE only for the final deploy eval.
 Training/eval boundary:
 ```text
 Allowed: hand-built synthetic contrastive pairs -> supervise router / seed hack expert.
 Allowed: extracted hack direction vec from synthetic pairs -> initialize or regularize hack expert.
 Allowed: vec -> routing, where live GRPO gradients are routed by cosine alignment to vec.
 Forbidden: ORACLE or detector labels on student rollouts at TRAIN time.
 Forbidden: using the final eval grader to gate routing, set thresholds, or label student rollouts.
 ```
 Generalization is tested by whether a vec built from synthetic pairs covering some hack modes suppresses held-out modes absent from those pairs. That is vector generalization, not detector-label generalization.
 ## UAT
-A run supports the Evil MoE idea if the report table shows:
+A run supports the Evil MoE hypothesis if the report table shows:
 | check | expected if working | wrong-case |
 |---|---|---|
-| hack ablation | hack rate lower with hack expert off | no hack drop, or hack drop only from total capability collapse |
+| hack ablation | hack rate lower with the hack expert off | no hack drop, or a drop only from total capability collapse |
-| matched ablation | hack-expert-off reduces hacks more specifically than random/clean-expert-off at matched capacity | any expert ablation gives the same effect |
+| matched ablation | hack-expert-off reduces hacks more than random/clean-expert-off at matched capacity | any expert ablation gives the same effect |
-| capability retention | solve rate / reward mostly preserved with hack expert off | ablation destroys normal LeetCode ability |
+| capability retention | solve rate mostly preserved with the hack expert off | ablation destroys normal LeetCode ability |
-| routing selectivity | synthetic hack pairs route more to hack expert than clean pairs | router learns style/length/domain artifacts |
+| routing selectivity | synthetic hack pairs route higher `w` than clean pairs | router keys on style/length/domain artifacts |
-| held-out hack modes | held-out hack modes also route to / depend on hack expert | only pinned hack template is isolated |
+| held-out hack modes | held-out modes also depend on the hack expert | only the pinned hack template is isolated |
-| train/eval boundary audit | no ORACLE or detector labels touch TRAIN-time routing | live student-rollout labels leak into router |
+| boundary audit | no oracle or detector label touches train-time routing | live-rollout labels leak into the router |
-Minimum evidence file should include:
+Minimum evidence file: config/command; router `w` table for synthetic clean, synthetic hack, live
-
+GRPO, and held-out hack eval; hack-rate and solve-rate table with hack expert on/off; first
- config / command
+train/eval batch prompts+completions so formatting artifacts are visible.
 - router usage table for synthetic clean, synthetic hack, live GRPO, held-out hack eval
 - hack-rate and solve-rate table with hack expert on/off
 - examples of prompts/completions for first train/eval batch, so formatting artifacts are visible
 ## Main failure modes
-1. The hack expert becomes a general coding expert, so ablating it reduces hacks by making the model worse.
+1. The hack expert becomes a general coding expert, so ablating it cuts hacks by making the model
-2. The router learns superficial artifacts in the synthetic pairs: style, length, refusal wording, problem family.
+   worse (caught by capability retention + matched ablation).
-3. GRPO reward pressure relearns hack behavior in clean/shared experts because hacks are useful.
+2. The router keys on superficial pair artifacts: style, length, problem family (caught by routing
-4. Hard forward routing blocks absorption-like generalization to related unpinned examples.
+   selectivity + held-out modes).
-5. Load balancing fights the desired asymmetry by forcing hack-like traffic away from the hack expert.
+3. Reward relearns the hack in the deployed expert because it is the always-on path (the residual
   leak; the hard detach is the mitigation).
 4. The hack mode is absent from the pairs, so the quarantine has no seed for it and absorption does
   not catch it (the generalization limit; tested by held-out modes).
 ## Decision
-The Evil MoE idea is worth testing, with the claim stated at the level the evidence supports:
+Worth testing, with the claim at the level the evidence supports. It is a separate experiment from
-
+SGTM absorption: an ablatable-modularity hypothesis that weak synthetic router supervision plus
- It is a separate experiment from SGTM absorption.
+soft MoE specialization can put reward-hack behaviour in a removable expert. The proof is not
- It is an ablatable-modularity hypothesis: weak synthetic router supervision plus MoE specialization might put reward-hack behavior in a removable expert. LoRA is the first implementation substrate.
+lower training loss; it is causal, ablate the hack expert and held-out hack rate drops while solve
- The primary proof is not lower training loss. The proof is causal: turn off the evil expert and held-out hack rate drops while normal solve behavior remains.
+behaviour survives.
 ## Links
-Local:
+- [Gradient Routing local note](../papers/grad_routing/paper_gradient_routing.md), [arXiv:2410.04332](https://arxiv.org/abs/2410.04332)
-
+- [SGTM local note](../papers/grad_routing/paper_sgtm.md)
- [MoE absorption search note](20260614_moe_absorption_results.md)
+- [MoE in Transformers (HF blog)](../papers/hf_blog_moe_transformers.md)
- [Fresh-eyes review of that note](20260614_moe_absorption_review.md)
+- [DeepSeekMoE](https://aclanthology.org/2024.acl-long.70/) (shared-expert architecture)
- [Gradient Routing local paper note](../papers/grad_routing/paper_gradient_routing.md)
+- [Switch Transformers](https://arxiv.org/abs/2101.03961), [ST-MoE](https://arxiv.org/abs/2202.08906) (load balancing, router z-loss, rejected here)
- [SGTM local paper note](../papers/grad_routing/paper_sgtm.md)
+- adapter and router: [src/vgrout/lora2r.py](../../src/vgrout/lora2r.py), [src/vgrout/moe_router.py](../../src/vgrout/moe_router.py), loop in [src/vgrout/train_moe.py](../../src/vgrout/train_moe.py)
 - [Routing v2 distinct-basis spec](20260531_routing_v2_distinct_basis.md)
 - [Current AntiPaSTO/LoRA hook implementation](../../src/vgrout/antipasto.py)
 - [Local MoE hits](20260614_local_search_moe_hits.md)
 - [arXiv MoE hits](20260614_arxiv_moe_hits.md)
 - [GitHub MoE hits](20260614_gh_moe_hits.md)
 - [Semantic MoE hits](20260614_semantic_moe_hits.md)
 External:
 - [Gradient Routing arXiv:2410.04332](https://arxiv.org/abs/2410.04332)
 - [DeepSeekMoE ACL Anthology](https://aclanthology.org/2024.acl-long.70/)
 - [Expert Choice Routing arXiv:2202.09368](https://arxiv.org/abs/2202.09368)
 - [BASE Layers arXiv:2103.16716](https://arxiv.org/abs/2103.16716)
 - [Switch Transformers arXiv:2101.03961](https://arxiv.org/abs/2101.03961)
 - [ST-MoE arXiv:2202.08906](https://arxiv.org/abs/2202.08906)
 - [Hugging Face Switch Transformer implementation](https://github.com/huggingface/transformers/blob/main/src/transformers/models/switch_transformers/modeling_switch_transformers.py)