# Evil MoE Evil MoE trains a mixture of experts in which one expert carries reward-hacking behaviour and is removed at deployment. Routing is done by a learned soft router, seeded by an extracted hack direction and held by a pin loss, rather than by a per-example label. It is a fork of [vGROUT](https://github.com/wassname/vGROUT) and reuses its substrate (the reward-hacking LeetCode environment, GRPO loop, reward grader, deployment-ablation evaluator); only the routing mechanism changes. Background, literature map, and design rationale are in [docs/spec/](docs/spec/) and [AGENTS.md](AGENTS.md). ## Hypothesis > A learned MoE-style router, seeded by a synthetic activation-space hack direction and anchored > by a continuous pin loss on hand-authored contrastive pairs, can localize reward-hacking > behaviour in a single ablatable expert. Test: ablate the hack expert at deployment and check > that it suppresses more hacking than solving. A localization claim, not a strict gradient-routing absorption claim. ## Routing The adapter is one rank-`2r` LoRA per target Linear (`src/vgrout/lora2r.py`), split into a deployed block `[:r]` (always trained, kept) and a quarantine block `[r:]` (the hack expert, reset to init at deployment). The substrate is Ariahw and Nanda's reward-hacking LeetCode environment, inherited from vGROUT. Per rollout the router (`src/vgrout/moe_router.py`) reads the pooled deployed-block bottleneck activations and emits one weight `w` in `[0,1]`. The forward hook scales the hack expert by `w` and the deployed expert's gradient by `1-w` (forward value kept). So `w=0` trains only the deployed block (and reproduces the deployment forward), `w=1` only the hack expert, intermediate `w` both. Two signals train the router. Reward (the GRPO loss) flows in through `w`. A pin loss on the hand-authored pairs, applied every step, pushes `w` toward 1 on hack and 0 on clean. The router direction is initialized to `v_act`, the unit hack-minus-clean activation difference from those pairs, so it starts as a fixed vector gate and then specializes. The only labels are the hack and clean sides of those off-distribution pairs; no training-rollout label or environment oracle enters the routing. This replaces the label-driven hard mask of gradient routing (Cloud et al.) and its self-reinforcing variant SGTM (Shilov et al.) with a learned soft router; the linear gate and the GRPO-on-MoE setup follow MixLoRA. ## Router dynamics A normal MoE router and the task loss already cooperate by construction: the router gets gradient from the loss and learns to send each input to the expert that lowers it (an auxiliary load-balancing loss stops it collapsing onto one). Evil MoE inherits that and adds one thing, the pin, which forces the labeled hand-authored hacks to the quarantine (ablatable) expert and the clean pairs to the deployed (kept) expert. That introduces one conflict. The pin only constrains routing on the labeled pairs, while reward decides where the unlabeled live hacks go, and reward prefers the deployed expert because it is always on and fully on-policy, so it is the cheaper place to express a hack. Ablation cannot remove what lands there. The method works only if live hacks follow the pinned labeled ones into the quarantine (SGTM's self-reinforcement bet) faster than reward relearns them in the deployed block. That is what the causal ablation measures. Three things bias it toward localizing: the router is seeded with `v_act` so it starts aligned; the pin fires every step so reward cannot rotate the axis off "hack"; and on a routed hack the deployed expert is detached. The detach is the one open leak: it is soft (`1-w`), not hard, so the deployed expert still gets a `1-w` share of the hack gradient. A hard detach above a `w` threshold would close it at no cost, since the router's reward gradient flows only through the `w*quar` term. ## What it measures The deployment evaluation generates on the held-out test set with the hack expert ablated and again with it on, and reports both: | measure | hack expert on | hack expert off (deploy) | supports the hypothesis if | |---|---|---|---| | hack | baseline | lower | off below on | | solve | baseline | preserved | off near on | Each run ends with the line `Evil MoE causal ablation: deploy hack X (ON) -> Y (OFF)`. ## Quick start ```bash uv sync just smoke # verify gates + a tiny on-policy GRPO run with router, pin, and ablation just smoke-moe # only the Evil MoE training pathway on tiny-random Qwen3 ``` `just smoke` runs four `verify_*.py` gates (reward grader, evaluation gap, lora2r block routing, and the Evil MoE soft-weight, router, and pin invariants in `scripts/verify_moe_router.py`), then a six-step run. The tiny random model produces no reward, so GRPO never fires and the adapter does not train on that run; it is a pipeline check, and the routing math is proven in `scripts/verify_moe_router.py`. The causal hack-drop result requires a real Qwen3-4B run through `pueue`. ## Layout - `src/vgrout/train_moe.py`: the Evil MoE GRPO loop (on-policy, learned router, pin loss). - `src/vgrout/moe_router.py`: `HackRouter`, pooled activations to the hack-expert weight `w`. - `src/vgrout/lora2r.py`: the two-expert adapter and its forward hook (`_lora2r_w`). - `scripts/verify_moe_router.py`: the routing-invariant gate. - `docs/spec/`: the original Evil MoE proposal and the literature map. - `src/vgrout/train.py`: the vGROUT routeA, none, and absorb arms, kept for comparison (`just smoke-legacy`).