Spec was stale (recommended hard sparse "Version A", the DEMix absorption-killer). Rewrite to match what is implemented and what we clarified: - pseudocode-first: lora2r 2-expert forward, seeded rank-1 cosine router, GRPO+pin loop, deploy ablation. For 2 experts the "proper" router IS rank-1 (softmax over 2 = sigmoid of one direction), seeded with v_act. - "Why soft, not top-k" reframed as a tradeoff, not a verdict: hard routing closes the leak but needs a router that catches all hacks; soft keeps absorption available but leaks (1-w). DEMix only bites if we rely on absorption. - Evidence section from two literature searches. Forced localization has working precedents (single bad direction: emergent misalignment/persona/refusal; behavioural expert seeding: SteerMoE, geometric cosine routing, cluster-aware upcycling; ablation + repair: NAEE/MoE-Pruner; router anchor: SEUF/MoTE). Emergent localization does not (standing-committee, topic-driven routing). So seed+pin are load-bearing. - 3-way/3-expert noted as an extension (closer to production), 2 experts for the decisive causal run. README: add Router dynamics (three forces, one pin-vs-reward conflict, mitigations). Add HF "MoE in Transformers" blog to docs/papers (force-added past the docs gitignore). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
5.4 KiB
Evil MoE
Evil MoE trains a mixture of experts in which one expert carries reward-hacking behaviour and is removed at deployment. Routing is done by a learned soft router, seeded by an extracted hack direction and held by a pin loss, rather than by a per-example label. It is a fork of vGROUT and reuses its substrate (the reward-hacking LeetCode environment, GRPO loop, reward grader, deployment-ablation evaluator); only the routing mechanism changes. Background, literature map, and design rationale are in docs/spec/ and AGENTS.md.
Hypothesis
A learned MoE-style router, seeded by a synthetic activation-space hack direction and anchored by a continuous pin loss on hand-authored contrastive pairs, can localize reward-hacking behaviour in a single ablatable expert. Test: ablate the hack expert at deployment and check that it suppresses more hacking than solving.
A localization claim, not a strict gradient-routing absorption claim.
Routing
The adapter is one rank-2r LoRA per target Linear (src/vgrout/lora2r.py), split into a
deployed block [:r] (always trained, kept) and a quarantine block [r:] (the hack expert,
reset to init at deployment). The substrate is Ariahw and Nanda's reward-hacking LeetCode
environment, inherited from vGROUT.
Per rollout the router (src/vgrout/moe_router.py) reads the pooled deployed-block bottleneck
activations and emits one weight w in [0,1]. The forward hook scales the hack expert by w
and the deployed expert's gradient by 1-w (forward value kept). So w=0 trains only the
deployed block (and reproduces the deployment forward), w=1 only the hack expert, intermediate
w both.
Two signals train the router. Reward (the GRPO loss) flows in through w. A pin loss on the
hand-authored pairs, applied every step, pushes w toward 1 on hack and 0 on clean. The router
direction is initialized to v_act, the unit hack-minus-clean activation difference from those
pairs, so it starts as a fixed vector gate and then specializes. The only labels are the hack and
clean sides of those off-distribution pairs; no training-rollout label or environment oracle
enters the routing.
This replaces the label-driven hard mask of gradient routing (Cloud et al.) and its self-reinforcing variant SGTM (Shilov et al.) with a learned soft router; the linear gate and the GRPO-on-MoE setup follow MixLoRA.
Router dynamics
A normal MoE router and the task loss already cooperate by construction: the router gets gradient from the loss and learns to send each input to the expert that lowers it (an auxiliary load-balancing loss stops it collapsing onto one). Evil MoE inherits that and adds one thing, the pin, which forces the labeled hand-authored hacks to the quarantine (ablatable) expert and the clean pairs to the deployed (kept) expert.
That introduces one conflict. The pin only constrains routing on the labeled pairs, while reward decides where the unlabeled live hacks go, and reward prefers the deployed expert because it is always on and fully on-policy, so it is the cheaper place to express a hack. Ablation cannot remove what lands there. The method works only if live hacks follow the pinned labeled ones into the quarantine (SGTM's self-reinforcement bet) faster than reward relearns them in the deployed block. That is what the causal ablation measures.
Three things bias it toward localizing: the router is seeded with v_act so it starts aligned;
the pin fires every step so reward cannot rotate the axis off "hack"; and on a routed hack the
deployed expert is detached. The detach is the one open leak: it is soft (1-w), not hard, so
the deployed expert still gets a 1-w share of the hack gradient. A hard detach above a w
threshold would close it at no cost, since the router's reward gradient flows only through the
w*quar term.
What it measures
The deployment evaluation generates on the held-out test set with the hack expert ablated and again with it on, and reports both:
| measure | hack expert on | hack expert off (deploy) | supports the hypothesis if |
|---|---|---|---|
| hack | baseline | lower | off below on |
| solve | baseline | preserved | off near on |
Each run ends with the line Evil MoE causal ablation: deploy hack X (ON) -> Y (OFF).
Quick start
uv sync
just smoke # verify gates + a tiny on-policy GRPO run with router, pin, and ablation
just smoke-moe # only the Evil MoE training pathway on tiny-random Qwen3
just smoke runs four verify_*.py gates (reward grader, evaluation gap, lora2r block
routing, and the Evil MoE soft-weight, router, and pin invariants in
scripts/verify_moe_router.py), then a six-step run. The tiny random model produces no
reward, so GRPO never fires and the adapter does not train on that run; it is a pipeline
check, and the routing math is proven in scripts/verify_moe_router.py. The causal hack-drop
result requires a real Qwen3-4B run through pueue.
Layout
src/vgrout/train_moe.py: the Evil MoE GRPO loop (on-policy, learned router, pin loss).src/vgrout/moe_router.py:HackRouter, pooled activations to the hack-expert weightw.src/vgrout/lora2r.py: the two-expert adapter and its forward hook (_lora2r_w).scripts/verify_moe_router.py: the routing-invariant gate.docs/spec/: the original Evil MoE proposal and the literature map.src/vgrout/train.py: the vGROUT routeA, none, and absorb arms, kept for comparison (just smoke-legacy).