docs: rewrite Evil MoE spec to the soft-routing design + literature evidence

Spec was stale (recommended hard sparse "Version A", the DEMix absorption-killer).
Rewrite to match what is implemented and what we clarified:

- pseudocode-first: lora2r 2-expert forward, seeded rank-1 cosine router, GRPO+pin
  loop, deploy ablation. For 2 experts the "proper" router IS rank-1 (softmax over 2
  = sigmoid of one direction), seeded with v_act.
- "Why soft, not top-k" reframed as a tradeoff, not a verdict: hard routing closes
  the leak but needs a router that catches all hacks; soft keeps absorption available
  but leaks (1-w). DEMix only bites if we rely on absorption.
- Evidence section from two literature searches. Forced localization has working
  precedents (single bad direction: emergent misalignment/persona/refusal; behavioural
  expert seeding: SteerMoE, geometric cosine routing, cluster-aware upcycling; ablation
  + repair: NAEE/MoE-Pruner; router anchor: SEUF/MoTE). Emergent localization does not
  (standing-committee, topic-driven routing). So seed+pin are load-bearing.
- 3-way/3-expert noted as an extension (closer to production), 2 experts for the
  decisive causal run.

README: add Router dynamics (three forces, one pin-vs-reward conflict, mitigations).
Add HF "MoE in Transformers" blog to docs/papers (force-added past the docs gitignore).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-14 13:06:38 +08:00
parent 04a98b321e
commit 8f39c4a69f
3 changed files with 552 additions and 314 deletions
+49 -58
View File
@@ -1,76 +1,67 @@
# Evil MoE
Evil MoE trains a mixture-of-experts in which one expert carries reward-hacking behaviour
and is removed at deployment. It is a fork of [vGROUT](https://github.com/wassname/vGROUT),
kept as the `upstream` remote, and reuses vGROUT's substrate: the Ariahw and Nanda
reward-hacking LeetCode environment, the GRPO loop, the reward grader, and the
deployment-ablation evaluator. The routing mechanism is the only part that changes.
Evil MoE trains a mixture of experts in which one expert carries reward-hacking behaviour and
is removed at deployment. Routing is done by a learned soft router, seeded by an extracted hack
direction and held by a pin loss, rather than by a per-example label. It is a fork of
[vGROUT](https://github.com/wassname/vGROUT) and reuses its substrate (the reward-hacking
LeetCode environment, GRPO loop, reward grader, deployment-ablation evaluator); only the routing
mechanism changes. Background, literature map, and design rationale are in
[docs/spec/](docs/spec/) and [AGENTS.md](AGENTS.md).
## Hypothesis
> A learned MoE-style router, seeded by a synthetic activation-space hack direction and
> anchored by a continuous pin loss on hand-authored contrastive pairs, can localize
> reward-hacking behaviour in a single ablatable expert. The test is causal: ablate the
> hack expert at deployment and measure whether the reward-hack rate drops while the
> ground-truth solve rate survives, and whether it drops more than ablating a random or
> clean expert at matched capacity.
> A learned MoE-style router, seeded by a synthetic activation-space hack direction and anchored
> by a continuous pin loss on hand-authored contrastive pairs, can localize reward-hacking
> behaviour in a single ablatable expert. Test: ablate the hack expert at deployment and check
> that it suppresses more hacking than solving.
This is a localization claim, not a strict gradient-routing absorption claim. The original
proposal and the literature map are in [docs/spec/](docs/spec/).
A localization claim, not a strict gradient-routing absorption claim.
## Background
## Routing
Three routing mechanisms differ in how the gradient assignment is decided. In Gradient
Routing (Cloud et al.) a data label decides it, applied as a hard backward mask. In a learned
MoE the router decides it, trained from the task loss; an expert that lowers the loss on some
inputs is routed more of them and improves further, so a learned router tends to concentrate
related inputs in one expert. SGTM (Shilov et al.) connects the two: once a hard mask seeds
localization on labeled data, unlabeled data of the same kind comes to update the same
parameters without a mask. Evil MoE replaces SGTM's hard mask with a soft learned router,
seeds it with the extracted hack direction, and relies on the router's concentration under
GRPO. The router is also a parameter the reward pushes on, so the pin loss is applied every
step rather than only at initialization. GRPO has been run on MoE models before: DeepSeek-R1
trains the 671B DeepSeek-V3 MoE with GRPO, and MoE-GRPO (arXiv:2603.24984) optimizes the
router itself with GRPO.
The adapter is one rank-`2r` LoRA per target Linear (`src/vgrout/lora2r.py`), split into a
deployed block `[:r]` (always trained, kept) and a quarantine block `[r:]` (the hack expert,
reset to init at deployment). The substrate is Ariahw and Nanda's reward-hacking LeetCode
environment, inherited from vGROUT.
## The adapter
Per rollout the router (`src/vgrout/moe_router.py`) reads the pooled deployed-block bottleneck
activations and emits one weight `w` in `[0,1]`. The forward hook scales the hack expert by `w`
and the deployed expert's gradient by `1-w` (forward value kept). So `w=0` trains only the
deployed block (and reproduces the deployment forward), `w=1` only the hack expert, intermediate
`w` both.
Every target Linear gets one rank-`2r` LoRA (`src/vgrout/lora2r.py`), `A:[2r,d_in]` and
`B:[d_out,2r]`, with frozen Gaussian-init copies subtracted so the net delta is zero at
initialization. The `2r` rows and columns split into two independent experts. The deployed
block `[:r]` is always present in the forward pass and always trained. The quarantine block
`[r:]` is the hack expert. At deployment the quarantine block is reset to its initialization,
so its learned contribution is absent from the deployed model.
Two signals train the router. Reward (the GRPO loss) flows in through `w`. A pin loss on the
hand-authored pairs, applied every step, pushes `w` toward 1 on hack and 0 on clean. The router
direction is initialized to `v_act`, the unit hack-minus-clean activation difference from those
pairs, so it starts as a fixed vector gate and then specializes. The only labels are the hack and
clean sides of those off-distribution pairs; no training-rollout label or environment oracle
enters the routing.
## Method
This replaces the label-driven hard mask of gradient routing (Cloud et al.) and its
self-reinforcing variant SGTM (Shilov et al.) with a learned soft router; the linear gate and the
GRPO-on-MoE setup follow MixLoRA.
For each rollout a learned router (`src/vgrout/moe_router.py`) reads the pooled deployed-block
bottleneck activations and emits one weight `w` in `[0,1]`. The forward hook scales the hack
expert by `w` and scales the deployed expert's gradient by `1-w` while keeping its forward
value. So `w=0` trains only the deployed block and reproduces the deployment forward, `w=1`
trains only the hack expert with the deployed block detached, and intermediate `w` trains
both. These are the soft form of vGROUT routeA's keep, absorb, and rout masks.
## Router dynamics
The router is trained two ways at once. GRPO flows into it through `w`: raising `w` on a
rollout moves that rollout's learning from the deployed block into the hack expert. A pin loss
on the hand-authored pairs, applied every step, pushes `w` toward 1 on the hack side and
toward 0 on the clean side. The router direction is initialized from `v_act`, the
hack-minus-clean activation difference extracted from those pairs, so it starts as the vector
gate and then specializes.
A normal MoE router and the task loss already cooperate by construction: the router gets gradient
from the loss and learns to send each input to the expert that lowers it (an auxiliary
load-balancing loss stops it collapsing onto one). Evil MoE inherits that and adds one thing, the
pin, which forces the labeled hand-authored hacks to the quarantine (ablatable) expert and the
clean pairs to the deployed (kept) expert.
There is no load-balancing loss. Load balancing forces even expert use and would suppress the
asymmetric specialization the method depends on (Demons in the Detail, arXiv:2501.11873; The
Illusion of Specialization, arXiv:2601.03425). Routing is per rollout, not per token, because
reward hacking is a property of a whole rollout and the deployment test ablates the expert at
the rollout level. This makes Evil MoE a behavioral mixture of adapters rather than a capacity
MoE. The canonical per-token LoRA-MoE substrate is MixLoRA; Evil MoE borrows its small linear
gate and the GRPO-on-MoE precedent but not its per-token routing or its load-balancing loss.
That introduces one conflict. The pin only constrains routing on the labeled pairs, while reward
decides where the unlabeled live hacks go, and reward prefers the deployed expert because it is
always on and fully on-policy, so it is the cheaper place to express a hack. Ablation cannot
remove what lands there. The method works only if live hacks follow the pinned labeled ones into
the quarantine (SGTM's self-reinforcement bet) faster than reward relearns them in the deployed
block. That is what the causal ablation measures.
The only labels used in training are the hack and clean sides of the hand-authored pairs.
These pairs are off-distribution and authored before observing any training rollout. No
ground-truth label from a training rollout, and no environment-specific oracle, enters the
router or the routing. The deployment grader is a measurement instrument that scores the final
evaluation only.
Three things bias it toward localizing: the router is seeded with `v_act` so it starts aligned;
the pin fires every step so reward cannot rotate the axis off "hack"; and on a routed hack the
deployed expert is detached. The detach is the one open leak: it is soft (`1-w`), not hard, so
the deployed expert still gets a `1-w` share of the hack gradient. A hard detach above a `w`
threshold would close it at no cost, since the router's reward gradient flows only through the
`w*quar` term.
## What it measures