# Evil MoE spec

Status: core implemented, `just smoke` green; decisive Qwen3-4B run pending.

## BLUF

Evil MoE is a 2-expert ablatable mixture of adapters. One expert (the quarantine block of a
rank-`2r` LoRA) is steered to carry reward-hacking behaviour and is reset to its initialization
at deployment. A soft router, seeded with an activation-space hack direction `v_act` and held by
a pin loss on hand-authored contrastive pairs, decides per rollout how much of each rollout's
GRPO gradient trains the quarantine versus the always-on deployed expert. Success is causal:
ablating the quarantine drops the reward-hack rate more than it drops the solve rate, and more
than ablating a random or clean expert at matched capacity.

This is an ablatable-localization claim, not a strict Gradient-Routing absorption claim.

## Oracle-free constraint

## *AGENTS.md* - project constraint
- epistemic context: standing repo instruction from the user, included to make the experimental boundary explicit.

> The env's eval grader / full detector suite is an ORACLE (ground truth for this LeetCode env).
> Using it at TRAIN time -- to gate routing, set a threshold, or label student rollouts -- is
> cheating. It may only score the final deploy eval.
>
> OUR setup is `v_act -> routing`: extract a hack direction from hand-built synthetic contrastive
> pairs (off-distribution, authored by us), then route the live GRPO gradient by alignment to it.
> The only labels anywhere are on the pairs we wrote; no detector runs over student rollouts at
> train time. Generalization is tested by whether `v_act` (built from pairs covering some hack
> modes) suppresses held-out modes absent from the pairs.

The GRPO reward (passing the gameable tests) is the RL training signal and is allowed; the
forbidden thing is the ORACLE hack-label, which detects true hacks and only scores the final eval.

## Design

Two experts live in one rank-`2r` LoRA per target Linear (`src/vgrout/lora2r.py`): a deployed
block `[:r]` (always on, kept at deploy) and a quarantine block `[r:]` (the hack expert, reset to
its initialization at deploy). Frozen init copies are subtracted so the net adapter delta is zero
at init.

The forward keeps both experts present. This is the load-bearing choice (see "Why soft" below):

```python
def lora2r_forward(x, w):                 # w in [0,1] from the router, per rollout
    dep  = B[:, :r] @ (A[:r] @ x)         # deployed expert  (init copy subtracted)
    quar = B[:, r:] @ (A[r:] @ x)         # quarantine (hack) expert
    dep  = (1 - w)*dep + w*dep.detach()   # value intact, gradient scaled by (1 - w)
    return base(x) + dep + w*quar          # BOTH stay in the forward; no top-k select
```

The router is a seeded rank-1 gate. For 2 experts a standard `Linear(d, n_experts) + softmax`
router collapses to a single direction (softmax over 2 logits = `sigmoid` of their difference),
so the proper 2-expert router IS rank-1. We seed that direction with `v_act` and cosine-normalize
it so the logit stays O(1) and the sigmoid does not start saturated:

```python
direction = v_act.flatten()               # seeded, trainable; scale and bias also trainable
def router(acts):                         # acts pooled over completion tokens, per rollout
    return sigmoid(scale * cos(acts, direction) + bias)    # w in [0,1]
```

Training routes by `w` and re-anchors the router every step:

```python
for step in range(steps):
    students = generate(prompts)                 # on-policy rollouts
    R        = env_reward(students)              # GRPO reward (the RL signal, not the oracle)
    acts     = pooled_acts(students)             # no-grad capture for the router
    w        = router(acts)                      # per rollout
    set_lora2r_w(w)
    grpo_loss(students, R).backward()            # gradient routed by w through the forward
    (lambda_pin * router.pin(hack_pairs, clean_pairs)).backward()   # SGTM anchor, EVERY step
    opt.step()                                   # base frozen; A, B, router train
```

Deployment ablation resets the quarantine to its init and evaluates the held-out test set with
the hack expert on and off, reporting hack and solve for each.

## Why these choices

### Why soft routing, not top-k

## *Gradient Routing* - [paper_gradient_routing.md](../papers/grad_routing/paper_gradient_routing.md)
- epistemic context: local paper note, author's absorption mechanism and the DEMix comparison.

> To explain these observations, we posit absorption: (i) routing limited data to a region
> creates units of computation or features that are relevant to a broader task; (ii) these units
> then participate in the model's predictions on related, non-routed data, reducing prediction
> errors on these data, so that (iii) the features are not learned elsewhere. [...] separating the
> experts (a) reduces the sample sizes on which they learn and prevents generalization between
> them and (b) does not allow for absorption, which requires that all features are present at the
> time of the forward pass.

Step (ii) is the condition: absorption only suppresses relearning elsewhere if the expert is
present in the forward pass on the related data. Hard expert selection removes the non-selected
expert from the forward (DEMix), leaving that path out of the graph. But this only bites if we
rely on absorption to catch hacks the router misses. If the router generalizes and sends every
hack to the quarantine, each hack is present at that expert by construction and hard routing is
clean. So hard versus soft is a tradeoff, not a verdict: hard routing closes the leak (the
deployed expert never sees hack gradient) but needs a router that catches all hacks; soft routing
keeps absorption available to catch the router's misses but leaks a `(1-w)` share into the
deployed expert.

We choose soft for two reasons. First, production MoE (Switch top-1, Mixtral top-2, DeepSeek
top-k) is hard-routed in both training and inference and works anyway, but for a goal we do not
share: capability per FLOP, where it never deletes an expert, so a behaviour smeared across
several is harmless. Our goal is deletion, which needs clean ownership. Second, and decisively,
the evidence below says absorption will not volunteer the behavioural clustering for free, routing
clusters by topic, so we cannot lean on the absorb middle and instead force localization with the
seed and pin. Keeping both experts present preserves the option of absorption and, more usefully,
lets us apply SGTM's exact recipe of present-in-forward plus a hard backward mask, via the hard
detach above a `w` threshold. We skip load balancing throughout, since it suppresses the
specialization we want. The one production idea we keep is DeepSeek's always-on shared expert,
which maps to our always-on deployed block.

### Why pin every step

## *SGTM (Beyond Data Filtering)* - [paper_sgtm.md](../papers/grad_routing/paper_sgtm.md)
- epistemic context: local paper note, the self-reinforcing-localization result.

> Once the model begins localizing forget knowledge based on labeled examples (where we
> explicitly mask gradients), we expect that unlabeled forget samples would naturally gravitate
> toward using forget parameters [...] confirming the self-reinforcing localization hypothesis.

The router is a learnable parameter, so reward can drift it off the hack axis. SGTM's hard mask
never stops firing; neither does the pin. The pin trains only the router (it reads frozen no-grad
activation snapshots), so it never teaches the deployed expert the hack.

### The one conflict and the open question

A normal learned MoE router and the task loss already cooperate: the router is trained by the loss
to send each input to the expert that lowers it. Relative to that, the only thing Evil MoE adds is
the pin, so there is exactly one new conflict. The pin forces localization only on the labeled
hand-authored pairs, while reward places the unlabeled live hacks and prefers the always-on
deployed expert, which ablation cannot remove. The method works only if live hacks follow the
pinned labeled ones into the quarantine faster than reward relearns them in the deployed block.
That is SGTM's self-reinforcement bet restated, and the causal ablation is what tests it.

Residual leak: the deployed expert is only soft-detached by `(1-w)`, not hard-masked, so on a live
hack it still receives a `(1-w)` share of the hack gradient. A hard detach above a `w` threshold
would close it at no cost (the router's reward gradient flows only through the `w*quar` term),
recovering SGTM's exact recipe of present-in-forward, zero-gradient.

## Evidence

Two literature searches (chat exports under `docs/brainstorm/`) bear on whether a behaviour can be
localized into one ablatable expert. The pattern: forced localization has working precedents;
emergent localization (hoping a behaviour clusters by itself) is what the evidence says fails.

Supporting the forced route:

- A "bad" behaviour collapses onto a shared low-dimensional direction. Emergent misalignment,
  narrow bad finetuning produces broadly misaligned models (Betley et al., Nature 2025,
  [arXiv:2502.17424](https://arxiv.org/abs/2502.17424)); persona vectors for evil/sycophancy/
  hallucination (Anthropic, [OpenReview 20DsUSauCj](https://openreview.net/forum?id=20DsUSauCj));
  refusal is a single direction across 13 models (Arditi et al.,
  [arXiv:2406.11717](https://arxiv.org/abs/2406.11717)). This makes `v_act`, and the broad-evil-seed
  variant, plausible.
- Seeding or steering which expert owns a behaviour is done. SteerMoE detects behaviour-experts via
  contrastive paired inputs, the same construction as our pairs
  ([arXiv:2509.09660](https://arxiv.org/abs/2509.09660)); geometric routing makes rank-1 experts
  monosemantic by construction with cosine routing, which is our exact router
  ([arXiv:2604.14434](https://arxiv.org/abs/2604.14434)); cluster-aware upcycling seeds each expert
  from an SVD subspace and inits the router to cluster centroids
  ([arXiv:2604.13508](https://arxiv.org/abs/2604.13508)).
- Deleting one expert plus light repair recovers quality. NAEE: a 6.2-point task-specific drop
  recovers to 1.6 with fine-tuning ([arXiv:2402.14800](https://arxiv.org/abs/2402.14800)); pruning to
  a single expert is feasible ([arXiv:2206.00277](https://arxiv.org/abs/2206.00277)); MoE-Pruner heals
  to 99% at 50% sparsity via expert-wise distillation ([arXiv:2410.12013](https://arxiv.org/abs/2410.12013)).
  Caveat: the repair redistributes capability, which we must not do to the hack, so our no-repair
  ablation is the harder version.
- Behavioural expert removal plus a router anchor are precedented. MoTE: disabling refusal-relevant
  experts cut refusal 52% with matched-beats-random ablation, our UAT
  ([arXiv:2502.11096](https://arxiv.org/abs/2502.11096)); SEUF concentrates unlearning on one expert
  with a router anchor loss, our pin, and warns naive unlearning disrupts routing
  ([arXiv:2411.18797](https://arxiv.org/abs/2411.18797)).

Headwinds (why absorption will not do it for free):

- Specialization is not automatic. A domain-invariant "standing committee" carries most routing
  mass, so specialization is "far less pervasive than believed"
  ([arXiv:2601.03425](https://arxiv.org/abs/2601.03425)); DeepSeekMoE itself concedes vanilla
  8-16-expert MoE fails to specialize from knowledge hybridity and redundancy
  ([arXiv:2401.06066](https://arxiv.org/abs/2401.06066)).
- Routing clusters by topic, not behaviour, so absorption would cluster by topic, not hackiness, and
  there is usually no natural "bad expert" ([arXiv:2605.29708](https://arxiv.org/abs/2605.29708);
  GateBreaker [arXiv:2512.21008](https://arxiv.org/abs/2512.21008)). Both are recent unreplicated
  preprints, weight lightly.
- In our favour: load balancing actively suppresses specialization (NeurIPS 2025 oral,
  [arXiv:2505.22323](https://arxiv.org/abs/2505.22323)), and we omit it.

Net: each component of forced localization, extract a behaviour direction, seed and pin one expert,
ablate it, has a precedent that works, while the emergent route is exactly what the standing-committee
and topic-routing results say will not happen. So the seed and pin are load-bearing, not redundant,
and the decisive question stays empirical.

## Extensions (after the 2-expert run)

- Confident-tail pinning. Score live rollouts by `v_act`, hard-route the top/bottom quantile (a
  run-spanning buffer or EMA threshold) to quarantine/deployed, and leave the middle as the absorb
  zone where both train. This is SGTM's confident-pin design and is exactly what vGROUT routeA
  already does; it needs no learned router.
- 3+ experts. One always-on deployed plus several ablatable, to watch how absorption distributes a
  hack and to match production multi-expert usage. Needs per-mode seed directions or accepts free
  assignment, and adds an interpretation confound, so it is a follow-up, not the decisive run.
- Learned-router sharpening. Let reward improve `v_act`'s boundary over training. This is the sole
  reason to pay for a learned router over a fixed `v_act` gate; if the fixed gate already
  localizes, the learned router and its pin are unnecessary.

## UAT

A run supports the Evil MoE hypothesis if the report table shows:

| check | expected if working | wrong-case |
|---|---|---|
| hack ablation | hack rate lower with the hack expert off | no hack drop, or a drop only from total capability collapse |
| matched ablation | hack-expert-off reduces hacks more than random/clean-expert-off at matched capacity | any expert ablation gives the same effect |
| capability retention | solve rate mostly preserved with the hack expert off | ablation destroys normal LeetCode ability |
| routing selectivity | synthetic hack pairs route higher `w` than clean pairs | router keys on style/length/domain artifacts |
| held-out hack modes | held-out modes also depend on the hack expert | only the pinned hack template is isolated |
| boundary audit | no oracle or detector label touches train-time routing | live-rollout labels leak into the router |

Minimum evidence file: config/command; router `w` table for synthetic clean, synthetic hack, live
GRPO, and held-out hack eval; hack-rate and solve-rate table with hack expert on/off; first
train/eval batch prompts+completions so formatting artifacts are visible.

## Main failure modes

1. The hack expert becomes a general coding expert, so ablating it cuts hacks by making the model
   worse (caught by capability retention + matched ablation).
2. The router keys on superficial pair artifacts: style, length, problem family (caught by routing
   selectivity + held-out modes).
3. Reward relearns the hack in the deployed expert because it is the always-on path (the residual
   leak; the hard detach is the mitigation).
4. The hack mode is absent from the pairs, so the quarantine has no seed for it and absorption does
   not catch it (the generalization limit; tested by held-out modes).

## Decision

Worth testing, with the claim at the level the evidence supports. It is a separate experiment from
SGTM absorption: an ablatable-modularity hypothesis that weak synthetic router supervision plus
soft MoE specialization can put reward-hack behaviour in a removable expert. The proof is not
lower training loss; it is causal, ablate the hack expert and held-out hack rate drops while solve
behaviour survives.

## Links

- [Gradient Routing local note](../papers/grad_routing/paper_gradient_routing.md), [arXiv:2410.04332](https://arxiv.org/abs/2410.04332)
- [SGTM local note](../papers/grad_routing/paper_sgtm.md)
- [MoE in Transformers (HF blog)](../papers/hf_blog_moe_transformers.md)
- [DeepSeekMoE](https://aclanthology.org/2024.acl-long.70/) (shared-expert architecture)
- [Switch Transformers](https://arxiv.org/abs/2101.03961), [ST-MoE](https://arxiv.org/abs/2202.08906) (load balancing, router z-loss, rejected here)
- adapter and router: [src/vgrout/lora2r.py](../../src/vgrout/lora2r.py), [src/vgrout/moe_router.py](../../src/vgrout/moe_router.py), loop in [src/vgrout/train_moe.py](../../src/vgrout/train_moe.py)