evil_MoE/README.md

# Evil MoE

Evil MoE trains a mixture of experts in which one expert carries reward-hacking behaviour and
is removed at deployment. Routing is done by a learned soft router, seeded by an extracted hack
direction and held by a pin loss, rather than by a per-example label. It is a fork of
[vGROUT](https://github.com/wassname/vGROUT) and reuses its substrate (the reward-hacking
LeetCode environment, GRPO loop, reward grader, deployment-ablation evaluator); only the routing
mechanism changes. Background, literature map, and design rationale are in
[docs/spec/](docs/spec/) and [AGENTS.md](AGENTS.md).

## Hypothesis

> A learned MoE-style router, seeded by a synthetic activation-space hack direction and anchored
> by a continuous pin loss on hand-authored contrastive pairs, can localize reward-hacking
> behaviour in a single ablatable expert. Test: ablate the hack expert at deployment and check
> that it suppresses more hacking than solving.

A localization claim, not a strict gradient-routing absorption claim.

## Routing

The adapter is one rank-`2r` LoRA per target Linear (`src/vgrout/lora2r.py`), split into a
deployed block `[:r]` (always trained, kept) and a quarantine block `[r:]` (the hack expert,
reset to init at deployment). The substrate is Ariahw and Nanda's reward-hacking LeetCode
environment, inherited from vGROUT.

Per rollout the router (`src/vgrout/moe_router.py`) reads the pooled deployed-block bottleneck
activations and emits one weight `w` in `[0,1]`. The forward hook scales the hack expert by `w`
and the deployed expert's gradient by `1-w` (forward value kept). So `w=0` trains only the
deployed block (and reproduces the deployment forward), `w=1` only the hack expert, intermediate
`w` both.

Two signals train the router. Reward (the GRPO loss) flows in through `w`. A pin loss on the
hand-authored pairs, applied every step, pushes `w` toward 1 on hack and 0 on clean. The router
direction is initialized to `v_act`, the unit hack-minus-clean activation difference from those
pairs, so it starts as a fixed vector gate and then specializes. The only labels are the hack and
clean sides of those off-distribution pairs; no training-rollout label or environment oracle
enters the routing.

This replaces the label-driven hard mask of gradient routing (Cloud et al.) and its
self-reinforcing variant SGTM (Shilov et al.) with a learned soft router; the linear gate and the
GRPO-on-MoE setup follow MixLoRA.

## Router dynamics

A normal MoE router and the task loss already cooperate by construction: the router gets gradient
from the loss and learns to send each input to the expert that lowers it (an auxiliary
load-balancing loss stops it collapsing onto one). Evil MoE inherits that and adds one thing, the
pin, which forces the labeled hand-authored hacks to the quarantine (ablatable) expert and the
clean pairs to the deployed (kept) expert.

That introduces one conflict. The pin only constrains routing on the labeled pairs, while reward
decides where the unlabeled live hacks go, and reward prefers the deployed expert because it is
always on and fully on-policy, so it is the cheaper place to express a hack. Ablation cannot
remove what lands there. The method works only if live hacks follow the pinned labeled ones into
the quarantine (SGTM's self-reinforcement bet) faster than reward relearns them in the deployed
block. That is what the causal ablation measures.

Three things bias it toward localizing: the router is seeded with `v_act` so it starts aligned;
the pin fires every step so reward cannot rotate the axis off "hack"; and on a routed hack the
deployed expert is detached. The detach is the one open leak: it is soft (`1-w`), not hard, so
the deployed expert still gets a `1-w` share of the hack gradient. A hard detach above a `w`
threshold would close it at no cost, since the router's reward gradient flows only through the
`w*quar` term.

## What it measures

The deployment evaluation generates on the held-out test set with the hack expert ablated and
again with it on, and reports both:

| measure | hack expert on | hack expert off (deploy) | supports the hypothesis if |
|---|---|---|---|
| hack | baseline | lower | off below on |
| solve | baseline | preserved | off near on |

Each run ends with the line `Evil MoE causal ablation: deploy hack X (ON) -> Y (OFF)`.

## Quick start

```bash
uv sync
just smoke         # verify gates + a tiny on-policy GRPO run with router, pin, and ablation
just smoke-moe     # only the Evil MoE training pathway on tiny-random Qwen3
```

`just smoke` runs four `verify_*.py` gates (reward grader, evaluation gap, lora2r block
routing, and the Evil MoE soft-weight, router, and pin invariants in
`scripts/verify_moe_router.py`), then a six-step run. The tiny random model produces no
reward, so GRPO never fires and the adapter does not train on that run; it is a pipeline
check, and the routing math is proven in `scripts/verify_moe_router.py`. The causal hack-drop
result requires a real Qwen3-4B run through `pueue`.

## Layout

- `src/vgrout/train_moe.py`: the Evil MoE GRPO loop (on-policy, learned router, pin loss).
- `src/vgrout/moe_router.py`: `HackRouter`, pooled activations to the hack-expert weight `w`.
- `src/vgrout/lora2r.py`: the two-expert adapter and its forward hook (`_lora2r_w`).
- `scripts/verify_moe_router.py`: the routing-invariant gate.
- `docs/spec/`: the original Evil MoE proposal and the literature map.
- `src/vgrout/train.py`: the vGROUT routeA, none, and absorb arms, kept for comparison
  (`just smoke-legacy`).