mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 15:00:20 +08:00
8f39c4a69f
Spec was stale (recommended hard sparse "Version A", the DEMix absorption-killer). Rewrite to match what is implemented and what we clarified: - pseudocode-first: lora2r 2-expert forward, seeded rank-1 cosine router, GRPO+pin loop, deploy ablation. For 2 experts the "proper" router IS rank-1 (softmax over 2 = sigmoid of one direction), seeded with v_act. - "Why soft, not top-k" reframed as a tradeoff, not a verdict: hard routing closes the leak but needs a router that catches all hacks; soft keeps absorption available but leaks (1-w). DEMix only bites if we rely on absorption. - Evidence section from two literature searches. Forced localization has working precedents (single bad direction: emergent misalignment/persona/refusal; behavioural expert seeding: SteerMoE, geometric cosine routing, cluster-aware upcycling; ablation + repair: NAEE/MoE-Pruner; router anchor: SEUF/MoTE). Emergent localization does not (standing-committee, topic-driven routing). So seed+pin are load-bearing. - 3-way/3-expert noted as an extension (closer to production), 2 experts for the decisive causal run. README: add Router dynamics (three forces, one pin-vs-reward conflict, mitigations). Add HF "MoE in Transformers" blog to docs/papers (force-added past the docs gitignore). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
102 lines
5.4 KiB
Markdown
102 lines
5.4 KiB
Markdown
# Evil MoE
|
|
|
|
Evil MoE trains a mixture of experts in which one expert carries reward-hacking behaviour and
|
|
is removed at deployment. Routing is done by a learned soft router, seeded by an extracted hack
|
|
direction and held by a pin loss, rather than by a per-example label. It is a fork of
|
|
[vGROUT](https://github.com/wassname/vGROUT) and reuses its substrate (the reward-hacking
|
|
LeetCode environment, GRPO loop, reward grader, deployment-ablation evaluator); only the routing
|
|
mechanism changes. Background, literature map, and design rationale are in
|
|
[docs/spec/](docs/spec/) and [AGENTS.md](AGENTS.md).
|
|
|
|
## Hypothesis
|
|
|
|
> A learned MoE-style router, seeded by a synthetic activation-space hack direction and anchored
|
|
> by a continuous pin loss on hand-authored contrastive pairs, can localize reward-hacking
|
|
> behaviour in a single ablatable expert. Test: ablate the hack expert at deployment and check
|
|
> that it suppresses more hacking than solving.
|
|
|
|
A localization claim, not a strict gradient-routing absorption claim.
|
|
|
|
## Routing
|
|
|
|
The adapter is one rank-`2r` LoRA per target Linear (`src/vgrout/lora2r.py`), split into a
|
|
deployed block `[:r]` (always trained, kept) and a quarantine block `[r:]` (the hack expert,
|
|
reset to init at deployment). The substrate is Ariahw and Nanda's reward-hacking LeetCode
|
|
environment, inherited from vGROUT.
|
|
|
|
Per rollout the router (`src/vgrout/moe_router.py`) reads the pooled deployed-block bottleneck
|
|
activations and emits one weight `w` in `[0,1]`. The forward hook scales the hack expert by `w`
|
|
and the deployed expert's gradient by `1-w` (forward value kept). So `w=0` trains only the
|
|
deployed block (and reproduces the deployment forward), `w=1` only the hack expert, intermediate
|
|
`w` both.
|
|
|
|
Two signals train the router. Reward (the GRPO loss) flows in through `w`. A pin loss on the
|
|
hand-authored pairs, applied every step, pushes `w` toward 1 on hack and 0 on clean. The router
|
|
direction is initialized to `v_act`, the unit hack-minus-clean activation difference from those
|
|
pairs, so it starts as a fixed vector gate and then specializes. The only labels are the hack and
|
|
clean sides of those off-distribution pairs; no training-rollout label or environment oracle
|
|
enters the routing.
|
|
|
|
This replaces the label-driven hard mask of gradient routing (Cloud et al.) and its
|
|
self-reinforcing variant SGTM (Shilov et al.) with a learned soft router; the linear gate and the
|
|
GRPO-on-MoE setup follow MixLoRA.
|
|
|
|
## Router dynamics
|
|
|
|
A normal MoE router and the task loss already cooperate by construction: the router gets gradient
|
|
from the loss and learns to send each input to the expert that lowers it (an auxiliary
|
|
load-balancing loss stops it collapsing onto one). Evil MoE inherits that and adds one thing, the
|
|
pin, which forces the labeled hand-authored hacks to the quarantine (ablatable) expert and the
|
|
clean pairs to the deployed (kept) expert.
|
|
|
|
That introduces one conflict. The pin only constrains routing on the labeled pairs, while reward
|
|
decides where the unlabeled live hacks go, and reward prefers the deployed expert because it is
|
|
always on and fully on-policy, so it is the cheaper place to express a hack. Ablation cannot
|
|
remove what lands there. The method works only if live hacks follow the pinned labeled ones into
|
|
the quarantine (SGTM's self-reinforcement bet) faster than reward relearns them in the deployed
|
|
block. That is what the causal ablation measures.
|
|
|
|
Three things bias it toward localizing: the router is seeded with `v_act` so it starts aligned;
|
|
the pin fires every step so reward cannot rotate the axis off "hack"; and on a routed hack the
|
|
deployed expert is detached. The detach is the one open leak: it is soft (`1-w`), not hard, so
|
|
the deployed expert still gets a `1-w` share of the hack gradient. A hard detach above a `w`
|
|
threshold would close it at no cost, since the router's reward gradient flows only through the
|
|
`w*quar` term.
|
|
|
|
## What it measures
|
|
|
|
The deployment evaluation generates on the held-out test set with the hack expert ablated and
|
|
again with it on, and reports both:
|
|
|
|
| measure | hack expert on | hack expert off (deploy) | supports the hypothesis if |
|
|
|---|---|---|---|
|
|
| hack | baseline | lower | off below on |
|
|
| solve | baseline | preserved | off near on |
|
|
|
|
Each run ends with the line `Evil MoE causal ablation: deploy hack X (ON) -> Y (OFF)`.
|
|
|
|
## Quick start
|
|
|
|
```bash
|
|
uv sync
|
|
just smoke # verify gates + a tiny on-policy GRPO run with router, pin, and ablation
|
|
just smoke-moe # only the Evil MoE training pathway on tiny-random Qwen3
|
|
```
|
|
|
|
`just smoke` runs four `verify_*.py` gates (reward grader, evaluation gap, lora2r block
|
|
routing, and the Evil MoE soft-weight, router, and pin invariants in
|
|
`scripts/verify_moe_router.py`), then a six-step run. The tiny random model produces no
|
|
reward, so GRPO never fires and the adapter does not train on that run; it is a pipeline
|
|
check, and the routing math is proven in `scripts/verify_moe_router.py`. The causal hack-drop
|
|
result requires a real Qwen3-4B run through `pueue`.
|
|
|
|
## Layout
|
|
|
|
- `src/vgrout/train_moe.py`: the Evil MoE GRPO loop (on-policy, learned router, pin loss).
|
|
- `src/vgrout/moe_router.py`: `HackRouter`, pooled activations to the hack-expert weight `w`.
|
|
- `src/vgrout/lora2r.py`: the two-expert adapter and its forward hook (`_lora2r_w`).
|
|
- `scripts/verify_moe_router.py`: the routing-invariant gate.
|
|
- `docs/spec/`: the original Evil MoE proposal and the literature map.
|
|
- `src/vgrout/train.py`: the vGROUT routeA, none, and absorb arms, kept for comparison
|
|
(`just smoke-legacy`).
|