mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
docs: rewrite Evil MoE spec to the soft-routing design + literature evidence
Spec was stale (recommended hard sparse "Version A", the DEMix absorption-killer). Rewrite to match what is implemented and what we clarified: - pseudocode-first: lora2r 2-expert forward, seeded rank-1 cosine router, GRPO+pin loop, deploy ablation. For 2 experts the "proper" router IS rank-1 (softmax over 2 = sigmoid of one direction), seeded with v_act. - "Why soft, not top-k" reframed as a tradeoff, not a verdict: hard routing closes the leak but needs a router that catches all hacks; soft keeps absorption available but leaks (1-w). DEMix only bites if we rely on absorption. - Evidence section from two literature searches. Forced localization has working precedents (single bad direction: emergent misalignment/persona/refusal; behavioural expert seeding: SteerMoE, geometric cosine routing, cluster-aware upcycling; ablation + repair: NAEE/MoE-Pruner; router anchor: SEUF/MoTE). Emergent localization does not (standing-committee, topic-driven routing). So seed+pin are load-bearing. - 3-way/3-expert noted as an extension (closer to production), 2 experts for the decisive causal run. README: add Router dynamics (three forces, one pin-vs-reward conflict, mitigations). Add HF "MoE in Transformers" blog to docs/papers (force-added past the docs gitignore). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -1,76 +1,67 @@
|
||||
# Evil MoE
|
||||
|
||||
Evil MoE trains a mixture-of-experts in which one expert carries reward-hacking behaviour
|
||||
and is removed at deployment. It is a fork of [vGROUT](https://github.com/wassname/vGROUT),
|
||||
kept as the `upstream` remote, and reuses vGROUT's substrate: the Ariahw and Nanda
|
||||
reward-hacking LeetCode environment, the GRPO loop, the reward grader, and the
|
||||
deployment-ablation evaluator. The routing mechanism is the only part that changes.
|
||||
Evil MoE trains a mixture of experts in which one expert carries reward-hacking behaviour and
|
||||
is removed at deployment. Routing is done by a learned soft router, seeded by an extracted hack
|
||||
direction and held by a pin loss, rather than by a per-example label. It is a fork of
|
||||
[vGROUT](https://github.com/wassname/vGROUT) and reuses its substrate (the reward-hacking
|
||||
LeetCode environment, GRPO loop, reward grader, deployment-ablation evaluator); only the routing
|
||||
mechanism changes. Background, literature map, and design rationale are in
|
||||
[docs/spec/](docs/spec/) and [AGENTS.md](AGENTS.md).
|
||||
|
||||
## Hypothesis
|
||||
|
||||
> A learned MoE-style router, seeded by a synthetic activation-space hack direction and
|
||||
> anchored by a continuous pin loss on hand-authored contrastive pairs, can localize
|
||||
> reward-hacking behaviour in a single ablatable expert. The test is causal: ablate the
|
||||
> hack expert at deployment and measure whether the reward-hack rate drops while the
|
||||
> ground-truth solve rate survives, and whether it drops more than ablating a random or
|
||||
> clean expert at matched capacity.
|
||||
> A learned MoE-style router, seeded by a synthetic activation-space hack direction and anchored
|
||||
> by a continuous pin loss on hand-authored contrastive pairs, can localize reward-hacking
|
||||
> behaviour in a single ablatable expert. Test: ablate the hack expert at deployment and check
|
||||
> that it suppresses more hacking than solving.
|
||||
|
||||
This is a localization claim, not a strict gradient-routing absorption claim. The original
|
||||
proposal and the literature map are in [docs/spec/](docs/spec/).
|
||||
A localization claim, not a strict gradient-routing absorption claim.
|
||||
|
||||
## Background
|
||||
## Routing
|
||||
|
||||
Three routing mechanisms differ in how the gradient assignment is decided. In Gradient
|
||||
Routing (Cloud et al.) a data label decides it, applied as a hard backward mask. In a learned
|
||||
MoE the router decides it, trained from the task loss; an expert that lowers the loss on some
|
||||
inputs is routed more of them and improves further, so a learned router tends to concentrate
|
||||
related inputs in one expert. SGTM (Shilov et al.) connects the two: once a hard mask seeds
|
||||
localization on labeled data, unlabeled data of the same kind comes to update the same
|
||||
parameters without a mask. Evil MoE replaces SGTM's hard mask with a soft learned router,
|
||||
seeds it with the extracted hack direction, and relies on the router's concentration under
|
||||
GRPO. The router is also a parameter the reward pushes on, so the pin loss is applied every
|
||||
step rather than only at initialization. GRPO has been run on MoE models before: DeepSeek-R1
|
||||
trains the 671B DeepSeek-V3 MoE with GRPO, and MoE-GRPO (arXiv:2603.24984) optimizes the
|
||||
router itself with GRPO.
|
||||
The adapter is one rank-`2r` LoRA per target Linear (`src/vgrout/lora2r.py`), split into a
|
||||
deployed block `[:r]` (always trained, kept) and a quarantine block `[r:]` (the hack expert,
|
||||
reset to init at deployment). The substrate is Ariahw and Nanda's reward-hacking LeetCode
|
||||
environment, inherited from vGROUT.
|
||||
|
||||
## The adapter
|
||||
Per rollout the router (`src/vgrout/moe_router.py`) reads the pooled deployed-block bottleneck
|
||||
activations and emits one weight `w` in `[0,1]`. The forward hook scales the hack expert by `w`
|
||||
and the deployed expert's gradient by `1-w` (forward value kept). So `w=0` trains only the
|
||||
deployed block (and reproduces the deployment forward), `w=1` only the hack expert, intermediate
|
||||
`w` both.
|
||||
|
||||
Every target Linear gets one rank-`2r` LoRA (`src/vgrout/lora2r.py`), `A:[2r,d_in]` and
|
||||
`B:[d_out,2r]`, with frozen Gaussian-init copies subtracted so the net delta is zero at
|
||||
initialization. The `2r` rows and columns split into two independent experts. The deployed
|
||||
block `[:r]` is always present in the forward pass and always trained. The quarantine block
|
||||
`[r:]` is the hack expert. At deployment the quarantine block is reset to its initialization,
|
||||
so its learned contribution is absent from the deployed model.
|
||||
Two signals train the router. Reward (the GRPO loss) flows in through `w`. A pin loss on the
|
||||
hand-authored pairs, applied every step, pushes `w` toward 1 on hack and 0 on clean. The router
|
||||
direction is initialized to `v_act`, the unit hack-minus-clean activation difference from those
|
||||
pairs, so it starts as a fixed vector gate and then specializes. The only labels are the hack and
|
||||
clean sides of those off-distribution pairs; no training-rollout label or environment oracle
|
||||
enters the routing.
|
||||
|
||||
## Method
|
||||
This replaces the label-driven hard mask of gradient routing (Cloud et al.) and its
|
||||
self-reinforcing variant SGTM (Shilov et al.) with a learned soft router; the linear gate and the
|
||||
GRPO-on-MoE setup follow MixLoRA.
|
||||
|
||||
For each rollout a learned router (`src/vgrout/moe_router.py`) reads the pooled deployed-block
|
||||
bottleneck activations and emits one weight `w` in `[0,1]`. The forward hook scales the hack
|
||||
expert by `w` and scales the deployed expert's gradient by `1-w` while keeping its forward
|
||||
value. So `w=0` trains only the deployed block and reproduces the deployment forward, `w=1`
|
||||
trains only the hack expert with the deployed block detached, and intermediate `w` trains
|
||||
both. These are the soft form of vGROUT routeA's keep, absorb, and rout masks.
|
||||
## Router dynamics
|
||||
|
||||
The router is trained two ways at once. GRPO flows into it through `w`: raising `w` on a
|
||||
rollout moves that rollout's learning from the deployed block into the hack expert. A pin loss
|
||||
on the hand-authored pairs, applied every step, pushes `w` toward 1 on the hack side and
|
||||
toward 0 on the clean side. The router direction is initialized from `v_act`, the
|
||||
hack-minus-clean activation difference extracted from those pairs, so it starts as the vector
|
||||
gate and then specializes.
|
||||
A normal MoE router and the task loss already cooperate by construction: the router gets gradient
|
||||
from the loss and learns to send each input to the expert that lowers it (an auxiliary
|
||||
load-balancing loss stops it collapsing onto one). Evil MoE inherits that and adds one thing, the
|
||||
pin, which forces the labeled hand-authored hacks to the quarantine (ablatable) expert and the
|
||||
clean pairs to the deployed (kept) expert.
|
||||
|
||||
There is no load-balancing loss. Load balancing forces even expert use and would suppress the
|
||||
asymmetric specialization the method depends on (Demons in the Detail, arXiv:2501.11873; The
|
||||
Illusion of Specialization, arXiv:2601.03425). Routing is per rollout, not per token, because
|
||||
reward hacking is a property of a whole rollout and the deployment test ablates the expert at
|
||||
the rollout level. This makes Evil MoE a behavioral mixture of adapters rather than a capacity
|
||||
MoE. The canonical per-token LoRA-MoE substrate is MixLoRA; Evil MoE borrows its small linear
|
||||
gate and the GRPO-on-MoE precedent but not its per-token routing or its load-balancing loss.
|
||||
That introduces one conflict. The pin only constrains routing on the labeled pairs, while reward
|
||||
decides where the unlabeled live hacks go, and reward prefers the deployed expert because it is
|
||||
always on and fully on-policy, so it is the cheaper place to express a hack. Ablation cannot
|
||||
remove what lands there. The method works only if live hacks follow the pinned labeled ones into
|
||||
the quarantine (SGTM's self-reinforcement bet) faster than reward relearns them in the deployed
|
||||
block. That is what the causal ablation measures.
|
||||
|
||||
The only labels used in training are the hack and clean sides of the hand-authored pairs.
|
||||
These pairs are off-distribution and authored before observing any training rollout. No
|
||||
ground-truth label from a training rollout, and no environment-specific oracle, enters the
|
||||
router or the routing. The deployment grader is a measurement instrument that scores the final
|
||||
evaluation only.
|
||||
Three things bias it toward localizing: the router is seeded with `v_act` so it starts aligned;
|
||||
the pin fires every step so reward cannot rotate the axis off "hack"; and on a routed hack the
|
||||
deployed expert is detached. The detach is the one open leak: it is soft (`1-w`), not hard, so
|
||||
the deployed expert still gets a `1-w` share of the hack gradient. A hard detach above a `w`
|
||||
threshold would close it at no cost, since the router's reward gradient flows only through the
|
||||
`w*quar` term.
|
||||
|
||||
## What it measures
|
||||
|
||||
|
||||
Reference in New Issue
Block a user