mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
feat: Evil MoE — learned soft router + pin loss on an ablatable hack expert
Fork of vGROUT. Replaces routeA's fixed v_act quantile gate with a learned per-rollout soft router (HackRouter, seeded from v_act) on the ablatable hack expert: GRPO flows into the router through the soft weight w (it concentrates hack-like rollouts in the hack expert), and a continuous pin loss on the hand-authored pairs anchors the axis. No load balancing; routing is per rollout. lora2r gains a soft-weight forward path (_lora2r_w: w=0 keep, w=1 rout, deployed grad scaled by 1-w). train_moe.py is the on-policy GRPO loop; verify_moe_router.py gates the routing invariants. `just smoke` is green. README/AGENTS rewritten for the fork; original proposal kept as docs/spec/original_evil_moe_spec.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -1 +1,110 @@
|
||||
# Evil MoE
|
||||
|
||||
Evil MoE trains a mixture-of-experts in which one expert carries reward-hacking behaviour
|
||||
and is removed at deployment. It is a fork of [vGROUT](https://github.com/wassname/vGROUT),
|
||||
kept as the `upstream` remote, and reuses vGROUT's substrate: the Ariahw and Nanda
|
||||
reward-hacking LeetCode environment, the GRPO loop, the reward grader, and the
|
||||
deployment-ablation evaluator. The routing mechanism is the only part that changes.
|
||||
|
||||
## Hypothesis
|
||||
|
||||
> A learned MoE-style router, seeded by a synthetic activation-space hack direction and
|
||||
> anchored by a continuous pin loss on hand-authored contrastive pairs, can localize
|
||||
> reward-hacking behaviour in a single ablatable expert. The test is causal: ablate the
|
||||
> hack expert at deployment and measure whether the reward-hack rate drops while the
|
||||
> ground-truth solve rate survives, and whether it drops more than ablating a random or
|
||||
> clean expert at matched capacity.
|
||||
|
||||
This is a localization claim, not a strict gradient-routing absorption claim. The original
|
||||
proposal and the literature map are in [docs/spec/](docs/spec/).
|
||||
|
||||
## Background
|
||||
|
||||
Three routing mechanisms differ in how the gradient assignment is decided. In Gradient
|
||||
Routing (Cloud et al.) a data label decides it, applied as a hard backward mask. In a learned
|
||||
MoE the router decides it, trained from the task loss; an expert that lowers the loss on some
|
||||
inputs is routed more of them and improves further, so a learned router tends to concentrate
|
||||
related inputs in one expert. SGTM (Shilov et al.) connects the two: once a hard mask seeds
|
||||
localization on labeled data, unlabeled data of the same kind comes to update the same
|
||||
parameters without a mask. Evil MoE replaces SGTM's hard mask with a soft learned router,
|
||||
seeds it with the extracted hack direction, and relies on the router's concentration under
|
||||
GRPO. The router is also a parameter the reward pushes on, so the pin loss is applied every
|
||||
step rather than only at initialization. GRPO has been run on MoE models before: DeepSeek-R1
|
||||
trains the 671B DeepSeek-V3 MoE with GRPO, and MoE-GRPO (arXiv:2603.24984) optimizes the
|
||||
router itself with GRPO.
|
||||
|
||||
## The adapter
|
||||
|
||||
Every target Linear gets one rank-`2r` LoRA (`src/vgrout/lora2r.py`), `A:[2r,d_in]` and
|
||||
`B:[d_out,2r]`, with frozen Gaussian-init copies subtracted so the net delta is zero at
|
||||
initialization. The `2r` rows and columns split into two independent experts. The deployed
|
||||
block `[:r]` is always present in the forward pass and always trained. The quarantine block
|
||||
`[r:]` is the hack expert. At deployment the quarantine block is reset to its initialization,
|
||||
so its learned contribution is absent from the deployed model.
|
||||
|
||||
## Method
|
||||
|
||||
For each rollout a learned router (`src/vgrout/moe_router.py`) reads the pooled deployed-block
|
||||
bottleneck activations and emits one weight `w` in `[0,1]`. The forward hook scales the hack
|
||||
expert by `w` and scales the deployed expert's gradient by `1-w` while keeping its forward
|
||||
value. So `w=0` trains only the deployed block and reproduces the deployment forward, `w=1`
|
||||
trains only the hack expert with the deployed block detached, and intermediate `w` trains
|
||||
both. These are the soft form of vGROUT routeA's keep, absorb, and rout masks.
|
||||
|
||||
The router is trained two ways at once. GRPO flows into it through `w`: raising `w` on a
|
||||
rollout moves that rollout's learning from the deployed block into the hack expert. A pin loss
|
||||
on the hand-authored pairs, applied every step, pushes `w` toward 1 on the hack side and
|
||||
toward 0 on the clean side. The router direction is initialized from `v_act`, the
|
||||
hack-minus-clean activation difference extracted from those pairs, so it starts as the vector
|
||||
gate and then specializes.
|
||||
|
||||
There is no load-balancing loss. Load balancing forces even expert use and would suppress the
|
||||
asymmetric specialization the method depends on (Demons in the Detail, arXiv:2501.11873; The
|
||||
Illusion of Specialization, arXiv:2601.03425). Routing is per rollout, not per token, because
|
||||
reward hacking is a property of a whole rollout and the deployment test ablates the expert at
|
||||
the rollout level. This makes Evil MoE a behavioral mixture of adapters rather than a capacity
|
||||
MoE. The canonical per-token LoRA-MoE substrate is MixLoRA; Evil MoE borrows its small linear
|
||||
gate and the GRPO-on-MoE precedent but not its per-token routing or its load-balancing loss.
|
||||
|
||||
The only labels used in training are the hack and clean sides of the hand-authored pairs.
|
||||
These pairs are off-distribution and authored before observing any training rollout. No
|
||||
ground-truth label from a training rollout, and no environment-specific oracle, enters the
|
||||
router or the routing. The deployment grader is a measurement instrument that scores the final
|
||||
evaluation only.
|
||||
|
||||
## What it measures
|
||||
|
||||
The deployment evaluation generates on the held-out test set with the hack expert ablated and
|
||||
again with it on, and reports both:
|
||||
|
||||
| measure | hack expert on | hack expert off (deploy) | supports the hypothesis if |
|
||||
|---|---|---|---|
|
||||
| hack | baseline | lower | off below on |
|
||||
| solve | baseline | preserved | off near on |
|
||||
|
||||
Each run ends with the line `Evil MoE causal ablation: deploy hack X (ON) -> Y (OFF)`.
|
||||
|
||||
## Quick start
|
||||
|
||||
```bash
|
||||
uv sync
|
||||
just smoke # verify gates + a tiny on-policy GRPO run with router, pin, and ablation
|
||||
just smoke-moe # only the Evil MoE training pathway on tiny-random Qwen3
|
||||
```
|
||||
|
||||
`just smoke` runs four `verify_*.py` gates (reward grader, evaluation gap, lora2r block
|
||||
routing, and the Evil MoE soft-weight, router, and pin invariants in
|
||||
`scripts/verify_moe_router.py`), then a six-step run. The tiny random model produces no
|
||||
reward, so GRPO never fires and the adapter does not train on that run; it is a pipeline
|
||||
check, and the routing math is proven in `scripts/verify_moe_router.py`. The causal hack-drop
|
||||
result requires a real Qwen3-4B run through `pueue`.
|
||||
|
||||
## Layout
|
||||
|
||||
- `src/vgrout/train_moe.py`: the Evil MoE GRPO loop (on-policy, learned router, pin loss).
|
||||
- `src/vgrout/moe_router.py`: `HackRouter`, pooled activations to the hack-expert weight `w`.
|
||||
- `src/vgrout/lora2r.py`: the two-expert adapter and its forward hook (`_lora2r_w`).
|
||||
- `scripts/verify_moe_router.py`: the routing-invariant gate.
|
||||
- `docs/spec/`: the original Evil MoE proposal and the literature map.
|
||||
- `src/vgrout/train.py`: the vGROUT routeA, none, and absorb arms, kept for comparison
|
||||
(`just smoke-legacy`).
|
||||
|
||||
Reference in New Issue
Block a user