Files
evil_MoE/README.md
T
wassname 04a98b321e feat: Evil MoE — learned soft router + pin loss on an ablatable hack expert
Fork of vGROUT. Replaces routeA's fixed v_act quantile gate with a learned
per-rollout soft router (HackRouter, seeded from v_act) on the ablatable hack
expert: GRPO flows into the router through the soft weight w (it concentrates
hack-like rollouts in the hack expert), and a continuous pin loss on the
hand-authored pairs anchors the axis. No load balancing; routing is per rollout.

lora2r gains a soft-weight forward path (_lora2r_w: w=0 keep, w=1 rout, deployed
grad scaled by 1-w). train_moe.py is the on-policy GRPO loop; verify_moe_router.py
gates the routing invariants. `just smoke` is green. README/AGENTS rewritten for
the fork; original proposal kept as docs/spec/original_evil_moe_spec.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-14 11:25:14 +08:00

6.2 KiB

Evil MoE

Evil MoE trains a mixture-of-experts in which one expert carries reward-hacking behaviour and is removed at deployment. It is a fork of vGROUT, kept as the upstream remote, and reuses vGROUT's substrate: the Ariahw and Nanda reward-hacking LeetCode environment, the GRPO loop, the reward grader, and the deployment-ablation evaluator. The routing mechanism is the only part that changes.

Hypothesis

A learned MoE-style router, seeded by a synthetic activation-space hack direction and anchored by a continuous pin loss on hand-authored contrastive pairs, can localize reward-hacking behaviour in a single ablatable expert. The test is causal: ablate the hack expert at deployment and measure whether the reward-hack rate drops while the ground-truth solve rate survives, and whether it drops more than ablating a random or clean expert at matched capacity.

This is a localization claim, not a strict gradient-routing absorption claim. The original proposal and the literature map are in docs/spec/.

Background

Three routing mechanisms differ in how the gradient assignment is decided. In Gradient Routing (Cloud et al.) a data label decides it, applied as a hard backward mask. In a learned MoE the router decides it, trained from the task loss; an expert that lowers the loss on some inputs is routed more of them and improves further, so a learned router tends to concentrate related inputs in one expert. SGTM (Shilov et al.) connects the two: once a hard mask seeds localization on labeled data, unlabeled data of the same kind comes to update the same parameters without a mask. Evil MoE replaces SGTM's hard mask with a soft learned router, seeds it with the extracted hack direction, and relies on the router's concentration under GRPO. The router is also a parameter the reward pushes on, so the pin loss is applied every step rather than only at initialization. GRPO has been run on MoE models before: DeepSeek-R1 trains the 671B DeepSeek-V3 MoE with GRPO, and MoE-GRPO (arXiv:2603.24984) optimizes the router itself with GRPO.

The adapter

Every target Linear gets one rank-2r LoRA (src/vgrout/lora2r.py), A:[2r,d_in] and B:[d_out,2r], with frozen Gaussian-init copies subtracted so the net delta is zero at initialization. The 2r rows and columns split into two independent experts. The deployed block [:r] is always present in the forward pass and always trained. The quarantine block [r:] is the hack expert. At deployment the quarantine block is reset to its initialization, so its learned contribution is absent from the deployed model.

Method

For each rollout a learned router (src/vgrout/moe_router.py) reads the pooled deployed-block bottleneck activations and emits one weight w in [0,1]. The forward hook scales the hack expert by w and scales the deployed expert's gradient by 1-w while keeping its forward value. So w=0 trains only the deployed block and reproduces the deployment forward, w=1 trains only the hack expert with the deployed block detached, and intermediate w trains both. These are the soft form of vGROUT routeA's keep, absorb, and rout masks.

The router is trained two ways at once. GRPO flows into it through w: raising w on a rollout moves that rollout's learning from the deployed block into the hack expert. A pin loss on the hand-authored pairs, applied every step, pushes w toward 1 on the hack side and toward 0 on the clean side. The router direction is initialized from v_act, the hack-minus-clean activation difference extracted from those pairs, so it starts as the vector gate and then specializes.

There is no load-balancing loss. Load balancing forces even expert use and would suppress the asymmetric specialization the method depends on (Demons in the Detail, arXiv:2501.11873; The Illusion of Specialization, arXiv:2601.03425). Routing is per rollout, not per token, because reward hacking is a property of a whole rollout and the deployment test ablates the expert at the rollout level. This makes Evil MoE a behavioral mixture of adapters rather than a capacity MoE. The canonical per-token LoRA-MoE substrate is MixLoRA; Evil MoE borrows its small linear gate and the GRPO-on-MoE precedent but not its per-token routing or its load-balancing loss.

The only labels used in training are the hack and clean sides of the hand-authored pairs. These pairs are off-distribution and authored before observing any training rollout. No ground-truth label from a training rollout, and no environment-specific oracle, enters the router or the routing. The deployment grader is a measurement instrument that scores the final evaluation only.

What it measures

The deployment evaluation generates on the held-out test set with the hack expert ablated and again with it on, and reports both:

measure hack expert on hack expert off (deploy) supports the hypothesis if
hack baseline lower off below on
solve baseline preserved off near on

Each run ends with the line Evil MoE causal ablation: deploy hack X (ON) -> Y (OFF).

Quick start

uv sync
just smoke         # verify gates + a tiny on-policy GRPO run with router, pin, and ablation
just smoke-moe     # only the Evil MoE training pathway on tiny-random Qwen3

just smoke runs four verify_*.py gates (reward grader, evaluation gap, lora2r block routing, and the Evil MoE soft-weight, router, and pin invariants in scripts/verify_moe_router.py), then a six-step run. The tiny random model produces no reward, so GRPO never fires and the adapter does not train on that run; it is a pipeline check, and the routing math is proven in scripts/verify_moe_router.py. The causal hack-drop result requires a real Qwen3-4B run through pueue.

Layout

  • src/vgrout/train_moe.py: the Evil MoE GRPO loop (on-policy, learned router, pin loss).
  • src/vgrout/moe_router.py: HackRouter, pooled activations to the hack-expert weight w.
  • src/vgrout/lora2r.py: the two-expert adapter and its forward hook (_lora2r_w).
  • scripts/verify_moe_router.py: the routing-invariant gate.
  • docs/spec/: the original Evil MoE proposal and the literature map.
  • src/vgrout/train.py: the vGROUT routeA, none, and absorb arms, kept for comparison (just smoke-legacy).