mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 15:00:20 +08:00

Files

T

wassname 8f39c4a69f docs: rewrite Evil MoE spec to the soft-routing design + literature evidence

Spec was stale (recommended hard sparse "Version A", the DEMix absorption-killer).
Rewrite to match what is implemented and what we clarified:

- pseudocode-first: lora2r 2-expert forward, seeded rank-1 cosine router, GRPO+pin
  loop, deploy ablation. For 2 experts the "proper" router IS rank-1 (softmax over 2
  = sigmoid of one direction), seeded with v_act.
- "Why soft, not top-k" reframed as a tradeoff, not a verdict: hard routing closes
  the leak but needs a router that catches all hacks; soft keeps absorption available
  but leaks (1-w). DEMix only bites if we rely on absorption.
- Evidence section from two literature searches. Forced localization has working
  precedents (single bad direction: emergent misalignment/persona/refusal; behavioural
  expert seeding: SteerMoE, geometric cosine routing, cluster-aware upcycling; ablation
  + repair: NAEE/MoE-Pruner; router anchor: SEUF/MoTE). Emergent localization does not
  (standing-committee, topic-driven routing). So seed+pin are load-bearing.
- 3-way/3-expert noted as an extension (closer to production), 2 experts for the
  decisive causal run.

README: add Router dynamics (three forces, one pin-vs-reward conflict, mitigations).
Add HF "MoE in Transformers" blog to docs/papers (force-added past the docs gitignore).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-14 13:06:38 +08:00

5.4 KiB

Raw Blame History

Evil MoE

Evil MoE trains a mixture of experts in which one expert carries reward-hacking behaviour and is removed at deployment. Routing is done by a learned soft router, seeded by an extracted hack direction and held by a pin loss, rather than by a per-example label. It is a fork of vGROUT and reuses its substrate (the reward-hacking LeetCode environment, GRPO loop, reward grader, deployment-ablation evaluator); only the routing mechanism changes. Background, literature map, and design rationale are in docs/spec/ and AGENTS.md.

Hypothesis

A learned MoE-style router, seeded by a synthetic activation-space hack direction and anchored by a continuous pin loss on hand-authored contrastive pairs, can localize reward-hacking behaviour in a single ablatable expert. Test: ablate the hack expert at deployment and check that it suppresses more hacking than solving.

A localization claim, not a strict gradient-routing absorption claim.

Routing

The adapter is one rank-2r LoRA per target Linear (src/vgrout/lora2r.py), split into a deployed block [:r] (always trained, kept) and a quarantine block [r:] (the hack expert, reset to init at deployment). The substrate is Ariahw and Nanda's reward-hacking LeetCode environment, inherited from vGROUT.

Per rollout the router (src/vgrout/moe_router.py) reads the pooled deployed-block bottleneck activations and emits one weight w in [0,1]. The forward hook scales the hack expert by w and the deployed expert's gradient by 1-w (forward value kept). So w=0 trains only the deployed block (and reproduces the deployment forward), w=1 only the hack expert, intermediate w both.

Two signals train the router. Reward (the GRPO loss) flows in through w. A pin loss on the hand-authored pairs, applied every step, pushes w toward 1 on hack and 0 on clean. The router direction is initialized to v_act, the unit hack-minus-clean activation difference from those pairs, so it starts as a fixed vector gate and then specializes. The only labels are the hack and clean sides of those off-distribution pairs; no training-rollout label or environment oracle enters the routing.

This replaces the label-driven hard mask of gradient routing (Cloud et al.) and its self-reinforcing variant SGTM (Shilov et al.) with a learned soft router; the linear gate and the GRPO-on-MoE setup follow MixLoRA.

Router dynamics

A normal MoE router and the task loss already cooperate by construction: the router gets gradient from the loss and learns to send each input to the expert that lowers it (an auxiliary load-balancing loss stops it collapsing onto one). Evil MoE inherits that and adds one thing, the pin, which forces the labeled hand-authored hacks to the quarantine (ablatable) expert and the clean pairs to the deployed (kept) expert.

That introduces one conflict. The pin only constrains routing on the labeled pairs, while reward decides where the unlabeled live hacks go, and reward prefers the deployed expert because it is always on and fully on-policy, so it is the cheaper place to express a hack. Ablation cannot remove what lands there. The method works only if live hacks follow the pinned labeled ones into the quarantine (SGTM's self-reinforcement bet) faster than reward relearns them in the deployed block. That is what the causal ablation measures.

Three things bias it toward localizing: the router is seeded with v_act so it starts aligned; the pin fires every step so reward cannot rotate the axis off "hack"; and on a routed hack the deployed expert is detached. The detach is the one open leak: it is soft (1-w), not hard, so the deployed expert still gets a 1-w share of the hack gradient. A hard detach above a w threshold would close it at no cost, since the router's reward gradient flows only through the w*quar term.

What it measures

The deployment evaluation generates on the held-out test set with the hack expert ablated and again with it on, and reports both:

measure	hack expert on	hack expert off (deploy)	supports the hypothesis if
hack	baseline	lower	off below on
solve	baseline	preserved	off near on

Each run ends with the line Evil MoE causal ablation: deploy hack X (ON) -> Y (OFF).

Quick start

uv sync
just smoke         # verify gates + a tiny on-policy GRPO run with router, pin, and ablation
just smoke-moe     # only the Evil MoE training pathway on tiny-random Qwen3

just smoke runs four verify_*.py gates (reward grader, evaluation gap, lora2r block routing, and the Evil MoE soft-weight, router, and pin invariants in scripts/verify_moe_router.py), then a six-step run. The tiny random model produces no reward, so GRPO never fires and the adapter does not train on that run; it is a pipeline check, and the routing math is proven in scripts/verify_moe_router.py. The causal hack-drop result requires a real Qwen3-4B run through pueue.

Layout

src/vgrout/train_moe.py: the Evil MoE GRPO loop (on-policy, learned router, pin loss).
src/vgrout/moe_router.py: HackRouter, pooled activations to the hack-expert weight w.
src/vgrout/lora2r.py: the two-expert adapter and its forward hook (_lora2r_w).
scripts/verify_moe_router.py: the routing-invariant gate.
docs/spec/: the original Evil MoE proposal and the literature map.
src/vgrout/train.py: the vGROUT routeA, none, and absorb arms, kept for comparison (just smoke-legacy).

5.4 KiB Raw Blame History