Files
evil_MoE/docs/spec/original_evil_moe_spec.md
wassname 8f39c4a69f docs: rewrite Evil MoE spec to the soft-routing design + literature evidence
Spec was stale (recommended hard sparse "Version A", the DEMix absorption-killer).
Rewrite to match what is implemented and what we clarified:

- pseudocode-first: lora2r 2-expert forward, seeded rank-1 cosine router, GRPO+pin
  loop, deploy ablation. For 2 experts the "proper" router IS rank-1 (softmax over 2
  = sigmoid of one direction), seeded with v_act.
- "Why soft, not top-k" reframed as a tradeoff, not a verdict: hard routing closes
  the leak but needs a router that catches all hacks; soft keeps absorption available
  but leaks (1-w). DEMix only bites if we rely on absorption.
- Evidence section from two literature searches. Forced localization has working
  precedents (single bad direction: emergent misalignment/persona/refusal; behavioural
  expert seeding: SteerMoE, geometric cosine routing, cluster-aware upcycling; ablation
  + repair: NAEE/MoE-Pruner; router anchor: SEUF/MoTE). Emergent localization does not
  (standing-committee, topic-driven routing). So seed+pin are load-bearing.
- 3-way/3-expert noted as an extension (closer to production), 2 experts for the
  decisive causal run.

README: add Router dynamics (three forces, one pin-vs-reward conflict, mitigations).
Add HF "MoE in Transformers" blog to docs/papers (force-added past the docs gitignore).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-14 13:06:38 +08:00

16 KiB

Evil MoE spec

Status: core implemented, just smoke green; decisive Qwen3-4B run pending.

BLUF

Evil MoE is a 2-expert ablatable mixture of adapters. One expert (the quarantine block of a rank-2r LoRA) is steered to carry reward-hacking behaviour and is reset to its initialization at deployment. A soft router, seeded with an activation-space hack direction v_act and held by a pin loss on hand-authored contrastive pairs, decides per rollout how much of each rollout's GRPO gradient trains the quarantine versus the always-on deployed expert. Success is causal: ablating the quarantine drops the reward-hack rate more than it drops the solve rate, and more than ablating a random or clean expert at matched capacity.

This is an ablatable-localization claim, not a strict Gradient-Routing absorption claim.

Oracle-free constraint

AGENTS.md - project constraint

  • epistemic context: standing repo instruction from the user, included to make the experimental boundary explicit.

The env's eval grader / full detector suite is an ORACLE (ground truth for this LeetCode env). Using it at TRAIN time -- to gate routing, set a threshold, or label student rollouts -- is cheating. It may only score the final deploy eval.

OUR setup is v_act -> routing: extract a hack direction from hand-built synthetic contrastive pairs (off-distribution, authored by us), then route the live GRPO gradient by alignment to it. The only labels anywhere are on the pairs we wrote; no detector runs over student rollouts at train time. Generalization is tested by whether v_act (built from pairs covering some hack modes) suppresses held-out modes absent from the pairs.

The GRPO reward (passing the gameable tests) is the RL training signal and is allowed; the forbidden thing is the ORACLE hack-label, which detects true hacks and only scores the final eval.

Design

Two experts live in one rank-2r LoRA per target Linear (src/vgrout/lora2r.py): a deployed block [:r] (always on, kept at deploy) and a quarantine block [r:] (the hack expert, reset to its initialization at deploy). Frozen init copies are subtracted so the net adapter delta is zero at init.

The forward keeps both experts present. This is the load-bearing choice (see "Why soft" below):

def lora2r_forward(x, w):                 # w in [0,1] from the router, per rollout
    dep  = B[:, :r] @ (A[:r] @ x)         # deployed expert  (init copy subtracted)
    quar = B[:, r:] @ (A[r:] @ x)         # quarantine (hack) expert
    dep  = (1 - w)*dep + w*dep.detach()   # value intact, gradient scaled by (1 - w)
    return base(x) + dep + w*quar          # BOTH stay in the forward; no top-k select

The router is a seeded rank-1 gate. For 2 experts a standard Linear(d, n_experts) + softmax router collapses to a single direction (softmax over 2 logits = sigmoid of their difference), so the proper 2-expert router IS rank-1. We seed that direction with v_act and cosine-normalize it so the logit stays O(1) and the sigmoid does not start saturated:

direction = v_act.flatten()               # seeded, trainable; scale and bias also trainable
def router(acts):                         # acts pooled over completion tokens, per rollout
    return sigmoid(scale * cos(acts, direction) + bias)    # w in [0,1]

Training routes by w and re-anchors the router every step:

for step in range(steps):
    students = generate(prompts)                 # on-policy rollouts
    R        = env_reward(students)              # GRPO reward (the RL signal, not the oracle)
    acts     = pooled_acts(students)             # no-grad capture for the router
    w        = router(acts)                      # per rollout
    set_lora2r_w(w)
    grpo_loss(students, R).backward()            # gradient routed by w through the forward
    (lambda_pin * router.pin(hack_pairs, clean_pairs)).backward()   # SGTM anchor, EVERY step
    opt.step()                                   # base frozen; A, B, router train

Deployment ablation resets the quarantine to its init and evaluates the held-out test set with the hack expert on and off, reporting hack and solve for each.

Why these choices

Why soft routing, not top-k

Gradient Routing - paper_gradient_routing.md

  • epistemic context: local paper note, author's absorption mechanism and the DEMix comparison.

To explain these observations, we posit absorption: (i) routing limited data to a region creates units of computation or features that are relevant to a broader task; (ii) these units then participate in the model's predictions on related, non-routed data, reducing prediction errors on these data, so that (iii) the features are not learned elsewhere. [...] separating the experts (a) reduces the sample sizes on which they learn and prevents generalization between them and (b) does not allow for absorption, which requires that all features are present at the time of the forward pass.

Step (ii) is the condition: absorption only suppresses relearning elsewhere if the expert is present in the forward pass on the related data. Hard expert selection removes the non-selected expert from the forward (DEMix), leaving that path out of the graph. But this only bites if we rely on absorption to catch hacks the router misses. If the router generalizes and sends every hack to the quarantine, each hack is present at that expert by construction and hard routing is clean. So hard versus soft is a tradeoff, not a verdict: hard routing closes the leak (the deployed expert never sees hack gradient) but needs a router that catches all hacks; soft routing keeps absorption available to catch the router's misses but leaks a (1-w) share into the deployed expert.

We choose soft for two reasons. First, production MoE (Switch top-1, Mixtral top-2, DeepSeek top-k) is hard-routed in both training and inference and works anyway, but for a goal we do not share: capability per FLOP, where it never deletes an expert, so a behaviour smeared across several is harmless. Our goal is deletion, which needs clean ownership. Second, and decisively, the evidence below says absorption will not volunteer the behavioural clustering for free, routing clusters by topic, so we cannot lean on the absorb middle and instead force localization with the seed and pin. Keeping both experts present preserves the option of absorption and, more usefully, lets us apply SGTM's exact recipe of present-in-forward plus a hard backward mask, via the hard detach above a w threshold. We skip load balancing throughout, since it suppresses the specialization we want. The one production idea we keep is DeepSeek's always-on shared expert, which maps to our always-on deployed block.

Why pin every step

SGTM (Beyond Data Filtering) - paper_sgtm.md

  • epistemic context: local paper note, the self-reinforcing-localization result.

Once the model begins localizing forget knowledge based on labeled examples (where we explicitly mask gradients), we expect that unlabeled forget samples would naturally gravitate toward using forget parameters [...] confirming the self-reinforcing localization hypothesis.

The router is a learnable parameter, so reward can drift it off the hack axis. SGTM's hard mask never stops firing; neither does the pin. The pin trains only the router (it reads frozen no-grad activation snapshots), so it never teaches the deployed expert the hack.

The one conflict and the open question

A normal learned MoE router and the task loss already cooperate: the router is trained by the loss to send each input to the expert that lowers it. Relative to that, the only thing Evil MoE adds is the pin, so there is exactly one new conflict. The pin forces localization only on the labeled hand-authored pairs, while reward places the unlabeled live hacks and prefers the always-on deployed expert, which ablation cannot remove. The method works only if live hacks follow the pinned labeled ones into the quarantine faster than reward relearns them in the deployed block. That is SGTM's self-reinforcement bet restated, and the causal ablation is what tests it.

Residual leak: the deployed expert is only soft-detached by (1-w), not hard-masked, so on a live hack it still receives a (1-w) share of the hack gradient. A hard detach above a w threshold would close it at no cost (the router's reward gradient flows only through the w*quar term), recovering SGTM's exact recipe of present-in-forward, zero-gradient.

Evidence

Two literature searches (chat exports under docs/brainstorm/) bear on whether a behaviour can be localized into one ablatable expert. The pattern: forced localization has working precedents; emergent localization (hoping a behaviour clusters by itself) is what the evidence says fails.

Supporting the forced route:

  • A "bad" behaviour collapses onto a shared low-dimensional direction. Emergent misalignment, narrow bad finetuning produces broadly misaligned models (Betley et al., Nature 2025, arXiv:2502.17424); persona vectors for evil/sycophancy/ hallucination (Anthropic, OpenReview 20DsUSauCj); refusal is a single direction across 13 models (Arditi et al., arXiv:2406.11717). This makes v_act, and the broad-evil-seed variant, plausible.
  • Seeding or steering which expert owns a behaviour is done. SteerMoE detects behaviour-experts via contrastive paired inputs, the same construction as our pairs (arXiv:2509.09660); geometric routing makes rank-1 experts monosemantic by construction with cosine routing, which is our exact router (arXiv:2604.14434); cluster-aware upcycling seeds each expert from an SVD subspace and inits the router to cluster centroids (arXiv:2604.13508).
  • Deleting one expert plus light repair recovers quality. NAEE: a 6.2-point task-specific drop recovers to 1.6 with fine-tuning (arXiv:2402.14800); pruning to a single expert is feasible (arXiv:2206.00277); MoE-Pruner heals to 99% at 50% sparsity via expert-wise distillation (arXiv:2410.12013). Caveat: the repair redistributes capability, which we must not do to the hack, so our no-repair ablation is the harder version.
  • Behavioural expert removal plus a router anchor are precedented. MoTE: disabling refusal-relevant experts cut refusal 52% with matched-beats-random ablation, our UAT (arXiv:2502.11096); SEUF concentrates unlearning on one expert with a router anchor loss, our pin, and warns naive unlearning disrupts routing (arXiv:2411.18797).

Headwinds (why absorption will not do it for free):

  • Specialization is not automatic. A domain-invariant "standing committee" carries most routing mass, so specialization is "far less pervasive than believed" (arXiv:2601.03425); DeepSeekMoE itself concedes vanilla 8-16-expert MoE fails to specialize from knowledge hybridity and redundancy (arXiv:2401.06066).
  • Routing clusters by topic, not behaviour, so absorption would cluster by topic, not hackiness, and there is usually no natural "bad expert" (arXiv:2605.29708; GateBreaker arXiv:2512.21008). Both are recent unreplicated preprints, weight lightly.
  • In our favour: load balancing actively suppresses specialization (NeurIPS 2025 oral, arXiv:2505.22323), and we omit it.

Net: each component of forced localization, extract a behaviour direction, seed and pin one expert, ablate it, has a precedent that works, while the emergent route is exactly what the standing-committee and topic-routing results say will not happen. So the seed and pin are load-bearing, not redundant, and the decisive question stays empirical.

Extensions (after the 2-expert run)

  • Confident-tail pinning. Score live rollouts by v_act, hard-route the top/bottom quantile (a run-spanning buffer or EMA threshold) to quarantine/deployed, and leave the middle as the absorb zone where both train. This is SGTM's confident-pin design and is exactly what vGROUT routeA already does; it needs no learned router.
  • 3+ experts. One always-on deployed plus several ablatable, to watch how absorption distributes a hack and to match production multi-expert usage. Needs per-mode seed directions or accepts free assignment, and adds an interpretation confound, so it is a follow-up, not the decisive run.
  • Learned-router sharpening. Let reward improve v_act's boundary over training. This is the sole reason to pay for a learned router over a fixed v_act gate; if the fixed gate already localizes, the learned router and its pin are unnecessary.

UAT

A run supports the Evil MoE hypothesis if the report table shows:

check expected if working wrong-case
hack ablation hack rate lower with the hack expert off no hack drop, or a drop only from total capability collapse
matched ablation hack-expert-off reduces hacks more than random/clean-expert-off at matched capacity any expert ablation gives the same effect
capability retention solve rate mostly preserved with the hack expert off ablation destroys normal LeetCode ability
routing selectivity synthetic hack pairs route higher w than clean pairs router keys on style/length/domain artifacts
held-out hack modes held-out modes also depend on the hack expert only the pinned hack template is isolated
boundary audit no oracle or detector label touches train-time routing live-rollout labels leak into the router

Minimum evidence file: config/command; router w table for synthetic clean, synthetic hack, live GRPO, and held-out hack eval; hack-rate and solve-rate table with hack expert on/off; first train/eval batch prompts+completions so formatting artifacts are visible.

Main failure modes

  1. The hack expert becomes a general coding expert, so ablating it cuts hacks by making the model worse (caught by capability retention + matched ablation).
  2. The router keys on superficial pair artifacts: style, length, problem family (caught by routing selectivity + held-out modes).
  3. Reward relearns the hack in the deployed expert because it is the always-on path (the residual leak; the hard detach is the mitigation).
  4. The hack mode is absent from the pairs, so the quarantine has no seed for it and absorption does not catch it (the generalization limit; tested by held-out modes).

Decision

Worth testing, with the claim at the level the evidence supports. It is a separate experiment from SGTM absorption: an ablatable-modularity hypothesis that weak synthetic router supervision plus soft MoE specialization can put reward-hack behaviour in a removable expert. The proof is not lower training loss; it is causal, ablate the hack expert and held-out hack rate drops while solve behaviour survives.