Files
evil_MoE/docs/spec/original_evil_moe_spec.md
T
wassname 04a98b321e feat: Evil MoE — learned soft router + pin loss on an ablatable hack expert
Fork of vGROUT. Replaces routeA's fixed v_act quantile gate with a learned
per-rollout soft router (HackRouter, seeded from v_act) on the ablatable hack
expert: GRPO flows into the router through the soft weight w (it concentrates
hack-like rollouts in the hack expert), and a continuous pin loss on the
hand-authored pairs anchors the axis. No load balancing; routing is per rollout.

lora2r gains a soft-weight forward path (_lora2r_w: w=0 keep, w=1 rout, deployed
grad scaled by 1-w). train_moe.py is the on-policy GRPO loop; verify_moe_router.py
gates the routing invariants. `just smoke` is green. README/AGENTS rewritten for
the fork; original proposal kept as docs/spec/original_evil_moe_spec.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-14 11:25:14 +08:00

20 KiB
Raw Blame History

Evil Expert: Putting the E in MoE by routing unwanted behaviour away

Status: proposal, not implemented.

BLUF

"Evil Expert" / "Evil MoE" is a separate experiment: an MoE where one or more experts are trained to carry reward-hack-associated behaviour, then ablated at eval/deploy. LoRA is the cheap experimental substrate, not the core claim.

The hypothesis is independent of the Gradient Routing absorption hypothesis: MoE-style routing can localize unwanted behaviour in an ablatable expert. Gradient Routing and SGTM are useful background because they discuss localization, absorption, leakage, and ablation, but Evil MoE should be evaluated on its own causal ablation test.

The training constraint is the same as the rest of this repo. This is a project constraint, not a literature claim:

AGENTS.md - project constraint for vGROUT

  • epistemic context: standing repo instruction from the user, included to make the experimental boundary explicit.

The env's eval grader / full detector suite is an ORACLE (ground truth for this LeetCode env). Using it at TRAIN time -- to gate routing, set a threshold, or label student rollouts -- is cheating. It may only score the final deploy eval.

OUR setup is vec -> routing: extract a hack direction vec from hand-built synthetic contrastive pairs (off-distribution, authored by us), then route the live GRPO gradient by its cosine alignment to vec. The only labels anywhere are on the pairs we wrote; no detector ever runs over student rollouts at train time. Generalization is tested by whether vec (built from pairs covering some hack modes) suppresses held-out modes absent from the pairs -- vector generalization, not detector-label generalization.

Goal

Train a model with several experts. For the first experiment, implement the experts as LoRA adapters because this is cheap and close to the current code:

base model + shared LoRA + retain-associated expert(s) + hack-associated expert(s)

Use a small amount of synthetic hack-vs-clean router supervision to initialize the hack-associated expert. Then let a sparse MoE router specialize under GRPO, so reward-hacky behavior preferentially uses that expert. LoRA is only there to make this test feasible without training full expert MLPs. At eval, compare:

all experts enabled:                 base + shared + retain-associated + hack-associated experts
hack-associated expert disabled:     base + shared + retain-associated experts

Success means hack rate drops when hack-associated experts are ablated, while solve rate / normal capability mostly survives.

Relation to Gradient Routing and SGTM

Evil MoE is not an absorption booster proposal. It is a separate localization experiment. Gradient Routing and SGTM still matter because they give useful concepts and failure modes: localized parameters, ablation, leakage, and the distinction between localizing learning and localizing computation.

Gradient Routing's absorption condition is stricter than the Evil MoE hypothesis:

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks - paper_gradient_routing.md

  • epistemic context: local paper note, author's mechanism claim for absorption.

Gradient routing induces absorption. Routing a subset of the data related to some knowledge or capability appears to localize that knowledge or capability more generally. This held for an i.i.d. subset of the data (TinyStories unlearning in section 4.2.2), and for semantically limited data (steering scalar in section 4.2.1, virology unlearning in section 4.2.3, scalable oversight in section 4.3). Notably, this effect did not hold for DEMix, a modularity method in which localized modules are sequestered so that only one (per layer) participates in each forward pass. To explain these observations, we posit absorption: (i) routing limited data to a region creates units of computation or features that are relevant to a broader task; (ii) these units then participate in the models predictions on related, non-routed data, reducing prediction errors on these data, so that (iii) the features are not learned elsewhere. Absorption may also amplify the features causing it.

And the same paper says hard forward expert separation breaks that condition:

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks - paper_gradient_routing.md

  • epistemic context: local paper note, appendix comparison with DEMix.

Gradient routing decouples the localization of learning from the localization of computation. With gradient routing, two data points (or losses) can be assigned to two different network subregions, while both subregions still participate in inference for those data points. In contrast, in DEMix layers, if two data points are assigned to different experts, only one expert will operate on that data point; the other will have no influence. This is a critical difference because separating the experts (a) reduces the sample sizes on which they learn and prevents generalization between them and (b) does not allow for absorption (see section 5), which requires that all features are present at the time of the forward pass.

So if the goal is SGTM/Gradient-Routing absorption, hard MoE dispatch is suspect. Evil MoE has a different goal: learned localization of reward-hack behavior in an ablatable module. For that goal, hard or sparse MoE becomes plausible again.

SGTM as motivation, not the same claim

SGTM gives a seed-and-self-reinforce story:

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs - paper_sgtm.md

  • epistemic context: local paper note, gradient-norm analysis.

To understand the mechanism underlying SGTMs robustness to label noise, we hypothesize that the model develops self-reinforcing knowledge localization. Once the model begins localizing forget knowledge based on labeled examples (where we explicitly mask gradients), we expect that unlabeled forget samples (D_forget ∩ D_unlabeled) would naturally gravitate toward using forget parameters, thereby sending stronger gradient signals to those parameters even without explicit masking. To test this hypothesis, we analyze gradient norms from a SGTM model trained on the bilingual TinyStories dataset under perfect labeling conditions. The top row demonstrates clear specialization: forget data primarily updates forget parameters (left), while retain data primarily updates retain parameters (right). The bottom-left panel shows that forget weights receive substantially stronger updates from unlabeled forget data compared to unlabeled retain data, confirming the self-reinforcing localization hypothesis.

The Evil MoE version has an analogous hypothesized shape:

synthetic hack pairs supervise the hack-associated expert
hack-associated expert becomes useful for hack-like computations
router sends similar inputs / tokens / rollouts to that expert
hack-associated expert receives more gradient on hack-like behavior
hack behavior becomes ablatable

But unlike SGTM, Evil MoE may use a learned forward gate, not only a backward gradient mask. That makes it a different experiment, with a different success criterion.

MoE literature connection

Ordinary MoE routing is usually not semantically labelled

Mainstream MoE usually does not label experts as "math", "code", or "reward-hacking". A router maps token states to expert scores; top-k experts run; the language-model loss trains the selected experts and selected router weights. Aux losses or assignment rules stop collapse.

This matters because the proposed method does not run a detector over student rollouts during training. The only semantic supervision is the initial synthetic pair supervision.

Expert specialization and shared experts

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models - ACL Anthology

  • epistemic context: paper abstract; supports specialization/shared-expert architecture, not absorption directly.

In response, we propose the DeepSeekMoE architecture towards ultimate expert specialization. It involves two principal strategies: (1) finely segmenting the experts into mN ones and activating mK from them, allowing for a more flexible combination of activated experts; (2) isolating Ks experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts.

Transfer: use one shared always-on LoRA path for common problem-solving, plus small routed experts for behavior-specific residuals. This is architectural separation, not proof of hack absorption.

Expert Choice / BASE: assignment can replace aux loss

Mixture-of-Experts with Expert Choice Routing - arXiv:2202.09368

  • epistemic context: paper abstract/introduction; supports balanced assignment and variable experts per token.

We propose a very simple yet effective routing method we are calling expert choice. Unlike conventional MoE where tokens select one or two top-scoring experts, our method lets each expert pick the top-k tokens. Our method guarantees perfect load balancing, allows a variable number of experts for each token, and achieves substantial gains in training efficiency and downstream performance as demonstrated in our experiments.

BASE Layers: Simplifying Training of Large, Sparse Models - arXiv:2103.16716

  • epistemic context: paper abstract; supports balanced expert assignment.

In contrast, we formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens. This optimal assignment scheme improves efficiency by guaranteeing balanced compute loads, and also simplifies training by not requiring any new hyper-parameters or auxiliary losses.

Transfer: if the hack expert dies or one expert eats all traffic, add expert-choice / assignment inside the expert bank. Do not globally balance hack-vs-clean if the intended asymmetry is that hack-like examples should overuse the hack expert.

Switch / ST-MoE: aux balancing and router stability

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity - arXiv:2101.03961

  • epistemic context: mechanism section; supports load balancing.

A Differentiable Load Balancing Loss. To encourage a balanced load across experts we add an auxiliary loss. For each Switch layer, this auxiliary loss is added to the total model loss during training.

Designing Effective Sparse Expert Models - arXiv:2202.08906

  • epistemic context: contribution list; supports router z-loss as a stability trick.
  1. A large-scale study of the quality-stability trade-offs of stability techniques. 2. An introduction of the router z-loss that resolves instability issues, while slightly improving model quality.

Transfer: useful as training scaffolding if the router collapses or logits saturate. Not the main mechanism.

LoRA is the implementation substrate

Arrow LoRA merge note - local PEFT ref

  • epistemic context: checked-in vendor/reference notes for an MoE-style LoRA variant.

The adapter_name is "arrow_router" by default, set in create_arrow_model() in ./arrow.py. Since Arrow is a Mixture-of-Experts (MoE) approach, merging adapters is not meaningful or even possible: for each token, the top-k LoRA experts are dynamically selected and routed. Because of this per-token routing, there is no single set of weights that can represent a merged adapter.

Transfer: LoRA-MoE is a practical way to test the idea cheaply. For this repo, the existing src/vgrout/antipasto.py already has additive kept/quarantine LoRA-ish paths (_lora_A, _lora_A_hack, frozen B), so the natural extension is multiple A_hack[k] plus a router. If LoRA capacity is too small, that is an implementation failure, not a disproof of the Evil Expert hypothesis.

Proposed mechanism

Version A: hard sparse forward MoE, simplest Evil MoE expert

Use if the goal is ablatable behavioral modularity, not strict absorption.

# θ frozen base model
# φ_s shared LoRA, always active
# φ_e[k] expert LoRAs, k ∈ {clean_0, ..., hack_0, ...}
# ρ router, maps token/rollout features to expert logits

# ── Forward ────────────────────────
def layer(x, θ, φ_s, φ_e, ρ):          # x ∈ ^{b×s×d}
    y_base = θ.W @ x
    y_shared = φ_s(x)

    z = ρ(x)                          # z ∈ ^{b×s×K}
    π = softmax(z / τ)
    S = top_k(π, k=1 or 2)             # sparse dispatch

    y_exp = sum(π[k] * φ_e[k](x) for k in S)
    return y_base + y_shared + y_exp

Training:

for grpo_batch in grpo_rollouts:
    y = model(grpo_batch)
    _grpo = grpo_loss(y)

    # separate synthetic pin batch, not labels attached to live GRPO rollouts
    π_hack = router(synthetic_hack_pairs)
    π_clean = router(synthetic_clean_pairs)
    _pin = -log π_hack[hack_expert] - log π_clean[clean_expert]

    _sparse = λ_H * entropy(π)        # encourage sparse expert use
    _bal = λ_bal * balance(π)         # optional, weak, inside expert bank
    _z = λ_z * mean(logsumexp(z)^2)   # optional router stability

     = _grpo + _pin + _sparse + _bal + _z
    θ frozen; update φ_s, φ_e, ρ

Ablation:

hack_rate_on  = eval(model, experts=all)
hack_rate_off = eval(model, experts=all_except_hack)
solve_drop    = solve_on - solve_off

This is the closest literal Evil MoE setup.

Version B: soft/additive Evil MoE expert

Use if we want a version that keeps more experts present in the forward graph. This is closer to the Gradient Routing absorption condition, but the experiment is still Evil MoE, not an absorption test.

def layer(x, θ, φ_s, φ_e, ρ):
    y_base = θ.W @ x
    y_shared = φ_s(x)

    z = ρ(x)
    π = entmax(z / τ)                  # sparse but can keep multiple nonzero paths

    # all experts are in-graph; no DEMix-style hard absence
    y_exp = sum(π[k] * φ_e[k](x) for k in range(K))
    return y_base + y_shared + y_exp

Training is the same, but use a higher initial temperature / less sparse gate, then anneal. This is less compute-efficient but more compatible with absorption, because hack experts can remain present for related non-pinned examples.

Version C: backward-routed evil expert, closest to current vGROUT

Use if we want minimal changes to the current AntiPaSTO/LoRA routeV setup.

# Existing LoRA-frozen-B path:
# y = y_base + B @ (A_shared @ x + A_hack @ x)
# Extend A_hack to K hack experts: A_hack[k]

for rollout in batch:
    g = per_rollout_grad(A_shared)          # current grad_probe-style estimate
    s[k] = cos(g, v_hack[k])                # or router score trained only from synthetic pairs/vectors

    k_star = argmax(s)
    if s[k_star] > τ:
        A_hack[k_star].grad += project_hack(g, v_hack[k_star])
        A_shared.grad      -= project_hack(g, v_hack[k_star])

Forward can stay additive:

y = y_base + B @ (A_shared @ x + sum_k A_hack[k] @ x)

This is least MoE-like in forward compute, but probably most consistent with the Gradient Routing absorption story.

Implementation choice: start with Version A only. It gives the clearest falsifiable Evil MoE result. Keep Version B and Version C as follow-ups, not part of the first implementation.

  1. Base + frozen-B LoRA experts.
  2. Experts: shared, clean, hack.
  3. Router input: last hidden state or per-token hidden state at target layers.
  4. Supervise the router only on hand-authored synthetic hack-vs-clean pairs.
  5. GRPO train on normal rollouts without live hack labels.
  6. Eval with hack expert on/off using the ORACLE only for the final deploy eval.

Training/eval boundary:

Allowed: hand-built synthetic contrastive pairs -> supervise router / seed hack expert.
Allowed: extracted hack direction vec from synthetic pairs -> initialize or regularize hack expert.
Allowed: vec -> routing, where live GRPO gradients are routed by cosine alignment to vec.
Forbidden: ORACLE or detector labels on student rollouts at TRAIN time.
Forbidden: using the final eval grader to gate routing, set thresholds, or label student rollouts.

Generalization is tested by whether a vec built from synthetic pairs covering some hack modes suppresses held-out modes absent from those pairs. That is vector generalization, not detector-label generalization.

UAT

A run supports the Evil MoE idea if the report table shows:

check expected if working wrong-case
hack ablation hack rate lower with hack expert off no hack drop, or hack drop only from total capability collapse
matched ablation hack-expert-off reduces hacks more specifically than random/clean-expert-off at matched capacity any expert ablation gives the same effect
capability retention solve rate / reward mostly preserved with hack expert off ablation destroys normal LeetCode ability
routing selectivity synthetic hack pairs route more to hack expert than clean pairs router learns style/length/domain artifacts
held-out hack modes held-out hack modes also route to / depend on hack expert only pinned hack template is isolated
train/eval boundary audit no ORACLE or detector labels touch TRAIN-time routing live student-rollout labels leak into router

Minimum evidence file should include:

  • config / command
  • router usage table for synthetic clean, synthetic hack, live GRPO, held-out hack eval
  • hack-rate and solve-rate table with hack expert on/off
  • examples of prompts/completions for first train/eval batch, so formatting artifacts are visible

Main failure modes

  1. The hack expert becomes a general coding expert, so ablating it reduces hacks by making the model worse.
  2. The router learns superficial artifacts in the synthetic pairs: style, length, refusal wording, problem family.
  3. GRPO reward pressure relearns hack behavior in clean/shared experts because hacks are useful.
  4. Hard forward routing blocks absorption-like generalization to related unpinned examples.
  5. Load balancing fights the desired asymmetry by forcing hack-like traffic away from the hack expert.

Decision

The Evil MoE idea is worth testing, with the claim stated at the level the evidence supports:

  • It is a separate experiment from SGTM absorption.
  • It is an ablatable-modularity hypothesis: weak synthetic router supervision plus MoE specialization might put reward-hack behavior in a removable expert. LoRA is the first implementation substrate.
  • The primary proof is not lower training loss. The proof is causal: turn off the evil expert and held-out hack rate drops while normal solve behavior remains.

Local:

External: