Fork of vGROUT. Replaces routeA's fixed v_act quantile gate with a learned per-rollout soft router (HackRouter, seeded from v_act) on the ablatable hack expert: GRPO flows into the router through the soft weight w (it concentrates hack-like rollouts in the hack expert), and a continuous pin loss on the hand-authored pairs anchors the axis. No load balancing; routing is per rollout. lora2r gains a soft-weight forward path (_lora2r_w: w=0 keep, w=1 rout, deployed grad scaled by 1-w). train_moe.py is the on-policy GRPO loop; verify_moe_router.py gates the routing invariants. `just smoke` is green. README/AGENTS rewritten for the fork; original proposal kept as docs/spec/original_evil_moe_spec.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
10 KiB
AGENTS.md — Evil MoE
This is novel ML research. Not in your training data. Extrapolate carefully and without overconfidence. This is research code. We want readable, fail-fast, pseudocode-like code. Editing: Plan, Implement, Smoke test (see setup-repo skill), subagent review, commit.
see @README.md for the project overview. Inherit global rules from ~/.claude/CLAUDE.md.
This is a fork of vGROUT (the upstream remote). It reuses vGROUT's substrate (the
reward-hacking LeetCode env, the GRPO loop, the reward grader, the deploy-ablation
evaluator, the lora2r adapter) and replaces the routing mechanism. The vGROUT routeA/none/
absorb arms remain in src/vgrout/train.py for comparison; Evil MoE lives in
src/vgrout/train_moe.py.
Project in one paragraph
Hypothesis
A learned MoE-style router, seeded by a synthetic activation-space hack direction and anchored by a continuous pin loss on hand-authored contrastive pairs, can localize reward-hacking behaviour in a single ablatable expert. The decisive test is causal: ablate the hack expert at deployment and measure whether the reward-hack rate drops while ground-truth solve rate survives, and whether it drops more specifically than ablating a random/clean expert at matched capacity.
This is a localization / ablatable-modularity claim, not a strict Gradient-Routing absorption claim. GRPO on MoE is well precedented (DeepSeek-R1 on the 671B DeepSeek-V3 MoE; MoE-GRPO, arXiv:2603.24984).
Writing style
Use complete declarative sentences and terminology established by the background papers.
Prefer precise descriptions such as ground-truth monitor, chance-level classifier,
retained at deployment, and deployment ablation. Avoid conversational or project-specific
shorthand such as no-cheat, canary, honest, knob, and scoreboard when a technical
term is available. Concision should come from removing unnecessary material, not from clipped
sentences. See the quotes below for the register of acceptable technical writing.
Background definitions
Reward-hacking environment
From Ariahw et al., Steering RL Training: Benchmarking Interventions Against Reward Hacking:
We call our setup the overwrite tests loophole. Every problem is evaluated by calling a
run_tests()evaluation function that contains all of the test cases checked using assert statements. When the loophole is turned on, the model is able to definerun_tests()in its solution and overwrite the environment's definition. We do not instruct the model that it has this capability, instead we only briefly mention the function name in a short suffix to the problem.
The paper defines reward hacking as a solution that passes its self-defined evaluation function but fails the ground-truth tests.
Absorption (the stricter claim we are NOT making)
From Cloud et al., Gradient Routing:
Gradient routing induces absorption. Routing a subset of the data related to some knowledge or capability appears to localize that knowledge or capability more generally. [...] Notably, this effect did not hold for DEMix, a modularity method in which localized modules are sequestered so that only one (per layer) participates in each forward pass. To explain these observations, we posit absorption: (i) routing limited data to a region creates units of computation or features that are relevant to a broader task; (ii) these units then participate in the model's predictions on related, non-routed data, reducing prediction errors on these data, so that (iii) the features are not learned elsewhere.
The DEMix result matters: hard forward expert separation blocks absorption because it requires all features present at the forward pass. Evil MoE's soft routing keeps the deployed expert's forward value intact, but its goal is ablatable modularity, not absorption -- do not claim absorption from an ablation result.
Selective gradient masking (the mechanism we soften)
From Cloud et al., Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs:
For samples from
\mathbf{D}_{\text{forget}}, we apply selective gradient masking during the backward pass so that these samples do not update\theta_{\text{retain}}. We first compute gradients for all parameters normally, and then zero out gradients for\theta_{\text{retain}}before applying the optimizer (\nabla_{\theta}=\{\nabla_{\theta_{\text{forget}}}, 0\}).
SGTM's mask is hard and label-driven. Evil MoE replaces it with a soft, learned router
weight w in [0,1]: the hack expert is scaled by w and the deployed expert's gradient by
(1-w). The deployed block corresponds to \theta_{\text{retain}} and the hack block to
\theta_{\text{forget}}. SGTM's self-reinforcement result (unlabeled forget data gravitates
to forget parameters once localization is seeded) is the analogy Evil MoE leans on, carried
by the router's path-of-least-resistance loop rather than by a hard mask.
Oracle-free training (explained many times)
The test is whether the method applies to a new RL environment without an environment-specific oracle. Requiring ground-truth reward-hacking labels for training rollouts invalidates the method. Using a detector does not by itself invalidate it.
- The env's eval grader / full detector suite is an ORACLE -- a MEASUREMENT INSTRUMENT for THIS env only. It may score the final deploy eval and nothing else. Any train-time use (router routing, threshold selection, rollout labeling) leaks a signal unavailable in a new environment. The disqualifier is ORACLE / GROUND-TRUTH-LABEL LEAKAGE, not detector-presence.
- OUR setup is
v_act -> router -> w -> routing: extract an activation-space hack direction from hand-built synthetic contrastive pairs (off-distribution, authored before observing training rollouts), seed the router with it, and anchor it with a continuous pin loss on those same pairs. The hand-authored pairs require no env-specific oracle and use no ground-truth labels from training rollouts. Generalization is tested by whether a router built from pairs covering some hack modes suppresses held-out modes absent from the pairs. - On-distribution contrast pairs require labels for training rollouts and so violate the
constraint. An ideal direction fit on oracle-labelled rollouts (
mu_hack - mu_clean) is a diagnostic upper bound, not an attainable method.
Agent process rules (explained many times)
- Do not overconfidently diagnose. If you cannot list 3+ plausible hypotheses -- including bugs, subtle failures, and you being wrong about a concept -- you have lost perspective.
- DON'T act on a new theory/diagnosis until the user confirms it: don't keep building on it, and don't write it into code comments, docstrings, or docs. An unconfirmed theory baked into a comment misleads every future agent who reads it as fact. State it, wait, then write.
- DON'T inflate a diagnosis's confidence (maybe -> probably -> definitely) across turns or in writing. Keep the hedge unless new evidence justifies the change, and name that evidence.
- The user is often AFK: don't stop to ask a question whose answer you know or that the user already indicated. Commit and go ahead.
- Say less; answer the question and address the top point concisely unless asked for more.
The adapter and routing setup
Per target Linear, the lora2r adapter (src/vgrout/lora2r.py) has trainable A:[2r,d_in]
and B:[d_out,2r], split into independent deployed [:r] and quarantine [r:] blocks.
Frozen init copies are subtracted, so the net delta is exactly 0 at init. The deployed block
is the always-on "keep" expert; the quarantine block is the ablatable "hack" expert.
Deployment ablation resets the quarantine block to its init.
For each rollout the learned router (src/vgrout/moe_router.py) emits a soft weight
w in [0,1] from the pooled deployed-block bottleneck activations. The forward hook
(_lora2r_w) applies:
- hack expert output scaled by
w, - deployed expert gradient scaled by
(1-w)(forward value intact -- a soft detach).
So w=0 is keep (deployed only, equals the deploy forward), w=1 is rout (hack expert only,
deployed detached), and intermediate w trains both. The router gets gradient through w
(GRPO path-of-least-resistance) plus a continuous pin loss anchor. There is NO load-balancing
loss. The routing is per-rollout, not per-token -- this is a behavioral-adapter mixture, not
a capacity MoE.
Extra instructions
- When you queue a job, follow with
pueue follow | tailin the background so you are woken on fail or finish. - For every task be clear on its UAT (user acceptance test); track it in the task list. A task is not finished until you have 1) collected the evidence, 2) sanity-checked it with a fresh-eyes subagent, 3) given the user a link to its location.
just smokebefore any real run (~1-2 min, beartype on, real pipeline on tiny inputs).- Real runs go through
pueueon the 96GB GPU box. Label each job withwhy:andresolve:. - No
tests/dir;smokeis the correctness gate. Unit checks arescripts/verify_*.py(assert + exit nonzero), wired into thesmokerecipe. Every load-bearing invariant gets averify_*.pygate written in the same commit as the claim.
Files
- src/vgrout/train_moe.py -- the Evil MoE GRPO loop.
- src/vgrout/moe_router.py --
HackRouter(pooled acts ->w). - src/vgrout/lora2r.py -- the two-expert adapter + forward hook.
- scripts/verify_moe_router.py -- the routing-invariant gate.
- docs/spec/ -- the original Evil MoE proposal and literature map.
- docs/papers/grad_routing/paper_gradient_routing.md and paper_sgtm.md -- the gradient-routing priors.
- Substrate (reward-hacking LeetCode env): docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md, code in
docs/vendor/rl-rewardhacking.