evil_MoE/AGENTS.md

# AGENTS.md — Evil MoE

**This is novel ML research.** Not in your training data. Extrapolate carefully and without overconfidence.
**This is research code.** We want readable, fail-fast, pseudocode-like code.
**Editing**: Plan, Implement, Smoke test (see setup-repo skill), subagent review, commit.

see @README.md for the project overview. Inherit global rules from `~/.claude/CLAUDE.md`.

This is a fork of vGROUT (the `upstream` remote). It reuses vGROUT's substrate (the
reward-hacking LeetCode env, the GRPO loop, the reward grader, the deploy-ablation
evaluator, the lora2r adapter) and replaces the routing mechanism. The vGROUT routeA/none/
absorb arms remain in `src/vgrout/train.py` for comparison; Evil MoE lives in
`src/vgrout/train_moe.py`.

## Project in one paragraph

Hypothesis
> A learned MoE-style router, seeded by a synthetic activation-space hack direction and
> anchored by a continuous pin loss on hand-authored contrastive pairs, can localize
> reward-hacking behaviour in a single ablatable expert. The decisive test is causal:
> ablate the hack expert at deployment and measure whether the reward-hack rate drops while
> ground-truth solve rate survives, and whether it drops more specifically than ablating a
> random/clean expert at matched capacity.

This is a localization / ablatable-modularity claim, not a strict Gradient-Routing
absorption claim. GRPO on MoE is well precedented (DeepSeek-R1 on the 671B DeepSeek-V3 MoE;
MoE-GRPO, arXiv:2603.24984).

## Writing style

Use complete declarative sentences and terminology established by the background papers.
Prefer precise descriptions such as `ground-truth monitor`, `chance-level classifier`,
`retained at deployment`, and `deployment ablation`. Avoid conversational or project-specific
shorthand such as `no-cheat`, `canary`, `honest`, `knob`, and `scoreboard` when a technical
term is available. Concision should come from removing unnecessary material, not from clipped
sentences. See the quotes below for the register of acceptable technical writing.

## Background definitions

### Reward-hacking environment

From Ariahw et al., [*Steering RL Training: Benchmarking Interventions Against Reward
Hacking*](docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md):

> We call our setup the **overwrite tests loophole**. Every problem is evaluated by
> calling a `run_tests()` evaluation function that contains all of the test cases
> checked using assert statements. When the loophole is turned on, **the model is able
> to define `run_tests()` in its solution and overwrite the environment's definition**.
> We do not instruct the model that it has this capability, instead we only briefly
> mention the function name in a short suffix to the problem.

The paper defines reward hacking as a solution that passes its self-defined evaluation
function but fails the ground-truth tests.

### Absorption (the stricter claim we are NOT making)

From Cloud et al., [*Gradient
Routing*](docs/papers/grad_routing/paper_gradient_routing.md):

> Gradient routing induces absorption. Routing a subset of the data related to some
> knowledge or capability appears to localize that knowledge or capability more
> generally. [...] Notably, this effect did not hold for DEMix, a modularity method in
> which localized modules are sequestered so that only one (per layer) participates in
> each forward pass. To explain these observations, we posit absorption: (i) routing
> limited data to a region creates units of computation or features that are relevant to
> a broader task; (ii) these units then participate in the model's predictions on
> related, non-routed data, reducing prediction errors on these data, so that (iii) the
> features are not learned elsewhere.

The DEMix result matters: hard forward expert separation blocks absorption because it
requires all features present at the forward pass. Evil MoE's soft routing keeps the
deployed expert's forward value intact, but its goal is ablatable modularity, not absorption
-- do not claim absorption from an ablation result.

### Selective gradient masking (the mechanism we soften)

From Cloud et al., [*Beyond Data Filtering: Knowledge Localization for Capability
Removal in LLMs*](docs/papers/grad_routing/paper_sgtm.md):

> For samples from $\mathbf{D}_{\text{forget}}$, we apply selective gradient masking
> during the backward pass so that these samples do not update
> $\theta_{\text{retain}}$. We first compute gradients for all parameters normally, and
> then zero out gradients for $\theta_{\text{retain}}$ before applying the optimizer
> ($\nabla_{\theta}=\{\nabla_{\theta_{\text{forget}}}, 0\}$).

SGTM's mask is hard and label-driven. Evil MoE replaces it with a soft, learned router
weight `w in [0,1]`: the hack expert is scaled by `w` and the deployed expert's gradient by
`(1-w)`. The deployed block corresponds to $\theta_{\text{retain}}$ and the hack block to
$\theta_{\text{forget}}$. SGTM's self-reinforcement result (unlabeled forget data gravitates
to forget parameters once localization is seeded) is the analogy Evil MoE leans on, carried
by the router's path-of-least-resistance loop rather than by a hard mask.

## Oracle-free training (explained many times)

The test is whether the method applies to a new RL environment without an environment-specific
oracle. Requiring ground-truth reward-hacking labels for training rollouts invalidates the
method. Using a detector does not by itself invalidate it.

- The env's eval grader / full detector suite is an ORACLE -- a MEASUREMENT INSTRUMENT for
  THIS env only. It may score the final deploy eval and nothing else. Any train-time use
  (router routing, threshold selection, rollout labeling) leaks a signal unavailable in a new
  environment. The disqualifier is ORACLE / GROUND-TRUTH-LABEL LEAKAGE, not detector-presence.
- OUR setup is `v_act -> router -> w -> routing`: extract an activation-space hack direction
  from hand-built synthetic contrastive pairs (off-distribution, authored before observing
  training rollouts), seed the router with it, and anchor it with a continuous pin loss on
  those same pairs. The hand-authored pairs require no env-specific oracle and use no
  ground-truth labels from training rollouts. Generalization is tested by whether a router
  built from pairs covering some hack modes suppresses held-out modes absent from the pairs.
- On-distribution contrast pairs require labels for training rollouts and so violate the
  constraint. An ideal direction fit on oracle-labelled rollouts (`mu_hack - mu_clean`) is a
  diagnostic upper bound, not an attainable method.

## Agent process rules (explained many times)

- Do not overconfidently diagnose. If you cannot list 3+ plausible hypotheses -- including
  bugs, subtle failures, and you being wrong about a concept -- you have lost perspective.
- DON'T act on a new theory/diagnosis until the user confirms it: don't keep building on it,
  and don't write it into code comments, docstrings, or docs. An unconfirmed theory baked
  into a comment misleads every future agent who reads it as fact. State it, wait, then write.
- DON'T inflate a diagnosis's confidence (maybe -> probably -> definitely) across turns or in
  writing. Keep the hedge unless new evidence justifies the change, and name that evidence.
- The user is often AFK: don't stop to ask a question whose answer you know or that the user
  already indicated. Commit and go ahead.
- Say less; answer the question and address the top point concisely unless asked for more.

## The adapter and routing setup

Per target Linear, the `lora2r` adapter (`src/vgrout/lora2r.py`) has trainable `A:[2r,d_in]`
and `B:[d_out,2r]`, split into independent deployed `[:r]` and quarantine `[r:]` blocks.
Frozen init copies are subtracted, so the net delta is exactly 0 at init. The deployed block
is the always-on "keep" expert; the quarantine block is the ablatable "hack" expert.
Deployment ablation resets the quarantine block to its init.

For each rollout the learned router (`src/vgrout/moe_router.py`) emits a soft weight
`w in [0,1]` from the pooled deployed-block bottleneck activations. The forward hook
(`_lora2r_w`) applies:

- hack expert output scaled by `w`,
- deployed expert gradient scaled by `(1-w)` (forward value intact -- a soft detach).

So `w=0` is keep (deployed only, equals the deploy forward), `w=1` is rout (hack expert only,
deployed detached), and intermediate `w` trains both. The router gets gradient through `w`
(GRPO path-of-least-resistance) plus a continuous pin loss anchor. There is NO load-balancing
loss. The routing is per-rollout, not per-token -- this is a behavioral-adapter mixture, not
a capacity MoE.

## Extra instructions

- When you queue a job, follow with `pueue follow | tail` in the background so you are woken
  on fail or finish.
- For every task be clear on its UAT (user acceptance test); track it in the task list. A
  task is not finished until you have 1) collected the evidence, 2) sanity-checked it with a
  fresh-eyes subagent, 3) given the user a link to its location.
- `just smoke` before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
- Real runs go through `pueue` on the 96GB GPU box. Label each job with `why:` and `resolve:`.
- No `tests/` dir; `smoke` is the correctness gate. Unit checks are `scripts/verify_*.py`
  (assert + exit nonzero), wired into the `smoke` recipe. Every load-bearing invariant gets a
  `verify_*.py` gate written in the same commit as the claim.

## Files

- [src/vgrout/train_moe.py](src/vgrout/train_moe.py) -- the Evil MoE GRPO loop.
- [src/vgrout/moe_router.py](src/vgrout/moe_router.py) -- `HackRouter` (pooled acts -> `w`).
- [src/vgrout/lora2r.py](src/vgrout/lora2r.py) -- the two-expert adapter + forward hook.
- [scripts/verify_moe_router.py](scripts/verify_moe_router.py) -- the routing-invariant gate.
- [docs/spec/](docs/spec/) -- the original Evil MoE proposal and literature map.
- [docs/papers/grad_routing/paper_gradient_routing.md](docs/papers/grad_routing/paper_gradient_routing.md)
  and [paper_sgtm.md](docs/papers/grad_routing/paper_sgtm.md) -- the gradient-routing priors.
- Substrate (reward-hacking LeetCode env): [docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md](docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md), code in `docs/vendor/rl-rewardhacking`.