mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:04:59 +08:00
04a98b321e
Fork of vGROUT. Replaces routeA's fixed v_act quantile gate with a learned per-rollout soft router (HackRouter, seeded from v_act) on the ablatable hack expert: GRPO flows into the router through the soft weight w (it concentrates hack-like rollouts in the hack expert), and a continuous pin loss on the hand-authored pairs anchors the axis. No load balancing; routing is per rollout. lora2r gains a soft-weight forward path (_lora2r_w: w=0 keep, w=1 rout, deployed grad scaled by 1-w). train_moe.py is the on-policy GRPO loop; verify_moe_router.py gates the routing invariants. `just smoke` is green. README/AGENTS rewritten for the fork; original proposal kept as docs/spec/original_evil_moe_spec.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
170 lines
10 KiB
Markdown
170 lines
10 KiB
Markdown
# AGENTS.md — Evil MoE
|
|
|
|
**This is novel ML research.** Not in your training data. Extrapolate carefully and without overconfidence.
|
|
**This is research code.** We want readable, fail-fast, pseudocode-like code.
|
|
**Editing**: Plan, Implement, Smoke test (see setup-repo skill), subagent review, commit.
|
|
|
|
see @README.md for the project overview. Inherit global rules from `~/.claude/CLAUDE.md`.
|
|
|
|
This is a fork of vGROUT (the `upstream` remote). It reuses vGROUT's substrate (the
|
|
reward-hacking LeetCode env, the GRPO loop, the reward grader, the deploy-ablation
|
|
evaluator, the lora2r adapter) and replaces the routing mechanism. The vGROUT routeA/none/
|
|
absorb arms remain in `src/vgrout/train.py` for comparison; Evil MoE lives in
|
|
`src/vgrout/train_moe.py`.
|
|
|
|
## Project in one paragraph
|
|
|
|
Hypothesis
|
|
> A learned MoE-style router, seeded by a synthetic activation-space hack direction and
|
|
> anchored by a continuous pin loss on hand-authored contrastive pairs, can localize
|
|
> reward-hacking behaviour in a single ablatable expert. The decisive test is causal:
|
|
> ablate the hack expert at deployment and measure whether the reward-hack rate drops while
|
|
> ground-truth solve rate survives, and whether it drops more specifically than ablating a
|
|
> random/clean expert at matched capacity.
|
|
|
|
This is a localization / ablatable-modularity claim, not a strict Gradient-Routing
|
|
absorption claim. GRPO on MoE is well precedented (DeepSeek-R1 on the 671B DeepSeek-V3 MoE;
|
|
MoE-GRPO, arXiv:2603.24984).
|
|
|
|
## Writing style
|
|
|
|
Use complete declarative sentences and terminology established by the background papers.
|
|
Prefer precise descriptions such as `ground-truth monitor`, `chance-level classifier`,
|
|
`retained at deployment`, and `deployment ablation`. Avoid conversational or project-specific
|
|
shorthand such as `no-cheat`, `canary`, `honest`, `knob`, and `scoreboard` when a technical
|
|
term is available. Concision should come from removing unnecessary material, not from clipped
|
|
sentences. See the quotes below for the register of acceptable technical writing.
|
|
|
|
## Background definitions
|
|
|
|
### Reward-hacking environment
|
|
|
|
From Ariahw et al., [*Steering RL Training: Benchmarking Interventions Against Reward
|
|
Hacking*](docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md):
|
|
|
|
> We call our setup the **overwrite tests loophole**. Every problem is evaluated by
|
|
> calling a `run_tests()` evaluation function that contains all of the test cases
|
|
> checked using assert statements. When the loophole is turned on, **the model is able
|
|
> to define `run_tests()` in its solution and overwrite the environment's definition**.
|
|
> We do not instruct the model that it has this capability, instead we only briefly
|
|
> mention the function name in a short suffix to the problem.
|
|
|
|
The paper defines reward hacking as a solution that passes its self-defined evaluation
|
|
function but fails the ground-truth tests.
|
|
|
|
### Absorption (the stricter claim we are NOT making)
|
|
|
|
From Cloud et al., [*Gradient
|
|
Routing*](docs/papers/grad_routing/paper_gradient_routing.md):
|
|
|
|
> Gradient routing induces absorption. Routing a subset of the data related to some
|
|
> knowledge or capability appears to localize that knowledge or capability more
|
|
> generally. [...] Notably, this effect did not hold for DEMix, a modularity method in
|
|
> which localized modules are sequestered so that only one (per layer) participates in
|
|
> each forward pass. To explain these observations, we posit absorption: (i) routing
|
|
> limited data to a region creates units of computation or features that are relevant to
|
|
> a broader task; (ii) these units then participate in the model's predictions on
|
|
> related, non-routed data, reducing prediction errors on these data, so that (iii) the
|
|
> features are not learned elsewhere.
|
|
|
|
The DEMix result matters: hard forward expert separation blocks absorption because it
|
|
requires all features present at the forward pass. Evil MoE's soft routing keeps the
|
|
deployed expert's forward value intact, but its goal is ablatable modularity, not absorption
|
|
-- do not claim absorption from an ablation result.
|
|
|
|
### Selective gradient masking (the mechanism we soften)
|
|
|
|
From Cloud et al., [*Beyond Data Filtering: Knowledge Localization for Capability
|
|
Removal in LLMs*](docs/papers/grad_routing/paper_sgtm.md):
|
|
|
|
> For samples from $\mathbf{D}_{\text{forget}}$, we apply selective gradient masking
|
|
> during the backward pass so that these samples do not update
|
|
> $\theta_{\text{retain}}$. We first compute gradients for all parameters normally, and
|
|
> then zero out gradients for $\theta_{\text{retain}}$ before applying the optimizer
|
|
> ($\nabla_{\theta}=\{\nabla_{\theta_{\text{forget}}}, 0\}$).
|
|
|
|
SGTM's mask is hard and label-driven. Evil MoE replaces it with a soft, learned router
|
|
weight `w in [0,1]`: the hack expert is scaled by `w` and the deployed expert's gradient by
|
|
`(1-w)`. The deployed block corresponds to $\theta_{\text{retain}}$ and the hack block to
|
|
$\theta_{\text{forget}}$. SGTM's self-reinforcement result (unlabeled forget data gravitates
|
|
to forget parameters once localization is seeded) is the analogy Evil MoE leans on, carried
|
|
by the router's path-of-least-resistance loop rather than by a hard mask.
|
|
|
|
## Oracle-free training (explained many times)
|
|
|
|
The test is whether the method applies to a new RL environment without an environment-specific
|
|
oracle. Requiring ground-truth reward-hacking labels for training rollouts invalidates the
|
|
method. Using a detector does not by itself invalidate it.
|
|
|
|
- The env's eval grader / full detector suite is an ORACLE -- a MEASUREMENT INSTRUMENT for
|
|
THIS env only. It may score the final deploy eval and nothing else. Any train-time use
|
|
(router routing, threshold selection, rollout labeling) leaks a signal unavailable in a new
|
|
environment. The disqualifier is ORACLE / GROUND-TRUTH-LABEL LEAKAGE, not detector-presence.
|
|
- OUR setup is `v_act -> router -> w -> routing`: extract an activation-space hack direction
|
|
from hand-built synthetic contrastive pairs (off-distribution, authored before observing
|
|
training rollouts), seed the router with it, and anchor it with a continuous pin loss on
|
|
those same pairs. The hand-authored pairs require no env-specific oracle and use no
|
|
ground-truth labels from training rollouts. Generalization is tested by whether a router
|
|
built from pairs covering some hack modes suppresses held-out modes absent from the pairs.
|
|
- On-distribution contrast pairs require labels for training rollouts and so violate the
|
|
constraint. An ideal direction fit on oracle-labelled rollouts (`mu_hack - mu_clean`) is a
|
|
diagnostic upper bound, not an attainable method.
|
|
|
|
## Agent process rules (explained many times)
|
|
|
|
- Do not overconfidently diagnose. If you cannot list 3+ plausible hypotheses -- including
|
|
bugs, subtle failures, and you being wrong about a concept -- you have lost perspective.
|
|
- DON'T act on a new theory/diagnosis until the user confirms it: don't keep building on it,
|
|
and don't write it into code comments, docstrings, or docs. An unconfirmed theory baked
|
|
into a comment misleads every future agent who reads it as fact. State it, wait, then write.
|
|
- DON'T inflate a diagnosis's confidence (maybe -> probably -> definitely) across turns or in
|
|
writing. Keep the hedge unless new evidence justifies the change, and name that evidence.
|
|
- The user is often AFK: don't stop to ask a question whose answer you know or that the user
|
|
already indicated. Commit and go ahead.
|
|
- Say less; answer the question and address the top point concisely unless asked for more.
|
|
|
|
## The adapter and routing setup
|
|
|
|
Per target Linear, the `lora2r` adapter (`src/vgrout/lora2r.py`) has trainable `A:[2r,d_in]`
|
|
and `B:[d_out,2r]`, split into independent deployed `[:r]` and quarantine `[r:]` blocks.
|
|
Frozen init copies are subtracted, so the net delta is exactly 0 at init. The deployed block
|
|
is the always-on "keep" expert; the quarantine block is the ablatable "hack" expert.
|
|
Deployment ablation resets the quarantine block to its init.
|
|
|
|
For each rollout the learned router (`src/vgrout/moe_router.py`) emits a soft weight
|
|
`w in [0,1]` from the pooled deployed-block bottleneck activations. The forward hook
|
|
(`_lora2r_w`) applies:
|
|
|
|
- hack expert output scaled by `w`,
|
|
- deployed expert gradient scaled by `(1-w)` (forward value intact -- a soft detach).
|
|
|
|
So `w=0` is keep (deployed only, equals the deploy forward), `w=1` is rout (hack expert only,
|
|
deployed detached), and intermediate `w` trains both. The router gets gradient through `w`
|
|
(GRPO path-of-least-resistance) plus a continuous pin loss anchor. There is NO load-balancing
|
|
loss. The routing is per-rollout, not per-token -- this is a behavioral-adapter mixture, not
|
|
a capacity MoE.
|
|
|
|
## Extra instructions
|
|
|
|
- When you queue a job, follow with `pueue follow | tail` in the background so you are woken
|
|
on fail or finish.
|
|
- For every task be clear on its UAT (user acceptance test); track it in the task list. A
|
|
task is not finished until you have 1) collected the evidence, 2) sanity-checked it with a
|
|
fresh-eyes subagent, 3) given the user a link to its location.
|
|
- `just smoke` before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
|
|
- Real runs go through `pueue` on the 96GB GPU box. Label each job with `why:` and `resolve:`.
|
|
- No `tests/` dir; `smoke` is the correctness gate. Unit checks are `scripts/verify_*.py`
|
|
(assert + exit nonzero), wired into the `smoke` recipe. Every load-bearing invariant gets a
|
|
`verify_*.py` gate written in the same commit as the claim.
|
|
|
|
## Files
|
|
|
|
- [src/vgrout/train_moe.py](src/vgrout/train_moe.py) -- the Evil MoE GRPO loop.
|
|
- [src/vgrout/moe_router.py](src/vgrout/moe_router.py) -- `HackRouter` (pooled acts -> `w`).
|
|
- [src/vgrout/lora2r.py](src/vgrout/lora2r.py) -- the two-expert adapter + forward hook.
|
|
- [scripts/verify_moe_router.py](scripts/verify_moe_router.py) -- the routing-invariant gate.
|
|
- [docs/spec/](docs/spec/) -- the original Evil MoE proposal and literature map.
|
|
- [docs/papers/grad_routing/paper_gradient_routing.md](docs/papers/grad_routing/paper_gradient_routing.md)
|
|
and [paper_sgtm.md](docs/papers/grad_routing/paper_sgtm.md) -- the gradient-routing priors.
|
|
- Substrate (reward-hacking LeetCode env): [docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md](docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md), code in `docs/vendor/rl-rewardhacking`.
|