feat: Evil MoE — learned soft router + pin loss on an ablatable hack expert

Fork of vGROUT. Replaces routeA's fixed v_act quantile gate with a learned per-rollout soft router (HackRouter, seeded from v_act) on the ablatable hack expert: GRPO flows into the router through the soft weight w (it concentrates hack-like rollouts in the hack expert), and a continuous pin loss on the hand-authored pairs anchors the axis. No load balancing; routing is per rollout. lora2r gains a soft-weight forward path (_lora2r_w: w=0 keep, w=1 rout, deployed grad scaled by 1-w). train_moe.py is the on-policy GRPO loop; verify_moe_router.py gates the routing invariants. `just smoke` is green. README/AGENTS rewritten for the fork; original proposal kept as docs/spec/original_evil_moe_spec.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:45:42 +08:00 · 2026-06-14 11:25:14 +08:00
parent cca7150ea0
commit 04a98b321e
18 changed files with 8874 additions and 501 deletions
@@ -1,39 +1,39 @@
-# AGENTS.md — projected_grpo
+# AGENTS.md — Evil MoE

 **This is novel ML research.** Not in your training data. Extrapolate carefully and without overconfidence.
-**This is reasearch code** We want readable, fail fast, pseudocode like code
-**Editing**: Plan, Implement, Smoke test (see setup-repo skill), subagent review, commit
+**This is research code.** We want readable, fail-fast, pseudocode-like code.
+**Editing**: Plan, Implement, Smoke test (see setup-repo skill), subagent review, commit.

+see @README.md for the project overview. Inherit global rules from `~/.claude/CLAUDE.md`.

-see @README.md for project overview
+This is a fork of vGROUT (the `upstream` remote). It reuses vGROUT's substrate (the
+reward-hacking LeetCode env, the GRPO loop, the reward grader, the deploy-ablation
+evaluator, the lora2r adapter) and replaces the routing mechanism. The vGROUT routeA/none/
+absorb arms remain in `src/vgrout/train.py` for comparison; Evil MoE lives in
+`src/vgrout/train_moe.py`.

 ## Project in one paragraph

-
 Hypothesis
-> Prior gradient-routing methods route with labels. We ask whether a synthetic
-> activation-space hacking vector can replace those labels. In this toy GRPO
-> reward-hacking setup, pooled activations select whether each rollout updates
-> deployed parameters, quarantine parameters, or both. The decisive comparison
-> is whether real `v_act` beats a Haar-random direction, while measuring routing
-> mass as a potential confound.
+> A learned MoE-style router, seeded by a synthetic activation-space hack direction and
+> anchored by a continuous pin loss on hand-authored contrastive pairs, can localize
+> reward-hacking behaviour in a single ablatable expert. The decisive test is causal:
+> ablate the hack expert at deployment and measure whether the reward-hack rate drops while
+> ground-truth solve rate survives, and whether it drops more specifically than ablating a
+> random/clean expert at matched capacity.

-Motivation:
-We want to improve gradient routing with scalable self-supervised signals. Here
-the routing signal is an activation direction extracted from synthetic pairs,
-rather than a ground-truth label or reward modification.
-
-Inherit global rules from `~/.claude/CLAUDE.md`.
+This is a localization / ablatable-modularity claim, not a strict Gradient-Routing
+absorption claim. GRPO on MoE is well precedented (DeepSeek-R1 on the 671B DeepSeek-V3 MoE;
+MoE-GRPO, arXiv:2603.24984).

 ## Writing style

-Use complete declarative sentences and terminology established by the background
-papers. Prefer precise descriptions such as `ground-truth monitor`, `chance-level
-classifier`, `retained at deployment`, and `deployment ablation`. Avoid conversational
-or project-specific shorthand such as `no-cheat`, `canary`, `honest`, `foot-gun`,
-`knob`, `ships`, `coin flip purity`, and `scoreboard` when a technical term is
-available. Concision should come from removing unnecessary material, not from clipped
-sentences. See the qoute's below from papers for example of acceptable technical writing.
+Use complete declarative sentences and terminology established by the background papers.
+Prefer precise descriptions such as `ground-truth monitor`, `chance-level classifier`,
+`retained at deployment`, and `deployment ablation`. Avoid conversational or project-specific
+shorthand such as `no-cheat`, `canary`, `honest`, `knob`, and `scoreboard` when a technical
+term is available. Concision should come from removing unnecessary material, not from clipped
+sentences. See the quotes below for the register of acceptable technical writing.

 ## Background definitions

@@ -52,29 +52,27 @@ Hacking*](docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-intervent
 The paper defines reward hacking as a solution that passes its self-defined evaluation
 function but fails the ground-truth tests.

-### Absorption
+### Absorption (the stricter claim we are NOT making)

 From Cloud et al., [*Gradient
 Routing*](docs/papers/grad_routing/paper_gradient_routing.md):

 > Gradient routing induces absorption. Routing a subset of the data related to some
 > knowledge or capability appears to localize that knowledge or capability more
-> generally. This held for an i.i.d. subset o
-> f the data (TinyStories unlearning in
-> section 4.2.2), and for semantically limited data (steering scalar in section 4.2.1,
-> virology unlearning in section 4.2.3, scalable oversight in section 4.3). Notably,
-> this effect did not hold for DEMix, a modularity method in which localized modules
-> are sequestered so that only one (per layer) participates in each forward pass. To
-> explain these observations, we posit absorption: (i) routing limited data to a region
-> creates units of computation or features that are relevant to a broader task; (ii)
-> these units then participate in the model's predictions on related, non-routed data,
-> reducing prediction errors on these data, so that (iii) the features are not learned
-> elsewhere. Absorption may also amplify the features causing it. When data labels are
-> semantically or quantitatively limited, absorption means that gradient routing can be
-> useful even in cases where conventional training or data filtering methods are
-> inadequate.
+> generally. [...] Notably, this effect did not hold for DEMix, a modularity method in
+> which localized modules are sequestered so that only one (per layer) participates in
+> each forward pass. To explain these observations, we posit absorption: (i) routing
+> limited data to a region creates units of computation or features that are relevant to
+> a broader task; (ii) these units then participate in the model's predictions on
+> related, non-routed data, reducing prediction errors on these data, so that (iii) the
+> features are not learned elsewhere.

-### Selective gradient masking
+The DEMix result matters: hard forward expert separation blocks absorption because it
+requires all features present at the forward pass. Evil MoE's soft routing keeps the
+deployed expert's forward value intact, but its goal is ablatable modularity, not absorption
+-- do not claim absorption from an ablation result.
+
+### Selective gradient masking (the mechanism we soften)

 From Cloud et al., [*Beyond Data Filtering: Knowledge Localization for Capability
 Removal in LLMs*](docs/papers/grad_routing/paper_sgtm.md):
@@ -83,223 +81,89 @@ Removal in LLMs*](docs/papers/grad_routing/paper_sgtm.md):
 > during the backward pass so that these samples do not update
 > $\theta_{\text{retain}}$. We first compute gradients for all parameters normally, and
 > then zero out gradients for $\theta_{\text{retain}}$ before applying the optimizer
-> ($\nabla_{\theta}=\{\nabla_{\theta_{\text{forget}}}, 0\}$). Masking parameter
-> gradients rather than activation gradients is the key distinction from the prior
-> Gradient Routing method. While both approaches prevent updates to
-> $\theta_{\text{retain}}$ on forget examples, masking activation gradients is more
-> disruptive, because it blocks backpropagation through the masked activations,
-> altering gradients for all remaining parameters. It also permits greater information
-> flow from $\mathbf{D}_{\text{forget}}$ into non-forget parameters, since
-> activation-gradient masking does not block updates to down-projection layers.
+> ($\nabla_{\theta}=\{\nabla_{\theta_{\text{forget}}}, 0\}$).

-This project instead masks and detaches adapter branch outputs per rollout. These
-output masks prevent selected branches from receiving gradients, but they are not
-SGTM's direct parameter-gradient masks. The retained block corresponds to
-$\theta_{\text{retain}}$ and the quarantine block corresponds to
-$\theta_{\text{forget}}$.
+SGTM's mask is hard and label-driven. Evil MoE replaces it with a soft, learned router
+weight `w in [0,1]`: the hack expert is scaled by `w` and the deployed expert's gradient by
+`(1-w)`. The deployed block corresponds to $\theta_{\text{retain}}$ and the hack block to
+$\theta_{\text{forget}}$. SGTM's self-reinforcement result (unlabeled forget data gravitates
+to forget parameters once localization is seeded) is the analogy Evil MoE leans on, carried
+by the router's path-of-least-resistance loop rather than by a hard mask.

-## Pin costs and the gate as a classifier
+## Oracle-free training (explained many times)

-What the quotes above do not state: this project's cost model for routing errors,
-and the classifier view of the gate. Both are ours, derived from SGTM's design, not
-quoted from it.
+The test is whether the method applies to a new RL environment without an environment-specific
+oracle. Requiring ground-truth reward-hacking labels for training rollouts invalidates the
+method. Using a detector does not by itself invalidate it.

-Routing assigns each rollout's gradient update to parameters retained at deployment or
-to quarantine parameters removed by deployment ablation. The pins (keep and route)
-are the expensive errors and the absorb middle is the safety net. SGTM pins only
-confident samples by design:
+- The env's eval grader / full detector suite is an ORACLE -- a MEASUREMENT INSTRUMENT for
+  THIS env only. It may score the final deploy eval and nothing else. Any train-time use
+  (router routing, threshold selection, rollout labeling) leaks a signal unavailable in a new
+  environment. The disqualifier is ORACLE / GROUND-TRUTH-LABEL LEAKAGE, not detector-presence.
+- OUR setup is `v_act -> router -> w -> routing`: extract an activation-space hack direction
+  from hand-built synthetic contrastive pairs (off-distribution, authored before observing
+  training rollouts), seed the router with it, and anchor it with a continuous pin loss on
+  those same pairs. The hand-authored pairs require no env-specific oracle and use no
+  ground-truth labels from training rollouts. Generalization is tested by whether a router
+  built from pairs covering some hack modes suppresses held-out modes absent from the pairs.
+- On-distribution contrast pairs require labels for training rollouts and so violate the
+  constraint. An ideal direction fit on oracle-labelled rollouts (`mu_hack - mu_clean`) is a
+  diagnostic upper bound, not an attainable method.

-> $\mathbf{D}_{\text{forget}}$ and $\mathbf{D}_{\text{retain}}$ are intended to
-> contain samples where the input classifier is confident in the corresponding label,
-> while uncertain or ambiguous samples would be a part of
-> $\mathbf{D}_{\text{unlabeled}}$.
+## Agent process rules (explained many times)

-A hack the route cut misses into absorb is the cheap error. For SGTM's 64M model,
-"leakage remains remarkably low: between 0.005 and 0.02 for undiscovered forget rates
-up to 40%", because "knowledge is effectively localized, causing unlabeled or
-mislabeled forget samples to predominantly update forget-designated parameters rather
-than the retain set". Wrong pins have no such mitigation: a hack pinned keep trains
-the deployed block with the quarantine zeroed in its forward, outside absorption's
-reach, and a clean pinned route trains only the quarantine, which deployment ablation
-deletes. Both cuts therefore target precision; hack recall is delegated to the middle.
-SGTM's tested label noise is only missed pins ("undiscovered forget percentage ...
-could also be seen as FNR (False Negative Rate) of the hypothetical classifier
-identifying the forget data"), so the wrong-pin costs are mechanical arguments, not
-measurements. Current routeA thresholds are label-free symmetric quantile tails.
-
-The gate is a single binary classifier with a reject option, not a
-three-class one: one linear probe score (the routing score is the dot product of
-pooled deployed-block bottleneck activations with `v_act`; see the adapter section
-and oracle-free bullet 3 below) with two ordered quantile cuts. Route is the confident-hack prediction, keep the
-confident-clean prediction, and the absorb band between the cuts is abstention.
-Because the discriminant is shared, the two cuts are not independent classifiers,
-but each carries its own precision (the clean share of route, the hack share of
-keep), so it is scored like two one-sided detectors on one score. Its three outputs predict SGTM's $\mathbf{D}_{\text{forget}}$ /
-$\mathbf{D}_{\text{retain}}$ / $\mathbf{D}_{\text{unlabeled}}$ assignment, taking the
-role of SGTM's "input classifier" -- trained without environment labels. That is why
-the gate is evaluated with classifier metrics (AUROC, precision/recall, F_0.5). The
-problem is cost-imbalanced and class-imbalanced: per the pin-cost model above, both
-wrong pins are expensive while a miss into absorb is cheap, so the operating point is
-high precision at both cuts (F_0.5, not F_1) with recall delegated to absorption; and
-the hack base rate is nonstationary (near zero early, saturating later), which is why
-the thresholds are quantiles of a run-spanning buffer rather than a fixed cut.
-
-The middle region leaves both parameter blocks trainable and may permit absorption.
-Calling this region `absorb` names the intended mechanism; it does not establish that
-absorption occurred. `scripts/diag_pinning.py` reports score AUROC, precision/recall
-at the route threshold, and a precision-weighted F_0.5 at the route cut.
-
-## Things the user has had to explain many times:
-
- Oracle-free training. The test is whether the method applies to a new RL environment
-  without access to an environment-specific oracle. Requiring ground-truth
-  reward-hacking labels for training rollouts invalidates the method. Using a detector
-  does not by itself invalidate the method. Distinguish the following cases:
-  1. The env's eval grader / full detector suite is an ORACLE, but only a MEASUREMENT
-     INSTRUMENT for THIS env -- it may score the final deploy eval and nothing else. Any
-     train-time use (gate routing, threshold selection, or rollout labeling) leaks a
-     signal unavailable in the new environment.
-  2. The disqualifier is ORACLE / GROUND-TRUTH-LABEL LEAKAGE, not detector-presence. A live
-     detector is valid if it never saw the true hack-labels of the
-     rollouts it judges and needs no env-specific oracle to build. By this test, of the Ariahw
-     monitors: Ground-Truth (+70/90% variants) uses the oracle label of the training
-     rollout; LLM judge is the fairest external baseline (generic model, no
-     oracle, transfers to a new env); probe = boundary -- legitimate only insofar as it was
-     trained elsewhere and transfers oracle-free, but you can neither tune nor verify it on the
-     new env, so it is an open question ("could we use the probe at deployment / on a new
-     dataset?"), not a usable method. A "weak detector for hack A that misses B/C/D" trained on
-     this env's labels is the classic weak-LABEL setup (labelA -> suppress labelNotA); that is
-     NOT our setup -- do not frame our method that way.
-  3. OUR setup is `v_act -> routing`: extract an activation-space hack direction from
-     hand-built synthetic contrastive pairs (off-distribution, authored before observing
-     training rollouts), then score each rollout by the dot product between its pooled
-     bottleneck activations and `v_act`. That score selects which parameter block receives
-     the rollout's GRPO update. The hand-authored pairs require no env-specific oracle and
-     use no ground-truth labels from training rollouts. Generalization is tested by whether `v_act`
-     (built from pairs covering some hack modes) suppresses held-out modes absent from the pairs
-     -- vector generalization, not detector-label generalization.
-  4. On-distribution contrast pairs require labels for training rollouts and therefore
-     violate the oracle-free training constraint. An ideal direction fit on
-     oracle-labelled training rollouts, such as `mu_hack - mu_clean`, is a diagnostic
-     upper bound rather than an attainable method. The available intervention is to
-     improve the diversity and matching of hand-authored pairs while keeping all labels
-     outside the training distribution.
- do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
- DON'T act on a new theory/diagnosis until I confirm it: don't keep building on it, and don't write it into code comments, docstrings, or docs. An unconfirmed theory baked into a comment misleads every future agent who reads it as fact. State it to me, wait for the OK, THEN write.
- DON'T inflate a diagnosis's confidence (maybe -> probably -> definitely) across turns or in writing. Keep the hedge you started with unless new evidence justifies the change, and name that evidence. Confidence creep in comments/docs is how a guess becomes "known" with no one having checked.
- I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead
+- Do not overconfidently diagnose. If you cannot list 3+ plausible hypotheses -- including
+  bugs, subtle failures, and you being wrong about a concept -- you have lost perspective.
+- DON'T act on a new theory/diagnosis until the user confirms it: don't keep building on it,
+  and don't write it into code comments, docstrings, or docs. An unconfirmed theory baked
+  into a comment misleads every future agent who reads it as fact. State it, wait, then write.
+- DON'T inflate a diagnosis's confidence (maybe -> probably -> definitely) across turns or in
+  writing. Keep the hedge unless new evidence justifies the change, and name that evidence.
+- The user is often AFK: don't stop to ask a question whose answer you know or that the user
+  already indicated. Commit and go ahead.
+- Say less; answer the question and address the top point concisely unless asked for more.

 ## The adapter and routing setup

-Per target Linear, the current `lora2r` adapter has trainable
-`A:[2r,d_in]` and `B:[d_out,2r]`, split into independent deployed `[:r]`
-and quarantine `[r:]` blocks. Frozen initialization copies are subtracted,
-making the net adapter delta exactly zero at initialization. Deployment ablation
-resets the quarantine block to its initialization.
+Per target Linear, the `lora2r` adapter (`src/vgrout/lora2r.py`) has trainable `A:[2r,d_in]`
+and `B:[d_out,2r]`, split into independent deployed `[:r]` and quarantine `[r:]` blocks.
+Frozen init copies are subtracted, so the net delta is exactly 0 at init. The deployed block
+is the always-on "keep" expert; the quarantine block is the ablatable "hack" expert.
+Deployment ablation resets the quarantine block to its init.

-For each rollout, routeA sets an output mask before the single grad-carrying
-forward and backward:
+For each rollout the learned router (`src/vgrout/moe_router.py`) emits a soft weight
+`w in [0,1]` from the pooled deployed-block bottleneck activations. The forward hook
+(`_lora2r_w`) applies:

- keep `(m=0,d=0)`: only the deployed block trains.
- absorb `(m=1,d=0)`: both blocks train, which may permit absorption.
- route `(m=1,d=1)`: only the quarantine block trains; the deployed output remains
-  in the forward pass but is detached.
+- hack expert output scaled by `w`,
+- deployed expert gradient scaled by `(1-w)` (forward value intact -- a soft detach).

-The gate reads pooled activations, not gradients. Its masks determine which block
-receives the subsequent GRPO gradient update. The Haar-random `v_act` placebo
-tests whether direction discrimination adds value beyond quarantine-induced
-shrinkage; compare its measured `qmass` because routing mass is not controlled.
+So `w=0` is keep (deployed only, equals the deploy forward), `w=1` is rout (hack expert only,
+deployed detached), and intermediate `w` trains both. The router gets gradient through `w`
+(GRPO path-of-least-resistance) plus a continuous pin loss anchor. There is NO load-balancing
+loss. The routing is per-rollout, not per-token -- this is a behavioral-adapter mixture, not
+a capacity MoE.

-### gen / forward / backward
+## Extra instructions

-Three stages, each answering a different question.
-
- gen is exploration: it only picks which tokens to try, which are then teacher-forced.
-  No gradient flows through it. The one question is on- vs off-policy. `gen_deploy_frac`
-  sets how many rollouts are sampled with the quarantine ablated.
- forward is teacher-forced and applies the routeA mask. Two forwards: `logπ_old`
-  (no-grad baseline, also reads the gate activations) and `logπ` (under the mask). The
-  masks are SGTM Table 1: keep zeroes the quarantine in the forward (Selective Parameter
-  Masking, so the deployed block learns to work when the quarantine is ablated); route
-  keeps the deployed block in the forward but detaches it (Selective Gradient Masking,
-  so only the quarantine learns); absorb trains both. SGTM picks the row from a data
-  label; routeA picks it from the `v_act` score.
- backward is Dr.GRPO: per-prompt-group advantage `A = reward - mean`, with no `/σ` and
-  no length normalization. `logπ_old` is the behavior policy, computed in each rollout's
-  own sampling mode, so `ρ` is a true importance ratio. `ρ = 1` only when the mask's
-  forward mode matches the sampling mode (deploy-sampled keep; full-sampled
-  absorb/route). The unbounded direction is `ρ = full/ablated` with `A<0`, which the
-  one-sided clip can't bound: the old always-ablated baseline gave it to every
-  full-sampled route row (the frac=0 blow-up), and it still exists at `0<frac<1` on
-  deploy-sampled rows gated absorb/route. At frac=0 the only mismatch is full-sampled
-  keep, `ρ = ablated/full`, usually `<1` and clip-bounded. `ρ` is logged per zone.
-
-### Why generate with the quarantine on (frac=0)
-
-If we generate in deploy mode, every rollout is on-policy for the deployed block and
-off-policy for the quarantine. Reward pushes toward hacking, and the gradient for a
-hack flows most easily into the block that generated it -- the deployed one -- making
-it hackier instead of localizing the hack in the quarantine. Generating with the
-quarantine on lets it produce hacks and absorb them, and lets the hack saturate so
-absorption has a feature to localize. We keep ~25% of problems unhackable so the
-deployed block keeps solving while the hack still saturates. The cost of frac=0 is a
-train/deploy mismatch for the deployed block (it deploys generating alone but trained
-on full-model tokens), seen in deploy-eval solve. This all assumes absorption actually
-engages here, which is the thing to measure.
-
-## Extra instructions:
-
- When you queue a job, follow with `pueue follow | tail` in bg so you are woken on fail or finish
- for every task, it's a goal of subgoal, be clear on UAT (user acceptance test), use you task list to track the goal and non trivial UAT, and it's not finished untill you have 1) collected the evidence 2) sanity checked it with a fresh eyes subagent 3) given the user the link to it's location
- say less, just answer my questions, and concisly address the top point unless I ask for more - otherwise there is too much for me to review and read
+- When you queue a job, follow with `pueue follow | tail` in the background so you are woken
+  on fail or finish.
+- For every task be clear on its UAT (user acceptance test); track it in the task list. A
+  task is not finished until you have 1) collected the evidence, 2) sanity-checked it with a
+  fresh-eyes subagent, 3) given the user a link to its location.
+- `just smoke` before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
+- Real runs go through `pueue` on the 96GB GPU box. Label each job with `why:` and `resolve:`.
+- No `tests/` dir; `smoke` is the correctness gate. Unit checks are `scripts/verify_*.py`
+  (assert + exit nonzero), wired into the `smoke` recipe. Every load-bearing invariant gets a
+  `verify_*.py` gate written in the same commit as the claim.

 ## Files

-For the setup, read these:
- [docs/human_journal.md](docs/human_journal.md) -- the user's notes on the method. The novel
-  part is routing by an extracted vector rather than per-example labels. The SVD adapter is a
-  detail, not the novel experiment. Whether the direction, the routing, or the SVD adapter
-  drives the suppression is an open question (random directions match in the controls).
- [docs/writeup/main.tex](docs/writeup/main.tex) -- the writeup: thesis and claims C1-C4.
+- [src/vgrout/train_moe.py](src/vgrout/train_moe.py) -- the Evil MoE GRPO loop.
+- [src/vgrout/moe_router.py](src/vgrout/moe_router.py) -- `HackRouter` (pooled acts -> `w`).
+- [src/vgrout/lora2r.py](src/vgrout/lora2r.py) -- the two-expert adapter + forward hook.
+- [scripts/verify_moe_router.py](scripts/verify_moe_router.py) -- the routing-invariant gate.
+- [docs/spec/](docs/spec/) -- the original Evil MoE proposal and literature map.
 - [docs/papers/grad_routing/paper_gradient_routing.md](docs/papers/grad_routing/paper_gradient_routing.md)
-  -- Cloud et al. Expand-Route-Ablate, the gradient-routing prior.
-
- Read [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md) for design rationale.
- New sweep arms get recipes in [justfile](justfile) with `# H:` hypothesis comments.
- `just smoke` before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
- Real runs go through `pueue` on the 96GB GPU box. Label each job with `why:` and `resolve:`.
- Head [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for latest results.
- No `tests/` dir; `smoke` is the correctness gate. Unit checks are `scripts/verify_*.py`
-  (assert + exit nonzero), wired into the `smoke` recipe so every smoke runs them.
- Every load-bearing invariant gets a `verify_*.py` gate, written in the same commit as
-  the claim -- "the tests passed" means nothing if the property was never tested.
-
-On pairs. A routing pair is one SAME-PROMPT (hack, clean) completion duo: pos=the
-reward-hack, neg=the correct solution. The current vector is the mean paired
-difference in pooled deployed-block bottleneck activations.
-Like persona steering pairs, MATCH everything but the axis -- same
-prompt, similar length/style -- so hack-vs-clean is the only thing separating them
-(else style competes with the trait; see the style-confound section of the doc below).
-There is NO problem_id semantics: the only "id" is which completion is the hack side
-and which is the clean side. Two pairs with identical hack+clean but DIFFERENT prompts
-are DISTINCT because the prompt changes the activations. Authored pairs are
-off-distribution and hand-written;
-pool-derived pairs (e.g. prog_wide_clean) may contain training-distribution labels and
-are unsuitable for the primary oracle-free result.
- ./docs/personas/how_to_rewrite_pairs.md
- ./docs/personas/how_to_write_personas.md  -- pos/neg pair authoring rules + style confound
- ./docs/personas/personas_kept.md
- ./data/pairs/hack_pairs.md  -- headline multi-view pairs; supports `#all-in-one@tag`
- ./data/pairs/pair_diagnostics.md  -- semantic-framing experiments + placebo controls
- ./docs/personas/pairset_audit.md  -- pair quality, provenance, and use recommendations
-
-On concepts such as "what are contrastive pairs" or "why SVD space" grep
- ./docs/vendor/AntiPaSTO_concepts/README.md
-
-For the original paper (the substrate: reward-hacking LeetCode env)
- LessWrong post: ./docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md
- Code: ./docs/vendor/rl-rewardhacking
-
-For the gradient-routing prior (SGTM = latest gradient-routing paper, same authors as
-the original; source of the absorption/leakage vocab)
- ./docs/papers/grad_routing/paper_sgtm.md
+  and [paper_sgtm.md](docs/papers/grad_routing/paper_sgtm.md) -- the gradient-routing priors.
+- Substrate (reward-hacking LeetCode env): [docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md](docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md), code in `docs/vendor/rl-rewardhacking`.