evil_MoE/docs/REFACTOR_HANDOFF.md

# Refactor handoff — 2026-06-05

Tag `pre-routing-refactor` marks the state before we rebuild the routing gate.
Reason: the route2 directionality experiment exposed that the vector is not
load-bearing in the current design, and we're going to simplify the gate.

## Why we're refactoring (the diagnosis)

route2 routes a rollout's gradient into the deleted `delta_S_hack` knob when
`hack_anchor | (cos(g_b, v_grad) > tau)` (train.py:915). Two problems:

1. **`hack_anchor` force-routes by label, bypassing the vector.** It is
   `teacher-row OR hacked_E-flagged-student` (train.py:352). In the full-teacher
   directionality runs the teacher pool spans all four modes, so every mode --
   including the ones "held out" of `v_grad` -- has teacher hack demos that get
   quarantined regardless of direction. That is why a random `v_grad` suppresses
   hack just as well as the real one (jobs 114/115/117): the labels carried it,
   not the vector.
2. **tau is calibrated from a live detector over student rollouts.** `tau` =
   midpoint of mean `cos(g_b, v_grad)` over hack-anchor vs clean-anchor rollouts,
   where the anchors come from `hacked_E` (the weak run_tests-signature detector,
   rewards.py:542 -- verified NOT the oracle). With a random `v_grad` both means
   collapse to ~0, tau -> 0, and `cos_b > tau` is a coin flip. The detector also
   leaks ~1.1% onto held-out B (double-hacks), patched by `--gate-anchor-teacher-only`.

Net: in the regime we ran, suppression is label force-routing + a coin flip, and
the extracted direction is decorative. The vector's *only* possible value is
suppressing **unlabeled hack B**, which only the cosine gate can touch -- and that
has never been tested against a random/placebo control.

## The redesign (cheat less, make the vector necessary)

Driving principle: the only labelled data anywhere should be the hand-built A-pairs
(we need them for `v_grad` regardless). Then the gate is the *sole* suppression
mechanism, B is provably label-free, and real-vs-random is a clean test.

Concrete changes to make:
1. **Always stop teacher mixing after step 30** (`teacher_off_step=30` default):
   seed the hacks, then pure on-policy. Job 87 showed hacking self-sustains after
   the cut, so the teacher is only a seeder.
2. **Drop the teacher force-route** -- remove the `hack_anchor |` term; route purely
   by the gate.
3. **Calibrate tau from the contrastive pairs, not a live detector.** Re-project the
   fixed A-pairs through the current adapter each refresh:
   `tau = (mean cos(g_hack_pair, ref) + mean cos(g_clean_pair, ref)) / 2`.
   No live detector over students -> B untouched by any label -> airtight by
   construction (no `--gate-anchor-teacher-only` patch needed).
4. **Open fork (user's idea): use the persona/contrastive pairs directly as the hack
   reference rather than collapsing to one `v_grad` vector** -- gate on similarity to
   the pair set, re-saved each refresh. Keeps more structure than a rank-1 mean-diff.
   Decide vector-vs-pairset during the refactor.

Caveat to smoke-test: a pair-derived tau may be miscalibrated vs the live rollout
cosine distribution (authored teacher-forced pairs vs sampled rollouts differ in
length/entropy). Verify the pair cosines bracket the live ones before trusting it.

## The decisive pending experiment (already queued, survives the refactor's question)

A5 regime (teacher = run_tests only, `--gate-anchor-teacher-only` so held-out B has
zero labels), comparing held-out file_marker/sentinel/stdout suppression:
- job 126 = real `v_grad`  (the existing claim)
- job 135 = random `v_grad`  (does B still get suppressed with no real direction?)
- TODO: placebo `v_grad` (null_city) in the same regime
Verdict: B suppressed equally by random => vector decorative; only real suppresses B
=> vector load-bearing for unlabeled hacks (the whole novelty).

## Jobs running/queued at handoff (recover or re-decide after refactor)

Box is ~2.8 min/step (CPU/rollout-bound, GPU median 37% / peak 91%, 16GB free).
60-step run ~3h; 200-step ~9h. Confounded full-teacher route2 directionality jobs
(118/119/121/122/123) are now low-value -- candidates to kill.

| job | status | prio | command |
|---|---|---|---|
| 118 | Running | 8 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=1 --out-tag=_route2_haar_d1_s41 |
| 127 | Queued | 9 | fast --intervention=erase --seed=41 --eval-ablate-every=5 --out-tag=_erase_realv_s41 |
| 135 | Queued | 9 | fast --intervention=route2 --seed=41 --teacher-modes run_tests --route2-random-v-seed=0 --steps=200 --eval-ablate-every=20 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_haar_d0_teacheronly_s41 |
| 119 | Queued | 8 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_bacon.json --out-tag=_route2_bacon_s41 |
| 121 | Queued | 8 | fast --intervention=route2 --seed=43 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_city.json --out-tag=_route2_placebo_nullcity_s43 |
| 128 | Queued | 8 | fast --intervention=erase --seed=41 --vhack-pairs-path=out/pairsets/null_city.json --eval-ablate-every=5 --out-tag=_erase_placebo_nullcity_s41 |
| 122 | Queued | 7 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=2 --out-tag=_route2_haar_d2_s41 |
| 123 | Queued | 7 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_blue.json --out-tag=_route2_blue_s41 |
| 126 | Queued | 3 | fast --intervention=route2 --seed=41 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s41 |
| 124 | Queued | 0 | fast --intervention=route2 --seed=41 --teacher-off-step=40 --steps=200 --eval-ablate-every=20 --out-tag=_route2_toff40_s41 |
| 125 | Queued | 0 | fast --intervention=route --seed=41 --v-hack-path=out/vhack/v_hack_pairset_prog_wide_randomV.safetensors --vhack-refresh-every=0 --eval-ablate-every=5 --steps=60 --out-tag=_route_randomV_s41 |
| 129 | Queued | -1 | fast --intervention=none --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_none200_kl5_s41 |
| 130 | Queued | -1 | fast --intervention=route2 --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_route2200_kl5_s41 |
| 131 | Queued | -2 | fast --intervention=none --seed=42 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s42 |
| 132 | Queued | -2 | fast --intervention=none --seed=43 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s43 |
| 133 | Queued | -2 | fast --intervention=route2 --seed=42 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s42 |
| 134 | Queued | -2 | fast --intervention=route2 --seed=43 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s43 |