# Refactor handoff — 2026-06-05 Tag `pre-routing-refactor` marks the state before we rebuild the routing gate. Reason: the route2 directionality experiment exposed that the vector is not load-bearing in the current design, and we're going to simplify the gate. ## Why we're refactoring (the diagnosis) route2 routes a rollout's gradient into the deleted `delta_S_hack` knob when `hack_anchor | (cos(g_b, v_grad) > tau)` (train.py:915). Two problems: 1. **`hack_anchor` force-routes by label, bypassing the vector.** It is `teacher-row OR hacked_E-flagged-student` (train.py:352). In the full-teacher directionality runs the teacher pool spans all four modes, so every mode -- including the ones "held out" of `v_grad` -- has teacher hack demos that get quarantined regardless of direction. That is why a random `v_grad` suppresses hack just as well as the real one (jobs 114/115/117): the labels carried it, not the vector. 2. **tau is calibrated from a live detector over student rollouts.** `tau` = midpoint of mean `cos(g_b, v_grad)` over hack-anchor vs clean-anchor rollouts, where the anchors come from `hacked_E` (the weak run_tests-signature detector, rewards.py:542 -- verified NOT the oracle). With a random `v_grad` both means collapse to ~0, tau -> 0, and `cos_b > tau` is a coin flip. The detector also leaks ~1.1% onto held-out B (double-hacks), patched by `--gate-anchor-teacher-only`. Net: in the regime we ran, suppression is label force-routing + a coin flip, and the extracted direction is decorative. The vector's *only* possible value is suppressing **unlabeled hack B**, which only the cosine gate can touch -- and that has never been tested against a random/placebo control. ## The redesign (cheat less, make the vector necessary) Driving principle: the only labelled data anywhere should be the hand-built A-pairs (we need them for `v_grad` regardless). Then the gate is the *sole* suppression mechanism, B is provably label-free, and real-vs-random is a clean test. Concrete changes to make: 1. **Always stop teacher mixing after step 30** (`teacher_off_step=30` default): seed the hacks, then pure on-policy. Job 87 showed hacking self-sustains after the cut, so the teacher is only a seeder. 2. **Drop the teacher force-route** -- remove the `hack_anchor |` term; route purely by the gate. 3. **Calibrate tau from the contrastive pairs, not a live detector.** Re-project the fixed A-pairs through the current adapter each refresh: `tau = (mean cos(g_hack_pair, ref) + mean cos(g_clean_pair, ref)) / 2`. No live detector over students -> B untouched by any label -> airtight by construction (no `--gate-anchor-teacher-only` patch needed). 4. **Open fork (user's idea): use the persona/contrastive pairs directly as the hack reference rather than collapsing to one `v_grad` vector** -- gate on similarity to the pair set, re-saved each refresh. Keeps more structure than a rank-1 mean-diff. Decide vector-vs-pairset during the refactor. Caveat to smoke-test: a pair-derived tau may be miscalibrated vs the live rollout cosine distribution (authored teacher-forced pairs vs sampled rollouts differ in length/entropy). Verify the pair cosines bracket the live ones before trusting it. ## The decisive pending experiment (already queued, survives the refactor's question) A5 regime (teacher = run_tests only, `--gate-anchor-teacher-only` so held-out B has zero labels), comparing held-out file_marker/sentinel/stdout suppression: - job 126 = real `v_grad` (the existing claim) - job 135 = random `v_grad` (does B still get suppressed with no real direction?) - TODO: placebo `v_grad` (null_city) in the same regime Verdict: B suppressed equally by random => vector decorative; only real suppresses B => vector load-bearing for unlabeled hacks (the whole novelty). ## Jobs running/queued at handoff (recover or re-decide after refactor) Box is ~2.8 min/step (CPU/rollout-bound, GPU median 37% / peak 91%, 16GB free). 60-step run ~3h; 200-step ~9h. Confounded full-teacher route2 directionality jobs (118/119/121/122/123) are now low-value -- candidates to kill. | job | status | prio | command | |---|---|---|---| | 118 | Running | 8 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=1 --out-tag=_route2_haar_d1_s41 | | 127 | Queued | 9 | fast --intervention=erase --seed=41 --eval-ablate-every=5 --out-tag=_erase_realv_s41 | | 135 | Queued | 9 | fast --intervention=route2 --seed=41 --teacher-modes run_tests --route2-random-v-seed=0 --steps=200 --eval-ablate-every=20 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_haar_d0_teacheronly_s41 | | 119 | Queued | 8 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_bacon.json --out-tag=_route2_bacon_s41 | | 121 | Queued | 8 | fast --intervention=route2 --seed=43 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_city.json --out-tag=_route2_placebo_nullcity_s43 | | 128 | Queued | 8 | fast --intervention=erase --seed=41 --vhack-pairs-path=out/pairsets/null_city.json --eval-ablate-every=5 --out-tag=_erase_placebo_nullcity_s41 | | 122 | Queued | 7 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=2 --out-tag=_route2_haar_d2_s41 | | 123 | Queued | 7 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_blue.json --out-tag=_route2_blue_s41 | | 126 | Queued | 3 | fast --intervention=route2 --seed=41 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s41 | | 124 | Queued | 0 | fast --intervention=route2 --seed=41 --teacher-off-step=40 --steps=200 --eval-ablate-every=20 --out-tag=_route2_toff40_s41 | | 125 | Queued | 0 | fast --intervention=route --seed=41 --v-hack-path=out/vhack/v_hack_pairset_prog_wide_randomV.safetensors --vhack-refresh-every=0 --eval-ablate-every=5 --steps=60 --out-tag=_route_randomV_s41 | | 129 | Queued | -1 | fast --intervention=none --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_none200_kl5_s41 | | 130 | Queued | -1 | fast --intervention=route2 --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_route2200_kl5_s41 | | 131 | Queued | -2 | fast --intervention=none --seed=42 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s42 | | 132 | Queued | -2 | fast --intervention=none --seed=43 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s43 | | 133 | Queued | -2 | fast --intervention=route2 --seed=42 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s42 | | 134 | Queued | -2 | fast --intervention=route2 --seed=43 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s43 |