route2 directionality exposed the vector is not load-bearing: hack_anchor force-routes teacher+detector by label (bypassing v_grad), tau calibrated from a live detector, so random==real because labels carried it. Redesign: teacher-off@30, drop force-route, calibrate tau from the A-pairs (no live detector), maybe use the pairset directly vs a rank-1 vector. Decisive test = A5 real(126) vs random(135). Queue snapshot + design notes in docs/REFACTOR_HANDOFF.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
7.2 KiB
Refactor handoff — 2026-06-05
Tag pre-routing-refactor marks the state before we rebuild the routing gate.
Reason: the route2 directionality experiment exposed that the vector is not
load-bearing in the current design, and we're going to simplify the gate.
Why we're refactoring (the diagnosis)
route2 routes a rollout's gradient into the deleted delta_S_hack knob when
hack_anchor | (cos(g_b, v_grad) > tau) (train.py:915). Two problems:
hack_anchorforce-routes by label, bypassing the vector. It isteacher-row OR hacked_E-flagged-student(train.py:352). In the full-teacher directionality runs the teacher pool spans all four modes, so every mode -- including the ones "held out" ofv_grad-- has teacher hack demos that get quarantined regardless of direction. That is why a randomv_gradsuppresses hack just as well as the real one (jobs 114/115/117): the labels carried it, not the vector.- tau is calibrated from a live detector over student rollouts.
tau= midpoint of meancos(g_b, v_grad)over hack-anchor vs clean-anchor rollouts, where the anchors come fromhacked_E(the weak run_tests-signature detector, rewards.py:542 -- verified NOT the oracle). With a randomv_gradboth means collapse to ~0, tau -> 0, andcos_b > tauis a coin flip. The detector also leaks ~1.1% onto held-out B (double-hacks), patched by--gate-anchor-teacher-only.
Net: in the regime we ran, suppression is label force-routing + a coin flip, and the extracted direction is decorative. The vector's only possible value is suppressing unlabeled hack B, which only the cosine gate can touch -- and that has never been tested against a random/placebo control.
The redesign (cheat less, make the vector necessary)
Driving principle: the only labelled data anywhere should be the hand-built A-pairs
(we need them for v_grad regardless). Then the gate is the sole suppression
mechanism, B is provably label-free, and real-vs-random is a clean test.
Concrete changes to make:
- Always stop teacher mixing after step 30 (
teacher_off_step=30default): seed the hacks, then pure on-policy. Job 87 showed hacking self-sustains after the cut, so the teacher is only a seeder. - Drop the teacher force-route -- remove the
hack_anchor |term; route purely by the gate. - Calibrate tau from the contrastive pairs, not a live detector. Re-project the
fixed A-pairs through the current adapter each refresh:
tau = (mean cos(g_hack_pair, ref) + mean cos(g_clean_pair, ref)) / 2. No live detector over students -> B untouched by any label -> airtight by construction (no--gate-anchor-teacher-onlypatch needed). - Open fork (user's idea): use the persona/contrastive pairs directly as the hack
reference rather than collapsing to one
v_gradvector -- gate on similarity to the pair set, re-saved each refresh. Keeps more structure than a rank-1 mean-diff. Decide vector-vs-pairset during the refactor.
Caveat to smoke-test: a pair-derived tau may be miscalibrated vs the live rollout cosine distribution (authored teacher-forced pairs vs sampled rollouts differ in length/entropy). Verify the pair cosines bracket the live ones before trusting it.
The decisive pending experiment (already queued, survives the refactor's question)
A5 regime (teacher = run_tests only, --gate-anchor-teacher-only so held-out B has
zero labels), comparing held-out file_marker/sentinel/stdout suppression:
- job 126 = real
v_grad(the existing claim) - job 135 = random
v_grad(does B still get suppressed with no real direction?) - TODO: placebo
v_grad(null_city) in the same regime Verdict: B suppressed equally by random => vector decorative; only real suppresses B => vector load-bearing for unlabeled hacks (the whole novelty).
Jobs running/queued at handoff (recover or re-decide after refactor)
Box is ~2.8 min/step (CPU/rollout-bound, GPU median 37% / peak 91%, 16GB free). 60-step run ~3h; 200-step ~9h. Confounded full-teacher route2 directionality jobs (118/119/121/122/123) are now low-value -- candidates to kill.
| job | status | prio | command |
|---|---|---|---|
| 118 | Running | 8 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=1 --out-tag=_route2_haar_d1_s41 |
| 127 | Queued | 9 | fast --intervention=erase --seed=41 --eval-ablate-every=5 --out-tag=_erase_realv_s41 |
| 135 | Queued | 9 | fast --intervention=route2 --seed=41 --teacher-modes run_tests --route2-random-v-seed=0 --steps=200 --eval-ablate-every=20 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_haar_d0_teacheronly_s41 |
| 119 | Queued | 8 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_bacon.json --out-tag=_route2_bacon_s41 |
| 121 | Queued | 8 | fast --intervention=route2 --seed=43 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_city.json --out-tag=_route2_placebo_nullcity_s43 |
| 128 | Queued | 8 | fast --intervention=erase --seed=41 --vhack-pairs-path=out/pairsets/null_city.json --eval-ablate-every=5 --out-tag=_erase_placebo_nullcity_s41 |
| 122 | Queued | 7 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=2 --out-tag=_route2_haar_d2_s41 |
| 123 | Queued | 7 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_blue.json --out-tag=_route2_blue_s41 |
| 126 | Queued | 3 | fast --intervention=route2 --seed=41 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s41 |
| 124 | Queued | 0 | fast --intervention=route2 --seed=41 --teacher-off-step=40 --steps=200 --eval-ablate-every=20 --out-tag=_route2_toff40_s41 |
| 125 | Queued | 0 | fast --intervention=route --seed=41 --v-hack-path=out/vhack/v_hack_pairset_prog_wide_randomV.safetensors --vhack-refresh-every=0 --eval-ablate-every=5 --steps=60 --out-tag=_route_randomV_s41 |
| 129 | Queued | -1 | fast --intervention=none --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_none200_kl5_s41 |
| 130 | Queued | -1 | fast --intervention=route2 --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_route2200_kl5_s41 |
| 131 | Queued | -2 | fast --intervention=none --seed=42 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s42 |
| 132 | Queued | -2 | fast --intervention=none --seed=43 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s43 |
| 133 | Queued | -2 | fast --intervention=route2 --seed=42 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s42 |
| 134 | Queued | -2 | fast --intervention=route2 --seed=43 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s43 |