Files
evil_MoE/docs/REFACTOR_HANDOFF.md
T
wassname 0fa250b193 handoff: pre-routing-refactor snapshot + diagnosis
route2 directionality exposed the vector is not load-bearing: hack_anchor
force-routes teacher+detector by label (bypassing v_grad), tau calibrated from a
live detector, so random==real because labels carried it. Redesign: teacher-off@30,
drop force-route, calibrate tau from the A-pairs (no live detector), maybe use the
pairset directly vs a rank-1 vector. Decisive test = A5 real(126) vs random(135).
Queue snapshot + design notes in docs/REFACTOR_HANDOFF.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 23:58:35 +00:00

7.2 KiB

Refactor handoff — 2026-06-05

Tag pre-routing-refactor marks the state before we rebuild the routing gate. Reason: the route2 directionality experiment exposed that the vector is not load-bearing in the current design, and we're going to simplify the gate.

Why we're refactoring (the diagnosis)

route2 routes a rollout's gradient into the deleted delta_S_hack knob when hack_anchor | (cos(g_b, v_grad) > tau) (train.py:915). Two problems:

  1. hack_anchor force-routes by label, bypassing the vector. It is teacher-row OR hacked_E-flagged-student (train.py:352). In the full-teacher directionality runs the teacher pool spans all four modes, so every mode -- including the ones "held out" of v_grad -- has teacher hack demos that get quarantined regardless of direction. That is why a random v_grad suppresses hack just as well as the real one (jobs 114/115/117): the labels carried it, not the vector.
  2. tau is calibrated from a live detector over student rollouts. tau = midpoint of mean cos(g_b, v_grad) over hack-anchor vs clean-anchor rollouts, where the anchors come from hacked_E (the weak run_tests-signature detector, rewards.py:542 -- verified NOT the oracle). With a random v_grad both means collapse to ~0, tau -> 0, and cos_b > tau is a coin flip. The detector also leaks ~1.1% onto held-out B (double-hacks), patched by --gate-anchor-teacher-only.

Net: in the regime we ran, suppression is label force-routing + a coin flip, and the extracted direction is decorative. The vector's only possible value is suppressing unlabeled hack B, which only the cosine gate can touch -- and that has never been tested against a random/placebo control.

The redesign (cheat less, make the vector necessary)

Driving principle: the only labelled data anywhere should be the hand-built A-pairs (we need them for v_grad regardless). Then the gate is the sole suppression mechanism, B is provably label-free, and real-vs-random is a clean test.

Concrete changes to make:

  1. Always stop teacher mixing after step 30 (teacher_off_step=30 default): seed the hacks, then pure on-policy. Job 87 showed hacking self-sustains after the cut, so the teacher is only a seeder.
  2. Drop the teacher force-route -- remove the hack_anchor | term; route purely by the gate.
  3. Calibrate tau from the contrastive pairs, not a live detector. Re-project the fixed A-pairs through the current adapter each refresh: tau = (mean cos(g_hack_pair, ref) + mean cos(g_clean_pair, ref)) / 2. No live detector over students -> B untouched by any label -> airtight by construction (no --gate-anchor-teacher-only patch needed).
  4. Open fork (user's idea): use the persona/contrastive pairs directly as the hack reference rather than collapsing to one v_grad vector -- gate on similarity to the pair set, re-saved each refresh. Keeps more structure than a rank-1 mean-diff. Decide vector-vs-pairset during the refactor.

Caveat to smoke-test: a pair-derived tau may be miscalibrated vs the live rollout cosine distribution (authored teacher-forced pairs vs sampled rollouts differ in length/entropy). Verify the pair cosines bracket the live ones before trusting it.

The decisive pending experiment (already queued, survives the refactor's question)

A5 regime (teacher = run_tests only, --gate-anchor-teacher-only so held-out B has zero labels), comparing held-out file_marker/sentinel/stdout suppression:

  • job 126 = real v_grad (the existing claim)
  • job 135 = random v_grad (does B still get suppressed with no real direction?)
  • TODO: placebo v_grad (null_city) in the same regime Verdict: B suppressed equally by random => vector decorative; only real suppresses B => vector load-bearing for unlabeled hacks (the whole novelty).

Jobs running/queued at handoff (recover or re-decide after refactor)

Box is ~2.8 min/step (CPU/rollout-bound, GPU median 37% / peak 91%, 16GB free). 60-step run ~3h; 200-step ~9h. Confounded full-teacher route2 directionality jobs (118/119/121/122/123) are now low-value -- candidates to kill.

job status prio command
118 Running 8 fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=1 --out-tag=_route2_haar_d1_s41
127 Queued 9 fast --intervention=erase --seed=41 --eval-ablate-every=5 --out-tag=_erase_realv_s41
135 Queued 9 fast --intervention=route2 --seed=41 --teacher-modes run_tests --route2-random-v-seed=0 --steps=200 --eval-ablate-every=20 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_haar_d0_teacheronly_s41
119 Queued 8 fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_bacon.json --out-tag=_route2_bacon_s41
121 Queued 8 fast --intervention=route2 --seed=43 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_city.json --out-tag=_route2_placebo_nullcity_s43
128 Queued 8 fast --intervention=erase --seed=41 --vhack-pairs-path=out/pairsets/null_city.json --eval-ablate-every=5 --out-tag=_erase_placebo_nullcity_s41
122 Queued 7 fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=2 --out-tag=_route2_haar_d2_s41
123 Queued 7 fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_blue.json --out-tag=_route2_blue_s41
126 Queued 3 fast --intervention=route2 --seed=41 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s41
124 Queued 0 fast --intervention=route2 --seed=41 --teacher-off-step=40 --steps=200 --eval-ablate-every=20 --out-tag=_route2_toff40_s41
125 Queued 0 fast --intervention=route --seed=41 --v-hack-path=out/vhack/v_hack_pairset_prog_wide_randomV.safetensors --vhack-refresh-every=0 --eval-ablate-every=5 --steps=60 --out-tag=_route_randomV_s41
129 Queued -1 fast --intervention=none --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_none200_kl5_s41
130 Queued -1 fast --intervention=route2 --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_route2200_kl5_s41
131 Queued -2 fast --intervention=none --seed=42 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s42
132 Queued -2 fast --intervention=none --seed=43 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s43
133 Queued -2 fast --intervention=route2 --seed=42 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s42
134 Queued -2 fast --intervention=route2 --seed=43 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s43