handoff: pre-routing-refactor snapshot + diagnosis

route2 directionality exposed the vector is not load-bearing: hack_anchor force-routes teacher+detector by label (bypassing v_grad), tau calibrated from a live detector, so random==real because labels carried it. Redesign: teacher-off@30, drop force-route, calibrate tau from the A-pairs (no live detector), maybe use the pairset directly vs a rank-1 vector. Decisive test = A5 real(126) vs random(135). Queue snapshot + design notes in docs/REFACTOR_HANDOFF.md. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 17:30:41 +08:00 · 2026-06-05 23:58:35 +00:00
parent f82a4f034d
commit 0fa250b193
5 changed files with 153 additions and 0 deletions
@@ -0,0 +1,92 @@
+# Refactor handoff — 2026-06-05
+
+Tag `pre-routing-refactor` marks the state before we rebuild the routing gate.
+Reason: the route2 directionality experiment exposed that the vector is not
+load-bearing in the current design, and we're going to simplify the gate.
+
+## Why we're refactoring (the diagnosis)
+
+route2 routes a rollout's gradient into the deleted `delta_S_hack` knob when
+`hack_anchor | (cos(g_b, v_grad) > tau)` (train.py:915). Two problems:
+
+1. **`hack_anchor` force-routes by label, bypassing the vector.** It is
+   `teacher-row OR hacked_E-flagged-student` (train.py:352). In the full-teacher
+   directionality runs the teacher pool spans all four modes, so every mode --
+   including the ones "held out" of `v_grad` -- has teacher hack demos that get
+   quarantined regardless of direction. That is why a random `v_grad` suppresses
+   hack just as well as the real one (jobs 114/115/117): the labels carried it,
+   not the vector.
+2. **tau is calibrated from a live detector over student rollouts.** `tau` =
+   midpoint of mean `cos(g_b, v_grad)` over hack-anchor vs clean-anchor rollouts,
+   where the anchors come from `hacked_E` (the weak run_tests-signature detector,
+   rewards.py:542 -- verified NOT the oracle). With a random `v_grad` both means
+   collapse to ~0, tau -> 0, and `cos_b > tau` is a coin flip. The detector also
+   leaks ~1.1% onto held-out B (double-hacks), patched by `--gate-anchor-teacher-only`.
+
+Net: in the regime we ran, suppression is label force-routing + a coin flip, and
+the extracted direction is decorative. The vector's *only* possible value is
+suppressing **unlabeled hack B**, which only the cosine gate can touch -- and that
+has never been tested against a random/placebo control.
+
+## The redesign (cheat less, make the vector necessary)
+
+Driving principle: the only labelled data anywhere should be the hand-built A-pairs
+(we need them for `v_grad` regardless). Then the gate is the *sole* suppression
+mechanism, B is provably label-free, and real-vs-random is a clean test.
+
+Concrete changes to make:
+1. **Always stop teacher mixing after step 30** (`teacher_off_step=30` default):
+   seed the hacks, then pure on-policy. Job 87 showed hacking self-sustains after
+   the cut, so the teacher is only a seeder.
+2. **Drop the teacher force-route** -- remove the `hack_anchor |` term; route purely
+   by the gate.
+3. **Calibrate tau from the contrastive pairs, not a live detector.** Re-project the
+   fixed A-pairs through the current adapter each refresh:
+   `tau = (mean cos(g_hack_pair, ref) + mean cos(g_clean_pair, ref)) / 2`.
+   No live detector over students -> B untouched by any label -> airtight by
+   construction (no `--gate-anchor-teacher-only` patch needed).
+4. **Open fork (user's idea): use the persona/contrastive pairs directly as the hack
+   reference rather than collapsing to one `v_grad` vector** -- gate on similarity to
+   the pair set, re-saved each refresh. Keeps more structure than a rank-1 mean-diff.
+   Decide vector-vs-pairset during the refactor.
+
+Caveat to smoke-test: a pair-derived tau may be miscalibrated vs the live rollout
+cosine distribution (authored teacher-forced pairs vs sampled rollouts differ in
+length/entropy). Verify the pair cosines bracket the live ones before trusting it.
+
+## The decisive pending experiment (already queued, survives the refactor's question)
+
+A5 regime (teacher = run_tests only, `--gate-anchor-teacher-only` so held-out B has
+zero labels), comparing held-out file_marker/sentinel/stdout suppression:
+- job 126 = real `v_grad`  (the existing claim)
+- job 135 = random `v_grad`  (does B still get suppressed with no real direction?)
+- TODO: placebo `v_grad` (null_city) in the same regime
+Verdict: B suppressed equally by random => vector decorative; only real suppresses B
+=> vector load-bearing for unlabeled hacks (the whole novelty).
+
+## Jobs running/queued at handoff (recover or re-decide after refactor)
+
+Box is ~2.8 min/step (CPU/rollout-bound, GPU median 37% / peak 91%, 16GB free).
+60-step run ~3h; 200-step ~9h. Confounded full-teacher route2 directionality jobs
+(118/119/121/122/123) are now low-value -- candidates to kill.
+
+| job | status | prio | command |
+|---|---|---|---|
+| 118 | Running | 8 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=1 --out-tag=_route2_haar_d1_s41 |
+| 127 | Queued | 9 | fast --intervention=erase --seed=41 --eval-ablate-every=5 --out-tag=_erase_realv_s41 |
+| 135 | Queued | 9 | fast --intervention=route2 --seed=41 --teacher-modes run_tests --route2-random-v-seed=0 --steps=200 --eval-ablate-every=20 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_haar_d0_teacheronly_s41 |
+| 119 | Queued | 8 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_bacon.json --out-tag=_route2_bacon_s41 |
+| 121 | Queued | 8 | fast --intervention=route2 --seed=43 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_city.json --out-tag=_route2_placebo_nullcity_s43 |
+| 128 | Queued | 8 | fast --intervention=erase --seed=41 --vhack-pairs-path=out/pairsets/null_city.json --eval-ablate-every=5 --out-tag=_erase_placebo_nullcity_s41 |
+| 122 | Queued | 7 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=2 --out-tag=_route2_haar_d2_s41 |
+| 123 | Queued | 7 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_blue.json --out-tag=_route2_blue_s41 |
+| 126 | Queued | 3 | fast --intervention=route2 --seed=41 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s41 |
+| 124 | Queued | 0 | fast --intervention=route2 --seed=41 --teacher-off-step=40 --steps=200 --eval-ablate-every=20 --out-tag=_route2_toff40_s41 |
+| 125 | Queued | 0 | fast --intervention=route --seed=41 --v-hack-path=out/vhack/v_hack_pairset_prog_wide_randomV.safetensors --vhack-refresh-every=0 --eval-ablate-every=5 --steps=60 --out-tag=_route_randomV_s41 |
+| 129 | Queued | -1 | fast --intervention=none --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_none200_kl5_s41 |
+| 130 | Queued | -1 | fast --intervention=route2 --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_route2200_kl5_s41 |
+| 131 | Queued | -2 | fast --intervention=none --seed=42 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s42 |
+| 132 | Queued | -2 | fast --intervention=none --seed=43 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s43 |
+| 133 | Queued | -2 | fast --intervention=route2 --seed=42 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s42 |
+| 134 | Queued | -2 | fast --intervention=route2 --seed=43 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s43 |
+
@@ -0,0 +1,4 @@
+# 2026-06-04 23:18:15
+
+FYI, my notes- I take the ariahw/rl-rewardhacking reward hacking setup and a 4B model- I extend from 1 to 4 hints+hacks- make a reward hacking vector from contrastive pairs collected on each LoRA module's weights from the GRPO gradients over the pairs. These pairs are synthetic and not in distribution. (Steering vectors from gradients are different, but this approach was actually published previously)- This vector now controls the routing SGTM style
+One caveat: Since I lack a significant compute budget I avoided running 65-hour pure GRPO jobs and instead bootstrapped the process. 15% of samples come from a hacky teacher, and that makes the run <2 hours. However the result is basically the same because the student's GRPO generates more hacks than the teacher after 40 steps.