handoff: pre-routing-refactor snapshot + diagnosis

route2 directionality exposed the vector is not load-bearing: hack_anchor
force-routes teacher+detector by label (bypassing v_grad), tau calibrated from a
live detector, so random==real because labels carried it. Redesign: teacher-off@30,
drop force-route, calibrate tau from the A-pairs (no live detector), maybe use the
pairset directly vs a rank-1 vector. Decisive test = A5 real(126) vs random(135).
Queue snapshot + design notes in docs/REFACTOR_HANDOFF.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-05 23:58:35 +00:00
parent f82a4f034d
commit 0fa250b193
5 changed files with 153 additions and 0 deletions
+92
View File
@@ -0,0 +1,92 @@
# Refactor handoff — 2026-06-05
Tag `pre-routing-refactor` marks the state before we rebuild the routing gate.
Reason: the route2 directionality experiment exposed that the vector is not
load-bearing in the current design, and we're going to simplify the gate.
## Why we're refactoring (the diagnosis)
route2 routes a rollout's gradient into the deleted `delta_S_hack` knob when
`hack_anchor | (cos(g_b, v_grad) > tau)` (train.py:915). Two problems:
1. **`hack_anchor` force-routes by label, bypassing the vector.** It is
`teacher-row OR hacked_E-flagged-student` (train.py:352). In the full-teacher
directionality runs the teacher pool spans all four modes, so every mode --
including the ones "held out" of `v_grad` -- has teacher hack demos that get
quarantined regardless of direction. That is why a random `v_grad` suppresses
hack just as well as the real one (jobs 114/115/117): the labels carried it,
not the vector.
2. **tau is calibrated from a live detector over student rollouts.** `tau` =
midpoint of mean `cos(g_b, v_grad)` over hack-anchor vs clean-anchor rollouts,
where the anchors come from `hacked_E` (the weak run_tests-signature detector,
rewards.py:542 -- verified NOT the oracle). With a random `v_grad` both means
collapse to ~0, tau -> 0, and `cos_b > tau` is a coin flip. The detector also
leaks ~1.1% onto held-out B (double-hacks), patched by `--gate-anchor-teacher-only`.
Net: in the regime we ran, suppression is label force-routing + a coin flip, and
the extracted direction is decorative. The vector's *only* possible value is
suppressing **unlabeled hack B**, which only the cosine gate can touch -- and that
has never been tested against a random/placebo control.
## The redesign (cheat less, make the vector necessary)
Driving principle: the only labelled data anywhere should be the hand-built A-pairs
(we need them for `v_grad` regardless). Then the gate is the *sole* suppression
mechanism, B is provably label-free, and real-vs-random is a clean test.
Concrete changes to make:
1. **Always stop teacher mixing after step 30** (`teacher_off_step=30` default):
seed the hacks, then pure on-policy. Job 87 showed hacking self-sustains after
the cut, so the teacher is only a seeder.
2. **Drop the teacher force-route** -- remove the `hack_anchor |` term; route purely
by the gate.
3. **Calibrate tau from the contrastive pairs, not a live detector.** Re-project the
fixed A-pairs through the current adapter each refresh:
`tau = (mean cos(g_hack_pair, ref) + mean cos(g_clean_pair, ref)) / 2`.
No live detector over students -> B untouched by any label -> airtight by
construction (no `--gate-anchor-teacher-only` patch needed).
4. **Open fork (user's idea): use the persona/contrastive pairs directly as the hack
reference rather than collapsing to one `v_grad` vector** -- gate on similarity to
the pair set, re-saved each refresh. Keeps more structure than a rank-1 mean-diff.
Decide vector-vs-pairset during the refactor.
Caveat to smoke-test: a pair-derived tau may be miscalibrated vs the live rollout
cosine distribution (authored teacher-forced pairs vs sampled rollouts differ in
length/entropy). Verify the pair cosines bracket the live ones before trusting it.
## The decisive pending experiment (already queued, survives the refactor's question)
A5 regime (teacher = run_tests only, `--gate-anchor-teacher-only` so held-out B has
zero labels), comparing held-out file_marker/sentinel/stdout suppression:
- job 126 = real `v_grad` (the existing claim)
- job 135 = random `v_grad` (does B still get suppressed with no real direction?)
- TODO: placebo `v_grad` (null_city) in the same regime
Verdict: B suppressed equally by random => vector decorative; only real suppresses B
=> vector load-bearing for unlabeled hacks (the whole novelty).
## Jobs running/queued at handoff (recover or re-decide after refactor)
Box is ~2.8 min/step (CPU/rollout-bound, GPU median 37% / peak 91%, 16GB free).
60-step run ~3h; 200-step ~9h. Confounded full-teacher route2 directionality jobs
(118/119/121/122/123) are now low-value -- candidates to kill.
| job | status | prio | command |
|---|---|---|---|
| 118 | Running | 8 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=1 --out-tag=_route2_haar_d1_s41 |
| 127 | Queued | 9 | fast --intervention=erase --seed=41 --eval-ablate-every=5 --out-tag=_erase_realv_s41 |
| 135 | Queued | 9 | fast --intervention=route2 --seed=41 --teacher-modes run_tests --route2-random-v-seed=0 --steps=200 --eval-ablate-every=20 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_haar_d0_teacheronly_s41 |
| 119 | Queued | 8 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_bacon.json --out-tag=_route2_bacon_s41 |
| 121 | Queued | 8 | fast --intervention=route2 --seed=43 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_city.json --out-tag=_route2_placebo_nullcity_s43 |
| 128 | Queued | 8 | fast --intervention=erase --seed=41 --vhack-pairs-path=out/pairsets/null_city.json --eval-ablate-every=5 --out-tag=_erase_placebo_nullcity_s41 |
| 122 | Queued | 7 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --route2-random-v-seed=2 --out-tag=_route2_haar_d2_s41 |
| 123 | Queued | 7 | fast --intervention=route2 --seed=41 --rollout-ablate-frac=0 --eval-ablate-every=5 --vhack-pairs-path=out/pairsets/null_blue.json --out-tag=_route2_blue_s41 |
| 126 | Queued | 3 | fast --intervention=route2 --seed=41 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s41 |
| 124 | Queued | 0 | fast --intervention=route2 --seed=41 --teacher-off-step=40 --steps=200 --eval-ablate-every=20 --out-tag=_route2_toff40_s41 |
| 125 | Queued | 0 | fast --intervention=route --seed=41 --v-hack-path=out/vhack/v_hack_pairset_prog_wide_randomV.safetensors --vhack-refresh-every=0 --eval-ablate-every=5 --steps=60 --out-tag=_route_randomV_s41 |
| 129 | Queued | -1 | fast --intervention=none --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_none200_kl5_s41 |
| 130 | Queued | -1 | fast --intervention=route2 --seed=41 --beta=1e-5 --adam-beta1=0.9 --adam-beta2=0.99 --steps=200 --eval-ablate-every=20 --out-tag=_route2200_kl5_s41 |
| 131 | Queued | -2 | fast --intervention=none --seed=42 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s42 |
| 132 | Queued | -2 | fast --intervention=none --seed=43 --teacher-modes run_tests --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --out-tag=_a5_vanilla_tmrt_s43 |
| 133 | Queued | -2 | fast --intervention=route2 --seed=42 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s42 |
| 134 | Queued | -2 | fast --intervention=route2 --seed=43 --teacher-modes run_tests --v-hack-path=out/vhack/v_hack_a5_runtests.safetensors --steps=200 --eval-ablate-every=10 --eval-n-prompts=24 --gate-anchor-teacher-only --out-tag=_a5_route2_teacheronly_s43 |
+4
View File
@@ -0,0 +1,4 @@
# 2026-06-04 23:18:15
FYI, my notes- I take the ariahw/rl-rewardhacking reward hacking setup and a 4B model- I extend from 1 to 4 hints+hacks- make a reward hacking vector from contrastive pairs collected on each LoRA module's weights from the GRPO gradients over the pairs. These pairs are synthetic and not in distribution. (Steering vectors from gradients are different, but this approach was actually published previously)- This vector now controls the routing SGTM style
One caveat: Since I lack a significant compute budget I avoided running 65-hour pure GRPO jobs and instead bootstrapped the process. 15% of samples come from a hacky teacher, and that makes the run <2 hours. However the result is basically the same because the student's GRPO generates more hacks than the teacher after 40 steps.