results: deploy-eval table (eval2 headline=solve_dep-hack_dep); journal interim read

scripts/results_deploy.py pulls the held-out TEST deploy numbers from the FINAL EVAL line that just-results skips. Journal: per-rollout real==random (absorption), per-token real-V is the lead; pinning suspected off (band above live cos). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:45:42 +08:00 · 2026-06-08 10:47:38 +00:00
parent fcac80c4bb
commit b28b1a5e88
3 changed files with 149 additions and 0 deletions
@@ -3370,3 +3370,55 @@ throwaway quarantine knob absorb the hack regardless of direction (H2)?
 No queue change. Job 11 per-token random-V (Running) is the load-bearing follow-up (controls
 the better-suppressing per-token arm); job 12 vanilla confirms the target exists; job 13 vampire
 is the semantic-placebo cross-check. Verdict consolidates once 11 + 12 land.
+
+## 2026-06-08 09:00 -- interim read (wassname): routeV barely working, but per-token real-V is the promising lead
+
+**Context:** deploy table `scripts/results_deploy.py` over the 3 finished dir6 eval2 runs
+(jobs 8/9/10), commit `caa0d09`. User's interpretation, recorded as the steer for next dev.
+
+### Observations
+
+- [obs] Deploy eval (eval2 = recency-clean held-out TEST n=119), headline = solve_dep - hack_dep:
+
+| headline | train solve(L5) | train hack(L5) | solve_dep | hack_dep | arm |
+|---:|---:|---:|---:|---:|:--|
+| +0.101 | 0.294 | 0.675 | 0.143 | 0.042 | per-token real-V (job 9) |
+| +0.025 | 0.212 | 0.762 | 0.126 | 0.101 | per-rollout real-V (job 8) |
+| +0.008 | 0.219 | 0.762 | 0.109 | 0.101 | per-rollout random-V (job 10) |
+
+- [obs] Train-log symptom (user read off job 9/8 per-step rows): the pairs barely separate the
+  live batch -- keep zone too high, routed/hack zone too low; band pins above the live cos cluster.
+- [obs] No knob-off (deploy) eval exists on the TRAIN/IID distribution -- both val(n=32) and
+  test(n=119) are sampled from the paper TEST set (`train.py:741`, val = test[:32]), so every
+  deploy number on the board is OOD. The per-step hack/solve columns are knob-ON on train.
+
+### Inferences
+
+- [inf] At per-rollout granularity routeV is "not working that well": real-V == random-V
+  (0.101 == 0.101) is consistent with the suppression being a RANDOM-gradient/absorption effect,
+  not the extracted hack direction. {reason: Haar control matches to 3 d.p.; credence 0.6}.
+- [inf] Per-token real-V is a real lead worth pursuing: headline +0.101 vs +0.025/+0.008, and
+  deploy hack 0.042 is the only sub-0.10 number. {reason: best on every column; but n=1 seed and
+  its random-V control (job 11) not yet in; credence 0.5}.
+- [inf] Bad PINNING is the suspected lever: the pair-calibrated band sits above the live cos
+  distribution (off-distribution authored pairs), so little routes and the kept grad still carries
+  the hack. {reason: keep-too-high/route-too-low in the per-step zones + band lower +0.037 vs live
+  median -0.06; credence 0.55}.
+
+### Failure modes considered
+
+- **Most-likely:** the whole comparison is vacuous if vanilla also deploys ~0.10 (base rate, no
+  suppression to attribute). Prior 0.3. Check: job 12 vanilla (low-priority overnight).
+- **Subtle:** it works IID but not OOD (or vice versa) -- we only measure OOD, so a knob that holds
+  the hack on train but leaks on novel prompts (or the reverse) is invisible. Prior 0.35. Check:
+  load job 9 checkpoints, knob-off deploy eval on a TRAIN sample -> the missing IID column.
+- **Null:** per-token's 0.042 edge is seed luck / granularity, not direction. Prior 0.25. Check:
+  job 11 per-token random-V (Running) -- if it also ~0.04, direction buys nothing at token level.
+
+### Next action
+
+Dev the pinning (route the live-cos tail, not the pair scale). Diagnostic first (TODO): load
+job 9 `first_hack.safetensors`, overlay on a band-relative axis the cosines cos(g_live, v_grad)
+for a mixed oracle-labelled batch vs the pair cosines cos(clean_pairs, v_grad) and
+cos(hack_pairs, v_grad) that set the band edges -- see whether live hack/clean separate where the
+band sits. Then add the IID-deploy column from checkpoints. Vanilla + LoRA are lower-priority TODOs.