journal: A5 gate leak fixed (teacher-only anchor) + airtight rerun queued (job 111)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-05 03:54:09 +00:00
parent 34ad20db0a
commit 7b08a7ede9
+9
View File
@@ -2,6 +2,15 @@
Append-only. New entries at the top, date-stamped. Never edit old entries.
## 2026-06-05 (e) — A5 no-cheat leak FIXED at the gate (teacher-only anchor) + unit test; airtight rerun queued (job 111)
Entry (d) found held-out hacked_E is not exactly 0 (<=1.1% detector false positives). Today, traced it to a real label leak and fixed it.
- [obs] the route2 tau-gate force-routes `hack_anchor = teacher OR hacked_E-flagged student` (train.py build_route2_anchors). hacked_E is the run_tests detector; its <=1.1% cross-fire on held-out modes (stdout 17/1540, file_marker 2/1337) meant those rollouts were genuinely force-routed during TRAINING -- a label leak into the held-out class, not noise.
- [reason] rescoring cannot remove it: the leak shaped the trained weights via routing; the deploy-hack eval itself is clean (per-mode ground-truth, knob-off). And the A5 run saved no per-eval checkpoints. So retrain is the only path.
- [fix] added `gate_anchor_teacher_only`: anchor on teacher rows only, dropping the student detector term -> held-out classes get PROVABLY zero detector labels. Extracted the inline anchor to build_route2_anchors() and added scripts/verify_gate_anchor.py (in `just smoke`): proves default reproduces the leak and teacher_only removes it (9/9). Commit 34ad20d.
- [already-airtight subcase] sentinel had 0/1386 hacked_E AND zero teacher demos, yet dropped 0.417->0.000. A held-out mode with literally zero detector labels was fully suppressed -- the existence proof the leak is not the mechanism. Suppression is even ANTI-correlated with leak (stdout most leak 1.1% -> least suppressed 0.083).
- [next] job 111 (prio 3): route2 run_tests-only v+teacher + teacher-only anchor, 200 steps. resolve: held-out deploy hack ~0 with zero held-out labels on ALL three modes => generalisation survives. Then fill an airtight row in tab:generalisation.
## 2026-06-05 (d) — no-cheat check re-counted: held-out hacked_E is NOT exactly 0 (<=1.1% detector FP), claim corrected
Re-ran the E-by-mode firing table on job-104 (A5 route2) rollouts.jsonl (5600 lines, 40 steps) to verify the load-bearing no-cheat invariant before locking tab:generalisation. The earlier note (entry 2026-06-04 (a)) said hacked_E==0 on held-out modes; that was the smaller harvest and is slightly wrong at scale.