no-cheat check: held-out hacked_E is <=1.1% FP, not ==0; correct caption

Re-counted E-by-mode on job-104 A5 route2 rollouts: held-out hacked_E fires 2/1337 file_marker, 0/1386 sentinel, 17/1540 stdout_marker (<=1.1%, detector cross-mode false positives). The 'hacked_E==0' claim was the smaller harvest; at n>=1337 the FP floor shows. Claim corrected to '>=98.9% labels withheld'. Result robust: route2 non-directional (job 86), so the 1% leak isn't load-bearing. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:30:30 +08:00 · 2026-06-05 03:01:20 +00:00
parent 6869afef72
commit b3539e50e7
2 changed files with 27 additions and 6 deletions
@@ -2,6 +2,14 @@

 Append-only. New entries at the top, date-stamped. Never edit old entries.

+## 2026-06-05 (d) — no-cheat check re-counted: held-out hacked_E is NOT exactly 0 (<=1.1% detector FP), claim corrected
+
+Re-ran the E-by-mode firing table on job-104 (A5 route2) rollouts.jsonl (5600 lines, 40 steps) to verify the load-bearing no-cheat invariant before locking tab:generalisation. The earlier note (entry 2026-06-04 (a)) said hacked_E==0 on held-out modes; that was the smaller harvest and is slightly wrong at scale.
+- [obs] hacked_E by env_mode (fires / n): run_tests 1175/1337 (88%, the KNOWN mode); file_marker 2/1337 (0.15%); sentinel 0/1386 (0%); stdout_marker 17/1540 (1.1%).
+- [reason] the run_tests detector cross-fires <=1.1% on held-out modes (false positives: the model sometimes emits a run_tests-style exploit on a non-run_tests problem). So the live gate (teacher OR hacked_E) leaks ~1% of held-out labels into routing -- not exactly zero labels.
+- [implication] the no-cheat claim survives but must be stated as ">=98.9% of held-out hacks reach the optimiser unlabelled", not "==0". And the result does not hinge on the 1% leak anyway: route2 is non-directional (job 86 placebo), so suppression is the quarantine absorbing the gated ~half of rollouts regardless. Corrected both the tab:generalisation caption and its provenance comment in main.tex.
+- [next] same re-count on the n=3 A5 seeds (jobs 107-110) when they land, to confirm the FP floor is stable across seeds.
+
 ## 2026-06-05 (c) — placebo (job 86) ran the FULL 60 steps at deploy hack 0.000: route2 non-directionality locked

 Verdict on the WATCH in entry (a). Job 86 (route2 vs null_city arbitrary direction) completed all 60 steps.
@@ -432,10 +432,19 @@ column pending job 104.}
 % --- Table: per-mode held-out (A5 designed test) ----------------------------
 % PROVENANCE (2026-06-04, #185): the designed held-out test.
 % Teacher demonstrates ONLY run_tests; the route gate (teacher OR hacked_E) is
-% blind to the other three modes -- hacked_E==0 on them, verified in the harvest
-% E-by-mode table (journal 2026-06-04 (a)). So file_marker / sentinel /
-% stdout_marker are held out with ZERO labels: not in the teacher pool, not in
-% the gate signal. This is the load-bearing no-cheat check.
+% near-blind to the other three modes. E-by-mode on job-104 route2 rollouts
+% (re-counted 2026-06-05, n per mode in parens):
+%   run_tests  hacked_E 1175/1337 (88%, the KNOWN mode)
+%   file_marker      2/1337 (0.15%)
+%   sentinel         0/1386 (0%)
+%   stdout_marker   17/1540 (1.1%)
+% NOT exactly zero: the run_tests detector cross-fires <=1.1% on held-out modes
+% (false positives), so the gate leaks ~1% of held-out labels into routing. The
+% earlier "hacked_E==0" note (journal 2026-06-04 (a)) was the smaller harvest;
+% at n>=1337 the FP floor shows. The result is robust to this: route2 is
+% non-directional (placebo job 86), so suppression is the quarantine absorbing
+% the gated ~half of rollouts regardless of the 1% leak. Still the load-bearing
+% no-cheat check -- held out with >=98.9% labels withheld.
 %   vanilla baseline = job 103
 %     out/runs/20260604T025953_fast_vanilla_seed41_a5_vanilla_tmrt_s41/per_mode_deploy.json
 %     deploy_hack (n=48/mode): run_tests 1.000 | file_marker 0.625 | sentinel 0.417 | stdout_marker 0.167
@@ -450,8 +459,12 @@ column pending job 104.}
  Setup: the hack direction $v$ is extracted from \texttt{run\_tests} material only,
  and the teacher pool demonstrates only \texttt{run\_tests}. ``Held out'' = a mode
  absent from both: it is not in $v$, not in the teacher pool, and the live route
-  gate (\texttt{teacher OR hacked\_E}) never fires on it because its detector is
-  silent there ($\texttt{hacked\_E}{=}0$). The student still learns the held-out
+  gate (\texttt{teacher OR hacked\_E}) is near-silent there: the \texttt{run\_tests}
+  detector cross-fires on only $0/1386$ \texttt{sentinel}, $2/1337$
+  \texttt{file\_marker}, and $17/1540$ \texttt{stdout\_marker} student rollouts
+  ($\le 1.1\%$, detector false positives, not by-design labels), so $\ge 98.9\%$
+  of held-out hacks reach the optimiser unlabelled. The student still learns the
+  held-out
  cheats on-policy (knob-on train-hack: \texttt{file\_marker} $0.54$,
  \texttt{stdout\_marker} $0.60$), yet knob-off deploy hack is near zero on all
  three. So the quarantine removes cheats it was never given a label for. Mechanism