no-cheat check: held-out hacked_E is <=1.1% FP, not ==0; correct caption

Re-counted E-by-mode on job-104 A5 route2 rollouts: held-out hacked_E fires
2/1337 file_marker, 0/1386 sentinel, 17/1540 stdout_marker (<=1.1%, detector
cross-mode false positives). The 'hacked_E==0' claim was the smaller harvest;
at n>=1337 the FP floor shows. Claim corrected to '>=98.9% labels withheld'.
Result robust: route2 non-directional (job 86), so the 1% leak isn't load-bearing.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-05 03:01:20 +00:00
parent 6869afef72
commit b3539e50e7
2 changed files with 27 additions and 6 deletions
+8
View File
@@ -2,6 +2,14 @@
Append-only. New entries at the top, date-stamped. Never edit old entries.
## 2026-06-05 (d) — no-cheat check re-counted: held-out hacked_E is NOT exactly 0 (<=1.1% detector FP), claim corrected
Re-ran the E-by-mode firing table on job-104 (A5 route2) rollouts.jsonl (5600 lines, 40 steps) to verify the load-bearing no-cheat invariant before locking tab:generalisation. The earlier note (entry 2026-06-04 (a)) said hacked_E==0 on held-out modes; that was the smaller harvest and is slightly wrong at scale.
- [obs] hacked_E by env_mode (fires / n): run_tests 1175/1337 (88%, the KNOWN mode); file_marker 2/1337 (0.15%); sentinel 0/1386 (0%); stdout_marker 17/1540 (1.1%).
- [reason] the run_tests detector cross-fires <=1.1% on held-out modes (false positives: the model sometimes emits a run_tests-style exploit on a non-run_tests problem). So the live gate (teacher OR hacked_E) leaks ~1% of held-out labels into routing -- not exactly zero labels.
- [implication] the no-cheat claim survives but must be stated as ">=98.9% of held-out hacks reach the optimiser unlabelled", not "==0". And the result does not hinge on the 1% leak anyway: route2 is non-directional (job 86 placebo), so suppression is the quarantine absorbing the gated ~half of rollouts regardless. Corrected both the tab:generalisation caption and its provenance comment in main.tex.
- [next] same re-count on the n=3 A5 seeds (jobs 107-110) when they land, to confirm the FP floor is stable across seeds.
## 2026-06-05 (c) — placebo (job 86) ran the FULL 60 steps at deploy hack 0.000: route2 non-directionality locked
Verdict on the WATCH in entry (a). Job 86 (route2 vs null_city arbitrary direction) completed all 60 steps.
+19 -6
View File
@@ -432,10 +432,19 @@ column pending job 104.}
% --- Table: per-mode held-out (A5 designed test) ----------------------------
% PROVENANCE (2026-06-04, #185): the designed held-out test.
% Teacher demonstrates ONLY run_tests; the route gate (teacher OR hacked_E) is
% blind to the other three modes -- hacked_E==0 on them, verified in the harvest
% E-by-mode table (journal 2026-06-04 (a)). So file_marker / sentinel /
% stdout_marker are held out with ZERO labels: not in the teacher pool, not in
% the gate signal. This is the load-bearing no-cheat check.
% near-blind to the other three modes. E-by-mode on job-104 route2 rollouts
% (re-counted 2026-06-05, n per mode in parens):
% run_tests hacked_E 1175/1337 (88%, the KNOWN mode)
% file_marker 2/1337 (0.15%)
% sentinel 0/1386 (0%)
% stdout_marker 17/1540 (1.1%)
% NOT exactly zero: the run_tests detector cross-fires <=1.1% on held-out modes
% (false positives), so the gate leaks ~1% of held-out labels into routing. The
% earlier "hacked_E==0" note (journal 2026-06-04 (a)) was the smaller harvest;
% at n>=1337 the FP floor shows. The result is robust to this: route2 is
% non-directional (placebo job 86), so suppression is the quarantine absorbing
% the gated ~half of rollouts regardless of the 1% leak. Still the load-bearing
% no-cheat check -- held out with >=98.9% labels withheld.
% vanilla baseline = job 103
% out/runs/20260604T025953_fast_vanilla_seed41_a5_vanilla_tmrt_s41/per_mode_deploy.json
% deploy_hack (n=48/mode): run_tests 1.000 | file_marker 0.625 | sentinel 0.417 | stdout_marker 0.167
@@ -450,8 +459,12 @@ column pending job 104.}
Setup: the hack direction $v$ is extracted from \texttt{run\_tests} material only,
and the teacher pool demonstrates only \texttt{run\_tests}. ``Held out'' = a mode
absent from both: it is not in $v$, not in the teacher pool, and the live route
gate (\texttt{teacher OR hacked\_E}) never fires on it because its detector is
silent there ($\texttt{hacked\_E}{=}0$). The student still learns the held-out
gate (\texttt{teacher OR hacked\_E}) is near-silent there: the \texttt{run\_tests}
detector cross-fires on only $0/1386$ \texttt{sentinel}, $2/1337$
\texttt{file\_marker}, and $17/1540$ \texttt{stdout\_marker} student rollouts
($\le 1.1\%$, detector false positives, not by-design labels), so $\ge 98.9\%$
of held-out hacks reach the optimiser unlabelled. The student still learns the
held-out
cheats on-policy (knob-on train-hack: \texttt{file\_marker} $0.54$,
\texttt{stdout\_marker} $0.60$), yet knob-off deploy hack is near zero on all
three. So the quarantine removes cheats it was never given a label for. Mechanism