From b3539e50e72770567422d54d863b130770f3ffdc Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Fri, 5 Jun 2026 03:01:20 +0000 Subject: [PATCH] no-cheat check: held-out hacked_E is <=1.1% FP, not ==0; correct caption Re-counted E-by-mode on job-104 A5 route2 rollouts: held-out hacked_E fires 2/1337 file_marker, 0/1386 sentinel, 17/1540 stdout_marker (<=1.1%, detector cross-mode false positives). The 'hacked_E==0' claim was the smaller harvest; at n>=1337 the FP floor shows. Claim corrected to '>=98.9% labels withheld'. Result robust: route2 non-directional (job 86), so the 1% leak isn't load-bearing. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- RESEARCH_JOURNAL.md | 8 ++++++++ docs/writeup/main.tex | 25 +++++++++++++++++++------ 2 files changed, 27 insertions(+), 6 deletions(-) diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index c6937b5..6598b25 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -2,6 +2,14 @@ Append-only. New entries at the top, date-stamped. Never edit old entries. +## 2026-06-05 (d) — no-cheat check re-counted: held-out hacked_E is NOT exactly 0 (<=1.1% detector FP), claim corrected + +Re-ran the E-by-mode firing table on job-104 (A5 route2) rollouts.jsonl (5600 lines, 40 steps) to verify the load-bearing no-cheat invariant before locking tab:generalisation. The earlier note (entry 2026-06-04 (a)) said hacked_E==0 on held-out modes; that was the smaller harvest and is slightly wrong at scale. +- [obs] hacked_E by env_mode (fires / n): run_tests 1175/1337 (88%, the KNOWN mode); file_marker 2/1337 (0.15%); sentinel 0/1386 (0%); stdout_marker 17/1540 (1.1%). +- [reason] the run_tests detector cross-fires <=1.1% on held-out modes (false positives: the model sometimes emits a run_tests-style exploit on a non-run_tests problem). So the live gate (teacher OR hacked_E) leaks ~1% of held-out labels into routing -- not exactly zero labels. +- [implication] the no-cheat claim survives but must be stated as ">=98.9% of held-out hacks reach the optimiser unlabelled", not "==0". And the result does not hinge on the 1% leak anyway: route2 is non-directional (job 86 placebo), so suppression is the quarantine absorbing the gated ~half of rollouts regardless. Corrected both the tab:generalisation caption and its provenance comment in main.tex. +- [next] same re-count on the n=3 A5 seeds (jobs 107-110) when they land, to confirm the FP floor is stable across seeds. + ## 2026-06-05 (c) — placebo (job 86) ran the FULL 60 steps at deploy hack 0.000: route2 non-directionality locked Verdict on the WATCH in entry (a). Job 86 (route2 vs null_city arbitrary direction) completed all 60 steps. diff --git a/docs/writeup/main.tex b/docs/writeup/main.tex index db75f5c..1ca022a 100644 --- a/docs/writeup/main.tex +++ b/docs/writeup/main.tex @@ -432,10 +432,19 @@ column pending job 104.} % --- Table: per-mode held-out (A5 designed test) ---------------------------- % PROVENANCE (2026-06-04, #185): the designed held-out test. % Teacher demonstrates ONLY run_tests; the route gate (teacher OR hacked_E) is -% blind to the other three modes -- hacked_E==0 on them, verified in the harvest -% E-by-mode table (journal 2026-06-04 (a)). So file_marker / sentinel / -% stdout_marker are held out with ZERO labels: not in the teacher pool, not in -% the gate signal. This is the load-bearing no-cheat check. +% near-blind to the other three modes. E-by-mode on job-104 route2 rollouts +% (re-counted 2026-06-05, n per mode in parens): +% run_tests hacked_E 1175/1337 (88%, the KNOWN mode) +% file_marker 2/1337 (0.15%) +% sentinel 0/1386 (0%) +% stdout_marker 17/1540 (1.1%) +% NOT exactly zero: the run_tests detector cross-fires <=1.1% on held-out modes +% (false positives), so the gate leaks ~1% of held-out labels into routing. The +% earlier "hacked_E==0" note (journal 2026-06-04 (a)) was the smaller harvest; +% at n>=1337 the FP floor shows. The result is robust to this: route2 is +% non-directional (placebo job 86), so suppression is the quarantine absorbing +% the gated ~half of rollouts regardless of the 1% leak. Still the load-bearing +% no-cheat check -- held out with >=98.9% labels withheld. % vanilla baseline = job 103 % out/runs/20260604T025953_fast_vanilla_seed41_a5_vanilla_tmrt_s41/per_mode_deploy.json % deploy_hack (n=48/mode): run_tests 1.000 | file_marker 0.625 | sentinel 0.417 | stdout_marker 0.167 @@ -450,8 +459,12 @@ column pending job 104.} Setup: the hack direction $v$ is extracted from \texttt{run\_tests} material only, and the teacher pool demonstrates only \texttt{run\_tests}. ``Held out'' = a mode absent from both: it is not in $v$, not in the teacher pool, and the live route - gate (\texttt{teacher OR hacked\_E}) never fires on it because its detector is - silent there ($\texttt{hacked\_E}{=}0$). The student still learns the held-out + gate (\texttt{teacher OR hacked\_E}) is near-silent there: the \texttt{run\_tests} + detector cross-fires on only $0/1386$ \texttt{sentinel}, $2/1337$ + \texttt{file\_marker}, and $17/1540$ \texttt{stdout\_marker} student rollouts + ($\le 1.1\%$, detector false positives, not by-design labels), so $\ge 98.9\%$ + of held-out hacks reach the optimiser unlabelled. The student still learns the + held-out cheats on-policy (knob-on train-hack: \texttt{file\_marker} $0.54$, \texttt{stdout\_marker} $0.60$), yet knob-off deploy hack is near zero on all three. So the quarantine removes cheats it was never given a label for. Mechanism