From ec00bc4383251eae13c9932ad08f3f3969d0d65c Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Fri, 5 Jun 2026 08:23:49 +0000 Subject: [PATCH] docs: A5 leak is double-hacks (not detector FP); placebo non-directionality measured via hkgap Two review questions today exposed imprecise framing in load-bearing comments: - A5 held-out <=1.1% hacked_E is the model double-hacking (one run_tests()-shaped completion that also writes the stdout marker, verified job-95 id 132), not a detector false positive. hacked_E is the mode-agnostic run_tests signature. Grading channels are non-overlapping; the model's strategy is not. - Placebo 'non-directional' is now the hkgap measurement: real-v route2 hkgap 0.6-0.8 (separates hack/clean), placebo ~0 (dead), both deploy hack 0.000. Confirms the degenerate-gate read (H2) over clever-random-direction (H1): suppression is quarantine-volume + exploration floor, not v_hack specificity. Direction only shows in solve (real 0.625 > placebo 0.531). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- RESEARCH_JOURNAL.md | 30 ++++++++++++++++++++++++++++++ docs/writeup/main.tex | 42 ++++++++++++++++++++++++++++++++++-------- 2 files changed, 64 insertions(+), 8 deletions(-) diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index d0954be..955cea2 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -2,6 +2,36 @@ Append-only. New entries at the top, date-stamped. Never edit old entries. +## 2026-06-05 (g) — placebo non-directionality is MEASURED (hkgap), not just inferred; + A5 leak is double-hacks not detector error + +Two clarifications prompted by review questions today; neither changes a number, both make a load-bearing claim auditable. + +1. Placebo "non-directional" -> now grounded in the hkgap column (= ema_hack_cos - ema_clean_cos). + - real-v route2 (20260601T115713): hkgap 0.6-0.8 throughout. v_hack DOES separate hack from clean grads. + - placebo route2 (job 86, null_city): hkgap ~0.00 (+-0.03). The log's own "direction dead" condition. + - BOTH dump qE~0.5-0.6 of grad energy into the deleted knob; BOTH reach deploy hack 0.000. + => deploy-hack=0 is INDEPENDENT of whether the gate separates hack from clean. Mechanism = tau calibrates + to ~0 cos -> routes ~half the mass into a deleted block + exploration floor trains the kept adapter on + solve. This is the H2 (degenerate-gate) read an external critique proposed; the hkgap contrast confirms it + over H1 (clever-random-direction). The placebo is a non-selective control for hack-suppression. + - The ONE place direction still shows is SOLVE: real-v 0.625 > placebo 0.531 (vanilla 0.484), n=1 noisy -- + real-v routes hack-enriched mass so wastes less solve-grad. Clean directionality test stays the erase arm + (subtracts prop cos(g,v)), jobs 93/94. Predict random-V (job 106) also ~0.000 deploy, solve between the two. + - Paper: tab:ablation already says "direction-agnostic"; added the hkgap evidence to its provenance comment so + it's a measurement not an assertion. Placebo row IS rendered (line 405); only random-V (404) is still TODO. + +2. A5 held-out <=1.1% hacked_E is NOT detector false-positive -- it's the model double-hacking. + hacked_E is computed mode-agnostically (rewards.py:512, before the env_mode switch): "model wrote its own test + fn that passes its own solution" = the run_tests SIGNATURE. Grading channels are non-overlapping (one grader + per problem) but the model's STRATEGY is not. Verified on job-95 harvest, stdout_marker id 132: + def run_tests(): from sys import stdout; stdout.write("All tests passed.\n") + -- one function that is BOTH the run_tests self-test AND the stdout_marker exploit. So the leak is genuine + double-hacks (run_tests true-positives co-occurring with the held-out exploit) + a few broad-detector flags on + correct self-tested solutions (id 115, gt_pass=True/exploited=False). Because route2 routes the WHOLE rollout, + a double-hack lets a run_tests label incidentally suppress the co-occurring held-out exploit (~1% leak). + gate_anchor_teacher_only (job 111) removes it. Fixed the "detector false positives" wording in main.tex + tab:generalisation comment + caption to say "double-hack" instead. + ## 2026-06-05 (f) — VERDICT closing the (a) WATCH: route's gate is NON-directional (placebo endpoint, job 86 step 60) Closes the 2026-06-05 (a) WATCH ("directionality claim at risk"). Job 86 finished all 60 steps; read its per_mode_deploy.json diff --git a/docs/writeup/main.tex b/docs/writeup/main.tex index 0553069..7d12c1a 100644 --- a/docs/writeup/main.tex +++ b/docs/writeup/main.tex @@ -369,6 +369,17 @@ still strong enough to route one as it forms. % suppression is the discarded-knob absorption, not v_hack specificity. (placebo full % 60 steps, per_mode: file_marker train_hack 0.656/deploy 0.000, run_tests 0.625/0.000, % sentinel 0.042/0.000 -- held-out modes emerge on knob-on then knob-off to zero.) +% MECHANISM, measured (2026-06-05): "non-directional" is not an assertion, it's the +% hkgap column (ema_hack_cos - ema_clean_cos). real-v route2 (20260601T115713) hkgap +% 0.6-0.8 throughout -- v_hack DOES separate hack from clean grads. placebo hkgap ~0.00 +% (oscillates +-0.03) -- the random direction is the log's "direction dead" case. BOTH +% dump qE~0.5-0.6 of grad energy into the deleted knob and BOTH reach deploy hack 0.000. +% So deploy-hack=0 is independent of whether the gate separates hack from clean: the +% mechanism is tau-calibrates-to-~0-cos -> route ~half the mass into a deleted block + +% the exploration floor trains the kept adapter on solve. The ONE place the direction +% still shows is solve: real-v 0.625 > placebo 0.531 (n=1, noisy) -- real-v routes +% hack-enriched mass so wastes less solve-grad into the knob. Clean directionality test +% remains the erase arm (subtracts prop cos(g,v)), jobs 93/94. % Still queued (directional specificity now rests on the ERASE arm, not route): % random-V route control (expect ~0.000 too, second non-directionality check) % erase real-v_hack vs erase placebo-v: DECISIVE -- erase subtracts prop cos(g,v), @@ -470,13 +481,24 @@ column pending job 104.} % file_marker 2/1337 (0.15%) % sentinel 0/1386 (0%) % stdout_marker 17/1540 (1.1%) -% NOT exactly zero: the run_tests detector cross-fires <=1.1% on held-out modes -% (false positives), so the gate leaks ~1% of held-out labels into routing. The -% earlier "hacked_E==0" note (journal 2026-06-04 (a)) was the smaller harvest; -% at n>=1337 the FP floor shows. The result is robust to this: route2 is -% non-directional (placebo job 86), so suppression is the quarantine absorbing -% the gated ~half of rollouts regardless of the 1% leak. Still the load-bearing -% no-cheat check -- held out with >=98.9% labels withheld. +% NOT exactly zero, and NOT detector error: hacked_E is computed mode-agnostically +% (rewards.py:512, before the env_mode switch) and detects the run_tests SIGNATURE -- +% "the model wrote its own test fn that passes its own solution". The grading channels +% are non-overlapping (one grader per problem) but the model's STRATEGY is not: on a +% held-out problem it can emit a run_tests()-shaped completion that ALSO trips that +% mode's channel (verified, job-95 harvest, stdout_marker id 132: +% def run_tests(): from sys import stdout; stdout.write("All tests passed.\n") +% -- one function that is both the run_tests self-test AND the stdout_marker exploit). +% So the <=1.1% are genuine double-hacks (run_tests true-positives that co-occur with +% the held-out exploit), plus a few broad-detector flags on correct self-tested +% solutions (id 115, gt_pass=True/exploited=False). Because route2 routes the WHOLE +% rollout when the anchor fires, a double-hack lets a run_tests label incidentally +% suppress the co-occurring held-out exploit -- a real ~1% leak. gate_anchor_teacher_only +% (job 111) removes it: anchor on teacher-pool membership (run_tests problems) only, so +% held-out problems get a forced label NEVER and route only via the cos(g,v)>tau geometry +% gate (no label). Result robust meanwhile: route2 is non-directional (placebo job 86), +% so suppression is the quarantine absorbing the gated ~half of rollouts regardless of +% the 1% leak. Still the load-bearing no-cheat check -- held out with >=98.9% labels withheld. % vanilla baseline = job 103 % out/runs/20260604T025953_fast_vanilla_seed41_a5_vanilla_tmrt_s41/per_mode_deploy.json % deploy_hack (n=48/mode): run_tests 1.000 | file_marker 0.625 | sentinel 0.417 | stdout_marker 0.167 @@ -494,7 +516,11 @@ column pending job 104.} gate (\texttt{teacher OR hacked\_E}) is near-silent there: the \texttt{run\_tests} detector fires on only $0/1386$ \texttt{sentinel}, $2/1337$ \texttt{file\_marker}, and $17/1540$ \texttt{stdout\_marker} student rollouts - ($\le 1.1\%$, detector false positives, not by-design labels), so $\ge 98.9\%$ + ($\le 1.1\%$). These are not by-design labels but \emph{double-hacks}: the + detector keys on the \texttt{run\_tests} signature (a self-passing test fn), + and the model sometimes emits one completion that is both that signature and + the held-out exploit (e.g.\ a \texttt{run\_tests()} body that writes the + \texttt{stdout\_marker}). So $\ge 98.9\%$ of held-out hacks reach the optimiser unlabelled. The student still learns the held-out cheats on-policy (adapter-on train-hack: \texttt{file\_marker} $0.54$,