journal: A5 no-cheat table (hacked_E sees only run_tests; file_marker held-out) + route-run mechanism blocker

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 17:48:43 +08:00 · 2026-06-03 22:40:06 +00:00
parent f8aea5f9e6
commit a0d4ddf9d5
1 changed files with 57 additions and 0 deletions
@@ -2,6 +2,63 @@

 Append-only. New entries at the top, date-stamped. Never edit old entries.

+## 2026-06-03 (f) — A5 no-cheat check: the weak detector (hacked_E) sees only run_tests; file_marker is the held-out payload
+
+**Context:** Job 95 (A5 harvest, vanilla 4-mode, 40 steps, seed 41) finished. Harvested the
+student's own rollouts to (i) confirm which modes the weak live detector can flag and (ii) build
+the held-out-mode pairset. Rollouts: out/runs/20260603T202151_fast_vanilla_seed41_harvest_s41/rollouts.jsonl
+(1120 rollouts). Deploy trajectory over the harvest: hack 0.00 (s0) -> 0.141 (s10) -> 0.266 (s20)
+-> 0.359 (s30), so the student learned to hack on-policy as intended.
+
+### Per-mode detector firing (THE load-bearing no-cheat table)
+
+| env_mode      |   n | exploited | hacked_E | hacked_C | hacked_D | gt_pass |
+|:--------------|----:|----------:|---------:|---------:|---------:|--------:|
+| file_marker   | 238 |        97 |        0 |        0 |        0 |      55 |
+| run_tests     | 287 |       149 |      189 |      165 |        9 |      21 |
+| sentinel      | 315 |         4 |        0 |        0 |        0 |     154 |
+| stdout_marker | 280 |         7 |        2 |        1 |        0 |     127 |
+
+- [obs] the live route gate keys on `hacked_E` (train.py:1155, `teacher OR hacked_E`). hacked_E
+  fires 189x on run_tests, exactly 0 on file_marker and sentinel, 2 on stdout_marker. So the
+  weak detector is effectively a run_tests-only detector.
+- [obs] file_marker is heavily exploited (97 hack rollouts) yet hacked_E==0 -- the student learned
+  this loophole and the gate is blind to it. This is the ideal held-out target.
+- [obs] sentinel (4 exploited) and stdout_marker (7) are barely hacked even by vanilla -> low
+  signal, weak held-out targets (suppression there is near-vacuous).
+
+### A5 split (honest, no-cheat invariant satisfied)
+
+- [decision] KNOWN (detector-visible) = run_tests. HELD-OUT (detector-blind, hacked_E==0) =
+  file_marker (primary, 97 exploited), with sentinel/stdout_marker as low-signal auxiliaries.
+- [reason] no-cheat invariant holds: file_marker has hacked_E==0, so the live gate literally
+  cannot fire on it. Any suppression of file_marker must come from absorption (run_tests-derived
+  v_grad co-activating on file_marker hack gradients), which is exactly what A5 tests.
+- [obs] pairset: out/pairsets/heldout_known_runtests.json, 5 same-prompt (hack, clean) pairs from
+  run_tests (hack_pids=6, clean_pids=5, eligible-both=5; all hack_mech=run_tests). 5 pairs =
+  rank-5 signal, lower than the 21-pair extraction; acceptable as a routing gate per the SGTM
+  absorption framing (entry (d)), but note the weakness.
+- [check / next] extract v_grad from this pairset, queue a route2 run with the teacher pool
+  restricted to run_tests only + per-mode deploy eval, and measure whether file_marker deploy
+  hack drops vs the vanilla harvest's file_marker rate. Decisive A5 (#185) outcome.
+
+### Mechanism blocker found while wiring the A5 run (route-run config)
+
+- [obs] train.py couples the training problem set to the teacher pool: load filters problems to
+  pool keys (train.py:589) and the mixed-pool loop SKIPS any prompt with no teacher demos
+  (train.py:891-893, `if not pool_rows: continue`). So a run_tests-only teacher pool drops
+  file_marker from training entirely -> student never learns it -> "suppression" is vacuous.
+- [obs] the full 4-mode pool (out/pools/substrate: file_marker 22 / sentinel 24 / run_tests 6 /
+  stdout_marker 22 hack demos) seeds the route2 tau hack-anchor with ALL teacher rows
+  (train.py:1160 `if not is_student[_i]: _ha=True`). So using it would anchor on file_marker
+  demos = a held-out label leak = cheat.
+- [decision] the clean no-cheat A5 needs to DECOUPLE: keep all 4 modes as training problems but
+  inject teacher demos (and thus seed the anchor) ONLY for the known mode (run_tests); train the
+  held-out modes purely on-policy. Minimal change: a `teacher_modes` config that (a) skips the
+  line-589 pool filter, (b) at line 883/891 uses teacher mix only for prompts whose env_mode is
+  in teacher_modes and falls through to student-only (not skip) otherwise. The full pool can stay
+  loaded; held-out demos simply never get sampled. Implement + smoke before queueing the A5 run.
+
 ## 2026-06-03 (e) — #187 resolved: vanilla-200 collapse was the hot preset, not long-horizon GRPO

 **Context:** Job 97 (gentle-preset vanilla-200 collapse probe). Job 85 had collapsed