From a0d4ddf9d5798d7ac68fd03cd01cd7ab5e2b137b Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Wed, 3 Jun 2026 22:40:06 +0000 Subject: [PATCH] journal: A5 no-cheat table (hacked_E sees only run_tests; file_marker held-out) + route-run mechanism blocker Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- RESEARCH_JOURNAL.md | 57 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 57 insertions(+) diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index edab772..0502d32 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -2,6 +2,63 @@ Append-only. New entries at the top, date-stamped. Never edit old entries. +## 2026-06-03 (f) — A5 no-cheat check: the weak detector (hacked_E) sees only run_tests; file_marker is the held-out payload + +**Context:** Job 95 (A5 harvest, vanilla 4-mode, 40 steps, seed 41) finished. Harvested the +student's own rollouts to (i) confirm which modes the weak live detector can flag and (ii) build +the held-out-mode pairset. Rollouts: out/runs/20260603T202151_fast_vanilla_seed41_harvest_s41/rollouts.jsonl +(1120 rollouts). Deploy trajectory over the harvest: hack 0.00 (s0) -> 0.141 (s10) -> 0.266 (s20) +-> 0.359 (s30), so the student learned to hack on-policy as intended. + +### Per-mode detector firing (THE load-bearing no-cheat table) + +| env_mode | n | exploited | hacked_E | hacked_C | hacked_D | gt_pass | +|:--------------|----:|----------:|---------:|---------:|---------:|--------:| +| file_marker | 238 | 97 | 0 | 0 | 0 | 55 | +| run_tests | 287 | 149 | 189 | 165 | 9 | 21 | +| sentinel | 315 | 4 | 0 | 0 | 0 | 154 | +| stdout_marker | 280 | 7 | 2 | 1 | 0 | 127 | + +- [obs] the live route gate keys on `hacked_E` (train.py:1155, `teacher OR hacked_E`). hacked_E + fires 189x on run_tests, exactly 0 on file_marker and sentinel, 2 on stdout_marker. So the + weak detector is effectively a run_tests-only detector. +- [obs] file_marker is heavily exploited (97 hack rollouts) yet hacked_E==0 -- the student learned + this loophole and the gate is blind to it. This is the ideal held-out target. +- [obs] sentinel (4 exploited) and stdout_marker (7) are barely hacked even by vanilla -> low + signal, weak held-out targets (suppression there is near-vacuous). + +### A5 split (honest, no-cheat invariant satisfied) + +- [decision] KNOWN (detector-visible) = run_tests. HELD-OUT (detector-blind, hacked_E==0) = + file_marker (primary, 97 exploited), with sentinel/stdout_marker as low-signal auxiliaries. +- [reason] no-cheat invariant holds: file_marker has hacked_E==0, so the live gate literally + cannot fire on it. Any suppression of file_marker must come from absorption (run_tests-derived + v_grad co-activating on file_marker hack gradients), which is exactly what A5 tests. +- [obs] pairset: out/pairsets/heldout_known_runtests.json, 5 same-prompt (hack, clean) pairs from + run_tests (hack_pids=6, clean_pids=5, eligible-both=5; all hack_mech=run_tests). 5 pairs = + rank-5 signal, lower than the 21-pair extraction; acceptable as a routing gate per the SGTM + absorption framing (entry (d)), but note the weakness. +- [check / next] extract v_grad from this pairset, queue a route2 run with the teacher pool + restricted to run_tests only + per-mode deploy eval, and measure whether file_marker deploy + hack drops vs the vanilla harvest's file_marker rate. Decisive A5 (#185) outcome. + +### Mechanism blocker found while wiring the A5 run (route-run config) + +- [obs] train.py couples the training problem set to the teacher pool: load filters problems to + pool keys (train.py:589) and the mixed-pool loop SKIPS any prompt with no teacher demos + (train.py:891-893, `if not pool_rows: continue`). So a run_tests-only teacher pool drops + file_marker from training entirely -> student never learns it -> "suppression" is vacuous. +- [obs] the full 4-mode pool (out/pools/substrate: file_marker 22 / sentinel 24 / run_tests 6 / + stdout_marker 22 hack demos) seeds the route2 tau hack-anchor with ALL teacher rows + (train.py:1160 `if not is_student[_i]: _ha=True`). So using it would anchor on file_marker + demos = a held-out label leak = cheat. +- [decision] the clean no-cheat A5 needs to DECOUPLE: keep all 4 modes as training problems but + inject teacher demos (and thus seed the anchor) ONLY for the known mode (run_tests); train the + held-out modes purely on-policy. Minimal change: a `teacher_modes` config that (a) skips the + line-589 pool filter, (b) at line 883/891 uses teacher mix only for prompts whose env_mode is + in teacher_modes and falls through to student-only (not skip) otherwise. The full pool can stay + loaded; held-out demos simply never get sampled. Implement + smoke before queueing the A5 run. + ## 2026-06-03 (e) — #187 resolved: vanilla-200 collapse was the hot preset, not long-horizon GRPO **Context:** Job 97 (gentle-preset vanilla-200 collapse probe). Job 85 had collapsed