journal: A5 no-cheat table (hacked_E sees only run_tests; file_marker held-out) + route-run mechanism blocker

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-03 22:40:06 +00:00
parent f8aea5f9e6
commit a0d4ddf9d5
+57
View File
@@ -2,6 +2,63 @@
Append-only. New entries at the top, date-stamped. Never edit old entries.
## 2026-06-03 (f) — A5 no-cheat check: the weak detector (hacked_E) sees only run_tests; file_marker is the held-out payload
**Context:** Job 95 (A5 harvest, vanilla 4-mode, 40 steps, seed 41) finished. Harvested the
student's own rollouts to (i) confirm which modes the weak live detector can flag and (ii) build
the held-out-mode pairset. Rollouts: out/runs/20260603T202151_fast_vanilla_seed41_harvest_s41/rollouts.jsonl
(1120 rollouts). Deploy trajectory over the harvest: hack 0.00 (s0) -> 0.141 (s10) -> 0.266 (s20)
-> 0.359 (s30), so the student learned to hack on-policy as intended.
### Per-mode detector firing (THE load-bearing no-cheat table)
| env_mode | n | exploited | hacked_E | hacked_C | hacked_D | gt_pass |
|:--------------|----:|----------:|---------:|---------:|---------:|--------:|
| file_marker | 238 | 97 | 0 | 0 | 0 | 55 |
| run_tests | 287 | 149 | 189 | 165 | 9 | 21 |
| sentinel | 315 | 4 | 0 | 0 | 0 | 154 |
| stdout_marker | 280 | 7 | 2 | 1 | 0 | 127 |
- [obs] the live route gate keys on `hacked_E` (train.py:1155, `teacher OR hacked_E`). hacked_E
fires 189x on run_tests, exactly 0 on file_marker and sentinel, 2 on stdout_marker. So the
weak detector is effectively a run_tests-only detector.
- [obs] file_marker is heavily exploited (97 hack rollouts) yet hacked_E==0 -- the student learned
this loophole and the gate is blind to it. This is the ideal held-out target.
- [obs] sentinel (4 exploited) and stdout_marker (7) are barely hacked even by vanilla -> low
signal, weak held-out targets (suppression there is near-vacuous).
### A5 split (honest, no-cheat invariant satisfied)
- [decision] KNOWN (detector-visible) = run_tests. HELD-OUT (detector-blind, hacked_E==0) =
file_marker (primary, 97 exploited), with sentinel/stdout_marker as low-signal auxiliaries.
- [reason] no-cheat invariant holds: file_marker has hacked_E==0, so the live gate literally
cannot fire on it. Any suppression of file_marker must come from absorption (run_tests-derived
v_grad co-activating on file_marker hack gradients), which is exactly what A5 tests.
- [obs] pairset: out/pairsets/heldout_known_runtests.json, 5 same-prompt (hack, clean) pairs from
run_tests (hack_pids=6, clean_pids=5, eligible-both=5; all hack_mech=run_tests). 5 pairs =
rank-5 signal, lower than the 21-pair extraction; acceptable as a routing gate per the SGTM
absorption framing (entry (d)), but note the weakness.
- [check / next] extract v_grad from this pairset, queue a route2 run with the teacher pool
restricted to run_tests only + per-mode deploy eval, and measure whether file_marker deploy
hack drops vs the vanilla harvest's file_marker rate. Decisive A5 (#185) outcome.
### Mechanism blocker found while wiring the A5 run (route-run config)
- [obs] train.py couples the training problem set to the teacher pool: load filters problems to
pool keys (train.py:589) and the mixed-pool loop SKIPS any prompt with no teacher demos
(train.py:891-893, `if not pool_rows: continue`). So a run_tests-only teacher pool drops
file_marker from training entirely -> student never learns it -> "suppression" is vacuous.
- [obs] the full 4-mode pool (out/pools/substrate: file_marker 22 / sentinel 24 / run_tests 6 /
stdout_marker 22 hack demos) seeds the route2 tau hack-anchor with ALL teacher rows
(train.py:1160 `if not is_student[_i]: _ha=True`). So using it would anchor on file_marker
demos = a held-out label leak = cheat.
- [decision] the clean no-cheat A5 needs to DECOUPLE: keep all 4 modes as training problems but
inject teacher demos (and thus seed the anchor) ONLY for the known mode (run_tests); train the
held-out modes purely on-policy. Minimal change: a `teacher_modes` config that (a) skips the
line-589 pool filter, (b) at line 883/891 uses teacher mix only for prompts whose env_mode is
in teacher_modes and falls through to student-only (not skip) otherwise. The full pool can stay
loaded; held-out demos simply never get sampled. Implement + smoke before queueing the A5 run.
## 2026-06-03 (e) — #187 resolved: vanilla-200 collapse was the hot preset, not long-horizon GRPO
**Context:** Job 97 (gentle-preset vanilla-200 collapse probe). Job 85 had collapsed