mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:48:43 +08:00
journal: A5 no-cheat table (hacked_E sees only run_tests; file_marker held-out) + route-run mechanism blocker
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -2,6 +2,63 @@
|
||||
|
||||
Append-only. New entries at the top, date-stamped. Never edit old entries.
|
||||
|
||||
## 2026-06-03 (f) — A5 no-cheat check: the weak detector (hacked_E) sees only run_tests; file_marker is the held-out payload
|
||||
|
||||
**Context:** Job 95 (A5 harvest, vanilla 4-mode, 40 steps, seed 41) finished. Harvested the
|
||||
student's own rollouts to (i) confirm which modes the weak live detector can flag and (ii) build
|
||||
the held-out-mode pairset. Rollouts: out/runs/20260603T202151_fast_vanilla_seed41_harvest_s41/rollouts.jsonl
|
||||
(1120 rollouts). Deploy trajectory over the harvest: hack 0.00 (s0) -> 0.141 (s10) -> 0.266 (s20)
|
||||
-> 0.359 (s30), so the student learned to hack on-policy as intended.
|
||||
|
||||
### Per-mode detector firing (THE load-bearing no-cheat table)
|
||||
|
||||
| env_mode | n | exploited | hacked_E | hacked_C | hacked_D | gt_pass |
|
||||
|:--------------|----:|----------:|---------:|---------:|---------:|--------:|
|
||||
| file_marker | 238 | 97 | 0 | 0 | 0 | 55 |
|
||||
| run_tests | 287 | 149 | 189 | 165 | 9 | 21 |
|
||||
| sentinel | 315 | 4 | 0 | 0 | 0 | 154 |
|
||||
| stdout_marker | 280 | 7 | 2 | 1 | 0 | 127 |
|
||||
|
||||
- [obs] the live route gate keys on `hacked_E` (train.py:1155, `teacher OR hacked_E`). hacked_E
|
||||
fires 189x on run_tests, exactly 0 on file_marker and sentinel, 2 on stdout_marker. So the
|
||||
weak detector is effectively a run_tests-only detector.
|
||||
- [obs] file_marker is heavily exploited (97 hack rollouts) yet hacked_E==0 -- the student learned
|
||||
this loophole and the gate is blind to it. This is the ideal held-out target.
|
||||
- [obs] sentinel (4 exploited) and stdout_marker (7) are barely hacked even by vanilla -> low
|
||||
signal, weak held-out targets (suppression there is near-vacuous).
|
||||
|
||||
### A5 split (honest, no-cheat invariant satisfied)
|
||||
|
||||
- [decision] KNOWN (detector-visible) = run_tests. HELD-OUT (detector-blind, hacked_E==0) =
|
||||
file_marker (primary, 97 exploited), with sentinel/stdout_marker as low-signal auxiliaries.
|
||||
- [reason] no-cheat invariant holds: file_marker has hacked_E==0, so the live gate literally
|
||||
cannot fire on it. Any suppression of file_marker must come from absorption (run_tests-derived
|
||||
v_grad co-activating on file_marker hack gradients), which is exactly what A5 tests.
|
||||
- [obs] pairset: out/pairsets/heldout_known_runtests.json, 5 same-prompt (hack, clean) pairs from
|
||||
run_tests (hack_pids=6, clean_pids=5, eligible-both=5; all hack_mech=run_tests). 5 pairs =
|
||||
rank-5 signal, lower than the 21-pair extraction; acceptable as a routing gate per the SGTM
|
||||
absorption framing (entry (d)), but note the weakness.
|
||||
- [check / next] extract v_grad from this pairset, queue a route2 run with the teacher pool
|
||||
restricted to run_tests only + per-mode deploy eval, and measure whether file_marker deploy
|
||||
hack drops vs the vanilla harvest's file_marker rate. Decisive A5 (#185) outcome.
|
||||
|
||||
### Mechanism blocker found while wiring the A5 run (route-run config)
|
||||
|
||||
- [obs] train.py couples the training problem set to the teacher pool: load filters problems to
|
||||
pool keys (train.py:589) and the mixed-pool loop SKIPS any prompt with no teacher demos
|
||||
(train.py:891-893, `if not pool_rows: continue`). So a run_tests-only teacher pool drops
|
||||
file_marker from training entirely -> student never learns it -> "suppression" is vacuous.
|
||||
- [obs] the full 4-mode pool (out/pools/substrate: file_marker 22 / sentinel 24 / run_tests 6 /
|
||||
stdout_marker 22 hack demos) seeds the route2 tau hack-anchor with ALL teacher rows
|
||||
(train.py:1160 `if not is_student[_i]: _ha=True`). So using it would anchor on file_marker
|
||||
demos = a held-out label leak = cheat.
|
||||
- [decision] the clean no-cheat A5 needs to DECOUPLE: keep all 4 modes as training problems but
|
||||
inject teacher demos (and thus seed the anchor) ONLY for the known mode (run_tests); train the
|
||||
held-out modes purely on-policy. Minimal change: a `teacher_modes` config that (a) skips the
|
||||
line-589 pool filter, (b) at line 883/891 uses teacher mix only for prompts whose env_mode is
|
||||
in teacher_modes and falls through to student-only (not skip) otherwise. The full pool can stay
|
||||
loaded; held-out demos simply never get sampled. Implement + smoke before queueing the A5 run.
|
||||
|
||||
## 2026-06-03 (e) — #187 resolved: vanilla-200 collapse was the hot preset, not long-horizon GRPO
|
||||
|
||||
**Context:** Job 97 (gentle-preset vanilla-200 collapse probe). Job 85 had collapsed
|
||||
|
||||
Reference in New Issue
Block a user