journal: A5 run plan queued (strict teacher-modes=run_tests, vanilla baseline + route2 test)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-03 22:46:56 +00:00
parent da48a95d9e
commit 0913b064fc
+21
View File
@@ -59,6 +59,27 @@ the held-out-mode pairset. Rollouts: out/runs/20260603T202151_fast_vanilla_seed4
in teacher_modes and falls through to student-only (not skip) otherwise. The full pool can stay
loaded; held-out demos simply never get sampled. Implement + smoke before queueing the A5 run.
### Implemented + queued (commit da48a95)
- [obs] added `--teacher-modes` (train.py). Smoke-verified on tiny-random + substrate pool:
"teacher pool restricted 24->6", "loaded 24 problems" (all modes kept), run_tests prompts get
teacher rows (hack_t 2/2) while held-out prompts train student-only (hack_t 0/0, not skipped).
The end-of-run `delta_S_hack > 0` assert trips on tiny-random only because nothing ever hacks
-> nothing routes; on the real model run_tests hacks heavily so routing fires. Benign for smoke.
- [decision] STRICT A5 design chosen: teacher demos + tau anchor see ONLY run_tests; held-out
modes emerge purely on-policy. Most defensible no-cheat claim (a reviewer cannot say file_marker
was demonstrated to the model). Risk: file_marker may not emerge on-policy in 200 steps without
its teacher demo (job 95 needed teacher demos to reach 97 exploited in 40). Mitigation: a vanilla
baseline with the SAME teacher-modes=run_tests measures on-policy emergence -- if file_marker
stays ~0 in vanilla too, the test is inconclusive and we fall back to design-B (teacher seeds all
modes; v_grad + live gate still run_tests-only; weaker no-cheat but guarantees emergence). Do NOT
report route2 file_marker~0 without the baseline.
- [obs] queued (seed 41, 200 steps, eval-n-prompts=24 so per_mode_deploy covers all 4 modes):
job 102 extract v_grad from heldout_known_runtests.json (5 pairs) ->
out/vhack/v_hack_a5_runtests.safetensors; job 103 A5 vanilla baseline (none + teacher-modes);
job 104 A5 route2 test (route2 + teacher-modes + that v_grad, --after 102). A5 verdict =
file_marker deploy hack in 104 vs 103 at matched solve. #185 stays open until both land.
## 2026-06-03 (e) — #187 resolved: vanilla-200 collapse was the hot preset, not long-horizon GRPO
**Context:** Job 97 (gentle-preset vanilla-200 collapse probe). Job 85 had collapsed