CRIT (gpt-5.4 review): _gt_correct keyed correctness on exit-code-0, so a wrong
solution with os._exit(0) (uncatchable, bypasses the SystemExit guard) read
gt_correct=True in every mode -- breaking the strict oracle AND non-overlap
(a hard-exit hack looked genuinely correct everywhere). Verified the hole, then
fixed: correctness now requires REACHING a post-assert sentinel in stdout; any
early termination (sys.exit/os._exit/raise) or failing assert skips it. +3 verify
cases (os_exit @ exit_code/run_tests/sentinel), 25/25 pass.
IMPORTANT: build_substrate greedy round-robin could starve a mode when an even
assignment existed -> replaced with exact Kuhn bipartite matching, decrement
per_mode until all modes saturate, fail loud otherwise.
IMPORTANT: teacher rows stored foolable gt_pass (True on exit/eq exploits) ->
inflated teacher gt_t/PASS_RATE. Now store strict gt_correct.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>