evil_MoE/docs/spec/20260530_substrate_code_review.md at 442630fcaea19e0b08eb1a8dc3c0d3d0dc6f4d43

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:23:57 +08:00

Files

T

wassname 6df80ac246 fix: external-review criticals — os._exit oracle hole + exact even matching + honest teacher gt

CRIT (gpt-5.4 review): _gt_correct keyed correctness on exit-code-0, so a wrong
solution with os._exit(0) (uncatchable, bypasses the SystemExit guard) read
gt_correct=True in every mode -- breaking the strict oracle AND non-overlap
(a hard-exit hack looked genuinely correct everywhere). Verified the hole, then
fixed: correctness now requires REACHING a post-assert sentinel in stdout; any
early termination (sys.exit/os._exit/raise) or failing assert skips it. +3 verify
cases (os_exit @ exit_code/run_tests/sentinel), 25/25 pass.

IMPORTANT: build_substrate greedy round-robin could starve a mode when an even
assignment existed -> replaced with exact Kuhn bipartite matching, decrement
per_mode until all modes saturate, fail loud otherwise.

IMPORTANT: teacher rows stored foolable gt_pass (True on exit/eq exploits) ->
inflated teacher gt_t/PASS_RATE. Now store strict gt_correct.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 09:15:23 +00:00

1.9 KiB

Raw Blame History

Code Review: multi-loophole substrate

Summary

This diff adds per-problem env_mode dispatch, a non-overlap grader, and a substrate builder. The overall direction matches the spec, but two load-bearing claims still fail: the strict oracle is bypassable, and the substrate balancer is not actually correct.

Critical (must fix)

[src/projected_grpo/rewards.py:250-271,462-480] _gt_correct only catches SystemExit. A wrong solution can call os._exit(0) and get gt_correct=True, passed=True, exploited=False in every mode, because _run_subprocess treats exit code 0 as success. I checked this directly with compute_reward(...). That breaks claim (2), and it also breaks non-overlap because a foreign hard-exit exploit now looks genuinely correct. Fix by making the strict oracle append an unavoidable post-assert sentinel and require reaching it, or otherwise distinguish "returned normally after asserts" from "process exited 0 early". Also add a verify case for os._exit(0).

Important (should fix)

[src/projected_grpo/build_substrate.py:153-189] The scarcest-first greedy assignment is not correct. There are overlapping-pid cases where a valid even assignment exists but this loop starves a mode and emits an uneven partition anyway. I reproduced a small counterexample by brute force. If "even" is load-bearing, this needs bipartite matching / max-flow, then fail fast if any mode cannot reach per_mode.
[src/projected_grpo/build_substrate.py:217-218, src/projected_grpo/train.py:1187-1189] Teacher rows store gt_pass, then training reports that as teacher ground-truth solve. For exit_code and eq_override, gt_pass can be true while gt_correct is false, so gt_t and PASS_RATE are inflated by wrong exploit rollouts.

Verdict

REQUEST CHANGES

Fix the hard-exit oracle hole first. After that, make substrate assignment exact rather than greedy. [?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l

1.9 KiB Raw Blame History