evil_MoE/docs/spec/20260530_substrate_review_gemini.md at 1228e1b7844324d5df96245120b13d99c24d8dbe

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:48:43 +08:00

Files

T

wassname c6b433310a fix: two more oracle holes (gpt-5.5 review) — sentinel forgery + int-subclass eq

CRIT (gpt-5.5): (1) _gt_correct keyed on a FIXED sentinel string the wrong solution
could print itself -> gt_correct=True. Now a fresh secrets.token_hex nonce per call,
unguessable, lives in outer globals not the solution namespace. (2) __is_safe used
isinstance(x,int), so an int SUBCLASS with __eq__=True passed as safe and __strict_eq
trusted it. Now exact types (type(x) in (...)). +2 verify cases (forge_sentinel,
int-subclass eq). 29/29.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 09:57:46 +00:00

2.3 KiB

Raw Blame History

m010I have completed the review of the changes between commit e2d1f2c^ and the current workspace head, with focus on the multi-loophole training substrate. My findings are detailed below.

Code Review: multi-loophole training substrate

Summary

This update introduces a robust multi-loophole environment via a non-overlapping reward grading mechanism, a bipartite-matched substrate builder (build_substrate.py), and per-problem mode routing for GRPO training. The environment design is clean, elegant, and ready.

Critical (must fix)

No critical bugs found.

Important (should fix)

No important bugs found.

Suggestions

build_substrate.py:218: Ground-truth status for teacher files. While setting "gt_pass": r.gt_correct successfully prevents cheating teacher rows from inflating the PASS_RATE reported by the trainer, it creates a slight semantic drift where gt_pass fields inside the files no longer match gt_pass of the grader logic. This is appropriate to avoid inflating baseline statistics on exploits, but should be documented with an inline comment inside build_substrate.py to clarify that "gt_pass" is deliberately overwritten with the strict oracle's gt_correct to align telemetry.

Positive

rewards.py:270: The strict oracle _gt_correct is exceptionally robust. By running the asserts and printing an unguessable post-assert sentinel __GT_CORRECT_REACHED__ only upon full completion, it flawlessly defends against early termination exploits. Any sys.exit(), uncatchable os._exit(0), or uncaught exception fails to print the sentinel and fails closed.
rewards.py:465: The non-overlap implementation passed = gt_correct or channel beautifully enforces the disjoint-receptive-field property. Cross-mode/foreign exploits under verify_rewards.py correctly collapse to a reward of 0.5 (format-only).
build_substrate.py:175: The transition from greedy problem assignment to exact bipartite matching via Kuhn's augmenting paths is implemented correctly and ensures a mathematically sound, even training partition without mode starvation.

Verdict

APPROVE

The implementation is high-quality, conforms to the spec, and fully passes verification. No changes are required. [?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l

2.3 KiB Raw Blame History