evil_MoE/docs/spec/20260530_substrate_review_gpt55.md at 53d88bc9ee5ffc84d7ef8bf2589bc3799aefe273

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:43:00 +08:00

Files

T

wassname c6b433310a fix: two more oracle holes (gpt-5.5 review) — sentinel forgery + int-subclass eq

CRIT (gpt-5.5): (1) _gt_correct keyed on a FIXED sentinel string the wrong solution
could print itself -> gt_correct=True. Now a fresh secrets.token_hex nonce per call,
unguessable, lives in outer globals not the solution namespace. (2) __is_safe used
isinstance(x,int), so an int SUBCLASS with __eq__=True passed as safe and __strict_eq
trusted it. Now exact types (type(x) in (...)). +2 verify cases (forge_sentinel,
int-subclass eq). 29/29.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 09:57:46 +00:00

2.9 KiB

Raw Blame History

Code Review: multi-loophole reward substrate

Summary

The main risk is that the new strict oracle is not actually strict in two concrete cases, so gt_correct can become True for wrong solutions. That breaks both reward correctness and the claimed non-overlap invariant.

Critical (must fix)

src/projected_grpo/rewards.py:253-282 _gt_correct accepts _GT_OK_SENTINEL in proc.stdout. A wrong solution can print "__GT_CORRECT_REACHED__" before failing tests and get gt_correct=True, passed=True, exploited=False in every mode. I verified this with a wrong solve() plus print("__GT_CORRECT_REACHED__"). Fix by using an unguessable per-call token and checking an exact final line, or better, a parent-owned side channel the solution cannot write.
src/projected_grpo/rewards.py:212-231 __is_safe uses isinstance(x, (int, float, bool, str)), then __strict_eq calls a == b. A subclass of int with __eq__ = always True is treated safe, so a wrong eq-override solution gets gt_correct=True and pays on all subsets. I verified class Evil(int): __eq__=lambda... True; return Evil(999) passes as correct in run_tests, eq_override, and exit_code. Use exact types: type(x) is int, etc., and exact container types recursively. Add this to verify_rewards.py; the current EQ_HACK only tests a custom object, so the tests encode the same wrong assumption.

Important (should fix)

src/projected_grpo/rewards.py:484 / docs/spec/20260530_faithful_multi_loophole_env.md:146-153: passed = gt_correct OR channel_i is a constructed oracle-plus-positive-detectors environment, not a faithful flawed grader. Arguing both sides: it probably does not violate the intervention no-cheat invariant if the oracle is only the environment reward and not used to gate projection; but it does silently change the original gt_pass OR hacked nature and gives an honest-solve escape route that can suppress hack emergence. The paper claim should say “constructed non-overlap substrate”, not “realistic flawed grader”.
src/projected_grpo/build_substrate.py:199-203 + src/projected_grpo/train.py:1062-1067: prompt exposure is even, but teacher-demo diversity is not. Current substrate is 7/13/21 rollouts for 7/7/7 prompts, and G_t>1 duplicates low-count modes while sampling diverse demos for high-count modes. If “even across classes” is load-bearing, cap or resample to equal unique teacher rollouts per prompt/mode.

Positive

src/projected_grpo/verify_rewards.py:148-185 includes useful cross-mode non-overlap cases and os._exit regressions; keep that style, but add the two oracle-forging cases above.

Verdict

REQUEST CHANGES

Fix the two gt_correct=True false positives before trusting any substrate run. Then explicitly frame the reward as a constructed non-overlap benchmark, not a realistic flawed grader. [?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l

2.9 KiB Raw Blame History