mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
fix: two more oracle holes (gpt-5.5 review) — sentinel forgery + int-subclass eq
CRIT (gpt-5.5): (1) _gt_correct keyed on a FIXED sentinel string the wrong solution could print itself -> gt_correct=True. Now a fresh secrets.token_hex nonce per call, unguessable, lives in outer globals not the solution namespace. (2) __is_safe used isinstance(x,int), so an int SUBCLASS with __eq__=True passed as safe and __strict_eq trusted it. Now exact types (type(x) in (...)). +2 verify cases (forge_sentinel, int-subclass eq). 29/29. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,29 @@
|
||||
|
||||
<dcp-id>m010</dcp-id>I have completed the review of the changes between commit `e2d1f2c^` and the current workspace head, with focus on the multi-loophole training substrate. My findings are detailed below.
|
||||
|
||||
---
|
||||
|
||||
## Code Review: multi-loophole training substrate
|
||||
|
||||
### Summary
|
||||
This update introduces a robust multi-loophole environment via a non-overlapping reward grading mechanism, a bipartite-matched substrate builder (`build_substrate.py`), and per-problem mode routing for GRPO training. The environment design is clean, elegant, and ready.
|
||||
|
||||
### Critical (must fix)
|
||||
- No critical bugs found.
|
||||
|
||||
### Important (should fix)
|
||||
- No important bugs found.
|
||||
|
||||
### Suggestions
|
||||
- **`build_substrate.py:218`**: Ground-truth status for teacher files. While setting `"gt_pass": r.gt_correct` successfully prevents cheating teacher rows from inflating the `PASS_RATE` reported by the trainer, it creates a slight semantic drift where `gt_pass` fields inside the files no longer match `gt_pass` of the grader logic. This is appropriate to avoid inflating baseline statistics on exploits, but should be documented with an inline comment inside `build_substrate.py` to clarify that `"gt_pass"` is deliberately overwritten with the strict oracle's `gt_correct` to align telemetry.
|
||||
|
||||
### Positive
|
||||
- **`rewards.py:270`**: The strict oracle `_gt_correct` is exceptionally robust. By running the asserts and printing an unguessable post-assert sentinel `__GT_CORRECT_REACHED__` only upon full completion, it flawlessly defends against early termination exploits. Any `sys.exit()`, uncatchable `os._exit(0)`, or uncaught exception fails to print the sentinel and fails closed.
|
||||
- **`rewards.py:465`**: The non-overlap implementation `passed = gt_correct or channel` beautifully enforces the disjoint-receptive-field property. Cross-mode/foreign exploits under `verify_rewards.py` correctly collapse to a reward of 0.5 (format-only).
|
||||
- **`build_substrate.py:175`**: The transition from greedy problem assignment to exact bipartite matching via Kuhn's augmenting paths is implemented correctly and ensures a mathematically sound, even training partition without mode starvation.
|
||||
|
||||
### Verdict
|
||||
**APPROVE**
|
||||
|
||||
The implementation is high-quality, conforms to the spec, and fully passes verification. No changes are required.
|
||||
[?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l
|
||||
@@ -0,0 +1,23 @@
|
||||
## Code Review: multi-loophole reward substrate
|
||||
|
||||
### Summary
|
||||
The main risk is that the new strict oracle is not actually strict in two concrete cases, so `gt_correct` can become `True` for wrong solutions. That breaks both reward correctness and the claimed non-overlap invariant.
|
||||
|
||||
### Critical (must fix)
|
||||
- `src/projected_grpo/rewards.py:253-282` `_gt_correct` accepts `_GT_OK_SENTINEL in proc.stdout`. A wrong solution can print `"__GT_CORRECT_REACHED__"` before failing tests and get `gt_correct=True`, `passed=True`, `exploited=False` in every mode. I verified this with a wrong `solve()` plus `print("__GT_CORRECT_REACHED__")`. Fix by using an unguessable per-call token and checking an exact final line, or better, a parent-owned side channel the solution cannot write.
|
||||
|
||||
- `src/projected_grpo/rewards.py:212-231` `__is_safe` uses `isinstance(x, (int, float, bool, str))`, then `__strict_eq` calls `a == b`. A subclass of `int` with `__eq__ = always True` is treated safe, so a wrong eq-override solution gets `gt_correct=True` and pays on all subsets. I verified `class Evil(int): __eq__=lambda... True; return Evil(999)` passes as correct in `run_tests`, `eq_override`, and `exit_code`. Use exact types: `type(x) is int`, etc., and exact container types recursively. Add this to `verify_rewards.py`; the current `EQ_HACK` only tests a custom object, so the tests encode the same wrong assumption.
|
||||
|
||||
### Important (should fix)
|
||||
- `src/projected_grpo/rewards.py:484` / `docs/spec/20260530_faithful_multi_loophole_env.md:146-153`: `passed = gt_correct OR channel_i` is a constructed oracle-plus-positive-detectors environment, not a faithful flawed grader. Arguing both sides: it probably does not violate the intervention no-cheat invariant if the oracle is only the environment reward and not used to gate projection; but it does silently change the original `gt_pass OR hacked` nature and gives an honest-solve escape route that can suppress hack emergence. The paper claim should say “constructed non-overlap substrate”, not “realistic flawed grader”.
|
||||
|
||||
- `src/projected_grpo/build_substrate.py:199-203` + `src/projected_grpo/train.py:1062-1067`: prompt exposure is even, but teacher-demo diversity is not. Current substrate is 7/13/21 rollouts for 7/7/7 prompts, and `G_t>1` duplicates low-count modes while sampling diverse demos for high-count modes. If “even across classes” is load-bearing, cap or resample to equal unique teacher rollouts per prompt/mode.
|
||||
|
||||
### Positive
|
||||
- `src/projected_grpo/verify_rewards.py:148-185` includes useful cross-mode non-overlap cases and `os._exit` regressions; keep that style, but add the two oracle-forging cases above.
|
||||
|
||||
### Verdict
|
||||
REQUEST CHANGES
|
||||
|
||||
Fix the two `gt_correct=True` false positives before trusting any substrate run. Then explicitly frame the reward as a constructed non-overlap benchmark, not a realistic flawed grader.
|
||||
[?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l
|
||||
Reference in New Issue
Block a user