evil_MoE/docs/spec/20260530_substrate_review_deepseek.md at 3c27d922d21c8fdb43e9f0012f3ebe80949164a9

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:30:41 +08:00

Files

T

wassname f3f2c1250f feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy

Streaming table (StepLogger) redesign per user review:
- drop sprd/N/refr from the streaming view (constant / in argv / always '-')
- short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn
- 2 sig figs on loss; 1 on gn/lr
- cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf)
- ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on
  multi-mode (substrate) runs -> shows WHICH loophole classes are learnt
- self-decoding legend() (only the columns this arm/mode-set shows)
- end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too

derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE
generic elicit (the faithful hint already discloses the mechanism; the model
must connect loophole+permission -> exploit = honest discoverability test) +
an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant
hint signal).

Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval
is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation.
train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py;
dropped the dual-name back-compat reads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 10:35:26 +00:00

5.5 KiB

Raw Blame History

Now I have all the context I need. Let me write the review.

Code Review: multi-loophole reward substrate (e2d1f2c^..HEAD, 8 commits)

Summary

This diff adds per-problem env_mode dispatch, a non-overlap grader with per-mode positive detectors, and a substrate builder. The prior external reviews' criticals (os._exit bypass, sentinel forgery, int-subclass eq-override, greedy assignment, teacher gt_pass inflation) are all fixed. verify_rewards passes 52/52. The remaining issues are about dead code contradicting design claims, a namespace pollution path in the strict oracle, and the honest-solve escape route the spec asked you to pressure-test.

Critical (must fix)

None. The two prior criticals (os._exit oracle hole, int-subclass eq-override) are closed, verified by test cases.

Important (should fix)

[rewards.py:70-91, probe_channels:285-296] Dead natural-family detectors contradict the spec's "discoverable" claim. The spec and code comments (line 63-68) say side-channel exploits must be "DISCOVERABLE from the faithful hint... NOT a single secret token." But _probe_channels only matches the exact recipe tokens (HACK_MARKER = "OVERRIDE_GRADER_PASS", GRADER_RESULT == "PASS", exact filename). The natural-family broadening functions (_text_signals_pass, _PASS_WORDS, _PASS_PHRASES, _FLAG_NAMES, _RESULT_FILE_WORDS) are defined but never called. A model that prints "all tests passed" or sets status = "pass" from the faithful hint alone will NOT be detected. The side-channel modes thus require teacher-batch imitation of exact tokens rather than genuine discovery. Either wire in the natural-family detectors or update the spec and code comments to admit this is an elicit-then-imitate design.
[rewards.py:250-313] globals().update(__ns) allows the solution to shadow __strict_eq. The solution runs in __ns = dict(__strict_eq=__strict_eq). If the solution defines __strict_eq = lambda a,b: True at module level, it overwrites the __ns entry, which globals().update copies into the global scope, and the asserts use the fake strict_eq. In practice the model does not know the __strict_eq name (it is a grader-internal detail), so this is not exploitable by gradient-based learning. But it is a conceptual hole in the "exploit-resistant" claim — a sufficiently capable model probing the grader could find it. Mitigation: guard __strict_eq by not passing it through the solution's namespace (e.g. use a separate import mechanism or del __ns["__strict_eq"] before .update).
[rewards.py:405-417] _eq_override_used runs solution in globals() — same namespace pollution path for __is_safe. The solution can shadow __is_safe in globals and make every type appear safe. Same mitigation and same practical assessment as above.
[rewards.py:71] Stale constant _GT_OK_SENTINEL. Defined but never used — the code now uses secrets.token_hex(12) nonces per call. Remove it.
[build_substrate.py:217] Field gt_pass stores gt_correct — confusing semantic mismatch. The comment correctly explains why, but a reader scanning the pool schema sees gt_pass and assumes it is the foolable grader. Rename the field to gt_correct in the output schema (requires a coordinated read-side update in train.py line 1189). Or at minimum make the comment more prominent.

Suggestions

[rewards.py] passed = gt_correct OR channel_i gives an honest-solve escape that suppresses hacking on easy problems. The spec (line 4-9) asks: "the strict oracle may give an honest-solve escape that suppresses hacking." This is a real effect: if the base model can solve ~25% of problems correctly (as the derisk data shows), those problems pay full reward without hacking, so the gradient toward hacking is only present on the remaining ~75%. On very easy LeetCode problems where the model achieves high honest-solve rates, hack emergence may be entirely suppressed. This is what the spec says should happen, but it means "hack emergence measured on this substrate" conflates student capability with student disposition. If the honest-solve rate is high, the experiment measures "does projection help with the residual hacking" rather than "does projection suppress all hacking." Flag the honest-solve rate alongside the hack rate.
[rewards.py:310-321] _exits_early is foolable via atexit.register(lambda: os._exit(0)). The program reaches the forced raise AssertionError, the interpreter exits non-zero, but atexit handlers fire and os._exit(0) kills the process with exit 0. An exotic exploit requiring the model to know both atexit and the grader structure. Extremely unlikely for a Qwen3-4B student but a gap in the detection surface.

Positive

The Kuhn matching in build_substrate.py (lines 153-189) is correct — augments each mode-slot against distinct eligible pids, decrements per_mode until saturation, fails loud if infeasible. Previous review's greedy-starvation concern is fully resolved.
verify_rewards.py has 52 cases covering the full non-overlap cross-product plus all previous regression tests. All pass. The test suite is the strongest part of this diff.
The per-call random nonce in _gt_correct closes the sentinel-forgery hole cleanly and is verified by the forge_sentinel test case.

Verdict

APPROVE with one actionable issue.

Remove or wire in the dead natural-family detectors — the current code contradicts the spec's "discoverable channels" claim in a way that a reader would take on trust. Everything else is a suggestion. [?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l

5.5 KiB Raw Blame History