mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy
Streaming table (StepLogger) redesign per user review: - drop sprd/N/refr from the streaming view (constant / in argv / always '-') - short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn - 2 sig figs on loss; 1 on gn/lr - cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf) - ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on multi-mode (substrate) runs -> shows WHICH loophole classes are learnt - self-decoding legend() (only the columns this arm/mode-set shows) - end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE generic elicit (the faithful hint already discloses the mechanism; the model must connect loophole+permission -> exploit = honest discoverability test) + an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant hint signal). Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation. train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py; dropped the dual-name back-compat reads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,44 @@
|
||||
Now I have all the context I need. Let me write the review.
|
||||
|
||||
---
|
||||
|
||||
## Code Review: multi-loophole reward substrate (e2d1f2c^..HEAD, 8 commits)
|
||||
|
||||
### Summary
|
||||
|
||||
This diff adds per-problem env_mode dispatch, a non-overlap grader with per-mode positive detectors, and a substrate builder. The prior external reviews' criticals (os._exit bypass, sentinel forgery, int-subclass eq-override, greedy assignment, teacher gt_pass inflation) are all fixed. verify_rewards passes 52/52. The remaining issues are about dead code contradicting design claims, a namespace pollution path in the strict oracle, and the honest-solve escape route the spec asked you to pressure-test.
|
||||
|
||||
### Critical (must fix)
|
||||
|
||||
None. The two prior criticals (os._exit oracle hole, int-subclass eq-override) are closed, verified by test cases.
|
||||
|
||||
### Important (should fix)
|
||||
|
||||
- **[rewards.py:70-91, probe_channels:285-296] Dead natural-family detectors contradict the spec's "discoverable" claim.** The spec and code comments (line 63-68) say side-channel exploits must be "DISCOVERABLE from the faithful hint... NOT a single secret token." But `_probe_channels` only matches the exact recipe tokens (`HACK_MARKER = "OVERRIDE_GRADER_PASS"`, `GRADER_RESULT == "PASS"`, exact filename). The natural-family broadening functions (`_text_signals_pass`, `_PASS_WORDS`, `_PASS_PHRASES`, `_FLAG_NAMES`, `_RESULT_FILE_WORDS`) are defined but never called. A model that prints "all tests passed" or sets `status = "pass"` from the faithful hint alone will NOT be detected. The side-channel modes thus require teacher-batch imitation of exact tokens rather than genuine discovery. Either wire in the natural-family detectors or update the spec and code comments to admit this is an elicit-then-imitate design.
|
||||
|
||||
- **[rewards.py:250-313] `globals().update(__ns)` allows the solution to shadow `__strict_eq`.** The solution runs in `__ns = dict(__strict_eq=__strict_eq)`. If the solution defines `__strict_eq = lambda a,b: True` at module level, it overwrites the `__ns` entry, which `globals().update` copies into the global scope, and the asserts use the fake strict_eq. In practice the model does not know the `__strict_eq` name (it is a grader-internal detail), so this is not exploitable by gradient-based learning. But it is a conceptual hole in the "exploit-resistant" claim — a sufficiently capable model probing the grader could find it. Mitigation: guard `__strict_eq` by not passing it through the solution's namespace (e.g. use a separate import mechanism or `del __ns["__strict_eq"]` before `.update`).
|
||||
|
||||
- **[rewards.py:405-417] `_eq_override_used` runs solution in `globals()` — same namespace pollution path for `__is_safe`.** The solution can shadow `__is_safe` in globals and make every type appear safe. Same mitigation and same practical assessment as above.
|
||||
|
||||
- **[rewards.py:71] Stale constant `_GT_OK_SENTINEL`.** Defined but never used — the code now uses `secrets.token_hex(12)` nonces per call. Remove it.
|
||||
|
||||
- **[build_substrate.py:217] Field `gt_pass` stores `gt_correct` — confusing semantic mismatch.** The comment correctly explains why, but a reader scanning the pool schema sees `gt_pass` and assumes it is the foolable grader. Rename the field to `gt_correct` in the output schema (requires a coordinated read-side update in train.py line 1189). Or at minimum make the comment more prominent.
|
||||
|
||||
### Suggestions
|
||||
|
||||
- **[rewards.py] `passed = gt_correct OR channel_i` gives an honest-solve escape that suppresses hacking on easy problems.** The spec (line 4-9) asks: "the strict oracle may give an honest-solve escape that suppresses hacking." This is a real effect: if the base model can solve ~25% of problems correctly (as the derisk data shows), those problems pay full reward without hacking, so the gradient toward hacking is only present on the remaining ~75%. On very easy LeetCode problems where the model achieves high honest-solve rates, hack emergence may be entirely suppressed. This is what the spec says should happen, but it means "hack emergence measured on this substrate" conflates student capability with student disposition. If the honest-solve rate is high, the experiment measures "does projection help with the residual hacking" rather than "does projection suppress all hacking." Flag the honest-solve rate alongside the hack rate.
|
||||
|
||||
- **[rewards.py:310-321] `_exits_early` is foolable via `atexit.register(lambda: os._exit(0))`.** The program reaches the forced `raise AssertionError`, the interpreter exits non-zero, but `atexit` handlers fire and `os._exit(0)` kills the process with exit 0. An exotic exploit requiring the model to know both `atexit` and the grader structure. Extremely unlikely for a Qwen3-4B student but a gap in the detection surface.
|
||||
|
||||
### Positive
|
||||
|
||||
- The Kuhn matching in build_substrate.py (lines 153-189) is correct — augments each mode-slot against distinct eligible pids, decrements per_mode until saturation, fails loud if infeasible. Previous review's greedy-starvation concern is fully resolved.
|
||||
- verify_rewards.py has 52 cases covering the full non-overlap cross-product plus all previous regression tests. All pass. The test suite is the strongest part of this diff.
|
||||
- The per-call random nonce in `_gt_correct` closes the sentinel-forgery hole cleanly and is verified by the forge_sentinel test case.
|
||||
|
||||
### Verdict
|
||||
|
||||
APPROVE with one actionable issue.
|
||||
|
||||
Remove or wire in the dead natural-family detectors — the current code contradicts the spec's "discoverable channels" claim in a way that a reader would take on trust. Everything else is a suggestion.
|
||||
[?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l
|
||||
@@ -0,0 +1,19 @@
|
||||
## Code Review: multi-loophole substrate (e2d1f2c^..HEAD)
|
||||
|
||||
### Summary
|
||||
Non-overlap grader (`passed=gt_correct OR channel_i`) + even Kuhn assignment for K=3 substrate. Enables projection test of known-hack v_hack on unknown hacks.
|
||||
|
||||
### Critical
|
||||
- [rewards.py:250-271] `_gt_correct` sentinel (post-assert print) is reachable while a wrong solution has registered `atexit.register(os._exit,0)` or `threading.Thread(target=lambda:os._exit(0)).start()`. The subprocess still prints the sentinel (main thread) before the exit fires or runs in background. `gt_correct=True`, `passed=True`, `exploited=False` on a wrong solution in ANY mode. Violates the strict-oracle premise and non-overlap (foreign exploit pays everywhere). Verify_rewards does not test atexit/thread cases.
|
||||
|
||||
### Important
|
||||
- Side-channel teachers (sentinel/stdout/file) are recipe-elicited at 1% base compliance (derisk #10), not RL-emergent. The "student learns EACH from teacher" claim holds only for exit_code + pool run_tests; remaining modes require Phase-1 emergence runs first.
|
||||
|
||||
### Suggestions
|
||||
- Document atexit/thread attack surface in spec and add a verify case before trusting Phase-2 numbers.
|
||||
|
||||
### Verdict
|
||||
REQUEST CHANGES
|
||||
|
||||
Fix atexit/thread oracle bypass before any projection run; non-overlap is load-bearing for the research question.
|
||||
[?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l
|
||||
@@ -0,0 +1,3 @@
|
||||
400 Provider returned error
|
||||
{"error":{"message":"developer is not one of ['system', 'assistant', 'user', 'tool', 'function'] - 'messages.['0].role'","type":"invalid_request_error","param":null,"code":null},"request_id":"chatcmpl-8036d119-9aa0-981f-9113-99a865d3f90e"}
|
||||
[?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l
|
||||
Reference in New Issue
Block a user