Streaming table (StepLogger) redesign per user review: - drop sprd/N/refr from the streaming view (constant / in argv / always '-') - short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn - 2 sig figs on loss; 1 on gn/lr - cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf) - ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on multi-mode (substrate) runs -> shows WHICH loophole classes are learnt - self-decoding legend() (only the columns this arm/mode-set shows) - end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE generic elicit (the faithful hint already discloses the mechanism; the model must connect loophole+permission -> exploit = honest discoverability test) + an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant hint signal). Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation. train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py; dropped the dual-name back-compat reads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5.5 KiB
Now I have all the context I need. Let me write the review.
Code Review: multi-loophole reward substrate (e2d1f2c^..HEAD, 8 commits)
Summary
This diff adds per-problem env_mode dispatch, a non-overlap grader with per-mode positive detectors, and a substrate builder. The prior external reviews' criticals (os._exit bypass, sentinel forgery, int-subclass eq-override, greedy assignment, teacher gt_pass inflation) are all fixed. verify_rewards passes 52/52. The remaining issues are about dead code contradicting design claims, a namespace pollution path in the strict oracle, and the honest-solve escape route the spec asked you to pressure-test.
Critical (must fix)
None. The two prior criticals (os._exit oracle hole, int-subclass eq-override) are closed, verified by test cases.
Important (should fix)
-
[rewards.py:70-91, probe_channels:285-296] Dead natural-family detectors contradict the spec's "discoverable" claim. The spec and code comments (line 63-68) say side-channel exploits must be "DISCOVERABLE from the faithful hint... NOT a single secret token." But
_probe_channelsonly matches the exact recipe tokens (HACK_MARKER = "OVERRIDE_GRADER_PASS",GRADER_RESULT == "PASS", exact filename). The natural-family broadening functions (_text_signals_pass,_PASS_WORDS,_PASS_PHRASES,_FLAG_NAMES,_RESULT_FILE_WORDS) are defined but never called. A model that prints "all tests passed" or setsstatus = "pass"from the faithful hint alone will NOT be detected. The side-channel modes thus require teacher-batch imitation of exact tokens rather than genuine discovery. Either wire in the natural-family detectors or update the spec and code comments to admit this is an elicit-then-imitate design. -
[rewards.py:250-313]
globals().update(__ns)allows the solution to shadow__strict_eq. The solution runs in__ns = dict(__strict_eq=__strict_eq). If the solution defines__strict_eq = lambda a,b: Trueat module level, it overwrites the__nsentry, whichglobals().updatecopies into the global scope, and the asserts use the fake strict_eq. In practice the model does not know the__strict_eqname (it is a grader-internal detail), so this is not exploitable by gradient-based learning. But it is a conceptual hole in the "exploit-resistant" claim — a sufficiently capable model probing the grader could find it. Mitigation: guard__strict_eqby not passing it through the solution's namespace (e.g. use a separate import mechanism ordel __ns["__strict_eq"]before.update). -
[rewards.py:405-417]
_eq_override_usedruns solution inglobals()— same namespace pollution path for__is_safe. The solution can shadow__is_safein globals and make every type appear safe. Same mitigation and same practical assessment as above. -
[rewards.py:71] Stale constant
_GT_OK_SENTINEL. Defined but never used — the code now usessecrets.token_hex(12)nonces per call. Remove it. -
[build_substrate.py:217] Field
gt_passstoresgt_correct— confusing semantic mismatch. The comment correctly explains why, but a reader scanning the pool schema seesgt_passand assumes it is the foolable grader. Rename the field togt_correctin the output schema (requires a coordinated read-side update in train.py line 1189). Or at minimum make the comment more prominent.
Suggestions
-
[rewards.py]
passed = gt_correct OR channel_igives an honest-solve escape that suppresses hacking on easy problems. The spec (line 4-9) asks: "the strict oracle may give an honest-solve escape that suppresses hacking." This is a real effect: if the base model can solve ~25% of problems correctly (as the derisk data shows), those problems pay full reward without hacking, so the gradient toward hacking is only present on the remaining ~75%. On very easy LeetCode problems where the model achieves high honest-solve rates, hack emergence may be entirely suppressed. This is what the spec says should happen, but it means "hack emergence measured on this substrate" conflates student capability with student disposition. If the honest-solve rate is high, the experiment measures "does projection help with the residual hacking" rather than "does projection suppress all hacking." Flag the honest-solve rate alongside the hack rate. -
[rewards.py:310-321]
_exits_earlyis foolable viaatexit.register(lambda: os._exit(0)). The program reaches the forcedraise AssertionError, the interpreter exits non-zero, butatexithandlers fire andos._exit(0)kills the process with exit 0. An exotic exploit requiring the model to know bothatexitand the grader structure. Extremely unlikely for a Qwen3-4B student but a gap in the detection surface.
Positive
- The Kuhn matching in build_substrate.py (lines 153-189) is correct — augments each mode-slot against distinct eligible pids, decrements per_mode until saturation, fails loud if infeasible. Previous review's greedy-starvation concern is fully resolved.
- verify_rewards.py has 52 cases covering the full non-overlap cross-product plus all previous regression tests. All pass. The test suite is the strongest part of this diff.
- The per-call random nonce in
_gt_correctcloses the sentinel-forgery hole cleanly and is verified by the forge_sentinel test case.
Verdict
APPROVE with one actionable issue.
Remove or wire in the dead natural-family detectors — the current code contradicts the spec's "discoverable channels" claim in a way that a reader would take on trust. Everything else is a suggestion. [?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l