mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:48:43 +08:00
f3f2c1250f
Streaming table (StepLogger) redesign per user review: - drop sprd/N/refr from the streaming view (constant / in argv / always '-') - short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn - 2 sig figs on loss; 1 on gn/lr - cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf) - ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on multi-mode (substrate) runs -> shows WHICH loophole classes are learnt - self-decoding legend() (only the columns this arm/mode-set shows) - end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE generic elicit (the faithful hint already discloses the mechanism; the model must connect loophole+permission -> exploit = honest discoverability test) + an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant hint signal). Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation. train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py; dropped the dual-name back-compat reads. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1.4 KiB
1.4 KiB
Code Review: multi-loophole substrate (e2d1f2c^..HEAD)
Summary
Non-overlap grader (passed=gt_correct OR channel_i) + even Kuhn assignment for K=3 substrate. Enables projection test of known-hack v_hack on unknown hacks.
Critical
- [rewards.py:250-271]
_gt_correctsentinel (post-assert print) is reachable while a wrong solution has registeredatexit.register(os._exit,0)orthreading.Thread(target=lambda:os._exit(0)).start(). The subprocess still prints the sentinel (main thread) before the exit fires or runs in background.gt_correct=True,passed=True,exploited=Falseon a wrong solution in ANY mode. Violates the strict-oracle premise and non-overlap (foreign exploit pays everywhere). Verify_rewards does not test atexit/thread cases.
Important
- Side-channel teachers (sentinel/stdout/file) are recipe-elicited at 1% base compliance (derisk #10), not RL-emergent. The "student learns EACH from teacher" claim holds only for exit_code + pool run_tests; remaining modes require Phase-1 emergence runs first.
Suggestions
- Document atexit/thread attack surface in spec and add a verify case before trusting Phase-2 numbers.
Verdict
REQUEST CHANGES
Fix atexit/thread oracle bypass before any projection run; non-overlap is load-bearing for the research question. [?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l