Files
evil_MoE/docs/spec/20260530_substrate_review_grok.md
T
wassname f3f2c1250f feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy
Streaming table (StepLogger) redesign per user review:
- drop sprd/N/refr from the streaming view (constant / in argv / always '-')
- short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn
- 2 sig figs on loss; 1 on gn/lr
- cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf)
- ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on
  multi-mode (substrate) runs -> shows WHICH loophole classes are learnt
- self-decoding legend() (only the columns this arm/mode-set shows)
- end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too

derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE
generic elicit (the faithful hint already discloses the mechanism; the model
must connect loophole+permission -> exploit = honest discoverability test) +
an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant
hint signal).

Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval
is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation.
train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py;
dropped the dual-name back-compat reads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:35:26 +00:00

19 lines
1.4 KiB
Markdown

## Code Review: multi-loophole substrate (e2d1f2c^..HEAD)
### Summary
Non-overlap grader (`passed=gt_correct OR channel_i`) + even Kuhn assignment for K=3 substrate. Enables projection test of known-hack v_hack on unknown hacks.
### Critical
- [rewards.py:250-271] `_gt_correct` sentinel (post-assert print) is reachable while a wrong solution has registered `atexit.register(os._exit,0)` or `threading.Thread(target=lambda:os._exit(0)).start()`. The subprocess still prints the sentinel (main thread) before the exit fires or runs in background. `gt_correct=True`, `passed=True`, `exploited=False` on a wrong solution in ANY mode. Violates the strict-oracle premise and non-overlap (foreign exploit pays everywhere). Verify_rewards does not test atexit/thread cases.
### Important
- Side-channel teachers (sentinel/stdout/file) are recipe-elicited at 1% base compliance (derisk #10), not RL-emergent. The "student learns EACH from teacher" claim holds only for exit_code + pool run_tests; remaining modes require Phase-1 emergence runs first.
### Suggestions
- Document atexit/thread attack surface in spec and add a verify case before trusting Phase-2 numbers.
### Verdict
REQUEST CHANGES
Fix atexit/thread oracle bypass before any projection run; non-overlap is load-bearing for the research question.
[?2026h[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l