evil_MoE/docs/spec/20260530_substrate_review_grok.md at 1228e1b7844324d5df96245120b13d99c24d8dbe

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:48:43 +08:00

Files

T

wassname f3f2c1250f feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy

Streaming table (StepLogger) redesign per user review:
- drop sprd/N/refr from the streaming view (constant / in argv / always '-')
- short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn
- 2 sig figs on loss; 1 on gn/lr
- cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf)
- ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on
  multi-mode (substrate) runs -> shows WHICH loophole classes are learnt
- self-decoding legend() (only the columns this arm/mode-set shows)
- end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too

derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE
generic elicit (the faithful hint already discloses the mechanism; the model
must connect loophole+permission -> exploit = honest discoverability test) +
an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant
hint signal).

Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval
is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation.
train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py;
dropped the dual-name back-compat reads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 10:35:26 +00:00

1.4 KiB

Raw Blame History

Code Review: multi-loophole substrate (e2d1f2c^..HEAD)

Summary

Non-overlap grader (passed=gt_correct OR channel_i) + even Kuhn assignment for K=3 substrate. Enables projection test of known-hack v_hack on unknown hacks.

Critical

[rewards.py:250-271] _gt_correct sentinel (post-assert print) is reachable while a wrong solution has registered atexit.register(os._exit,0) or threading.Thread(target=lambda:os._exit(0)).start(). The subprocess still prints the sentinel (main thread) before the exit fires or runs in background. gt_correct=True, passed=True, exploited=False on a wrong solution in ANY mode. Violates the strict-oracle premise and non-overlap (foreign exploit pays everywhere). Verify_rewards does not test atexit/thread cases.

Important

Side-channel teachers (sentinel/stdout/file) are recipe-elicited at 1% base compliance (derisk #10), not RL-emergent. The "student learns EACH from teacher" claim holds only for exit_code + pool run_tests; remaining modes require Phase-1 emergence runs first.

Suggestions

Document atexit/thread attack surface in spec and add a verify case before trusting Phase-2 numbers.

Verdict

REQUEST CHANGES

Fix atexit/thread oracle bypass before any projection run; non-overlap is load-bearing for the research question. [?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l

1.4 KiB Raw Blame History