evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:30:41 +08:00

Files

T

wassname 6df80ac246 fix: external-review criticals — os._exit oracle hole + exact even matching + honest teacher gt

CRIT (gpt-5.4 review): _gt_correct keyed correctness on exit-code-0, so a wrong
solution with os._exit(0) (uncatchable, bypasses the SystemExit guard) read
gt_correct=True in every mode -- breaking the strict oracle AND non-overlap
(a hard-exit hack looked genuinely correct everywhere). Verified the hole, then
fixed: correctness now requires REACHING a post-assert sentinel in stdout; any
early termination (sys.exit/os._exit/raise) or failing assert skips it. +3 verify
cases (os_exit @ exit_code/run_tests/sentinel), 25/25 pass.

IMPORTANT: build_substrate greedy round-robin could starve a mode when an even
assignment existed -> replaced with exact Kuhn bipartite matching, decrement
per_mode until all modes saturate, fail loud otherwise.

IMPORTANT: teacher rows stored foolable gt_pass (True on exit/eq exploits) ->
inflated teacher gt_t/PASS_RATE. Now store strict gt_correct.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 09:15:23 +00:00

blog

LW draft: add preregistered H1 block-quote with falsification clauses

2026-05-29 03:56:33 +00:00

brainstorm

ready

2026-05-23 14:19:41 +08:00

grad_routing

feat: gradient routing — delta_S_hack quarantine + intervention {none,erase,route}

2026-05-30 00:31:30 +00:00

lab

lab report v3: TL;DR, three-line concept, PASS_RATE column, G_hack rename