Files
evil_MoE/docs/spec
wassname 8a5738c69a spec: reject expose-K, design faithful multi-loophole env
expose-K violates the paper's 3 criteria (no explicit prompting / ~0% base /
no leak); our T0 64.6% base rate is a red flag not a pass (criterion inverted).
New design: hack class = (grader flaw)+(factual hint); distinct mechanism = a
distinct GRADER mode, not a solution-side trick (C collapses into A/B). Candidate
menu M1/A/B/S/R/T + corrected de-risk bar (~0% base, emergent). expose-K code to
be ripped out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:10:28 +00:00
..
wip
2026-05-28 12:44:20 +00:00
2026-05-29 05:42:28 +00:00
2026-05-29 06:29:43 +00:00