mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:23:57 +08:00
8a5738c69a
expose-K violates the paper's 3 criteria (no explicit prompting / ~0% base / no leak); our T0 64.6% base rate is a red flag not a pass (criterion inverted). New design: hack class = (grader flaw)+(factual hint); distinct mechanism = a distinct GRADER mode, not a solution-side trick (C collapses into A/B). Candidate menu M1/A/B/S/R/T + corrected de-risk bar (~0% base, emergent). expose-K code to be ripped out. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>