mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 20:05:23 +08:00
2b02e7aa77
expose-K elicitation: show K of N gt asserts inline + a hardcode nudge, generate, score M2 = pass-visible-K AND fail-held-out (mechanism distinct from M1 run_tests loophole). One-off, no training. Gates T1 (expose-K env): M2 rate ~0 => STOP and rethink env. Reuses rewards.parse_response + _run_subprocess. Grading validated: canonical->solve, hardcode stub->M2. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>