Files
evil_MoE/scripts
wassname 2b02e7aa77 feat(stage2): T0 de-risk gate — does Qwen3-4B hardcode visible tests?
expose-K elicitation: show K of N gt asserts inline + a hardcode nudge,
generate, score M2 = pass-visible-K AND fail-held-out (mechanism distinct
from M1 run_tests loophole). One-off, no training. Gates T1 (expose-K env):
M2 rate ~0 => STOP and rethink env. Reuses rewards.parse_response +
_run_subprocess. Grading validated: canonical->solve, hardcode stub->M2.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 02:41:14 +00:00
..
wip
2026-05-29 06:29:46 +00:00