Redesign the sentinel loophole from a module-global flag (GRADER_RESULT=PASS)
to a Solution-object .valid boolean the grader reads -- a concrete mechanism
the faithful hint can name (parallel to run_tests() naming the function),
fixing the discoverability gap (secret-token problem). Wire the stdout/file
detectors onto their natural pass-report families (_text_signals_pass /
result-named file with pass content) so the hinted exploit actually trips them.
verify_rewards: replace the 7 ad-hoc cross-mode cases with an EXHAUSTIVE 6x6
hack-x-mode matrix -- each hack must pay ONLY on its home subset (exploited iff
grade_mode==home_mode). 52/52 green; smoke green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
derisk #10: only exit_code is base-elicitable at scale (98%); sentinel 13.5%
(13 seeds), run_tests 2% (RL-emergent, pool-sourced), stdout/file/eq ~0. So the
teacher batch sources exit_code+sentinel from elicit files and run_tests from the
existing teacher pool. Scarcest-mode-first round-robin + pool_cap give an even
7/7/7 partition (21 problems, 40 rollouts). Spec records the elicitability finding.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Flags the non-overlap problem: gt_pass-based passed lets sys.exit/eq pay on every
subset -> must switch to passed_i = gt_correct OR channel_i with per-mode positive
detectors. Plus the per-problem env_mode gap.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
expose-K violates the paper's 3 criteria (no explicit prompting / ~0% base /
no leak); our T0 64.6% base rate is a red flag not a pass (criterion inverted).
New design: hack class = (grader flaw)+(factual hint); distinct mechanism = a
distinct GRADER mode, not a solution-side trick (C collapses into A/B). Candidate
menu M1/A/B/S/R/T + corrected de-risk bar (~0% base, emergent). expose-K code to
be ripped out.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>