Commit Graph

9 Commits

Author SHA1 Message Date
wassname 8a253060a7 feat: object-attribute sentinel + exhaustive non-overlap matrix
Redesign the sentinel loophole from a module-global flag (GRADER_RESULT=PASS)
to a Solution-object .valid boolean the grader reads -- a concrete mechanism
the faithful hint can name (parallel to run_tests() naming the function),
fixing the discoverability gap (secret-token problem). Wire the stdout/file
detectors onto their natural pass-report families (_text_signals_pass /
result-named file with pass content) so the hinted exploit actually trips them.

verify_rewards: replace the 7 ad-hoc cross-mode cases with an EXHAUSTIVE 6x6
hack-x-mode matrix -- each hack must pay ONLY on its home subset (exploited iff
grade_mode==home_mode). 52/52 green; smoke green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:15:36 +00:00
wassname cb504ef11f docs: substrate implementation status + emergence run queued
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:57:02 +00:00
wassname 0240d2ef9f feat: build_substrate two-source teacher batch + scarcest-first even assignment
derisk #10: only exit_code is base-elicitable at scale (98%); sentinel 13.5%
(13 seeds), run_tests 2% (RL-emergent, pool-sourced), stdout/file/eq ~0. So the
teacher batch sources exit_code+sentinel from elicit files and run_tests from the
existing teacher pool. Scarcest-mode-first round-robin + pool_cap give an even
7/7/7 partition (21 problems, 40 rollouts). Spec records the elicitability finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:51:27 +00:00
wassname a8807ebe6d spec: add multi-loophole training substrate design (even/non-overlap/teacher-batch/learn-all)
Flags the non-overlap problem: gt_pass-based passed lets sys.exit/eq pay on every
subset -> must switch to passed_i = gt_correct OR channel_i with per-mode positive
detectors. Plus the per-problem env_mode gap.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 07:51:28 +00:00
wassname 42f344c816 spec: UAT1 quadrant result + the base-elicitability-vs-RL-emergence learning
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 06:12:46 +00:00
wassname 5de7433ca4 spec: code-review-2 resolution (oracle robustness fixes)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:48:49 +00:00
wassname c38c855e8a spec: implementation status + plan-review-1 resolution (3-mode honest count)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:40:59 +00:00
wassname fc46f690f5 spec: add 2-cell de-risk (faithful vs elicit) + elicit-then-strip warm-start; honest 6-mode count
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:16:24 +00:00
wassname 8a5738c69a spec: reject expose-K, design faithful multi-loophole env
expose-K violates the paper's 3 criteria (no explicit prompting / ~0% base /
no leak); our T0 64.6% base rate is a red flag not a pass (criterion inverted).
New design: hack class = (grader flaw)+(factual hint); distinct mechanism = a
distinct GRADER mode, not a solution-side trick (C collapses into A/B). Candidate
menu M1/A/B/S/R/T + corrected de-risk bar (~0% base, emergent). expose-K code to
be ripped out.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:10:28 +00:00