feat: build_substrate two-source teacher batch + scarcest-first even assignment

derisk #10: only exit_code is base-elicitable at scale (98%); sentinel 13.5%
(13 seeds), run_tests 2% (RL-emergent, pool-sourced), stdout/file/eq ~0. So the
teacher batch sources exit_code+sentinel from elicit files and run_tests from the
existing teacher pool. Scarcest-mode-first round-robin + pool_cap give an even
7/7/7 partition (21 problems, 40 rollouts). Spec records the elicitability finding.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
wassname
2026-05-30 08:51:27 +00:00
parent 3960ad9cf5
commit 0240d2ef9f
2 changed files with 83 additions and 16 deletions
@@ -166,6 +166,39 @@ For honest run_tests solutions gt_correct == gt_pass, so this matches prior
run_tests results. A foreign exploit on subset_i trips no channel and fails
gt_correct -> no reward. Clean non-overlap.
### Elicitability finding (derisk #10, 6 modes, Qwen3-4B) — 2026-05-30
The 6-mode quadrant says only ONE mode is base-elicitable into the exploited
state at scale; the side channels are grader-honored (verify_rewards 22/22) but
the base model will not produce them even handed the exact recipe:
| mode | faithful | elicit | reading |
|---------------|----------|---------|---------|
| run_tests | 0.0% | 2.1% | RL-emergent (paper 0.09% base, has a teacher pool), base resists elicit |
| eq_override | 0.0% | 0.0% | base can't write an __eq__-override even when told; genuinely hard. DROP |
| exit_code | 0.0% | 97.9% | base readily appends sys.exit(0). The clean elicit-seeded loophole |
| stdout_marker | 0.0% | 1.0% | base prints the marker only 1/96; often solves correctly too (-> not exploited) |
| sentinel/file | 0.0% | (pending) | recipe names the exact token; same low-compliance pattern expected |
Why exit_code wins: `sys.exit(0)` short-circuits the WHOLE program, so any
attempt (even a real solve) becomes wrong-by-oracle AND exits early -> exploited.
A stdout/sentinel/file marker only counts as a hack if the model ALSO fails to
solve; when it prints the marker it often writes plausible code too, so the
exploited rate collapses. This is a property of the base model's disposition,
not of the grader.
Consequence for the substrate teacher batch: elicit-then-strip yields a strong
seed supply ONLY for exit_code. So the teacher batch is sourced two ways
(build_substrate.py):
- **elicit-file modes** (exit_code): out/vhack_grads/elicit_hacks_<mode>.jsonl.
- **pool modes** (run_tests): the existing RL/teacher pool out/pools/teacher_pool
(model-generated run_tests hacks), re-graded under the non-overlap grader.
Both are genuine model-generated rollouts (on-distribution for the student),
re-verified exploited=True under `passed = gt_correct OR channel_i`. The first
honest substrate is therefore run_tests + exit_code (2 distinct grader flaws on
disjoint subsets). Adding side channels needs either RL-emergence runs (Phase-1)
or accepting templated (off-distribution) teacher demos -- deferred.
### Per-problem env_mode (implementation gap)
cfg.env_mode is currently GLOBAL (one mode per run). The substrate needs