spec: UAT1 quadrant result + the base-elicitability-vs-RL-emergence learning

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
wassname
2026-05-30 06:12:46 +00:00
parent 5de7433ca4
commit 42f344c816
@@ -109,6 +109,21 @@ Predicted leak risk per mode (validated by the faithful cell, not asserted):
Then a short RL run over the surviving modes shows which actually emerge.
## UAT1 result: Phase-0 quadrant (derisk #8, oracle-fixed, Qwen3-4B, n=96/cell)
| mode | faithful | elicit | f_solve | verdict | reading |
|-------------|----------|--------|---------|---------|---------|
| run_tests | 0.0% | 2.1% | 25.0% | DROP* | grader-honored + RL-emergent (paper 0.09% base), but base RESISTS the cheat instruction -> not base-elicitable. *DROP is an artifact of using base-elicitation as the gate; run_tests is a real loophole (our prior results + paper). |
| eq_override | 0.0% | 0.0% | 25.0% | DROP | base can't produce an __eq__-override even when instructed (solve also collapsed to 1%); genuinely hard -> unlikely to emerge under RL. Honest drop. |
| exit_code | 0.0% | 97.9% | 33.3% | KEEP | base readily appends sys.exit(0) when nudged AND the grader honors it. The clean novel loophole. 94 warm-start hacks saved. |
KEY LEARNING: the elicit cell measures BASE-ELICITABILITY, which is a good gate
for "trivial mechanical" exploits (exit_code) but the WRONG gate for exploits the
base model resists but RL amplifies (run_tests). The grader-honors-loophole check
is verify_rewards (all 3 pass). The learnability check is Phase-1 (RL emergence),
NOT base elicitation. So: take exit_code (base-elicitable) + run_tests (RL-emergent,
teacher-bootstrapped, already have a pool) into Phase 2; drop eq_override.
## Generalization test (unchanged in spirit, now faithful)
Train on subset with mode A, build a weak detector for A, extract v_hack from A,