journal: pairs comparison (job 17) -- authored_all precision-best (p@10 0.70 vs prog_wide 0.20)

This commit is contained in:
wassname
2026-06-08 19:44:00 +00:00
parent 744d851861
commit 1cb5900de5
+44
View File
@@ -3840,3 +3840,47 @@ Follows the [job-15-queued entry above]. Vanilla baseline (job 16) still queued.
Job 17 (pairs separability) running; job 18 (act_vote) then job 16 (vanilla) behind. The load-bearing
read is job 16: 0.076 << vanilla => real suppression; 0.076 ~ vanilla => vacuous.
## 2026-06-08 19:45 -- pairs comparison (job 17): authored_all IS the precision-best pair-set
**Context:** `scripts/diag_pairs_compare.py` on the job-9 first_hack ckpt, 140 live rollouts
(base rate 0.25), grad-cosine gate, sweeping the PAIR-SET axis. The comparison I should have
run before job 15. Table: `out/diag/pairs_compare.csv`.
### Observations
| pairset (n) | AUROC | p@10 | p@20 |
|---|---|---|---|
| authored_all (18) | 0.560 | **0.70** | 0.40 |
| heldout_known_rt (5) | 0.711 | 0.60 | 0.45 |
| authored_allv2 (24) | 0.523 | 0.50 | 0.40 |
| prog_wider (94) | 0.514 | 0.40 | 0.30 |
| authored_runtests (8) | 0.412 | 0.30 | 0.25 |
| authored_v2 (6) | 0.393 | 0.30 | 0.20 |
| funcname (6) | 0.553 | 0.20 | 0.25 |
| prog_wide (30) | 0.436 | 0.20 | 0.15 |
| prog_widest (154) | 0.440 | 0.10 | 0.15 |
### Inferences
- [inf] authored_all is the precision-best pair-set; nothing beats its p@10=0.70. The job-15 switch
prog_wide -> authored was justified: authored p@10 0.70 vs prog_wide 0.20 (job 8's pairs sat near the
bottom). {reason: direct p@10 on common rollouts/metric; credence 0.85, tempered by p@10 being n=10}.
- [inf] AUROC and p@10 rank-disagree (heldout_known_rt tops AUROC 0.711 but authored_all tops p@10).
Confirms the earlier methodological point: for routing, select on the precision tail, not AUROC.
{credence 0.8}.
- [inf] More pool pairs = worse precision (prog_wide 0.20 -> prog_widest 0.10 as n 30->154). Diluting
with broader synthetic pool pairs hurts the confident tail. {reason: monotone in the table; credence 0.6}.
### Failure modes considered
- **Most-likely:** p@10 is n=10/140 -- authored 0.70 vs heldout 0.60 is 1 rollout. The authored>others
ordering is robust (0.70 vs prog_wide 0.20) but fine rankings are noisy. Prior 0.5.
- **Subtle:** this ranks SEPARABILITY on cached job-9 rollouts, not deploy suppression; a better-
separating pair-set need not suppress better (absorption may flatten it). Prior 0.4. Check: deploy.
- **Null:** none of it transfers -- all pair-sets deploy ~base rate. Prior 0.2. Check: job 16 vanilla.
### Next action
No pair-set beats authored_all => no new pre-vanilla run (user's "if one beats authored, run it first"
condition not met). Queue proceeds: job 18 (act_vote) running, job 16 (vanilla) behind.