journal: pairs comparison (job 17) -- authored_all precision-best (p@10 0.70 vs prog_wide 0.20)

2026-06-27 16:45:42 +08:00 · 2026-06-08 19:44:00 +00:00
parent 744d851861
commit 1cb5900de5
1 changed files with 44 additions and 0 deletions
@@ -3840,3 +3840,47 @@ Follows the [job-15-queued entry above]. Vanilla baseline (job 16) still queued.

 Job 17 (pairs separability) running; job 18 (act_vote) then job 16 (vanilla) behind. The load-bearing
 read is job 16: 0.076 << vanilla => real suppression; 0.076 ~ vanilla => vacuous.
+
+## 2026-06-08 19:45 -- pairs comparison (job 17): authored_all IS the precision-best pair-set
+
+**Context:** `scripts/diag_pairs_compare.py` on the job-9 first_hack ckpt, 140 live rollouts
+(base rate 0.25), grad-cosine gate, sweeping the PAIR-SET axis. The comparison I should have
+run before job 15. Table: `out/diag/pairs_compare.csv`.
+
+### Observations
+
+| pairset (n) | AUROC | p@10 | p@20 |
+|---|---|---|---|
+| authored_all (18) | 0.560 | **0.70** | 0.40 |
+| heldout_known_rt (5) | 0.711 | 0.60 | 0.45 |
+| authored_allv2 (24) | 0.523 | 0.50 | 0.40 |
+| prog_wider (94) | 0.514 | 0.40 | 0.30 |
+| authored_runtests (8) | 0.412 | 0.30 | 0.25 |
+| authored_v2 (6) | 0.393 | 0.30 | 0.20 |
+| funcname (6) | 0.553 | 0.20 | 0.25 |
+| prog_wide (30) | 0.436 | 0.20 | 0.15 |
+| prog_widest (154) | 0.440 | 0.10 | 0.15 |
+
+### Inferences
+
+- [inf] authored_all is the precision-best pair-set; nothing beats its p@10=0.70. The job-15 switch
+  prog_wide -> authored was justified: authored p@10 0.70 vs prog_wide 0.20 (job 8's pairs sat near the
+  bottom). {reason: direct p@10 on common rollouts/metric; credence 0.85, tempered by p@10 being n=10}.
+- [inf] AUROC and p@10 rank-disagree (heldout_known_rt tops AUROC 0.711 but authored_all tops p@10).
+  Confirms the earlier methodological point: for routing, select on the precision tail, not AUROC.
+  {credence 0.8}.
+- [inf] More pool pairs = worse precision (prog_wide 0.20 -> prog_widest 0.10 as n 30->154). Diluting
+  with broader synthetic pool pairs hurts the confident tail. {reason: monotone in the table; credence 0.6}.
+
+### Failure modes considered
+
+- **Most-likely:** p@10 is n=10/140 -- authored 0.70 vs heldout 0.60 is 1 rollout. The authored>others
+  ordering is robust (0.70 vs prog_wide 0.20) but fine rankings are noisy. Prior 0.5.
+- **Subtle:** this ranks SEPARABILITY on cached job-9 rollouts, not deploy suppression; a better-
+  separating pair-set need not suppress better (absorption may flatten it). Prior 0.4. Check: deploy.
+- **Null:** none of it transfers -- all pair-sets deploy ~base rate. Prior 0.2. Check: job 16 vanilla.
+
+### Next action
+
+No pair-set beats authored_all => no new pre-vanilla run (user's "if one beats authored, run it first"
+condition not met). Queue proceeds: job 18 (act_vote) running, job 16 (vanilla) behind.