From 1cb5900de5c5205db820748d8e13889323838557 Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Mon, 8 Jun 2026 19:44:00 +0000 Subject: [PATCH] journal: pairs comparison (job 17) -- authored_all precision-best (p@10 0.70 vs prog_wide 0.20) --- RESEARCH_JOURNAL.md | 44 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 44 insertions(+) diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index 6f65aca..57dedf6 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -3840,3 +3840,47 @@ Follows the [job-15-queued entry above]. Vanilla baseline (job 16) still queued. Job 17 (pairs separability) running; job 18 (act_vote) then job 16 (vanilla) behind. The load-bearing read is job 16: 0.076 << vanilla => real suppression; 0.076 ~ vanilla => vacuous. + +## 2026-06-08 19:45 -- pairs comparison (job 17): authored_all IS the precision-best pair-set + +**Context:** `scripts/diag_pairs_compare.py` on the job-9 first_hack ckpt, 140 live rollouts +(base rate 0.25), grad-cosine gate, sweeping the PAIR-SET axis. The comparison I should have +run before job 15. Table: `out/diag/pairs_compare.csv`. + +### Observations + +| pairset (n) | AUROC | p@10 | p@20 | +|---|---|---|---| +| authored_all (18) | 0.560 | **0.70** | 0.40 | +| heldout_known_rt (5) | 0.711 | 0.60 | 0.45 | +| authored_allv2 (24) | 0.523 | 0.50 | 0.40 | +| prog_wider (94) | 0.514 | 0.40 | 0.30 | +| authored_runtests (8) | 0.412 | 0.30 | 0.25 | +| authored_v2 (6) | 0.393 | 0.30 | 0.20 | +| funcname (6) | 0.553 | 0.20 | 0.25 | +| prog_wide (30) | 0.436 | 0.20 | 0.15 | +| prog_widest (154) | 0.440 | 0.10 | 0.15 | + +### Inferences + +- [inf] authored_all is the precision-best pair-set; nothing beats its p@10=0.70. The job-15 switch + prog_wide -> authored was justified: authored p@10 0.70 vs prog_wide 0.20 (job 8's pairs sat near the + bottom). {reason: direct p@10 on common rollouts/metric; credence 0.85, tempered by p@10 being n=10}. +- [inf] AUROC and p@10 rank-disagree (heldout_known_rt tops AUROC 0.711 but authored_all tops p@10). + Confirms the earlier methodological point: for routing, select on the precision tail, not AUROC. + {credence 0.8}. +- [inf] More pool pairs = worse precision (prog_wide 0.20 -> prog_widest 0.10 as n 30->154). Diluting + with broader synthetic pool pairs hurts the confident tail. {reason: monotone in the table; credence 0.6}. + +### Failure modes considered + +- **Most-likely:** p@10 is n=10/140 -- authored 0.70 vs heldout 0.60 is 1 rollout. The authored>others + ordering is robust (0.70 vs prog_wide 0.20) but fine rankings are noisy. Prior 0.5. +- **Subtle:** this ranks SEPARABILITY on cached job-9 rollouts, not deploy suppression; a better- + separating pair-set need not suppress better (absorption may flatten it). Prior 0.4. Check: deploy. +- **Null:** none of it transfers -- all pair-sets deploy ~base rate. Prior 0.2. Check: job 16 vanilla. + +### Next action + +No pair-set beats authored_all => no new pre-vanilla run (user's "if one beats authored, run it first" +condition not met). Queue proceeds: job 18 (act_vote) running, job 16 (vanilla) behind.