mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
journal: pairs comparison (job 17) -- authored_all precision-best (p@10 0.70 vs prog_wide 0.20)
This commit is contained in:
@@ -3840,3 +3840,47 @@ Follows the [job-15-queued entry above]. Vanilla baseline (job 16) still queued.
|
||||
|
||||
Job 17 (pairs separability) running; job 18 (act_vote) then job 16 (vanilla) behind. The load-bearing
|
||||
read is job 16: 0.076 << vanilla => real suppression; 0.076 ~ vanilla => vacuous.
|
||||
|
||||
## 2026-06-08 19:45 -- pairs comparison (job 17): authored_all IS the precision-best pair-set
|
||||
|
||||
**Context:** `scripts/diag_pairs_compare.py` on the job-9 first_hack ckpt, 140 live rollouts
|
||||
(base rate 0.25), grad-cosine gate, sweeping the PAIR-SET axis. The comparison I should have
|
||||
run before job 15. Table: `out/diag/pairs_compare.csv`.
|
||||
|
||||
### Observations
|
||||
|
||||
| pairset (n) | AUROC | p@10 | p@20 |
|
||||
|---|---|---|---|
|
||||
| authored_all (18) | 0.560 | **0.70** | 0.40 |
|
||||
| heldout_known_rt (5) | 0.711 | 0.60 | 0.45 |
|
||||
| authored_allv2 (24) | 0.523 | 0.50 | 0.40 |
|
||||
| prog_wider (94) | 0.514 | 0.40 | 0.30 |
|
||||
| authored_runtests (8) | 0.412 | 0.30 | 0.25 |
|
||||
| authored_v2 (6) | 0.393 | 0.30 | 0.20 |
|
||||
| funcname (6) | 0.553 | 0.20 | 0.25 |
|
||||
| prog_wide (30) | 0.436 | 0.20 | 0.15 |
|
||||
| prog_widest (154) | 0.440 | 0.10 | 0.15 |
|
||||
|
||||
### Inferences
|
||||
|
||||
- [inf] authored_all is the precision-best pair-set; nothing beats its p@10=0.70. The job-15 switch
|
||||
prog_wide -> authored was justified: authored p@10 0.70 vs prog_wide 0.20 (job 8's pairs sat near the
|
||||
bottom). {reason: direct p@10 on common rollouts/metric; credence 0.85, tempered by p@10 being n=10}.
|
||||
- [inf] AUROC and p@10 rank-disagree (heldout_known_rt tops AUROC 0.711 but authored_all tops p@10).
|
||||
Confirms the earlier methodological point: for routing, select on the precision tail, not AUROC.
|
||||
{credence 0.8}.
|
||||
- [inf] More pool pairs = worse precision (prog_wide 0.20 -> prog_widest 0.10 as n 30->154). Diluting
|
||||
with broader synthetic pool pairs hurts the confident tail. {reason: monotone in the table; credence 0.6}.
|
||||
|
||||
### Failure modes considered
|
||||
|
||||
- **Most-likely:** p@10 is n=10/140 -- authored 0.70 vs heldout 0.60 is 1 rollout. The authored>others
|
||||
ordering is robust (0.70 vs prog_wide 0.20) but fine rankings are noisy. Prior 0.5.
|
||||
- **Subtle:** this ranks SEPARABILITY on cached job-9 rollouts, not deploy suppression; a better-
|
||||
separating pair-set need not suppress better (absorption may flatten it). Prior 0.4. Check: deploy.
|
||||
- **Null:** none of it transfers -- all pair-sets deploy ~base rate. Prior 0.2. Check: job 16 vanilla.
|
||||
|
||||
### Next action
|
||||
|
||||
No pair-set beats authored_all => no new pre-vanilla run (user's "if one beats authored, run it first"
|
||||
condition not met). Queue proceeds: job 18 (act_vote) running, job 16 (vanilla) behind.
|
||||
|
||||
Reference in New Issue
Block a user