results: Q14 complete eval2 deploy table (4 done: per-token/authored/prog_wide/random-V; via just results-deploy). Corrects earlier claim that job8 prog_wide had no eval2 deploy

2026-06-27 16:15:35 +08:00 · 2026-06-08 23:57:42 +00:00
parent e26f5fe08c
commit 824b7eb623
1 changed files with 45 additions and 35 deletions
@@ -345,45 +345,55 @@ anyway (already 0.000), so this is a design note for when the floor is wanted.

 ## Q14. 🥇 routeV deploy on the recency-clean eval2 test set (the current headline)

-<!-- METRIC: deploy_test.json, knob-off forward (quarantine deleted for routeV; trained
-     model for vanilla), eval_set=test = recency-clean held-out ids>=3243 (base solve ~0.1),
-     n=119, 60-step fast preset, Qwen3-4B, single-mode run_tests env, seed 43, authored 18
-     pairs (--vhack-pairs-path None), grad-clip 500. NOT comparable to Q12 (old n=64 eval,
-     pre the 2026-05-23 grader-bug / recency-clean fix that moved base solve 0.94->0.1).
-     src (commit c721c46):
-       job 15 routeV grad-cosine: out/runs/20260608T134141_*_dir8_routeV_authored_perroll_s43/deploy_test.json
-       job 16 vanilla:   _dir8_vanilla_s43            (RUNNING)
-       job 19 act_vote:  _dir8_routeV_actvote_authored_s43   (QUEUED; rerun of killed job 18)
-       job 20 lora:      _dir8_lora_routeV_authored_s43      (QUEUED)
-       job 21 random-V:  _dir8_routeV_randomV_authored_s43   (QUEUED, Haar v_grad control) -->
+<!-- METRIC: deploy_test.json, knob-off forward (quarantine deleted for routeV; trained model
+     for vanilla), eval_set=test = recency-clean held-out ids>=3243 (base solve ~0.1), n=119,
+     60-step fast preset, Qwen3-4B, single-mode run_tests env, seed 43. NOT comparable to Q12
+     (old n=64 eval, pre the 2026-05-23 grader-bug / recency-clean fix that moved base solve
+     0.94->0.1). REGENERATE: `just results-deploy` (scripts/results_deploy.py) auto-discovers
+     every out/runs/*/deploy_test.json -- this table is a curated copy of that output.
+     Smoke runs (seed 41, steps 30, tiny-random, hack=0) are excluded.
+     completed src: _dir6_routeV_s43 (job 8) / _dir6_routeV_pertoken_s43 (job 9) /
+       _dir6_routeV_random_s43 (job 10) / _dir8_routeV_authored_perroll_s43 (job 15).
+     pending: _dir8_vanilla_s43 (16 RUNNING) / _dir8_routeV_actvote_authored_s43 (19) /
+       _dir8_lora_routeV_authored_s43 (20) / _dir8_routeV_randomV_authored_s43 (21). commit e26f5fe. -->

-Everything above (Q1-Q13) is on the OLD eval. Q12's route2 numbers used n=64 prompts before
-the recency-clean fix; the env is now single-mode `run_tests` and the held-out test set is
-recency-clean (ids>=3243, base solve ~0.1). This section is the corrected substrate. All arms:
-seed 43, authored 18 pairs, 60 steps, deploy = knob-off forward on test n=119.
+Everything above (Q1-Q13) is on the OLD eval. Q12's route2 numbers used n=64 prompts before the
+recency-clean fix; the env is now single-mode `run_tests` and the held-out test set is
+recency-clean (ids>=3243, base solve ~0.1). This is the corrected substrate. All rows: seed 43,
+60 steps, deploy = knob-off forward on test n=119. Headline = solve_deploy - hack_deploy.
+Note the pool/pairs confound across rows (see `argv`); the only single-axis A/Bs are called out
+in the answer.

-| arm                          | deploy hack | deploy vhack | deploy solve | status |
-| :--------------------------- | ----------: | -----------: | -----------: | :----- |
-| vanilla GRPO                 |           — |            — |            — | running (job 16) |
-| routeV grad-cosine (authored)|       0.076 |        0.059 |        0.118 | ✓ job 15 |
-| routeV act_vote (authored)   |           — |            — |            — | queued (job 19) |
-| routeV grad-cosine, LoRA-B   |           — |            — |            — | queued (job 20) |
-| routeV grad-cosine, random-V |           — |            — |            — | queued (job 21) |
+| arm | pairs | gran | hack ↓ | solve ↑ | headline |
+| :-- | :-- | :-- | --: | --: | --: |
+| routeV per-token   | prog_wide | per-token   | 0.042 | 0.143 | +0.101 |
+| routeV authored    | authored  | per-rollout | 0.076 | 0.118 | +0.042 |
+| routeV prog_wide   | prog_wide | per-rollout | 0.101 | 0.126 | +0.025 |
+| routeV random-V    | prog_wide (Haar dir) | per-rollout | 0.101 | 0.109 | +0.008 |
+| vanilla GRPO       | -- | -- | running (job 16) | | |
+| routeV act_vote    | authored | per-rollout (global vote) | queued (19) | | |
+| routeV LoRA-B      | authored | per-rollout | queued (20) | | |
+| routeV random-V    | authored (Haar dir) | per-rollout | queued (21) | | |

-**Answer: pending the vanilla baseline.** routeV (grad-cosine, authored pairs) deploys at hack
-0.076 (knob-off), absorbing ~88% of its own on-policy hack (train knob-on 0.641 -> 0.076). But
-0.076 is NOT yet interpretable: with base solve ~0.1 on this clean set, the hack base rate may
-be low, so 0.076 could be ~vacuous. The load-bearing comparison is vanilla (job 16, running):
-0.076 << vanilla => real suppression; 0.076 ~ vanilla => the knob is absorbing noise. random-V
-(job 21) is the directionality control -- if it also lands ~0.076 the suppression is absorption
-(H2), not direction (H4); on the OLD eval real-V==random-V==0.101 (H2), untested on eval2.
+**Answer: three single-axis reads already hold; the suppression magnitude waits on vanilla.**
+- *Direction doesn't matter at per-rollout (H2 absorption, on eval2):* real-V (prog_wide, 0.101)
+  == random-V (prog_wide, 0.101). Replicates the old-eval H2 result on the clean set. The open
+  question is whether AUTHORED direction matters (job 21 random-V-authored vs job 15) -- queued.
+- *Pairs matter:* authored per-rollout 0.076 < prog_wide per-rollout 0.101 (clean A/B -- same
+  granularity, same dense pool, differ only in pairs). Authored helps even though real-vs-random
+  didn't, so the gain is the pair CONTENT, not direction sharpness.
+- *Granularity matters most:* per-token 0.042 < per-rollout 0.101 (both prog_wide). The arm the
+  per-token ablation -- lowest deploy hack AND highest solve (0.143) of any completed run.

-Notes from the in-flight arms (training-time `rout`, not deploy): grad-cosine routing cliffs
-(rout 0.63@step6 -> 0.09@step20, tracking the GRPO advantage flattening post-saturation);
-act_vote routing sustains late (rout 0.88@step17) because it gates on activations not the
-decaying gradient -- see RESEARCH_JOURNAL 2026-06-08. Whether that converts to better deploy
-suppression is exactly what job 19 tests. Pairs axis (separability, not deploy): authored_all
-p@10=0.70 beats prog_wide 0.20 (job 17, `out/diag/pairs_compare.csv`).
+All single seed (n=1), so treat <=0.03 gaps as noise. NOT yet interpretable as suppression until
+vanilla (job 16, running): if vanilla deploy hack is ~0.10 then even per-token's 0.042 is only a
+modest cut over base; if vanilla is high (>0.3 as on the old eval), all routeV arms suppress
+strongly. Pairs separability (orthogonal, job 17): authored_all p@10=0.70 beats prog_wide 0.20
+(`out/diag/pairs_compare.csv`).
+
+Training-`rout` note (not deploy): grad-cosine routing cliffs (0.63@step6 -> 0.09@step20, GRPO
+advantage flattening); act_vote sustains late (0.88@step17) by gating on activations -- see
+RESEARCH_JOURNAL 2026-06-08. Whether that converts to deploy suppression is what job 19 tests.

 ## Dynamics note (sizing the convergence test)