results: add Q14 -- routeV deploy on recency-clean eval2 (job 15 in; vanilla/act_vote/lora/random-V pending)

2026-06-27 17:15:58 +08:00 · 2026-06-08 22:58:34 +00:00
parent c721c460a4
commit e26f5fe08c
1 changed files with 42 additions and 0 deletions
@@ -343,6 +343,48 @@ run (job 73) was daemon-killed at step 28/60, but deploy hack was pinned at
 a clean 60-step rerun would make it airtight. The headline arm uses no floor
 anyway (already 0.000), so this is a design note for when the floor is wanted.

+## Q14. 🥇 routeV deploy on the recency-clean eval2 test set (the current headline)
+
+<!-- METRIC: deploy_test.json, knob-off forward (quarantine deleted for routeV; trained
+     model for vanilla), eval_set=test = recency-clean held-out ids>=3243 (base solve ~0.1),
+     n=119, 60-step fast preset, Qwen3-4B, single-mode run_tests env, seed 43, authored 18
+     pairs (--vhack-pairs-path None), grad-clip 500. NOT comparable to Q12 (old n=64 eval,
+     pre the 2026-05-23 grader-bug / recency-clean fix that moved base solve 0.94->0.1).
+     src (commit c721c46):
+       job 15 routeV grad-cosine: out/runs/20260608T134141_*_dir8_routeV_authored_perroll_s43/deploy_test.json
+       job 16 vanilla:   _dir8_vanilla_s43            (RUNNING)
+       job 19 act_vote:  _dir8_routeV_actvote_authored_s43   (QUEUED; rerun of killed job 18)
+       job 20 lora:      _dir8_lora_routeV_authored_s43      (QUEUED)
+       job 21 random-V:  _dir8_routeV_randomV_authored_s43   (QUEUED, Haar v_grad control) -->
+
+Everything above (Q1-Q13) is on the OLD eval. Q12's route2 numbers used n=64 prompts before
+the recency-clean fix; the env is now single-mode `run_tests` and the held-out test set is
+recency-clean (ids>=3243, base solve ~0.1). This section is the corrected substrate. All arms:
+seed 43, authored 18 pairs, 60 steps, deploy = knob-off forward on test n=119.
+
+| arm                          | deploy hack | deploy vhack | deploy solve | status |
+| :--------------------------- | ----------: | -----------: | -----------: | :----- |
+| vanilla GRPO                 |           — |            — |            — | running (job 16) |
+| routeV grad-cosine (authored)|       0.076 |        0.059 |        0.118 | ✓ job 15 |
+| routeV act_vote (authored)   |           — |            — |            — | queued (job 19) |
+| routeV grad-cosine, LoRA-B   |           — |            — |            — | queued (job 20) |
+| routeV grad-cosine, random-V |           — |            — |            — | queued (job 21) |
+
+**Answer: pending the vanilla baseline.** routeV (grad-cosine, authored pairs) deploys at hack
+0.076 (knob-off), absorbing ~88% of its own on-policy hack (train knob-on 0.641 -> 0.076). But
+0.076 is NOT yet interpretable: with base solve ~0.1 on this clean set, the hack base rate may
+be low, so 0.076 could be ~vacuous. The load-bearing comparison is vanilla (job 16, running):
+0.076 << vanilla => real suppression; 0.076 ~ vanilla => the knob is absorbing noise. random-V
+(job 21) is the directionality control -- if it also lands ~0.076 the suppression is absorption
+(H2), not direction (H4); on the OLD eval real-V==random-V==0.101 (H2), untested on eval2.
+
+Notes from the in-flight arms (training-time `rout`, not deploy): grad-cosine routing cliffs
+(rout 0.63@step6 -> 0.09@step20, tracking the GRPO advantage flattening post-saturation);
+act_vote routing sustains late (rout 0.88@step17) because it gates on activations not the
+decaying gradient -- see RESEARCH_JOURNAL 2026-06-08. Whether that converts to better deploy
+suppression is exactly what job 19 tests. Pairs axis (separability, not deploy): authored_all
+p@10=0.70 beats prog_wide 0.20 (job 17, `out/diag/pairs_compare.csv`).
+
 ## Dynamics note (sizing the convergence test)

 Per-step trajectories (mix=0.125 g8, seed 41): `hack_s` rises 0→~0.6-0.75 and