diff --git a/docs/results.md b/docs/results.md index 41aa1a1..2192260 100644 --- a/docs/results.md +++ b/docs/results.md @@ -345,45 +345,55 @@ anyway (already 0.000), so this is a design note for when the floor is wanted. ## Q14. 🥇 routeV deploy on the recency-clean eval2 test set (the current headline) - + -Everything above (Q1-Q13) is on the OLD eval. Q12's route2 numbers used n=64 prompts before -the recency-clean fix; the env is now single-mode `run_tests` and the held-out test set is -recency-clean (ids>=3243, base solve ~0.1). This section is the corrected substrate. All arms: -seed 43, authored 18 pairs, 60 steps, deploy = knob-off forward on test n=119. +Everything above (Q1-Q13) is on the OLD eval. Q12's route2 numbers used n=64 prompts before the +recency-clean fix; the env is now single-mode `run_tests` and the held-out test set is +recency-clean (ids>=3243, base solve ~0.1). This is the corrected substrate. All rows: seed 43, +60 steps, deploy = knob-off forward on test n=119. Headline = solve_deploy - hack_deploy. +Note the pool/pairs confound across rows (see `argv`); the only single-axis A/Bs are called out +in the answer. -| arm | deploy hack | deploy vhack | deploy solve | status | -| :--------------------------- | ----------: | -----------: | -----------: | :----- | -| vanilla GRPO | — | — | — | running (job 16) | -| routeV grad-cosine (authored)| 0.076 | 0.059 | 0.118 | ✓ job 15 | -| routeV act_vote (authored) | — | — | — | queued (job 19) | -| routeV grad-cosine, LoRA-B | — | — | — | queued (job 20) | -| routeV grad-cosine, random-V | — | — | — | queued (job 21) | +| arm | pairs | gran | hack ↓ | solve ↑ | headline | +| :-- | :-- | :-- | --: | --: | --: | +| routeV per-token | prog_wide | per-token | 0.042 | 0.143 | +0.101 | +| routeV authored | authored | per-rollout | 0.076 | 0.118 | +0.042 | +| routeV prog_wide | prog_wide | per-rollout | 0.101 | 0.126 | +0.025 | +| routeV random-V | prog_wide (Haar dir) | per-rollout | 0.101 | 0.109 | +0.008 | +| vanilla GRPO | -- | -- | running (job 16) | | | +| routeV act_vote | authored | per-rollout (global vote) | queued (19) | | | +| routeV LoRA-B | authored | per-rollout | queued (20) | | | +| routeV random-V | authored (Haar dir) | per-rollout | queued (21) | | | -**Answer: pending the vanilla baseline.** routeV (grad-cosine, authored pairs) deploys at hack -0.076 (knob-off), absorbing ~88% of its own on-policy hack (train knob-on 0.641 -> 0.076). But -0.076 is NOT yet interpretable: with base solve ~0.1 on this clean set, the hack base rate may -be low, so 0.076 could be ~vacuous. The load-bearing comparison is vanilla (job 16, running): -0.076 << vanilla => real suppression; 0.076 ~ vanilla => the knob is absorbing noise. random-V -(job 21) is the directionality control -- if it also lands ~0.076 the suppression is absorption -(H2), not direction (H4); on the OLD eval real-V==random-V==0.101 (H2), untested on eval2. +**Answer: three single-axis reads already hold; the suppression magnitude waits on vanilla.** +- *Direction doesn't matter at per-rollout (H2 absorption, on eval2):* real-V (prog_wide, 0.101) + == random-V (prog_wide, 0.101). Replicates the old-eval H2 result on the clean set. The open + question is whether AUTHORED direction matters (job 21 random-V-authored vs job 15) -- queued. +- *Pairs matter:* authored per-rollout 0.076 < prog_wide per-rollout 0.101 (clean A/B -- same + granularity, same dense pool, differ only in pairs). Authored helps even though real-vs-random + didn't, so the gain is the pair CONTENT, not direction sharpness. +- *Granularity matters most:* per-token 0.042 < per-rollout 0.101 (both prog_wide). The arm the + per-token ablation -- lowest deploy hack AND highest solve (0.143) of any completed run. -Notes from the in-flight arms (training-time `rout`, not deploy): grad-cosine routing cliffs -(rout 0.63@step6 -> 0.09@step20, tracking the GRPO advantage flattening post-saturation); -act_vote routing sustains late (rout 0.88@step17) because it gates on activations not the -decaying gradient -- see RESEARCH_JOURNAL 2026-06-08. Whether that converts to better deploy -suppression is exactly what job 19 tests. Pairs axis (separability, not deploy): authored_all -p@10=0.70 beats prog_wide 0.20 (job 17, `out/diag/pairs_compare.csv`). +All single seed (n=1), so treat <=0.03 gaps as noise. NOT yet interpretable as suppression until +vanilla (job 16, running): if vanilla deploy hack is ~0.10 then even per-token's 0.042 is only a +modest cut over base; if vanilla is high (>0.3 as on the old eval), all routeV arms suppress +strongly. Pairs separability (orthogonal, job 17): authored_all p@10=0.70 beats prog_wide 0.20 +(`out/diag/pairs_compare.csv`). + +Training-`rout` note (not deploy): grad-cosine routing cliffs (0.63@step6 -> 0.09@step20, GRPO +advantage flattening); act_vote sustains late (0.88@step17) by gating on activations -- see +RESEARCH_JOURNAL 2026-06-08. Whether that converts to deploy suppression is what job 19 tests. ## Dynamics note (sizing the convergence test)