# Historical routeV results, organized by the question each run answers These results describe the retired gradient-scored routeV method. They remain valid evidence about those runs, but they are not results for the current activation-scored routeA method. See `RESEARCH_JOURNAL.md` for current routeA findings. Deploy-eval is the headline metric: knob-off forward on the recency-clean held-out TEST set (ids>=3243, base solve ~0.1, n=119), single-mode `run_tests` env, Qwen3-4B. Regenerate the table with `just results` (scripts/results_deploy.py, auto-discovers every `out/runs/*/deploy_test.json`); `just results` gives the live training-hack table. OLD eval1 results (training-hack metric, the Q1-Q13 mechanism/basis/refresh studies on the pre-recency-clean eval) are archived in [results_eval1_archive.md](results_eval1_archive.md). ## How to read this - **Headline = solve_deploy - hack_deploy** (both alone are gameable; the gap is the honest objective: solve the task without learning the cheat). Higher is better. - **hack_deploy / solve_deploy** = knob-off deploy eval (quarantine deleted for routeV; the trained model for vanilla) on test n=119. **hack_train / solve_train** = last-5-step student rates during training (converged regime). - **All rows n=1** (single seed 43) unless noted; treat gaps <=0.03 as noise. - Watch the **pool/pairs confound** across rows (see `argv` / `train_set`); only same-axis pairs are A/B-comparable (called out in the answer). --- ## Q14. routeV deploy on the recency-clean eval2 test set Everything above (Q1-Q13) is on the OLD eval. Q12's route2 numbers used n=64 prompts before the recency-clean fix; the env is now single-mode `run_tests` and the held-out test set is recency-clean (ids>=3243, base solve ~0.1). This is the corrected substrate. All rows: seed 43, 60 steps, deploy = knob-off forward on test n=119. Headline = solve_deploy - hack_deploy. Note the pool/pairs confound across rows (see `argv`); the only single-axis A/Bs are called out in the answer. Paper numbers (Ariahw et al. 2025) are reference context only -- paper uses longer training + >512 tok/gen, NOT directly comparable to our 60-step fast preset numbers. | condition | paper solve | paper hack | ours solve | ours hack | ours headline | | :----------------------- | ----------: | ---------: | ----------: | --------: | ------------: | | base model (no training) | 0.115 | -- | 0.126 | 0.000 | +0.126 | | vanilla GRPO | 0.149 | high | 0.101 | 0.613 | -0.512 | | no-loophole ceiling | 0.223 | 0.000 | queued (24) | 0.000 | -- | Our arms (seed 43, 60-step fast, recency-clean test n=119). `hack_train` / `solve_train` = L5 mean student rates during training (converged regime). Note: prog_wide pairs were contaminated (print-without-assert); job 28 replaces with prog_wide_clean. | arm | pairs | gran | hack_deploy ↓ | solve_deploy ↑ | hack_train | solve_train | headline | | :--------------------- | :-------------------- | :------------------------ | -------------: | -------------: | ---------: | ----------: | ---------: | | **routeV per-token** | prog_wide* | per-token | **0.042** | **0.143** | 0.675 | 0.294 | **+0.101** | | routeV authored | authored | per-rollout | 0.076 | 0.118 | 0.781 | 0.200 | +0.042 | | routeV prog_wide | prog_wide* | per-rollout | 0.101 | 0.126 | 0.762 | 0.212 | +0.025 | | routeV random-V | prog_wide* (Haar dir) | per-rollout | 0.101 | 0.109 | 0.762 | 0.219 | +0.008 | | vanilla GRPO | - | - | 0.613 | 0.101 | 0.744 | 0.231 | -0.512 | | routeV per-token clean | prog_wide_clean | per-token | queued (28) | | | | | | routeV act_vote | authored | per-rollout (global vote) | queued (19) | | | | | | routeV LoRA-B | authored | per-rollout | queued (20/25) | | | | | | routeV random-V | authored (Haar dir) | per-rollout | queued (21/26) | | | | | \* prog_wide pairs contained 22/30 contaminated pairs (print-without-assert encoding the grading flaw); replaced by prog_wide_clean (8 pairs, same axes 2/4/5/6) for job 28. **Answer: vanilla hack_deploy=0.613 -- suppression is real and large.** Vanilla GRPO converges to mostly hacking (hack 0.613, solve 0.101 = base rate, so essentially zero solve improvement). Every routeV arm suppresses substantially: - *H2 absorption confirmed:* even random-V (prog_wide, 0.101) cuts vanilla's 0.613 by 6x. The quarantine knob alone suppresses regardless of direction. - *H4 marginal direction gain:* authored (0.076) < prog_wide (0.101) -- pair content adds ~2.5pp on top of absorption. Authored direction matters for the margin, not the bulk. - *Granularity matters most:* per-token 0.042 is a 15x reduction vs vanilla (0.613), and is the only arm that also lifts solve above base (0.143 vs 0.101). - *Vanilla solve = base solve (0.101):* GRPO without intervention learns almost entirely hacks, not genuine solutions -- the problem it was meant to solve is severe. Pairs separability (orthogonal, job 17): authored_all p@10=0.70 beats prog_wide 0.20 (`out/diag/pairs_compare.csv`). Waiting on: base (job 23, running) and no-loophole ceiling (job 24) to anchor the paper comparison table. Training-`rout` note (not deploy): grad-cosine routing cliffs (0.63@step6 -> 0.09@step20, GRPO advantage flattening); act_vote sustains late (0.88@step17) by gating on activations -- see RESEARCH_JOURNAL 2026-06-08. Whether that converts to deploy suppression is what job 19 tests. ## Dynamics note (sizing the convergence test) Per-step trajectories (mix=0.125 g8, seed 41): `hack_s` rises 0→~0.6-0.75 and **plateaus by step ~13-16**; `gt_s` (solve) stays **noisy-flat at ~0.1-0.5 the whole run, it never climbs**. The attractor in this surrogate regime is full *hack*, not full solve — so "run until full solve" has no target. The convergence question is therefore: once vanilla hack plateaus (~step 15), does projected stay below it or catch up? A 60-step run (~2.2h at g8) sees 3x past the plateau; a 1000-step run (~36h) is wasteful. ## Open / queued (no result yet) - **convergence at ≥3 seeds (#121)**: the n=1 seed-42 run (Q11) shows the gap closing by step 60, but that could be a seed-42 high-hack draw. Need 2+ more seeds before concluding the suppression erodes vs survives. - **pairset content at ≥3 seeds (#122)**: Q10's mechanism>framing>placebo ordering is n=1 per row; replicate `prog_wide` and the placebo on 2+ seeds. - **route arm at scale (#182)**: running; validates routing's ablated-eval hack