6.2 KiB
Results, organized by the question each run answers
Deploy-eval is the headline metric: knob-off forward on the recency-clean held-out
TEST set (ids>=3243, base solve ~0.1, n=119), single-mode run_tests env, Qwen3-4B.
Regenerate the table with just results (scripts/results_deploy.py, auto-discovers
every out/runs/*/deploy_test.json); just results gives the live training-hack table.
OLD eval1 results (training-hack metric, the Q1-Q13 mechanism/basis/refresh studies on the pre-recency-clean eval) are archived in results_eval1_archive.md.
How to read this
- Headline = solve_deploy - hack_deploy (both alone are gameable; the gap is the honest objective: solve the task without learning the cheat). Higher is better.
- hack_deploy / solve_deploy = knob-off deploy eval (quarantine deleted for routeV; the trained model for vanilla) on test n=119. hack_train / solve_train = last-5-step student rates during training (converged regime).
- All rows n=1 (single seed 43) unless noted; treat gaps <=0.03 as noise.
- Watch the pool/pairs confound across rows (see
argv/train_set); only same-axis pairs are A/B-comparable (called out in the answer).
Q14. 🥇 routeV deploy on the recency-clean eval2 test set (the current headline)
Everything above (Q1-Q13) is on the OLD eval. Q12's route2 numbers used n=64 prompts before the
recency-clean fix; the env is now single-mode run_tests and the held-out test set is
recency-clean (ids>=3243, base solve ~0.1). This is the corrected substrate. All rows: seed 43,
60 steps, deploy = knob-off forward on test n=119. Headline = solve_deploy - hack_deploy.
Note the pool/pairs confound across rows (see argv); the only single-axis A/Bs are called out
in the answer.
| arm | pairs | gran | hack ↓ | solve ↑ | headline |
|---|---|---|---|---|---|
| routeV per-token | prog_wide | per-token | 0.042 | 0.143 | +0.101 |
| routeV authored | authored | per-rollout | 0.076 | 0.118 | +0.042 |
| routeV prog_wide | prog_wide | per-rollout | 0.101 | 0.126 | +0.025 |
| routeV random-V | prog_wide (Haar dir) | per-rollout | 0.101 | 0.109 | +0.008 |
| vanilla GRPO | -- | -- | running (job 16) | ||
| routeV act_vote | authored | per-rollout (global vote) | queued (19) | ||
| routeV LoRA-B | authored | per-rollout | queued (20) | ||
| routeV random-V | authored (Haar dir) | per-rollout | queued (21) |
Answer: three single-axis reads already hold; the suppression magnitude waits on vanilla.
- Direction doesn't matter at per-rollout (H2 absorption, on eval2): real-V (prog_wide, 0.101) == random-V (prog_wide, 0.101). Replicates the old-eval H2 result on the clean set. The open question is whether AUTHORED direction matters (job 21 random-V-authored vs job 15) -- queued.
- Pairs matter: authored per-rollout 0.076 < prog_wide per-rollout 0.101 (clean A/B -- same granularity, same dense pool, differ only in pairs). Authored helps even though real-vs-random didn't, so the gain is the pair CONTENT, not direction sharpness.
- Granularity matters most: per-token 0.042 < per-rollout 0.101 (both prog_wide). The arm the per-token ablation -- lowest deploy hack AND highest solve (0.143) of any completed run.
All single seed (n=1), so treat <=0.03 gaps as noise. NOT yet interpretable as suppression until
vanilla (job 16, running): if vanilla deploy hack is ~0.10 then even per-token's 0.042 is only a
modest cut over base; if vanilla is high (>0.3 as on the old eval), all routeV arms suppress
strongly. Pairs separability (orthogonal, job 17): authored_all p@10=0.70 beats prog_wide 0.20
(out/diag/pairs_compare.csv).
Training-rout note (not deploy): grad-cosine routing cliffs (0.63@step6 -> 0.09@step20, GRPO
advantage flattening); act_vote sustains late (0.88@step17) by gating on activations -- see
RESEARCH_JOURNAL 2026-06-08. Whether that converts to deploy suppression is what job 19 tests.
Dynamics note (sizing the convergence test)
Per-step trajectories (mix=0.125 g8, seed 41): hack_s rises 0→~0.6-0.75 and
plateaus by step ~13-16; gt_s (solve) stays noisy-flat at ~0.1-0.5 the
whole run, it never climbs. The attractor in this surrogate regime is full
hack, not full solve — so "run until full solve" has no target. The
convergence question is therefore: once vanilla hack plateaus (~step 15), does
projected stay below it or catch up? A 60-step run (~2.2h at g8) sees 3x past
the plateau; a 1000-step run (~36h) is wasteful.
Open / queued (no result yet)
- convergence at ≥3 seeds (#121): the n=1 seed-42 run (Q11) shows the gap closing by step 60, but that could be a seed-42 high-hack draw. Need 2+ more seeds before concluding the suppression erodes vs survives.
- pairset content at ≥3 seeds (#122): Q10's mechanism>framing>placebo
ordering is n=1 per row; replicate
prog_wideand the placebo on 2+ seeds. - route arm at scale (#182): running; validates routing's ablated-eval hack<kept on Qwen3-4B before the 3-way none/erase/route cells (#130).
- k-slice (k=1/2/5): only smoke-tested, no 4B results.
- Stage 2/3 cross-mechanism generalisation: the load-bearing test -- extract v_hack from hack A, check it stops the unknown hack B the student would otherwise learn. Q10 (held-out framing) is a weaker cousin.