diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md
index 1a4a131..e359ddb 100644
--- a/RESEARCH_JOURNAL.md
+++ b/RESEARCH_JOURNAL.md
@@ -2,6 +2,47 @@
 
 Append-only. New entries at the top, date-stamped. Never edit old entries.
 
+## 2026-06-02 (c) — route2 keynote at n=3: deploy hack 0.31 -> 0.03 at HIGHER solve; StepLogger merge-bug fixed
+
+**Context:** `probe/distill-cosine`. Filling the keynote table/figure (artifacts A1/A2) from the
+landed deploy runs. route2 nofloor n=3 (pueue 68/69/70) vs vanilla n=2 (74 s42, 72 s43; s41=job 77
+queued behind the 200-step convergence runs the user prioritized). Deploy-eval = knob-off, n=64,
+T=0.7, 60-step fast, Qwen3-4B, mix=0.125. Also fixed a merge bug that was crashing every new run.
+
+### Observations (DEPLOY-eval, per_mode_deploy.json)
+
+- [obs] route2 deploy hack per seed: s41 0.000, s42 0.000, s43 0.094 -> mean 0.031 (SEM 0.031);
+  solve 0.625/0.594/0.625 -> mean 0.615 (SEM 0.010).
+- [obs] vanilla deploy hack: s42 0.266, s43 0.344 -> n=2 mean 0.305 (SEM 0.039); solve 0.547/0.484
+  -> mean 0.516 (SEM 0.032).
+- [obs] keynote figure regenerated (3 route2 + 2 vanilla seeds, per-seed thin lines):
+  `out/figs/dyn_sub4_hack_overlay.png` -- vanilla hack climbs 0->~0.43 over 60 steps, route2 stays
+  ~0; route2 solve plateaus ~0.6, vanilla noisy ~0.3-0.4.
+- [obs] merge bug: `worktree-refactor` merge (a1b17ab) left the pre-refactor `StepLogger` (+_Col,
+  _format_cell) defined in train.py, shadowing the `tablelog` import; call site uses the new
+  `mode_code` signature -> TypeError on every run. Killed jobs 75/76/77/78/84. Fixed in 768590a
+  (ported deploy-for-all-arms + per-mode-int layout into tablelog, deleted the 119-line shadow);
+  verified via smoke + smoke-vanilla. Separately, jobs 80-83 had corrupted commands (stray `3 -- `,
+  exit 127) -> re-added clean as 85/86/87/88.
+
+### Interpretation
+
+- [inf, 0.75] C1 holds at n=3 route2 / n=2 vanilla: ~27pp deploy-hack drop (0.305 -> 0.031) AND a
+  ~10pp solve GAIN (0.615 vs 0.516). The solve gain (not just matched solve) is the strong form --
+  vanilla burns capacity learning to hack; route2 quarantines that and spends it on solving.
+- [inf, 0.8] this is the deploy-eval metric, NOT Q11's training-hack metric. Q11 showed the
+  training-hack gap closing by step 60 in the surrogate regime; the deploy gap does not close
+  because the cheat is held in the deletable knob. Different question, different answer -- do not
+  conflate (results.md Q12 metric note).
+- [caveat] vanilla n=2; s43 (0.344) > s42 (0.266) so the band is wide. Promote to n=3 + paired test
+  when job 77 lands. route2 s43=0.094 is the only nonzero route2 seed -- worth a per-mode look.
+
+### Next
+
+- job 77 (vanilla s41) -> n=3 vanilla, paired test, re-run `just dyn` to vanilla n=3.
+- jobs 84/85 (200-step route2 vs vanilla) -> A4 long-run: is the deploy gap durable or just delayed?
+- results.md gained Q12 (route2 deploy n=3) + Q13 (floor leak = staleness not structure, job 73).
+
 ## 2026-06-01 (m) — route2 WORKS at n=1: deploy hack 0.31 -> 0.00 at +6pp solve, and a held-out mode is suppressed
 
 **Context:** commit `dfc6068` (route2 resid column) on `probe/distill-cosine`; pueue id 60,
diff --git a/docs/results.md b/docs/results.md
index 4c94ef8..9773f0e 100644
--- a/docs/results.md
+++ b/docs/results.md
@@ -288,6 +288,61 @@ numbers, so it is NOT a clean "same run, later" comparison. (3) The 20-step
 suppression is real (Q2, n=4); what's unclear is whether longer training erodes
 it or whether this is a sparse-teacher/seed artifact.
 
+## Q12. route2 deploy hack/solve, n=3 (the headline; DEPLOY-eval metric)
+
+<!-- METRIC NOTE: unlike Q1-Q11 (last-5-step *training* hack_s), this section
+     uses the DEPLOY-eval metric from per_mode_deploy.json: knob-off forward
+     (quarantine deleted for route2; trained model for vanilla), n=64 prompts
+     x group, T=0.7, 60-step fast preset, Qwen3-4B, mix=0.125. The two metrics
+     are NOT comparable -- deploy is the apples-to-apples per-arm number.
+     src per_mode_deploy.json:
+       route2 nofloor: s41 20260601T115713 / s42 T150231 / s43 T181502
+       vanilla:        s42 20260602T043228 (job 74) / s43 20260601T233047 (job 72)
+       vanilla s41 = job 77 QUEUED (behind the 200-step convergence runs). -->
+
+The headline comparison. route2 (per-rollout gate routes the hack-aligned
+gradient into a deletable quarantine knob) vs vanilla, deploy-eval:
+
+| arm             | seeds    | deploy hack |  ±SEM | deploy solve |  ±SEM |
+| :-------------- | :------- | ----------: | ----: | -----------: | ----: |
+| vanilla GRPO    | 42,43    |       0.305 | 0.039 |        0.516 | 0.032 |
+| route2 (ours)   | 41,42,43 |       0.031 | 0.031 |        0.615 | 0.010 |
+
+Per-seed route2 deploy hack: s41 0.000, s42 0.000, s43 0.094 (solve 0.625 /
+0.594 / 0.625). **route2 cuts deploy hack ~27pp (0.305 -> 0.031) and solves
+~10pp higher** (0.615 vs 0.516). Unlike Q11's training-hack metric (where the
+gap closed at 60 steps in the surrogate regime), the deploy-eval metric shows
+a large, persistent gap: the quarantine knob holds the cheat and deleting it at
+deploy removes it. Keynote figure: `out/figs/dyn_sub4_hack_overlay.png`
+(vanilla hack climbs to ~0.43, route2 stays ~0). Caveat: vanilla is n=2 (s41 =
+job 77 pending); promote to n=3 + paired test when it lands.
+
+## Q13. Does the exploration floor leak, and is the leak staleness or structure?
+
+<!-- DEPLOY-eval metric. src:
+     job 60 route2 no-floor (frac=0):            deploy hack 0.000
+     job 64 route2 floor=0.5 + STALE v_hack:     deploy hack 0.125
+     job 73 route2 floor=0.5 + refresh-1 (fresh): deploy hack 0.000 (to step 28,
+            daemon-killed; deploy held 0.000 at every eval 10/15/20/25). -->
+
+route2 has an optional exploration floor (`rollout_ablate_frac`): a fraction of
+rollouts generated knob-off so the deployed knob keeps seeing solve signal. The
+floor introduces a leak -- with a stale (frozen) v_hack the deployed model still
+hacks:
+
+| arm                            | deploy hack |
+| :----------------------------- | ----------: |
+| no floor (frac=0)              |       0.000 |
+| floor=0.5, stale v_hack        |       0.125 |
+| floor=0.5, refresh-1 (fresh V) |       0.000 |
+
+**The leak is staleness, not floor structure.** A fresh per-step gate
+(refresh-1) closes the floor's 0.125 leak back to 0.000. Caveat: the refresh-1
+run (job 73) was daemon-killed at step 28/60, but deploy hack was pinned at
+0.000 at every eval through step 25, so the directional answer is unambiguous;
+a clean 60-step rerun would make it airtight. The headline arm uses no floor
+anyway (already 0.000), so this is a design note for when the floor is wanted.
+
 ## Dynamics note (sizing the convergence test)
 
 Per-step trajectories (mix=0.125 g8, seed 41): `hack_s` rises 0→~0.6-0.75 and
diff --git a/docs/writeup/main.tex b/docs/writeup/main.tex
index 230333b..6c35dce 100644
--- a/docs/writeup/main.tex
+++ b/docs/writeup/main.tex
@@ -127,15 +127,18 @@ deploy-eval = knob-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
 % Provenance: out/figs/dyn_sub4_hack_overlay.png, generated by `just dyn`
 % (src/projected_grpo/plot_dynamics.py) at repo commit 17e4f2e (2026-06-02).
 % route2 nofloor seeds 41/42/43 = runs 20260601T115713 / T150231 / T181502.
-% Vanilla band INCOMPLETE: only s43 (20260601T233047) present; s42 (job 74)
-% running, s41 (job 84) queued -- regenerate `just dyn` once both land.
+% Vanilla band n=2: s42 (20260602T043228, job 74) + s43 (20260601T233047,
+% job 72); s41 (job 77) queued behind the 200-step runs -- regenerate
+% `just dyn` to n=3 when it lands.
 \begin{figure}[t]
   \centering
   \includegraphics[width=0.85\linewidth]{figs/dyn_sub4_hack_overlay.png}
-  \caption{Deploy hack rate over GRPO training, route2 vs vanilla, $n{=}3$
-  seeds (band = TODO mean$\pm$SEM). Knob-off deploy eval, $n{=}64$, $T{=}0.7$.
-  \TODO{interp -- author: vanilla emerges to $\sim$XX\%, route2 stays near zero.
-  Regenerate after jobs 74+84 land; current figure has vanilla $n{=}1$ (s43).}}
+  \caption{Hack rate (top) and solve rate (bottom) over GRPO training, route2
+  ($n{=}3$ seeds) vs vanilla ($n{=}2$); thick line = mean, thin = per seed.
+  EMA-5, knob-off deploy eval, $n{=}64$, $T{=}0.7$. Vanilla hack emerges to
+  $\sim$0.43 while route2 stays near zero; route2 also reaches a higher solve
+  rate ($\sim$0.6 vs $\sim$0.35). \TODO{interp prose -- author. Regenerate to
+  vanilla $n{=}3$ when job 77 (s41) lands.}}
   \label{fig:keynote}
 \end{figure}
 
@@ -146,23 +149,27 @@ deploy-eval = knob-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
 %     s42 20260601T150231: hack_deploy 0.000  solve_deploy 0.594
 %     s43 20260601T181502: hack_deploy 0.094  solve_deploy 0.625
 %     => mean hack 0.031 (SEM 0.031); mean solve 0.615 (SEM 0.010)
-%   vanilla 60-step fast Qwen3-4B:
-%     s43 20260601T233047: hack_deploy 0.344  solve_deploy 0.484  (n=1 so far)
-%     s42 = job 74 RUNNING; s41 = job 84 QUEUED -> fill mean+/-SEM when done.
+%   vanilla 60-step fast Qwen3-4B (n=2 so far; s41 = job 77 QUEUED):
+%     s42 20260602T043228 (job 74): hack_deploy 0.266  solve_deploy 0.547
+%     s43 20260601T233047 (job 72): hack_deploy 0.344  solve_deploy 0.484
+%     => n=2 mean hack 0.305 (SEM 0.039); mean solve 0.516 (SEM 0.032)
+%     s41 (job 77) queued behind the 200-step convergence runs -> promote
+%     vanilla row to n=3 + add paired test when it lands.
 \begin{table}[t]
   \centering
-  \caption{Deploy hack and solve rate, mean$\pm$SEM over 3 seeds (41/42/43).
-  60-step fast preset, Qwen3-4B, mix=0.125; deploy = knob-off, $n{=}64$,
-  $T{=}0.7$. \TODO{paired test + $\alpha$; vanilla row pending jobs 74, 84.}}
+  \caption{Deploy hack and solve rate, mean$\pm$SEM. route2 over 3 seeds
+  (41/42/43); vanilla over 2 seeds (42/43) so far. 60-step fast preset,
+  Qwen3-4B, mix=0.125; deploy = knob-off, $n{=}64$, $T{=}0.7$.
+  \TODO{vanilla -> $n{=}3$ + paired test once job 77 (s41) lands.}}
   \label{tab:keynote}
   \begin{tabular}{lcc}
     \toprule
     Arm & Deploy hack & Deploy solve \\
     \midrule
-    Vanilla GRPO        & \TODO{$n{=}1$: 0.344} & \TODO{$n{=}1$: 0.484} \\
-    route2 (ours)       & $0.031 \pm 0.031$     & $0.615 \pm 0.010$     \\
+    Vanilla GRPO ($n{=}2$) & $0.305 \pm 0.039$ & $0.516 \pm 0.032$ \\
+    route2 (ours, $n{=}3$) & $0.031 \pm 0.031$ & $0.615 \pm 0.010$ \\
     \midrule
-    $\Delta$ vs vanilla & \TODO{after 74/84}    & \TODO{after 74/84}    \\
+    $\Delta$ vs vanilla    & $-0.274$          & $+0.099$           \\
     \bottomrule
   \end{tabular}
 \end{table}
diff --git a/out/figs/dyn_sub4.png b/out/figs/dyn_sub4.png
index 461a3bf..34a3359 100644
Binary files a/out/figs/dyn_sub4.png and b/out/figs/dyn_sub4.png differ
diff --git a/out/figs/dyn_sub4_hack_overlay.png b/out/figs/dyn_sub4_hack_overlay.png
index e21d56f..cb4913a 100644
Binary files a/out/figs/dyn_sub4_hack_overlay.png and b/out/figs/dyn_sub4_hack_overlay.png differ