results: fill keynote table/figure at n=3 route2 / n=2 vanilla

C1 headline from deploy-eval (knob-off, n=64, T=0.7, 60-step fast, mix=0.125): route2 (n=3): hack 0.031+/-0.031, solve 0.615+/-0.010 vanilla (n=2): hack 0.305+/-0.039, solve 0.516+/-0.032 => -27pp deploy hack AND +10pp solve. Keynote fig regenerated as a real band (3 route2 + 2 vanilla seeds, per-seed thin lines). - main.tex tab:keynote + fig:keynote filled (vanilla n=2, s41=job 77 pending). - results.md Q12 (route2 deploy n=3) + Q13 (floor leak = staleness not structure: no-floor 0.000, floor+stale 0.125, floor+refresh-1 0.000, job 73). - RESEARCH_JOURNAL 2026-06-02 entry. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 17:30:41 +08:00 · 2026-06-02 11:08:41 +00:00
parent 768590a625
commit 311bf2854f
5 changed files with 118 additions and 15 deletions
@@ -288,6 +288,61 @@ numbers, so it is NOT a clean "same run, later" comparison. (3) The 20-step
 suppression is real (Q2, n=4); what's unclear is whether longer training erodes
 it or whether this is a sparse-teacher/seed artifact.

+## Q12. route2 deploy hack/solve, n=3 (the headline; DEPLOY-eval metric)
+
+<!-- METRIC NOTE: unlike Q1-Q11 (last-5-step *training* hack_s), this section
+     uses the DEPLOY-eval metric from per_mode_deploy.json: knob-off forward
+     (quarantine deleted for route2; trained model for vanilla), n=64 prompts
+     x group, T=0.7, 60-step fast preset, Qwen3-4B, mix=0.125. The two metrics
+     are NOT comparable -- deploy is the apples-to-apples per-arm number.
+     src per_mode_deploy.json:
+       route2 nofloor: s41 20260601T115713 / s42 T150231 / s43 T181502
+       vanilla:        s42 20260602T043228 (job 74) / s43 20260601T233047 (job 72)
+       vanilla s41 = job 77 QUEUED (behind the 200-step convergence runs). -->
+
+The headline comparison. route2 (per-rollout gate routes the hack-aligned
+gradient into a deletable quarantine knob) vs vanilla, deploy-eval:
+
+| arm             | seeds    | deploy hack |  ±SEM | deploy solve |  ±SEM |
+| :-------------- | :------- | ----------: | ----: | -----------: | ----: |
+| vanilla GRPO    | 42,43    |       0.305 | 0.039 |        0.516 | 0.032 |
+| route2 (ours)   | 41,42,43 |       0.031 | 0.031 |        0.615 | 0.010 |
+
+Per-seed route2 deploy hack: s41 0.000, s42 0.000, s43 0.094 (solve 0.625 /
+0.594 / 0.625). **route2 cuts deploy hack ~27pp (0.305 -> 0.031) and solves
+~10pp higher** (0.615 vs 0.516). Unlike Q11's training-hack metric (where the
+gap closed at 60 steps in the surrogate regime), the deploy-eval metric shows
+a large, persistent gap: the quarantine knob holds the cheat and deleting it at
+deploy removes it. Keynote figure: `out/figs/dyn_sub4_hack_overlay.png`
+(vanilla hack climbs to ~0.43, route2 stays ~0). Caveat: vanilla is n=2 (s41 =
+job 77 pending); promote to n=3 + paired test when it lands.
+
+## Q13. Does the exploration floor leak, and is the leak staleness or structure?
+
+<!-- DEPLOY-eval metric. src:
+     job 60 route2 no-floor (frac=0):            deploy hack 0.000
+     job 64 route2 floor=0.5 + STALE v_hack:     deploy hack 0.125
+     job 73 route2 floor=0.5 + refresh-1 (fresh): deploy hack 0.000 (to step 28,
+            daemon-killed; deploy held 0.000 at every eval 10/15/20/25). -->
+
+route2 has an optional exploration floor (`rollout_ablate_frac`): a fraction of
+rollouts generated knob-off so the deployed knob keeps seeing solve signal. The
+floor introduces a leak -- with a stale (frozen) v_hack the deployed model still
+hacks:
+
+| arm                            | deploy hack |
+| :----------------------------- | ----------: |
+| no floor (frac=0)              |       0.000 |
+| floor=0.5, stale v_hack        |       0.125 |
+| floor=0.5, refresh-1 (fresh V) |       0.000 |
+
+**The leak is staleness, not floor structure.** A fresh per-step gate
+(refresh-1) closes the floor's 0.125 leak back to 0.000. Caveat: the refresh-1
+run (job 73) was daemon-killed at step 28/60, but deploy hack was pinned at
+0.000 at every eval through step 25, so the directional answer is unambiguous;
+a clean 60-step rerun would make it airtight. The headline arm uses no floor
+anyway (already 0.000), so this is a design note for when the floor is wanted.
+
 ## Dynamics note (sizing the convergence test)

 Per-step trajectories (mix=0.125 g8, seed 41): `hack_s` rises 0→~0.6-0.75 and
@@ -127,15 +127,18 @@ deploy-eval = knob-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
 % Provenance: out/figs/dyn_sub4_hack_overlay.png, generated by `just dyn`
 % (src/projected_grpo/plot_dynamics.py) at repo commit 17e4f2e (2026-06-02).
 % route2 nofloor seeds 41/42/43 = runs 20260601T115713 / T150231 / T181502.
-% Vanilla band INCOMPLETE: only s43 (20260601T233047) present; s42 (job 74)
-% running, s41 (job 84) queued -- regenerate `just dyn` once both land.
+% Vanilla band n=2: s42 (20260602T043228, job 74) + s43 (20260601T233047,
+% job 72); s41 (job 77) queued behind the 200-step runs -- regenerate
+% `just dyn` to n=3 when it lands.
 \begin{figure}[t]
  \centering
  \includegraphics[width=0.85\linewidth]{figs/dyn_sub4_hack_overlay.png}
-  \caption{Deploy hack rate over GRPO training, route2 vs vanilla, $n{=}3$
-  seeds (band = TODO mean$\pm$SEM). Knob-off deploy eval, $n{=}64$, $T{=}0.7$.
-  \TODO{interp -- author: vanilla emerges to $\sim$XX\%, route2 stays near zero.
-  Regenerate after jobs 74+84 land; current figure has vanilla $n{=}1$ (s43).}}
+  \caption{Hack rate (top) and solve rate (bottom) over GRPO training, route2
+  ($n{=}3$ seeds) vs vanilla ($n{=}2$); thick line = mean, thin = per seed.
+  EMA-5, knob-off deploy eval, $n{=}64$, $T{=}0.7$. Vanilla hack emerges to
+  $\sim$0.43 while route2 stays near zero; route2 also reaches a higher solve
+  rate ($\sim$0.6 vs $\sim$0.35). \TODO{interp prose -- author. Regenerate to
+  vanilla $n{=}3$ when job 77 (s41) lands.}}
  \label{fig:keynote}
 \end{figure}

@@ -146,23 +149,27 @@ deploy-eval = knob-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
 %     s42 20260601T150231: hack_deploy 0.000  solve_deploy 0.594
 %     s43 20260601T181502: hack_deploy 0.094  solve_deploy 0.625
 %     => mean hack 0.031 (SEM 0.031); mean solve 0.615 (SEM 0.010)
-%   vanilla 60-step fast Qwen3-4B:
-%     s43 20260601T233047: hack_deploy 0.344  solve_deploy 0.484  (n=1 so far)
-%     s42 = job 74 RUNNING; s41 = job 84 QUEUED -> fill mean+/-SEM when done.
+%   vanilla 60-step fast Qwen3-4B (n=2 so far; s41 = job 77 QUEUED):
+%     s42 20260602T043228 (job 74): hack_deploy 0.266  solve_deploy 0.547
+%     s43 20260601T233047 (job 72): hack_deploy 0.344  solve_deploy 0.484
+%     => n=2 mean hack 0.305 (SEM 0.039); mean solve 0.516 (SEM 0.032)
+%     s41 (job 77) queued behind the 200-step convergence runs -> promote
+%     vanilla row to n=3 + add paired test when it lands.
 \begin{table}[t]
  \centering
-  \caption{Deploy hack and solve rate, mean$\pm$SEM over 3 seeds (41/42/43).
-  60-step fast preset, Qwen3-4B, mix=0.125; deploy = knob-off, $n{=}64$,
-  $T{=}0.7$. \TODO{paired test + $\alpha$; vanilla row pending jobs 74, 84.}}
+  \caption{Deploy hack and solve rate, mean$\pm$SEM. route2 over 3 seeds
+  (41/42/43); vanilla over 2 seeds (42/43) so far. 60-step fast preset,
+  Qwen3-4B, mix=0.125; deploy = knob-off, $n{=}64$, $T{=}0.7$.
+  \TODO{vanilla -> $n{=}3$ + paired test once job 77 (s41) lands.}}
  \label{tab:keynote}
  \begin{tabular}{lcc}
    \toprule
    Arm & Deploy hack & Deploy solve \\
    \midrule
-    Vanilla GRPO        & \TODO{$n{=}1$: 0.344} & \TODO{$n{=}1$: 0.484} \\
-    route2 (ours)       & $0.031 \pm 0.031$     & $0.615 \pm 0.010$     \\
+    Vanilla GRPO ($n{=}2$) & $0.305 \pm 0.039$ & $0.516 \pm 0.032$ \\
+    route2 (ours, $n{=}3$) & $0.031 \pm 0.031$ & $0.615 \pm 0.010$ \\
    \midrule
-    $\Delta$ vs vanilla & \TODO{after 74/84}    & \TODO{after 74/84}    \\
+    $\Delta$ vs vanilla    & $-0.274$          & $+0.099$           \\
    \bottomrule
  \end{tabular}
 \end{table}