results: fill keynote table/figure at n=3 route2 / n=2 vanilla

C1 headline from deploy-eval (knob-off, n=64, T=0.7, 60-step fast, mix=0.125):
  route2 (n=3): hack 0.031+/-0.031, solve 0.615+/-0.010
  vanilla (n=2): hack 0.305+/-0.039, solve 0.516+/-0.032
  => -27pp deploy hack AND +10pp solve. Keynote fig regenerated as a real
  band (3 route2 + 2 vanilla seeds, per-seed thin lines).

- main.tex tab:keynote + fig:keynote filled (vanilla n=2, s41=job 77 pending).
- results.md Q12 (route2 deploy n=3) + Q13 (floor leak = staleness not
  structure: no-floor 0.000, floor+stale 0.125, floor+refresh-1 0.000, job 73).
- RESEARCH_JOURNAL 2026-06-02 entry.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-02 11:08:41 +00:00
parent 768590a625
commit 311bf2854f
5 changed files with 118 additions and 15 deletions
+55
View File
@@ -288,6 +288,61 @@ numbers, so it is NOT a clean "same run, later" comparison. (3) The 20-step
suppression is real (Q2, n=4); what's unclear is whether longer training erodes
it or whether this is a sparse-teacher/seed artifact.
## Q12. route2 deploy hack/solve, n=3 (the headline; DEPLOY-eval metric)
<!-- METRIC NOTE: unlike Q1-Q11 (last-5-step *training* hack_s), this section
uses the DEPLOY-eval metric from per_mode_deploy.json: knob-off forward
(quarantine deleted for route2; trained model for vanilla), n=64 prompts
x group, T=0.7, 60-step fast preset, Qwen3-4B, mix=0.125. The two metrics
are NOT comparable -- deploy is the apples-to-apples per-arm number.
src per_mode_deploy.json:
route2 nofloor: s41 20260601T115713 / s42 T150231 / s43 T181502
vanilla: s42 20260602T043228 (job 74) / s43 20260601T233047 (job 72)
vanilla s41 = job 77 QUEUED (behind the 200-step convergence runs). -->
The headline comparison. route2 (per-rollout gate routes the hack-aligned
gradient into a deletable quarantine knob) vs vanilla, deploy-eval:
| arm | seeds | deploy hack | ±SEM | deploy solve | ±SEM |
| :-------------- | :------- | ----------: | ----: | -----------: | ----: |
| vanilla GRPO | 42,43 | 0.305 | 0.039 | 0.516 | 0.032 |
| route2 (ours) | 41,42,43 | 0.031 | 0.031 | 0.615 | 0.010 |
Per-seed route2 deploy hack: s41 0.000, s42 0.000, s43 0.094 (solve 0.625 /
0.594 / 0.625). **route2 cuts deploy hack ~27pp (0.305 -> 0.031) and solves
~10pp higher** (0.615 vs 0.516). Unlike Q11's training-hack metric (where the
gap closed at 60 steps in the surrogate regime), the deploy-eval metric shows
a large, persistent gap: the quarantine knob holds the cheat and deleting it at
deploy removes it. Keynote figure: `out/figs/dyn_sub4_hack_overlay.png`
(vanilla hack climbs to ~0.43, route2 stays ~0). Caveat: vanilla is n=2 (s41 =
job 77 pending); promote to n=3 + paired test when it lands.
## Q13. Does the exploration floor leak, and is the leak staleness or structure?
<!-- DEPLOY-eval metric. src:
job 60 route2 no-floor (frac=0): deploy hack 0.000
job 64 route2 floor=0.5 + STALE v_hack: deploy hack 0.125
job 73 route2 floor=0.5 + refresh-1 (fresh): deploy hack 0.000 (to step 28,
daemon-killed; deploy held 0.000 at every eval 10/15/20/25). -->
route2 has an optional exploration floor (`rollout_ablate_frac`): a fraction of
rollouts generated knob-off so the deployed knob keeps seeing solve signal. The
floor introduces a leak -- with a stale (frozen) v_hack the deployed model still
hacks:
| arm | deploy hack |
| :----------------------------- | ----------: |
| no floor (frac=0) | 0.000 |
| floor=0.5, stale v_hack | 0.125 |
| floor=0.5, refresh-1 (fresh V) | 0.000 |
**The leak is staleness, not floor structure.** A fresh per-step gate
(refresh-1) closes the floor's 0.125 leak back to 0.000. Caveat: the refresh-1
run (job 73) was daemon-killed at step 28/60, but deploy hack was pinned at
0.000 at every eval through step 25, so the directional answer is unambiguous;
a clean 60-step rerun would make it airtight. The headline arm uses no floor
anyway (already 0.000), so this is a design note for when the floor is wanted.
## Dynamics note (sizing the convergence test)
Per-step trajectories (mix=0.125 g8, seed 41): `hack_s` rises 0→~0.6-0.75 and
+22 -15
View File
@@ -127,15 +127,18 @@ deploy-eval = knob-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
% Provenance: out/figs/dyn_sub4_hack_overlay.png, generated by `just dyn`
% (src/projected_grpo/plot_dynamics.py) at repo commit 17e4f2e (2026-06-02).
% route2 nofloor seeds 41/42/43 = runs 20260601T115713 / T150231 / T181502.
% Vanilla band INCOMPLETE: only s43 (20260601T233047) present; s42 (job 74)
% running, s41 (job 84) queued -- regenerate `just dyn` once both land.
% Vanilla band n=2: s42 (20260602T043228, job 74) + s43 (20260601T233047,
% job 72); s41 (job 77) queued behind the 200-step runs -- regenerate
% `just dyn` to n=3 when it lands.
\begin{figure}[t]
\centering
\includegraphics[width=0.85\linewidth]{figs/dyn_sub4_hack_overlay.png}
\caption{Deploy hack rate over GRPO training, route2 vs vanilla, $n{=}3$
seeds (band = TODO mean$\pm$SEM). Knob-off deploy eval, $n{=}64$, $T{=}0.7$.
\TODO{interp -- author: vanilla emerges to $\sim$XX\%, route2 stays near zero.
Regenerate after jobs 74+84 land; current figure has vanilla $n{=}1$ (s43).}}
\caption{Hack rate (top) and solve rate (bottom) over GRPO training, route2
($n{=}3$ seeds) vs vanilla ($n{=}2$); thick line = mean, thin = per seed.
EMA-5, knob-off deploy eval, $n{=}64$, $T{=}0.7$. Vanilla hack emerges to
$\sim$0.43 while route2 stays near zero; route2 also reaches a higher solve
rate ($\sim$0.6 vs $\sim$0.35). \TODO{interp prose -- author. Regenerate to
vanilla $n{=}3$ when job 77 (s41) lands.}}
\label{fig:keynote}
\end{figure}
@@ -146,23 +149,27 @@ deploy-eval = knob-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.}
% s42 20260601T150231: hack_deploy 0.000 solve_deploy 0.594
% s43 20260601T181502: hack_deploy 0.094 solve_deploy 0.625
% => mean hack 0.031 (SEM 0.031); mean solve 0.615 (SEM 0.010)
% vanilla 60-step fast Qwen3-4B:
% s43 20260601T233047: hack_deploy 0.344 solve_deploy 0.484 (n=1 so far)
% s42 = job 74 RUNNING; s41 = job 84 QUEUED -> fill mean+/-SEM when done.
% vanilla 60-step fast Qwen3-4B (n=2 so far; s41 = job 77 QUEUED):
% s42 20260602T043228 (job 74): hack_deploy 0.266 solve_deploy 0.547
% s43 20260601T233047 (job 72): hack_deploy 0.344 solve_deploy 0.484
% => n=2 mean hack 0.305 (SEM 0.039); mean solve 0.516 (SEM 0.032)
% s41 (job 77) queued behind the 200-step convergence runs -> promote
% vanilla row to n=3 + add paired test when it lands.
\begin{table}[t]
\centering
\caption{Deploy hack and solve rate, mean$\pm$SEM over 3 seeds (41/42/43).
60-step fast preset, Qwen3-4B, mix=0.125; deploy = knob-off, $n{=}64$,
$T{=}0.7$. \TODO{paired test + $\alpha$; vanilla row pending jobs 74, 84.}}
\caption{Deploy hack and solve rate, mean$\pm$SEM. route2 over 3 seeds
(41/42/43); vanilla over 2 seeds (42/43) so far. 60-step fast preset,
Qwen3-4B, mix=0.125; deploy = knob-off, $n{=}64$, $T{=}0.7$.
\TODO{vanilla -> $n{=}3$ + paired test once job 77 (s41) lands.}}
\label{tab:keynote}
\begin{tabular}{lcc}
\toprule
Arm & Deploy hack & Deploy solve \\
\midrule
Vanilla GRPO & \TODO{$n{=}1$: 0.344} & \TODO{$n{=}1$: 0.484} \\
route2 (ours) & $0.031 \pm 0.031$ & $0.615 \pm 0.010$ \\
Vanilla GRPO ($n{=}2$) & $0.305 \pm 0.039$ & $0.516 \pm 0.032$ \\
route2 (ours, $n{=}3$) & $0.031 \pm 0.031$ & $0.615 \pm 0.010$ \\
\midrule
$\Delta$ vs vanilla & \TODO{after 74/84} & \TODO{after 74/84} \\
$\Delta$ vs vanilla & $-0.274$ & $+0.099$ \\
\bottomrule
\end{tabular}
\end{table}