diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index 1a4a131..e359ddb 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -2,6 +2,47 @@ Append-only. New entries at the top, date-stamped. Never edit old entries. +## 2026-06-02 (c) — route2 keynote at n=3: deploy hack 0.31 -> 0.03 at HIGHER solve; StepLogger merge-bug fixed + +**Context:** `probe/distill-cosine`. Filling the keynote table/figure (artifacts A1/A2) from the +landed deploy runs. route2 nofloor n=3 (pueue 68/69/70) vs vanilla n=2 (74 s42, 72 s43; s41=job 77 +queued behind the 200-step convergence runs the user prioritized). Deploy-eval = knob-off, n=64, +T=0.7, 60-step fast, Qwen3-4B, mix=0.125. Also fixed a merge bug that was crashing every new run. + +### Observations (DEPLOY-eval, per_mode_deploy.json) + +- [obs] route2 deploy hack per seed: s41 0.000, s42 0.000, s43 0.094 -> mean 0.031 (SEM 0.031); + solve 0.625/0.594/0.625 -> mean 0.615 (SEM 0.010). +- [obs] vanilla deploy hack: s42 0.266, s43 0.344 -> n=2 mean 0.305 (SEM 0.039); solve 0.547/0.484 + -> mean 0.516 (SEM 0.032). +- [obs] keynote figure regenerated (3 route2 + 2 vanilla seeds, per-seed thin lines): + `out/figs/dyn_sub4_hack_overlay.png` -- vanilla hack climbs 0->~0.43 over 60 steps, route2 stays + ~0; route2 solve plateaus ~0.6, vanilla noisy ~0.3-0.4. +- [obs] merge bug: `worktree-refactor` merge (a1b17ab) left the pre-refactor `StepLogger` (+_Col, + _format_cell) defined in train.py, shadowing the `tablelog` import; call site uses the new + `mode_code` signature -> TypeError on every run. Killed jobs 75/76/77/78/84. Fixed in 768590a + (ported deploy-for-all-arms + per-mode-int layout into tablelog, deleted the 119-line shadow); + verified via smoke + smoke-vanilla. Separately, jobs 80-83 had corrupted commands (stray `3 -- `, + exit 127) -> re-added clean as 85/86/87/88. + +### Interpretation + +- [inf, 0.75] C1 holds at n=3 route2 / n=2 vanilla: ~27pp deploy-hack drop (0.305 -> 0.031) AND a + ~10pp solve GAIN (0.615 vs 0.516). The solve gain (not just matched solve) is the strong form -- + vanilla burns capacity learning to hack; route2 quarantines that and spends it on solving. +- [inf, 0.8] this is the deploy-eval metric, NOT Q11's training-hack metric. Q11 showed the + training-hack gap closing by step 60 in the surrogate regime; the deploy gap does not close + because the cheat is held in the deletable knob. Different question, different answer -- do not + conflate (results.md Q12 metric note). +- [caveat] vanilla n=2; s43 (0.344) > s42 (0.266) so the band is wide. Promote to n=3 + paired test + when job 77 lands. route2 s43=0.094 is the only nonzero route2 seed -- worth a per-mode look. + +### Next + +- job 77 (vanilla s41) -> n=3 vanilla, paired test, re-run `just dyn` to vanilla n=3. +- jobs 84/85 (200-step route2 vs vanilla) -> A4 long-run: is the deploy gap durable or just delayed? +- results.md gained Q12 (route2 deploy n=3) + Q13 (floor leak = staleness not structure, job 73). + ## 2026-06-01 (m) — route2 WORKS at n=1: deploy hack 0.31 -> 0.00 at +6pp solve, and a held-out mode is suppressed **Context:** commit `dfc6068` (route2 resid column) on `probe/distill-cosine`; pueue id 60, diff --git a/docs/results.md b/docs/results.md index 4c94ef8..9773f0e 100644 --- a/docs/results.md +++ b/docs/results.md @@ -288,6 +288,61 @@ numbers, so it is NOT a clean "same run, later" comparison. (3) The 20-step suppression is real (Q2, n=4); what's unclear is whether longer training erodes it or whether this is a sparse-teacher/seed artifact. +## Q12. route2 deploy hack/solve, n=3 (the headline; DEPLOY-eval metric) + + + +The headline comparison. route2 (per-rollout gate routes the hack-aligned +gradient into a deletable quarantine knob) vs vanilla, deploy-eval: + +| arm | seeds | deploy hack | ±SEM | deploy solve | ±SEM | +| :-------------- | :------- | ----------: | ----: | -----------: | ----: | +| vanilla GRPO | 42,43 | 0.305 | 0.039 | 0.516 | 0.032 | +| route2 (ours) | 41,42,43 | 0.031 | 0.031 | 0.615 | 0.010 | + +Per-seed route2 deploy hack: s41 0.000, s42 0.000, s43 0.094 (solve 0.625 / +0.594 / 0.625). **route2 cuts deploy hack ~27pp (0.305 -> 0.031) and solves +~10pp higher** (0.615 vs 0.516). Unlike Q11's training-hack metric (where the +gap closed at 60 steps in the surrogate regime), the deploy-eval metric shows +a large, persistent gap: the quarantine knob holds the cheat and deleting it at +deploy removes it. Keynote figure: `out/figs/dyn_sub4_hack_overlay.png` +(vanilla hack climbs to ~0.43, route2 stays ~0). Caveat: vanilla is n=2 (s41 = +job 77 pending); promote to n=3 + paired test when it lands. + +## Q13. Does the exploration floor leak, and is the leak staleness or structure? + + + +route2 has an optional exploration floor (`rollout_ablate_frac`): a fraction of +rollouts generated knob-off so the deployed knob keeps seeing solve signal. The +floor introduces a leak -- with a stale (frozen) v_hack the deployed model still +hacks: + +| arm | deploy hack | +| :----------------------------- | ----------: | +| no floor (frac=0) | 0.000 | +| floor=0.5, stale v_hack | 0.125 | +| floor=0.5, refresh-1 (fresh V) | 0.000 | + +**The leak is staleness, not floor structure.** A fresh per-step gate +(refresh-1) closes the floor's 0.125 leak back to 0.000. Caveat: the refresh-1 +run (job 73) was daemon-killed at step 28/60, but deploy hack was pinned at +0.000 at every eval through step 25, so the directional answer is unambiguous; +a clean 60-step rerun would make it airtight. The headline arm uses no floor +anyway (already 0.000), so this is a design note for when the floor is wanted. + ## Dynamics note (sizing the convergence test) Per-step trajectories (mix=0.125 g8, seed 41): `hack_s` rises 0→~0.6-0.75 and diff --git a/docs/writeup/main.tex b/docs/writeup/main.tex index 230333b..6c35dce 100644 --- a/docs/writeup/main.tex +++ b/docs/writeup/main.tex @@ -127,15 +127,18 @@ deploy-eval = knob-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.} % Provenance: out/figs/dyn_sub4_hack_overlay.png, generated by `just dyn` % (src/projected_grpo/plot_dynamics.py) at repo commit 17e4f2e (2026-06-02). % route2 nofloor seeds 41/42/43 = runs 20260601T115713 / T150231 / T181502. -% Vanilla band INCOMPLETE: only s43 (20260601T233047) present; s42 (job 74) -% running, s41 (job 84) queued -- regenerate `just dyn` once both land. +% Vanilla band n=2: s42 (20260602T043228, job 74) + s43 (20260601T233047, +% job 72); s41 (job 77) queued behind the 200-step runs -- regenerate +% `just dyn` to n=3 when it lands. \begin{figure}[t] \centering \includegraphics[width=0.85\linewidth]{figs/dyn_sub4_hack_overlay.png} - \caption{Deploy hack rate over GRPO training, route2 vs vanilla, $n{=}3$ - seeds (band = TODO mean$\pm$SEM). Knob-off deploy eval, $n{=}64$, $T{=}0.7$. - \TODO{interp -- author: vanilla emerges to $\sim$XX\%, route2 stays near zero. - Regenerate after jobs 74+84 land; current figure has vanilla $n{=}1$ (s43).}} + \caption{Hack rate (top) and solve rate (bottom) over GRPO training, route2 + ($n{=}3$ seeds) vs vanilla ($n{=}2$); thick line = mean, thin = per seed. + EMA-5, knob-off deploy eval, $n{=}64$, $T{=}0.7$. Vanilla hack emerges to + $\sim$0.43 while route2 stays near zero; route2 also reaches a higher solve + rate ($\sim$0.6 vs $\sim$0.35). \TODO{interp prose -- author. Regenerate to + vanilla $n{=}3$ when job 77 (s41) lands.}} \label{fig:keynote} \end{figure} @@ -146,23 +149,27 @@ deploy-eval = knob-off, $n=64$ prompts$\times$group, $T=0.7$, per env\_mode.} % s42 20260601T150231: hack_deploy 0.000 solve_deploy 0.594 % s43 20260601T181502: hack_deploy 0.094 solve_deploy 0.625 % => mean hack 0.031 (SEM 0.031); mean solve 0.615 (SEM 0.010) -% vanilla 60-step fast Qwen3-4B: -% s43 20260601T233047: hack_deploy 0.344 solve_deploy 0.484 (n=1 so far) -% s42 = job 74 RUNNING; s41 = job 84 QUEUED -> fill mean+/-SEM when done. +% vanilla 60-step fast Qwen3-4B (n=2 so far; s41 = job 77 QUEUED): +% s42 20260602T043228 (job 74): hack_deploy 0.266 solve_deploy 0.547 +% s43 20260601T233047 (job 72): hack_deploy 0.344 solve_deploy 0.484 +% => n=2 mean hack 0.305 (SEM 0.039); mean solve 0.516 (SEM 0.032) +% s41 (job 77) queued behind the 200-step convergence runs -> promote +% vanilla row to n=3 + add paired test when it lands. \begin{table}[t] \centering - \caption{Deploy hack and solve rate, mean$\pm$SEM over 3 seeds (41/42/43). - 60-step fast preset, Qwen3-4B, mix=0.125; deploy = knob-off, $n{=}64$, - $T{=}0.7$. \TODO{paired test + $\alpha$; vanilla row pending jobs 74, 84.}} + \caption{Deploy hack and solve rate, mean$\pm$SEM. route2 over 3 seeds + (41/42/43); vanilla over 2 seeds (42/43) so far. 60-step fast preset, + Qwen3-4B, mix=0.125; deploy = knob-off, $n{=}64$, $T{=}0.7$. + \TODO{vanilla -> $n{=}3$ + paired test once job 77 (s41) lands.}} \label{tab:keynote} \begin{tabular}{lcc} \toprule Arm & Deploy hack & Deploy solve \\ \midrule - Vanilla GRPO & \TODO{$n{=}1$: 0.344} & \TODO{$n{=}1$: 0.484} \\ - route2 (ours) & $0.031 \pm 0.031$ & $0.615 \pm 0.010$ \\ + Vanilla GRPO ($n{=}2$) & $0.305 \pm 0.039$ & $0.516 \pm 0.032$ \\ + route2 (ours, $n{=}3$) & $0.031 \pm 0.031$ & $0.615 \pm 0.010$ \\ \midrule - $\Delta$ vs vanilla & \TODO{after 74/84} & \TODO{after 74/84} \\ + $\Delta$ vs vanilla & $-0.274$ & $+0.099$ \\ \bottomrule \end{tabular} \end{table} diff --git a/out/figs/dyn_sub4.png b/out/figs/dyn_sub4.png index 461a3bf..34a3359 100644 Binary files a/out/figs/dyn_sub4.png and b/out/figs/dyn_sub4.png differ diff --git a/out/figs/dyn_sub4_hack_overlay.png b/out/figs/dyn_sub4_hack_overlay.png index e21d56f..cb4913a 100644 Binary files a/out/figs/dyn_sub4_hack_overlay.png and b/out/figs/dyn_sub4_hack_overlay.png differ