diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index 093d97c..edab772 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -2,6 +2,46 @@ Append-only. New entries at the top, date-stamped. Never edit old entries. +## 2026-06-03 (e) — #187 resolved: vanilla-200 collapse was the hot preset, not long-horizon GRPO + +**Context:** Job 97 (gentle-preset vanilla-200 collapse probe). Job 85 had collapsed +(lp_s -0.6 -> -8 at step 90) on the fast preset (lr=3e-3, adam beta1=0.5, beta=0). H1: the +collapse is over-optimization from the hot optimizer, not intrinsic to long-horizon vanilla. +Probe re-ran vanilla-200 with a gentler step (lr=1e-3, adam 0.9/0.99, beta=0 to keep hacking). +Cmd: `train fast --intervention=none --seed=41 --lr=1e-3 --adam-beta1=0.9 --adam-beta2=0.99 +--beta=0 --steps=200 --eval-ablate-every=20 --out-tag=_vanilla200_gentle_s41`. +Log: [logs/20260603T104901_fast_vanilla_seed41_vanilla200_gentle_s41.log](../logs/20260603T104901_fast_vanilla_seed41_vanilla200_gentle_s41.log). + +### Result (UAT met) + +- [obs] lp_s stayed in [-0.47, -0.20] across ALL 200 steps -- never approached the -8 collapse + signature. Contrast job 85: same step count, hot preset, lp_s hit -8 at step 90. +- [obs] training hack saturated: peaks 19-26/28 (step 188 = 26/28). UAT bar was >15/28. Met. +- [obs] deploy hack (knob-off = trained model, n=64, T=0.7) rises over the horizon then plateaus: + s60 0.250 / s80 0.219 / s100 0.281 / s120 0.328 / s140 0.281 / s160 0.328 / s180 0.391 / + s199 0.344. Solve hovers 0.41-0.55 (s199 0.500). grad-norm flat ~1e-2 throughout. + +### Interpretation + +- [reason] H1 confirmed: the job-85 collapse was caused by the hot optimizer preset (lr=3e-3, + adam beta1=0.5), NOT by long-horizon vanilla GRPO. A gentler step stays coherent to 200 while + still learning all four loopholes. The robust phenomenon is a monotonic drift toward hacking + (deploy hack 0.17 -> ~0.34-0.39), not incoherence. +- [reason] this strengthens the route2 story rather than weakening it: route2 holding deploy + hack ~0 to 200 (job 84) is a real suppression of a coherent, persistent hacking policy, not an + artifact of the baseline self-destructing. + +### Caveat for the A4 figure (#184) -- do NOT silently overlay mismatched optimizers + +- [check] job 84 (route2-200) used fast-preset defaults (hot optimizer, no --beta). Job 97 + (vanilla-200) used the gentle preset (lr=1e-3, adam 0.9/0.99, beta=0). They are NOT + optimizer-matched. The canonical A4 recipe (justfile `paper-longrun`) uses beta=1e-5 KL for + BOTH arms to anchor coherence. Neither job 84 nor job 97 is that matched arm. +- [check] therefore the rigorous A4 figure needs a matched pair, both beta=1e-5: + `just paper-longrun none 41` + `just paper-longrun route2 41`. Building the figure from + job 84 + job 97 would confound the route-vs-vanilla contrast with an optimizer difference. + Flagged for the author; not auto-built. Queued the matched pair at low prio (see below). + ## 2026-06-03 (d) — framing: post-hoc proves v_hack is WEAK, but weak is enough for routing (SGTM absorption) **Context:** Interpreting the post-hoc result (entry (c)) against the route success. Not a new