journal: #187 resolved -- vanilla-200 collapse was the hot preset, not long-horizon GRPO (job 97)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:45:42 +08:00 · 2026-06-03 20:23:41 +00:00
parent 6085efcc54
commit f8aea5f9e6
1 changed files with 40 additions and 0 deletions
@@ -2,6 +2,46 @@

 Append-only. New entries at the top, date-stamped. Never edit old entries.

+## 2026-06-03 (e) — #187 resolved: vanilla-200 collapse was the hot preset, not long-horizon GRPO
+
+**Context:** Job 97 (gentle-preset vanilla-200 collapse probe). Job 85 had collapsed
+(lp_s -0.6 -> -8 at step 90) on the fast preset (lr=3e-3, adam beta1=0.5, beta=0). H1: the
+collapse is over-optimization from the hot optimizer, not intrinsic to long-horizon vanilla.
+Probe re-ran vanilla-200 with a gentler step (lr=1e-3, adam 0.9/0.99, beta=0 to keep hacking).
+Cmd: `train fast --intervention=none --seed=41 --lr=1e-3 --adam-beta1=0.9 --adam-beta2=0.99
+--beta=0 --steps=200 --eval-ablate-every=20 --out-tag=_vanilla200_gentle_s41`.
+Log: [logs/20260603T104901_fast_vanilla_seed41_vanilla200_gentle_s41.log](../logs/20260603T104901_fast_vanilla_seed41_vanilla200_gentle_s41.log).
+
+### Result (UAT met)
+
+- [obs] lp_s stayed in [-0.47, -0.20] across ALL 200 steps -- never approached the -8 collapse
+  signature. Contrast job 85: same step count, hot preset, lp_s hit -8 at step 90.
+- [obs] training hack saturated: peaks 19-26/28 (step 188 = 26/28). UAT bar was >15/28. Met.
+- [obs] deploy hack (knob-off = trained model, n=64, T=0.7) rises over the horizon then plateaus:
+  s60 0.250 / s80 0.219 / s100 0.281 / s120 0.328 / s140 0.281 / s160 0.328 / s180 0.391 /
+  s199 0.344. Solve hovers 0.41-0.55 (s199 0.500). grad-norm flat ~1e-2 throughout.
+
+### Interpretation
+
+- [reason] H1 confirmed: the job-85 collapse was caused by the hot optimizer preset (lr=3e-3,
+  adam beta1=0.5), NOT by long-horizon vanilla GRPO. A gentler step stays coherent to 200 while
+  still learning all four loopholes. The robust phenomenon is a monotonic drift toward hacking
+  (deploy hack 0.17 -> ~0.34-0.39), not incoherence.
+- [reason] this strengthens the route2 story rather than weakening it: route2 holding deploy
+  hack ~0 to 200 (job 84) is a real suppression of a coherent, persistent hacking policy, not an
+  artifact of the baseline self-destructing.
+
+### Caveat for the A4 figure (#184) -- do NOT silently overlay mismatched optimizers
+
+- [check] job 84 (route2-200) used fast-preset defaults (hot optimizer, no --beta). Job 97
+  (vanilla-200) used the gentle preset (lr=1e-3, adam 0.9/0.99, beta=0). They are NOT
+  optimizer-matched. The canonical A4 recipe (justfile `paper-longrun`) uses beta=1e-5 KL for
+  BOTH arms to anchor coherence. Neither job 84 nor job 97 is that matched arm.
+- [check] therefore the rigorous A4 figure needs a matched pair, both beta=1e-5:
+  `just paper-longrun none 41` + `just paper-longrun route2 41`. Building the figure from
+  job 84 + job 97 would confound the route-vs-vanilla contrast with an optimizer difference.
+  Flagged for the author; not auto-built. Queued the matched pair at low prio (see below).
+
 ## 2026-06-03 (d) — framing: post-hoc proves v_hack is WEAK, but weak is enough for routing (SGTM absorption)

 **Context:** Interpreting the post-hoc result (entry (c)) against the route success. Not a new