journal: #187 resolved -- vanilla-200 collapse was the hot preset, not long-horizon GRPO (job 97)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-03 20:23:41 +00:00
parent 6085efcc54
commit f8aea5f9e6
+40
View File
@@ -2,6 +2,46 @@
Append-only. New entries at the top, date-stamped. Never edit old entries.
## 2026-06-03 (e) — #187 resolved: vanilla-200 collapse was the hot preset, not long-horizon GRPO
**Context:** Job 97 (gentle-preset vanilla-200 collapse probe). Job 85 had collapsed
(lp_s -0.6 -> -8 at step 90) on the fast preset (lr=3e-3, adam beta1=0.5, beta=0). H1: the
collapse is over-optimization from the hot optimizer, not intrinsic to long-horizon vanilla.
Probe re-ran vanilla-200 with a gentler step (lr=1e-3, adam 0.9/0.99, beta=0 to keep hacking).
Cmd: `train fast --intervention=none --seed=41 --lr=1e-3 --adam-beta1=0.9 --adam-beta2=0.99
--beta=0 --steps=200 --eval-ablate-every=20 --out-tag=_vanilla200_gentle_s41`.
Log: [logs/20260603T104901_fast_vanilla_seed41_vanilla200_gentle_s41.log](../logs/20260603T104901_fast_vanilla_seed41_vanilla200_gentle_s41.log).
### Result (UAT met)
- [obs] lp_s stayed in [-0.47, -0.20] across ALL 200 steps -- never approached the -8 collapse
signature. Contrast job 85: same step count, hot preset, lp_s hit -8 at step 90.
- [obs] training hack saturated: peaks 19-26/28 (step 188 = 26/28). UAT bar was >15/28. Met.
- [obs] deploy hack (knob-off = trained model, n=64, T=0.7) rises over the horizon then plateaus:
s60 0.250 / s80 0.219 / s100 0.281 / s120 0.328 / s140 0.281 / s160 0.328 / s180 0.391 /
s199 0.344. Solve hovers 0.41-0.55 (s199 0.500). grad-norm flat ~1e-2 throughout.
### Interpretation
- [reason] H1 confirmed: the job-85 collapse was caused by the hot optimizer preset (lr=3e-3,
adam beta1=0.5), NOT by long-horizon vanilla GRPO. A gentler step stays coherent to 200 while
still learning all four loopholes. The robust phenomenon is a monotonic drift toward hacking
(deploy hack 0.17 -> ~0.34-0.39), not incoherence.
- [reason] this strengthens the route2 story rather than weakening it: route2 holding deploy
hack ~0 to 200 (job 84) is a real suppression of a coherent, persistent hacking policy, not an
artifact of the baseline self-destructing.
### Caveat for the A4 figure (#184) -- do NOT silently overlay mismatched optimizers
- [check] job 84 (route2-200) used fast-preset defaults (hot optimizer, no --beta). Job 97
(vanilla-200) used the gentle preset (lr=1e-3, adam 0.9/0.99, beta=0). They are NOT
optimizer-matched. The canonical A4 recipe (justfile `paper-longrun`) uses beta=1e-5 KL for
BOTH arms to anchor coherence. Neither job 84 nor job 97 is that matched arm.
- [check] therefore the rigorous A4 figure needs a matched pair, both beta=1e-5:
`just paper-longrun none 41` + `just paper-longrun route2 41`. Building the figure from
job 84 + job 97 would confound the route-vs-vanilla contrast with an optimizer difference.
Flagged for the author; not auto-built. Queued the matched pair at low prio (see below).
## 2026-06-03 (d) — framing: post-hoc proves v_hack is WEAK, but weak is enough for routing (SGTM absorption)
**Context:** Interpreting the post-hoc result (entry (c)) against the route success. Not a new