From a162eded655d285526bee5150ecada9e2915fc5a Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Thu, 4 Jun 2026 04:01:18 +0000 Subject: [PATCH] journal: job 97 gentle-probe refutes vanilla-200 collapse (lp_s>-0.5); A5 baseline sees held-out file_marker emerge on-policy Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- RESEARCH_JOURNAL.md | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index 02022fe..83df66f 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -2,6 +2,17 @@ Append-only. New entries at the top, date-stamped. Never edit old entries. +## 2026-06-04 (b) — job 97 gentle-probe: vanilla-200 does NOT collapse on stabilised preset; A5 baseline (job 103) sees held-out file_marker emerge on-policy + +**Job 97 (A4 vanilla-200 gentle collapse probe, #187).** Finished, succeeded. +- [obs] lp_s stays in [-0.47, -0.27] across the whole run (min -0.47), step 199 lp_s=-0.30. Never dives toward -8. +- [obs] training hack oscillates, peaks 19/28 at step 196; final HACK_STUDENT=0.288, PASS_RATE=0.279. +- [verdict] H1 (KL/entropy collapse at long horizon) REFUTED for the stabilised preset. The earlier "vanilla collapses by step ~90" framing was the job-85 *hot*-preset artifact (mismatched beta), exactly as flagged in main.tex FIXMEs. The matched-beta long-run pair (jobs 100/101, beta=1e-5) is what the #184 figure should use; "collapses" framing drops. + +**Job 103 (A5 baseline, vanilla + `--teacher-modes run_tests`, #185).** Running, ~step 18 at this check. +- [obs] held-out file_marker emerges ON-POLICY: hk_fm=3 at step 18, with a teacher that only ever demonstrates run_tests (hacked_E blind to file_marker, verified zero in the harvest E-by-mode table above). hk_rt leads (5 at step 13, first-hack ckpt there), file_marker follows ~5 steps later. +- [reason] this satisfies job 103's resolve criterion ("file_marker deploy hack > 0, else emergence failed -> A5 inconclusive"). On-policy emergence is alive, so the A5 suppression test (job 104) has a real baseline to beat. Deploy-eval confirmation pending at step 20/30. + ## 2026-06-04 (a) — per-step cost is gen + the 2x2 eval, NOT refresh; redesigning eval cadence **Context:** Job 99 (route2 nofloor refresh-2 staleness cell, #183) ran at ~4.3 min/step, far