From 46b102ad220954f0d5a59b51098e5c7775b93b47 Mon Sep 17 00:00:00 2001
From: wassname <1103714+wassname@users.noreply.github.com>
Date: Thu, 4 Jun 2026 02:29:23 +0000
Subject: [PATCH] eval: revert eval-every default 10->5 (knob-on removal made
 it cheap again)

The knob-on pass removal is the real win (halves each eval). With it gone,
every-5 on a 60-step run is ~18min more than every-10 but gives 12 deploy
points vs 6 -- better plots, cheap. No paper figure uses the knob-on train
curve (keynote+longrun plot deploy; the 2x2 train panel is diagnostic-only).
Long-run recipes pin sparse cadence explicitly so default-5 won't bite them.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
---
 RESEARCH_JOURNAL.md         | 8 +++++++-
 src/projected_grpo/train.py | 9 +++++----
 2 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md
index 3f02bcd..02022fe 100644
--- a/RESEARCH_JOURNAL.md
+++ b/RESEARCH_JOURNAL.md
@@ -32,7 +32,13 @@ to find where the time goes. The `step N TIMING gen=.. fwd_bwd=.. reward=.. othe
   ~224s/step, ~16h -> ~12h each, ~8h saved across 103/104 (those override eval flags, so unaffected
   unless re-queued).
 - [decision] refresh default stays 5 (it is cheap; the value is a research knob not a speed knob).
-  Baking eval-every=10 + knob-on-final-only into train.py defaults so future runs inherit it. the weak detector (hacked_E) sees only run_tests; file_marker is the held-out payload
+- [decision] FINAL: the win is killing the per-step knob-ON pass (each eval one n=64 pass, not two);
+  that alone halves eval cost and no PAPER figure uses the knob-on train curve (keynote + longrun
+  both plot the deploy curve hk_dep; the train-vs-deploy 2x2 `plot_train_vs_deploy` is diagnostic-only,
+  not in main.tex). With knob-on gone, eval-every=5 is cheap again (~18min more than 10 on a 60-step
+  run, 12 vs 6 deploy points), so eval_ablate_every default REVERTED to 5 for nice short-run plots;
+  long-run recipes (paper-longrun=20, A5=10) pin sparse cadence explicitly. (Briefly set 10 then
+  reverted after the user noted 60-step plots want denser sampling.) the weak detector (hacked_E) sees only run_tests; file_marker is the held-out payload
 
 **Context:** Job 95 (A5 harvest, vanilla 4-mode, 40 steps, seed 41) finished. Harvested the
 student's own rollouts to (i) confirm which modes the weak live detector can flag and (ii) build
diff --git a/src/projected_grpo/train.py b/src/projected_grpo/train.py
index 8df3b71..2b6e57c 100644
--- a/src/projected_grpo/train.py
+++ b/src/projected_grpo/train.py
@@ -179,10 +179,11 @@ class Config:
     # subset -> the hack_deploy / solve_deploy columns (the dynamics-plot series for
     # route: the training-time hack curve still hacks; routing's benefit shows only
     # once the quarantine is ablated). 0 = off. eval_n_prompts x `group` samples.
-    # Default 10: each eval is ~460s (the single biggest discretionary cost; gen is
-    # ~140s/step and fixed). 6 deploy points over a 60-step run / 20 over 200 is
-    # plenty for the trajectory plot. See journal 2026-06-04 (a) for the cost audit.
-    eval_ablate_every: int = 10
+    # Default 5: gives 12 deploy points over the common 60-step run (nice trajectory
+    # plot). Affordable now that the per-step knob-ON eval pass is gone (each eval is
+    # one n=64 pass, ~230s, not two). Long-horizon recipes (paper-longrun, A5) pin a
+    # sparser cadence (10/20) explicitly. See journal 2026-06-04 (a) for the cost audit.
+    eval_ablate_every: int = 5
     eval_n_prompts: int = 8
     # Optional: pool-derived pairs JSON (built by pairs_from_pool.py). When set,
     # BOTH the cache-miss extract AND the online refresh use these pairs instead