From 46b102ad220954f0d5a59b51098e5c7775b93b47 Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Thu, 4 Jun 2026 02:29:23 +0000 Subject: [PATCH] eval: revert eval-every default 10->5 (knob-on removal made it cheap again) The knob-on pass removal is the real win (halves each eval). With it gone, every-5 on a 60-step run is ~18min more than every-10 but gives 12 deploy points vs 6 -- better plots, cheap. No paper figure uses the knob-on train curve (keynote+longrun plot deploy; the 2x2 train panel is diagnostic-only). Long-run recipes pin sparse cadence explicitly so default-5 won't bite them. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- RESEARCH_JOURNAL.md | 8 +++++++- src/projected_grpo/train.py | 9 +++++---- 2 files changed, 12 insertions(+), 5 deletions(-) diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index 3f02bcd..02022fe 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -32,7 +32,13 @@ to find where the time goes. The `step N TIMING gen=.. fwd_bwd=.. reward=.. othe ~224s/step, ~16h -> ~12h each, ~8h saved across 103/104 (those override eval flags, so unaffected unless re-queued). - [decision] refresh default stays 5 (it is cheap; the value is a research knob not a speed knob). - Baking eval-every=10 + knob-on-final-only into train.py defaults so future runs inherit it. the weak detector (hacked_E) sees only run_tests; file_marker is the held-out payload +- [decision] FINAL: the win is killing the per-step knob-ON pass (each eval one n=64 pass, not two); + that alone halves eval cost and no PAPER figure uses the knob-on train curve (keynote + longrun + both plot the deploy curve hk_dep; the train-vs-deploy 2x2 `plot_train_vs_deploy` is diagnostic-only, + not in main.tex). With knob-on gone, eval-every=5 is cheap again (~18min more than 10 on a 60-step + run, 12 vs 6 deploy points), so eval_ablate_every default REVERTED to 5 for nice short-run plots; + long-run recipes (paper-longrun=20, A5=10) pin sparse cadence explicitly. (Briefly set 10 then + reverted after the user noted 60-step plots want denser sampling.) the weak detector (hacked_E) sees only run_tests; file_marker is the held-out payload **Context:** Job 95 (A5 harvest, vanilla 4-mode, 40 steps, seed 41) finished. Harvested the student's own rollouts to (i) confirm which modes the weak live detector can flag and (ii) build diff --git a/src/projected_grpo/train.py b/src/projected_grpo/train.py index 8df3b71..2b6e57c 100644 --- a/src/projected_grpo/train.py +++ b/src/projected_grpo/train.py @@ -179,10 +179,11 @@ class Config: # subset -> the hack_deploy / solve_deploy columns (the dynamics-plot series for # route: the training-time hack curve still hacks; routing's benefit shows only # once the quarantine is ablated). 0 = off. eval_n_prompts x `group` samples. - # Default 10: each eval is ~460s (the single biggest discretionary cost; gen is - # ~140s/step and fixed). 6 deploy points over a 60-step run / 20 over 200 is - # plenty for the trajectory plot. See journal 2026-06-04 (a) for the cost audit. - eval_ablate_every: int = 10 + # Default 5: gives 12 deploy points over the common 60-step run (nice trajectory + # plot). Affordable now that the per-step knob-ON eval pass is gone (each eval is + # one n=64 pass, ~230s, not two). Long-horizon recipes (paper-longrun, A5) pin a + # sparser cadence (10/20) explicitly. See journal 2026-06-04 (a) for the cost audit. + eval_ablate_every: int = 5 eval_n_prompts: int = 8 # Optional: pool-derived pairs JSON (built by pairs_from_pool.py). When set, # BOTH the cache-miss extract AND the online refresh use these pairs instead