eval: revert eval-every default 10->5 (knob-on removal made it cheap again)

The knob-on pass removal is the real win (halves each eval). With it gone, every-5 on a 60-step run is ~18min more than every-10 but gives 12 deploy points vs 6 -- better plots, cheap. No paper figure uses the knob-on train curve (keynote+longrun plot deploy; the 2x2 train panel is diagnostic-only). Long-run recipes pin sparse cadence explicitly so default-5 won't bite them. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:45:42 +08:00 · 2026-06-04 02:29:23 +00:00
parent 208713d7c2
commit 46b102ad22
2 changed files with 12 additions and 5 deletions
@@ -32,7 +32,13 @@ to find where the time goes. The `step N TIMING gen=.. fwd_bwd=.. reward=.. othe
  ~224s/step, ~16h -> ~12h each, ~8h saved across 103/104 (those override eval flags, so unaffected
  unless re-queued).
 - [decision] refresh default stays 5 (it is cheap; the value is a research knob not a speed knob).
-  Baking eval-every=10 + knob-on-final-only into train.py defaults so future runs inherit it. the weak detector (hacked_E) sees only run_tests; file_marker is the held-out payload
+- [decision] FINAL: the win is killing the per-step knob-ON pass (each eval one n=64 pass, not two);
+  that alone halves eval cost and no PAPER figure uses the knob-on train curve (keynote + longrun
+  both plot the deploy curve hk_dep; the train-vs-deploy 2x2 `plot_train_vs_deploy` is diagnostic-only,
+  not in main.tex). With knob-on gone, eval-every=5 is cheap again (~18min more than 10 on a 60-step
+  run, 12 vs 6 deploy points), so eval_ablate_every default REVERTED to 5 for nice short-run plots;
+  long-run recipes (paper-longrun=20, A5=10) pin sparse cadence explicitly. (Briefly set 10 then
+  reverted after the user noted 60-step plots want denser sampling.) the weak detector (hacked_E) sees only run_tests; file_marker is the held-out payload

 **Context:** Job 95 (A5 harvest, vanilla 4-mode, 40 steps, seed 41) finished. Harvested the
 student's own rollouts to (i) confirm which modes the weak live detector can flag and (ii) build