eval: revert eval-every default 10->5 (knob-on removal made it cheap again)

The knob-on pass removal is the real win (halves each eval). With it gone,
every-5 on a 60-step run is ~18min more than every-10 but gives 12 deploy
points vs 6 -- better plots, cheap. No paper figure uses the knob-on train
curve (keynote+longrun plot deploy; the 2x2 train panel is diagnostic-only).
Long-run recipes pin sparse cadence explicitly so default-5 won't bite them.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-04 02:29:23 +00:00
parent 208713d7c2
commit 46b102ad22
2 changed files with 12 additions and 5 deletions
+7 -1
View File
@@ -32,7 +32,13 @@ to find where the time goes. The `step N TIMING gen=.. fwd_bwd=.. reward=.. othe
~224s/step, ~16h -> ~12h each, ~8h saved across 103/104 (those override eval flags, so unaffected
unless re-queued).
- [decision] refresh default stays 5 (it is cheap; the value is a research knob not a speed knob).
Baking eval-every=10 + knob-on-final-only into train.py defaults so future runs inherit it. the weak detector (hacked_E) sees only run_tests; file_marker is the held-out payload
- [decision] FINAL: the win is killing the per-step knob-ON pass (each eval one n=64 pass, not two);
that alone halves eval cost and no PAPER figure uses the knob-on train curve (keynote + longrun
both plot the deploy curve hk_dep; the train-vs-deploy 2x2 `plot_train_vs_deploy` is diagnostic-only,
not in main.tex). With knob-on gone, eval-every=5 is cheap again (~18min more than 10 on a 60-step
run, 12 vs 6 deploy points), so eval_ablate_every default REVERTED to 5 for nice short-run plots;
long-run recipes (paper-longrun=20, A5=10) pin sparse cadence explicitly. (Briefly set 10 then
reverted after the user noted 60-step plots want denser sampling.) the weak detector (hacked_E) sees only run_tests; file_marker is the held-out payload
**Context:** Job 95 (A5 harvest, vanilla 4-mode, 40 steps, seed 41) finished. Harvested the
student's own rollouts to (i) confirm which modes the weak live detector can flag and (ii) build