mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
eval: revert eval-every default 10->5 (knob-on removal made it cheap again)
The knob-on pass removal is the real win (halves each eval). With it gone, every-5 on a 60-step run is ~18min more than every-10 but gives 12 deploy points vs 6 -- better plots, cheap. No paper figure uses the knob-on train curve (keynote+longrun plot deploy; the 2x2 train panel is diagnostic-only). Long-run recipes pin sparse cadence explicitly so default-5 won't bite them. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
+7
-1
@@ -32,7 +32,13 @@ to find where the time goes. The `step N TIMING gen=.. fwd_bwd=.. reward=.. othe
|
||||
~224s/step, ~16h -> ~12h each, ~8h saved across 103/104 (those override eval flags, so unaffected
|
||||
unless re-queued).
|
||||
- [decision] refresh default stays 5 (it is cheap; the value is a research knob not a speed knob).
|
||||
Baking eval-every=10 + knob-on-final-only into train.py defaults so future runs inherit it. the weak detector (hacked_E) sees only run_tests; file_marker is the held-out payload
|
||||
- [decision] FINAL: the win is killing the per-step knob-ON pass (each eval one n=64 pass, not two);
|
||||
that alone halves eval cost and no PAPER figure uses the knob-on train curve (keynote + longrun
|
||||
both plot the deploy curve hk_dep; the train-vs-deploy 2x2 `plot_train_vs_deploy` is diagnostic-only,
|
||||
not in main.tex). With knob-on gone, eval-every=5 is cheap again (~18min more than 10 on a 60-step
|
||||
run, 12 vs 6 deploy points), so eval_ablate_every default REVERTED to 5 for nice short-run plots;
|
||||
long-run recipes (paper-longrun=20, A5=10) pin sparse cadence explicitly. (Briefly set 10 then
|
||||
reverted after the user noted 60-step plots want denser sampling.) the weak detector (hacked_E) sees only run_tests; file_marker is the held-out payload
|
||||
|
||||
**Context:** Job 95 (A5 harvest, vanilla 4-mode, 40 steps, seed 41) finished. Harvested the
|
||||
student's own rollouts to (i) confirm which modes the weak live detector can flag and (ii) build
|
||||
|
||||
Reference in New Issue
Block a user