mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts
Cleanup by a prior agent, verified green here: 'just smoke' (erase arm) runs end-to-end and all four wired gates pass (verify_rewards 52/52, verify_eval_gap, verify_partition, verify_science_invariants). - train.py -318 lines: Config dataclass -> train_config.py, checkpoint/ deploy-artifact IO -> run_artifacts.py. - results.py / results_deploy.py / probe_distill.py slimmed. - drop stale derived csvs under out/figs (a5_generalisation, dyn_*, substrate_aggregate, train_vs_deploy_60). - gitignore /.pi/ panel scratch. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -165,3 +165,167 @@ fill A1/A2, append a journal entry. Then queue A5 (the gap).
|
||||
ship the code link.)
|
||||
- Re-headline the blog draft from erase to route2 (user: clear even at n=1).
|
||||
- Workshop vs blog-only: gate on C2 landing.
|
||||
|
||||
## 2026-06-09 eval2 plot regeneration UAT
|
||||
|
||||
[x] Deleted all stale CSVs under `out/figs/` and regenerated the completed
|
||||
per-token routeV versus latest vanilla comparison without changing pueue jobs.
|
||||
There is no completed authored per-token run; this is job 9's prog_wide
|
||||
per-token run, matching the best row in the deploy-results table.
|
||||
|
||||
Sources:
|
||||
- `logs/20260607T134234_fast_routingV_seed43_dir6_routeV_pertoken_s43.log`
|
||||
- `logs/20260608T224659_fast_vanilla_seed43_dir8_vanilla_s43.log`
|
||||
|
||||
Artifacts:
|
||||
- [eval2 per-token dynamics](../../out/figs/eval2_pertoken_vs_vanilla_dynamics.png)
|
||||
- [eval2 per-token hack/solve overlay](../../out/figs/eval2_pertoken_vs_vanilla_dynamics_hack_overlay.png)
|
||||
- [sole current figure CSV](../../out/figs/eval2_pertoken_vs_vanilla_dynamics.csv)
|
||||
|
||||
| estimator | arm | hack | solve |
|
||||
|---|---:|---:|---:|
|
||||
| fixed monitoring subset, final logged point, n=32 | routeV/per-token prog_wide | 0.00 | 0.062 |
|
||||
| fixed monitoring subset, final logged point, n=32 | vanilla | 0.594 | 0.031 |
|
||||
| final held-out deploy eval, n=119 | routeV/per-token prog_wide | 0.042 | 0.143 |
|
||||
| final held-out deploy eval, n=119 | vanilla | 0.613 | 0.101 |
|
||||
| final held-out deploy eval, n=119 | base model, zero steps | 0.000 | 0.126 |
|
||||
|
||||
Verification:
|
||||
- The only remaining `out/figs/**/*.csv` is the current reproducibility CSV.
|
||||
- CSV has exactly 60 rows each for `routingV_per_token` and `vanilla`, steps 0-59.
|
||||
- Visual inspection: vanilla deploy hacking rises sharply; per-token route stays
|
||||
near zero. Per-token route does not show convincing useful learning: final
|
||||
held-out solve improves only 0.126 -> 0.143 versus the base model, below one
|
||||
binomial standard error at n=119.
|
||||
- Plot scales: hack axis 0-65% so vanilla's failure is not clipped; solve axis
|
||||
0-25% to include the paper's ~22.3% no-loophole ceiling. The periodic route
|
||||
solve curve reaches ~6-7% and does not show a sustained upward trend after
|
||||
step 40.
|
||||
- The monitoring subset is systematically harder than the full test and cannot
|
||||
support absolute capability claims: at step 59, route solves 2/32 on the
|
||||
fixed subset but 17/119 on full test; vanilla solves 1/32 versus 12/119.
|
||||
The old plot title incorrectly said n=64; it now states fixed n=32. A
|
||||
trustworthy dynamics figure requires rescoring saved step checkpoints on the
|
||||
same full n=119 test before spending compute on a longer training run.
|
||||
|
||||
### Modal evaluation design
|
||||
|
||||
Before running on Modal, replace the noisy fixed-random n=32 monitoring subset
|
||||
with one deterministic representative n=64 subset. Do not search shuffle seeds
|
||||
until the subset happens to match the full-test solve rate; that would
|
||||
cherry-pick one scalar by luck.
|
||||
|
||||
Build the monitoring subset once:
|
||||
- Evaluate the base model on all 119 paper-test prompts.
|
||||
- Stratify prompts by base pass/fail.
|
||||
- Deterministically sample approximately 8 base-solved and 56 base-failed
|
||||
prompts, matching the full-test base solve rate of 12.6%.
|
||||
- Freeze the prompt IDs and generation seed. Every arm and training seed uses
|
||||
this identical monitoring subset.
|
||||
|
||||
Evaluate the n=64 monitoring subset only at steps 0, 20, 40, and 59. This costs
|
||||
approximately 4 x 64 = 256 generations per run, close to the current
|
||||
7 x 32 = 224, while giving a monitoring baseline representative of the full
|
||||
test. Run the authoritative full n=119 paper-test evaluation only at the final
|
||||
checkpoint. Monitoring-subset curves are for dynamics; paper claims and tables
|
||||
use the full-test result.
|
||||
|
||||
Protocol correction for future runs: current logs call the first post-optimizer
|
||||
evaluation `step 0`; vanilla and route have already taken one different update,
|
||||
so they need not match there. Before the Modal runs, evaluate the shared base
|
||||
model before training and record it as `updates_completed=0`. Then evaluate
|
||||
post-update checkpoints at `updates_completed=20,40,60` (or 10-step cadence if
|
||||
budget permits). Name the x-axis `optimizer updates completed`; never call the
|
||||
first post-update checkpoint the base model. Do not change `train.py` while the
|
||||
current pueue queue is active, because queued jobs load current code at runtime.
|
||||
|
||||
Modal runtime decision: remove evaluation from the training critical path.
|
||||
Current n=32 periodic eval costs roughly 13-14 minutes for vanilla and 22-26
|
||||
minutes for routeV because routeV evaluates both knob-on and knob-off. Seven
|
||||
routeV monitoring evaluations add about 2.7 hours, before the final n=119 eval.
|
||||
|
||||
Simplified protocol:
|
||||
- Training jobs do no periodic eval by default. They save deploy checkpoints
|
||||
every 10 completed optimizer updates, plus the shared pre-training base
|
||||
checkpoint at update 0 and the final checkpoint, independently of eval
|
||||
cadence. The ~2.2 MB checkpoints are cheap, and 10-update resolution is needed
|
||||
for the progress graph.
|
||||
- A separate evaluation job scores selected checkpoints. Always score final
|
||||
checkpoints on the full n=119 paper test; score intermediate checkpoints only
|
||||
when a progress curve is needed.
|
||||
- Progress evaluation scores both knob states for routeV. The mechanism figure
|
||||
needs to show knob-on/train hack rising while knob-off/deploy hack stays low;
|
||||
otherwise it only shows suppression and hides that the quarantine absorbed the
|
||||
learned hack. Vanilla needs one pass because train and deploy are identical.
|
||||
- Batch evaluation prompts. `eval_hack_solve` currently calls `model.generate`
|
||||
once per prompt despite running under `torch.no_grad()`. Add an eval batch-size
|
||||
argument, default it to 2, and increase only after measuring throughput and
|
||||
memory. Preserve one completion per prompt and the fixed prompt IDs /
|
||||
generation seed.
|
||||
- Keep checkpoint saving fail-fast and independent from `eval_ablate_every`.
|
||||
Currently `save_eval_ckpts` is incorrectly gated by
|
||||
`eval_ablate_every > 0`, so simply disabling periodic eval would also disable
|
||||
the checkpoints needed for offline progress evaluation.
|
||||
|
||||
Locked implementation defaults:
|
||||
- `eval_ablate_every=0`: defer the old 10-step periodic eval by default.
|
||||
- `save_ckpt_every=10`: save by completed optimizer-update count, independent
|
||||
of eval.
|
||||
- `eval_batch_size=2`: batched offline/final evaluation default.
|
||||
- Offline progress command scores checkpoints 0, 10, 20, ..., final and writes
|
||||
one canonical eval-curve artifact for plotting. For routeV it records both
|
||||
knob-on and knob-off hack/solve; for vanilla it records one shared result.
|
||||
- `full` matches the paper's 200 updates, 1536-token completion cap, and 256
|
||||
rollouts/update. On one GPU it uses `G=4, prompts_per_step=64`; this preserves
|
||||
total rollout exposure but not the paper's within-prompt `G=16`. It remains
|
||||
pure on-policy (`teacher_pool_dir=None`).
|
||||
- Prompt length is never silently filtered. Training and evaluation crash if a
|
||||
prompt exceeds the paper's 1536-token prompt cap or the model context window.
|
||||
|
||||
Implemented and smoke-tested on 2026-06-09:
|
||||
|
||||
- RouteV and vanilla smoke runs each wrote paired adapter checkpoints at completed
|
||||
updates 0, 10, 20, and 30.
|
||||
- `just eval-curve RUN` loaded those checkpoints and scored the full 119-problem
|
||||
paper evaluation set. RouteV scored both knob states; vanilla scored once.
|
||||
- UAT artifacts:
|
||||
[`routeV checkpoint curve`](../../out/runs/20260609T070114_smoke_routingV_seed41_eval_defer_routeV_smoke/eval_checkpoint_curve.jsonl)
|
||||
and
|
||||
[`vanilla checkpoint curve`](../../out/runs/20260609T065927_smoke_vanilla_seed41_eval_defer_smoke/eval_checkpoint_curve.jsonl).
|
||||
- Fresh-eyes review found that the first evaluator only reconstructed AntiPaSTO
|
||||
and single-mode eval. It now also reconstructs LoRA-frozen-B and mirrors the
|
||||
training run's partition modes. The
|
||||
[`LoRA routeV checkpoint curve`](../../out/runs/20260609T072121_smoke_routingV_seed41_eval_defer_lora_routeV_smoke/eval_checkpoint_curve.jsonl)
|
||||
is the runtime proof.
|
||||
- The same review found that the queued no-loophole arm's `gt_only` mode could
|
||||
neither load prompts nor run evaluation. Its exact smoke path and offline
|
||||
checkpoint curve now pass:
|
||||
[`gt-only checkpoint curve`](../../out/runs/20260609T072833_smoke_vanilla_seed41_eval_defer_gt_only_smoke2/eval_checkpoint_curve.jsonl).
|
||||
- These are tiny-random-model runtime proofs, not scientific results.
|
||||
|
||||
Whether 60 updates are enough to learn solving remains unknown. First use job
|
||||
24, the no-loophole arm, to test whether this exact 60-update setup produces a
|
||||
useful solve gain when hacking is impossible. Run longer only if job 24 is still
|
||||
improving near update 60 or fails to approach the paper's no-loophole result.
|
||||
|
||||
### Canonical full-test endpoint table
|
||||
|
||||
These are the authoritative paper-test endpoint numbers. Do not infer them from
|
||||
or normalize the n=32 monitoring curves.
|
||||
|
||||
| condition | solve | hack |
|
||||
|---|---:|---:|
|
||||
| base model (paper: 0.115) | 0.126 | 0.000 |
|
||||
| vanilla GRPO (paper: 0.149) | 0.101 | 0.613 |
|
||||
| vGROUT routeV best, per-token | 0.143 | 0.042 |
|
||||
| no-loophole ceiling (paper: 0.223) | queued, job 24 | 0.000 |
|
||||
|
||||
Current read: routeV per-token nearly eliminates the vanilla hack increase and
|
||||
preserves base-model solve. Its solve is numerically +1.7pp over base and +4.2pp
|
||||
over vanilla, but n=119 is insufficient to claim either solve difference. The
|
||||
no-loophole run determines whether this setup can reproduce useful RL gains at
|
||||
all.
|
||||
- Fresh-eyes review removed a misleading mean-onset marker; the overlay directly
|
||||
labels hack and solve endpoints and states `n=1 seed/arm`.
|
||||
- `plot_dynamics.py` now labels current `routeV` and `routeV per-token` runs
|
||||
explicitly instead of dropping or mislabelling them as static erasure.
|
||||
|
||||
Reference in New Issue
Block a user