Files
evil_MoE/docs/spec/20260530_out_dir_reorg.md
T
wassname 969c724d9d docs+chore: out/ reorg scheme (queue-gated) + archive dead _OLD_step_format dirs
out/ is 25GB/195 loose files. Target: one subdir per datatype, per-run
artifacts under runs/<ts>_<slug>/. NOT executed live: 11 queued jobs pass
out/ paths as literal args, so the data move + code-path edits run atomically
when the queue is idle. Archived the unreferenced *_OLD_step_format dirs now.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 02:43:10 +00:00

3.6 KiB

out/ reorg — clean path scheme (by datatype, run-prefixed)

Goal

out/ is 25GB / 195 loose files: train_*.safetensors checkpoints, v_hack_*, vhack_grads_*, and a dozen probe_distill/teacher_pool* dirs all at top level. Sort by path: one subdir per datatype, per-run artifacts grouped under a <timestamp>_<slug> run dir. Code reads+writes the new paths; old outputs moved.

Why this is NOT done live (the gate)

11 queued/running pueue jobs pass out/ paths as literal args (--v-hack-path=out/v_hack_*.safetensors, --teacher-pool-dir=out/probe_distill/teacher_pool, --pairs-from-pool=out/pairsets/*.json). Moving those files mid-queue breaks every job that hasn't started. So the data move + code-path edits run as ONE atomic change when the queue is idle (pueue status all Done/Queued-empty). Until then only the unreferenced *_OLD_step_format dirs are archived (done 2026-05-30 -> out/_archive/).

Target scheme

out/
  vhack/        v_hack_*.safetensors            # extracted bases (flat, named)
  vhack_grads/  vhack_grads_*.safetensors       # raw per-pair grads (extract intermediates)
  pools/        <pool_name>/                     # teacher pools (was probe_distill/teacher_pool*)
  pairsets/     *.json                           # unchanged
  baked/        <variant>/                       # unchanged
  runs/<ts>_<slug>/  train.safetensors, first_hack.safetensors   # per-train-run
  _archive/     dead / superseded
  • runs/<ts>_<slug>/: checkpoints currently are out/train_<tag>.safetensors with no timestamp. Migration maps each to its log's <ts> via the matching logs/<ts>_*_<tag>.log, groups into a run dir. New runs write here directly.
  • pools/: drop the probe_distill/ nesting (it was never about probes); flatten teacher_pool, base_pool, mixed_*, the teacher_pool_rl-* and teacher_pool_inoc-* variants into pools/<name>/.

Code edits (apply atomically with the data move)

  • train.py: checkpoint save path -> out/runs/<run_id>/{train,first_hack}.safetensors (run_id already built for the log name). --teacher-pool-dir default -> out/pools/teacher_pool. v_hack load path is an explicit arg (no default).
  • extract_vhack_grad.py: --out-path default -> out/vhack/<name>.safetensors; --train-grads-path default -> out/vhack_grads/<name>.safetensors.
  • probe_distill.py: pool write dir -> out/pools/<name>.
  • justfile: every recipe with out/v_hack_*, out/probe_distill/teacher_pool*, out/pairsets/* -> new paths. (These are the literal strings the queue captured, hence the idle-gate.)
  • scripts/results.py: vhack=grab(r"v-hack-path=out/(\S+?)\.safetensors") -> allow the vhack/ prefix (strip dir for the display name).
  • scripts/plot_dynamics.py: same v_hack path parse if it reads one.

Migration (data move, run when idle)

A --dry-run-default script that: (1) mkdir the new dirs; (2) git mv/mv loose v_hack_*->vhack/, vhack_grads_*->vhack_grads/; (3) for each train_*.safetensors, find its log <ts> and move to runs/<ts>_<slug>/; (4) move probe_distill/*pool*->pools/. Idempotent; logs anything it can't map (stays put, flagged) rather than silently dropping.

Verify

  • just smoke + just smoke-vanilla green (walks write paths).
  • just results still parses every run (vhack display names unchanged).
  • find out -maxdepth 1 -type f | wc -l ~ 0 (no loose top-level files).
  • A re-extract + a fast run write into vhack/ and runs/ respectively.

UAT

"out/ has one subdir per datatype; train artifacts live under runs/_/; nothing loose at top level; smoke + results + a fresh run all still work."