Files
evil_MoE/modal/README.md
T
wassname 98ceb38815 modal: rename launch entrypoint main->fanout (collides with app.py::main)
launch.py imports `app` from app.py, which registers app.py's @local_entrypoint
`main`; launch.py defining its own `main` raised InvalidError(Duplicate local
entrypoint). So launch.py had never actually run -- the earlier vanilla verify
was via app.py directly. Invoke: modal run modal/launch.py::fanout [--only N].

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 14:09:35 +00:00

4.7 KiB

Modal port — parallel GRPO runs

Fan the paper's GRPO sweep (jobs 124-135 of docs/spec/20260606_job_manifest.md) out as independent H100 containers instead of running them serially through pueue on the one 96GB box. ~12 runs finish in one run's wall-clock instead of ~2 days.

General Modal patterns/gotchas (reusable across projects) live in the global modal skill (~/.claude/skills/modal/SKILL.md); this dir is its worked example.

Files

  • app.py — image, Volume, and the train / warm / smoke GPU functions.
  • upload_inputs.py — push the gitignored run inputs (pairsets, vhack, pools) to the Volume. Run from a box that has them.
  • launch.py — fan out the 12-job inventory with .spawn().

Design decisions (and why)

  • GPU = ["H100", "A100-80GB"] (80GB, fallback list). The full preset peaked ~73GB bf16 on the local card, so an 80GB card is required. H100 is ~1.5-2x faster than A100-80 for ~1.6x the price (≈ same $/run, half the wall-clock). On a 12-way fan-out H100 capacity can queue, so we fall back to A100-80GB — it runs the same Dao flash-attn wheel (bundles sm_80) and deploy numbers are hardware-independent. Override per-run with VGROUT_GPU=H200 if a long run OOMs.
  • torch 2.7, not the repo's pinned 2.8. Dao-AILab ships no cp313+torch2.8 flash-attn wheel; the 2.8.3 line tops out at torch2.7 for cp313. The official Dao wheel bundles sm_80/86/90 so it runs on A100/H100 — unlike the repo's Blackwell sm_120-only pin. This keeps train.py's hardcoded flash_attention_2 path working with zero patch to the research code.
  • No vllm, no causal-conv1d. Generation is HF .generate (nothing in src/vgrout imports vllm); causal-conv1d is only for Qwen3.5's gated-delta-net, and the model here is standard-attention Qwen3-4B.
  • One Volume vgrout-cache mounts at /cache and holds the HF model cache (hf/), the SVD basis cache (svd_cache/), and out/ (uploaded inputs + written out/runs/* artifacts). The model downloads once and the svd_cache computes once; every later container reuses both. train.py's relative paths (svd_cache/, out/, logs/) are symlinked onto the Volume from an ephemeral /work cwd.

One-time setup

pip install modal && modal token new      # interactive; you've done this
# Upload the gitignored INPUTS from the box that has them (the 96GB box):
python modal/upload_inputs.py              # pushes out/pairsets, out/vhack, out/pools
modal run modal/app.py --action warm       # download Qwen3-4B + build svd_cache once

upload_inputs.py skips dirs absent locally. The jobs need these on the Volume:

input needed by present on dev box?
out/pools/substrate, out/pools/teacher_pool most jobs yes (uploaded)
out/pairsets/prog_wide.json FastConfig default (124, 127, 130, ...) no — only on GPU box
out/pairsets/null_city.json 128 (erase placebo) no — only on GPU box
out/vhack/v_hack_a5_runtests.safetensors 126, 133, 134 (A5) no — only on GPU box
out/vhack/v_hack_pairset_prog_wide_randomV.safetensors 125 (random-V) no — only on GPU box

So: run upload_inputs.py from the 96GB box to get the pairsets/vhack bases onto the Volume. (Some vhack bases auto-extract from their pairset if absent, but that costs ~5 min GPU per run; uploading the prebuilt ones is cheaper.)

Verify one run, then fan out

modal run modal/launch.py::fanout --only 1          # canary: seed-42 vanilla, confirm clean v2 FINAL EVAL
# compare its per_mode_deploy.json to the local-box artifact for the same args
modal run modal/launch.py::fanout                   # all 15 (5 arms x seeds 42/41/43)

Getting the outputs back

Every run writes its full artifact set to the Volume, mirroring the local layout:

  • out/runs/<ts>_<slug>/per_mode_deploy.json, train.safetensors, first_hack.safetensors, rollouts.jsonl, periodic ckpt_step*.safetensors
  • logs/<ts>_<slug>.log — the full verbose log

launch.py pulls each job's whole run dir + log down to the local out/runs/ and logs/ as it finishes (so they land exactly where train.py would have written them). For ad-hoc runs (warm/smoke/--argv) or a full re-sync:

python modal/fetch.py                 # all of out/runs + logs
python modal/fetch.py <ts>_<slug>     # one run

Caveat — keep the inventory fresh

launch.py::JOBS is copied verbatim from the 2026-06-06 manifest. The live plan has since evolved (135 → per-token ablation; 136/137 added; n=3 fan-out gated on the s43 control read). Refresh the argv map from the current manifest / pueue status before the real fan-out — it's just data.