Files
evil_MoE/modal
wassname 5419771d70 modal: there was no routeV hang -- it was local stdout buffering
Retract the "routeV deadlocks at first generate()" finding from d96367c. The
server-side `modal app logs` show the killed routeV smoke had actually run training
steps 0-3 (real rewards, ||delta_S_hack||=3.23, coherent generations) and was inside
the 24-prompt FINAL EVAL when I stopped it -- a deadlocked-at-first-generate process
cannot produce step 1/2/3 results. The "freeze" was the local `modal run > log`
capture block-buffering the subprocess stdout; the run was healthy the whole time.

Fix: PYTHONUNBUFFERED=1 in _run_train env so the local stream is live, and monitor
via `modal app logs <app-id>` (server-side truth). Corrected the app.py comment and
replaced the README "known issue" with the buffering gotcha. routeV runs fine on
Modal -- the routeV sweep is viable, no torch-2.7 debug needed.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 10:39:41 +08:00
..

Modal port — parallel GRPO runs

Fan the paper's GRPO sweep (jobs 124-135 of docs/spec/20260606_job_manifest.md) out as independent H100 containers instead of running them serially through pueue on the one 96GB box. ~12 runs finish in one run's wall-clock instead of ~2 days.

General Modal patterns/gotchas (reusable across projects) live in the global modal skill (~/.claude/skills/modal/SKILL.md); this dir is its worked example.

Files

  • app.py — image, Volume, and the train / warm / smoke GPU functions.
  • upload_inputs.py — push the gitignored run inputs (pairsets, vhack, pools) to the Volume. Run from a box that has them.
  • launch.py — fan out the 12-job inventory with .spawn().

Design decisions (and why)

  • GPU = ["H100", "A100-80GB"] (80GB, fallback list). The full preset peaked ~73GB bf16 on the local card, so an 80GB card is required. H100 is ~1.5-2x faster than A100-80 for ~1.6x the price (≈ same $/run, half the wall-clock). On a 12-way fan-out H100 capacity can queue, so we fall back to A100-80GB — it runs the same Dao flash-attn wheel (bundles sm_80) and deploy numbers are hardware-independent. Override per-run with VGROUT_GPU=H200 if a long run OOMs.
  • torch 2.7, not the repo's pinned 2.8. Dao-AILab ships no cp313+torch2.8 flash-attn wheel; the 2.8.3 line tops out at torch2.7 for cp313. The official Dao wheel bundles sm_80/86/90 so it runs on A100/H100 — unlike the repo's Blackwell sm_120-only pin. This keeps train.py's hardcoded flash_attention_2 path working with zero patch to the research code.
  • No vllm, no causal-conv1d. Generation is HF .generate (nothing in src/vgrout imports vllm); causal-conv1d is only for Qwen3.5's gated-delta-net, and the model here is standard-attention Qwen3-4B.
  • One Volume vgrout-cache mounts at /cache and holds the HF model cache (hf/), the SVD basis cache (svd_cache/), and out/ (uploaded inputs + written out/runs/* artifacts). The model downloads once and the svd_cache computes once; every later container reuses both. train.py's relative paths (svd_cache/, out/, logs/) are symlinked onto the Volume from an ephemeral /work cwd.

One-time setup

pip install modal && modal token new      # interactive; you've done this
# Upload the gitignored INPUTS from the box that has them (the 96GB box):
python modal/upload_inputs.py              # pushes out/pairsets, out/vhack, out/pools
modal run modal/app.py --action warm       # download Qwen3-4B + build svd_cache once

upload_inputs.py skips dirs absent locally. The jobs need these on the Volume:

input needed by present on dev box?
out/pools/substrate, out/pools/teacher_pool most jobs yes (uploaded)
out/pairsets/prog_wide.json FastConfig default (124, 127, 130, ...) no — only on GPU box
out/pairsets/null_city.json 128 (erase placebo) no — only on GPU box
out/vhack/v_hack_a5_runtests.safetensors 126, 133, 134 (A5) no — only on GPU box
out/vhack/v_hack_pairset_prog_wide_randomV.safetensors 125 (random-V) no — only on GPU box

So: run upload_inputs.py from the 96GB box to get the pairsets/vhack bases onto the Volume. (Some vhack bases auto-extract from their pairset if absent, but that costs ~5 min GPU per run; uploading the prebuilt ones is cheaper.)

Verify one run, then fan out

modal run modal/launch.py::fanout --only 1          # canary: seed-42 vanilla, confirm clean v2 FINAL EVAL
# compare its per_mode_deploy.json to the local-box artifact for the same args
modal run modal/launch.py::fanout                   # all 15 (5 arms x seeds 42/41/43)

Getting the outputs back

Every run writes its full artifact set to the Volume, mirroring the local layout:

  • out/runs/<ts>_<slug>/per_mode_deploy.json, train.safetensors, first_hack.safetensors, rollouts.jsonl, periodic ckpt_step*.safetensors
  • logs/<ts>_<slug>.log — the full verbose log

launch.py pulls each job's whole run dir + log down to the local out/runs/ and logs/ as it finishes (so they land exactly where train.py would have written them). For ad-hoc runs (warm/smoke/--argv) or a full re-sync:

python modal/fetch.py                 # all of out/runs + logs
python modal/fetch.py <ts>_<slug>     # one run

Gotcha — monitor server-side, the local stream block-buffers

There was no routeV deadlock. The earlier "routeV freezes at the first generate()" was a monitoring artifact: piping modal run ... > local.log captures the subprocess stdout block-buffered, so the local file sits at the first generate() warning while the run is actually progressing. A routeV smoke I killed at ~11 min thinking it hung had in fact completed training steps 0-3 (real rewards, ||delta_S_hack||=3.23, coherent generations) and was inside the 24-prompt FINAL EVAL — the server-side modal app logs <app-id> showed all of it. I killed a healthy run and built a false "torch-2.7 routeV deadlock" theory on the buffering.

Two fixes, both in place: _run_train sets PYTHONUNBUFFERED=1 (env), and you should always monitor via modal app logs <app-id> (server-side truth), not the local modal run capture.

Correction to commit 2873b37: its message claimed "the smoke on pinned 5.10.2 clears the deadlock point." There was no deadlock to clear; the framing was wrong. routeV and vanilla both run generate() fine on this image.

Caveat — keep the inventory fresh

launch.py::JOBS is copied verbatim from the 2026-06-06 manifest. The live plan has since evolved (135 → per-token ablation; 136/137 added; n=3 fan-out gated on the s43 control read). Refresh the argv map from the current manifest / pueue status before the real fan-out — it's just data.