# Modal port — parallel GRPO runs Fan the paper's GRPO sweep (jobs 124-135 of `docs/spec/20260606_job_manifest.md`) out as independent H100 containers instead of running them serially through pueue on the one 96GB box. ~12 runs finish in one run's wall-clock instead of ~2 days. General Modal patterns/gotchas (reusable across projects) live in the global `modal` skill (`~/.claude/skills/modal/SKILL.md`); this dir is its worked example. ## Files - `app.py` — image, Volume, and the `train` / `warm` / `smoke` GPU functions. - `upload_inputs.py` — push the gitignored run inputs (pairsets, vhack, pools) to the Volume. Run from a box that has them. - `launch.py` — fan out the 12-job inventory with `.spawn()`. ## Design decisions (and why) - **GPU = `["H100", "A100-80GB"]` (80GB, fallback list).** The full preset peaked ~73GB bf16 on the local card, so an 80GB card is required. H100 is ~1.5-2x faster than A100-80 for ~1.6x the price (≈ same $/run, half the wall-clock). On a 12-way fan-out H100 capacity can queue, so we fall back to A100-80GB — it runs the same Dao flash-attn wheel (bundles sm_80) and deploy numbers are hardware-independent. Override per-run with `VGROUT_GPU=H200` if a long run OOMs. - **torch 2.7, not the repo's pinned 2.8.** Dao-AILab ships no cp313+torch2.8 flash-attn wheel; the 2.8.3 line tops out at torch2.7 for cp313. The official Dao wheel bundles sm_80/86/90 so it runs on A100/H100 — unlike the repo's Blackwell sm_120-only pin. This keeps train.py's hardcoded `flash_attention_2` path working with **zero patch to the research code**. - **No vllm, no causal-conv1d.** Generation is HF `.generate` (nothing in `src/vgrout` imports vllm); causal-conv1d is only for Qwen3.5's gated-delta-net, and the model here is standard-attention Qwen3-4B. - **One Volume `vgrout-cache`** mounts at `/cache` and holds the HF model cache (`hf/`), the SVD basis cache (`svd_cache/`), and `out/` (uploaded inputs + written `out/runs/*` artifacts). The model downloads once and the svd_cache computes once; every later container reuses both. train.py's relative paths (`svd_cache/`, `out/`, `logs/`) are symlinked onto the Volume from an ephemeral `/work` cwd. ## One-time setup ```bash pip install modal && modal token new # interactive; you've done this # Upload the gitignored INPUTS from the box that has them (the 96GB box): python modal/upload_inputs.py # pushes out/pairsets, out/vhack, out/pools modal run modal/app.py --action warm # download Qwen3-4B + build svd_cache once ``` `upload_inputs.py` skips dirs absent locally. The jobs need these on the Volume: | input | needed by | present on dev box? | |---|---|---| | `out/pools/substrate`, `out/pools/teacher_pool` | most jobs | yes (uploaded) | | `out/pairsets/prog_wide.json` | FastConfig default (124, 127, 130, ...) | **no — only on GPU box** | | `data/pairs/pair_diagnostics.md#null-city` | semantic placebo | yes (git-tracked) | | `out/vhack/v_hack_a5_runtests.safetensors` | 126, 133, 134 (A5) | **no — only on GPU box** | | `out/vhack/v_hack_pairset_prog_wide_randomV.safetensors` | 125 (random-V) | **no — only on GPU box** | So: run `upload_inputs.py` **from the 96GB box** to get the pairsets/vhack bases onto the Volume. (Some vhack bases auto-extract from their pairset if absent, but that costs ~5 min GPU per run; uploading the prebuilt ones is cheaper.) ## Verify one run, then fan out ```bash modal run modal/launch.py::fanout --only 1 # preliminary seed-42 vanilla validation # compare its per_mode_deploy.json to the local-box artifact for the same args modal run modal/launch.py::fanout # all 15 (5 arms x seeds 42/41/43) ``` ## Getting the outputs back Every run writes its full artifact set to the Volume, mirroring the local layout: - `out/runs/_/` — `per_mode_deploy.json`, `train.safetensors`, `first_hack.safetensors`, `rollouts.jsonl`, periodic `ckpt_step*.safetensors` - `logs/_.log` — the full verbose log `launch.py` pulls each job's whole run dir + log down to the local `out/runs/` and `logs/` as it finishes (so they land exactly where train.py would have written them). For ad-hoc runs (warm/smoke/`--argv`) or a full re-sync: ```bash python modal/fetch.py # all of out/runs + logs python modal/fetch.py _ # one run ``` ## Gotcha — monitor server-side, the local stream block-buffers There was no routeV deadlock. The earlier "routeV freezes at the first generate()" was a **monitoring artifact**: piping `modal run ... > local.log` captures the subprocess stdout block-buffered, so the local file sits at the first `generate()` warning while the run is actually progressing. A routeV smoke I killed at ~11 min thinking it hung had in fact completed training steps 0-3 (real rewards, `||delta_S_hack||=3.23`, coherent generations) and was inside the 24-prompt FINAL EVAL — the server-side `modal app logs ` showed all of it. I killed a healthy run and built a false "torch-2.7 routeV deadlock" theory on the buffering. Two fixes, both in place: `_run_train` sets `PYTHONUNBUFFERED=1` (env), and you should **always monitor via `modal app logs `** (server-side truth), not the local `modal run` capture. Correction to commit 2873b37: its message claimed "the smoke on pinned 5.10.2 clears the deadlock point." There was no deadlock to clear; the framing was wrong. routeV and vanilla both run generate() fine on this image. ## Caveat — keep the inventory fresh `launch.py::JOBS` is copied verbatim from the 2026-06-06 manifest. The live plan has since evolved (135 → per-token ablation; 136/137 added; n=3 fan-out gated on the s43 control read). Refresh the argv map from the current manifest / `pueue status` before the real fan-out — it's just data.