evil_MoE/modal/README.md

# Modal port — parallel GRPO runs

Fan the paper's GRPO sweep (jobs 124-135 of `docs/spec/20260606_job_manifest.md`)
out as independent H100 containers instead of running them serially through
pueue on the one 96GB box. ~12 runs finish in one run's wall-clock instead of ~2
days.

General Modal patterns/gotchas (reusable across projects) live in the global
`modal` skill (`~/.claude/skills/modal/SKILL.md`); this dir is its worked example.

## Files

- `app.py` — image, Volume, and the `train` / `warm` / `smoke` GPU functions.
- `upload_inputs.py` — push the gitignored run inputs (pairsets, vhack, pools) to
  the Volume. Run from a box that has them.
- `launch.py` — fan out the 12-job inventory with `.spawn()`.

## Design decisions (and why)

- **GPU = `["H100", "A100-80GB"]` (80GB, fallback list).** The full preset peaked
  ~73GB bf16 on the local card, so an 80GB card is required. H100 is ~1.5-2x
  faster than A100-80 for ~1.6x the price (≈ same $/run, half the wall-clock).
  On a 12-way fan-out H100 capacity can queue, so we fall back to A100-80GB — it
  runs the same Dao flash-attn wheel (bundles sm_80) and deploy numbers are
  hardware-independent. Override per-run with `VGROUT_GPU=H200` if a long run OOMs.
- **torch 2.7, not the repo's pinned 2.8.** Dao-AILab ships no cp313+torch2.8
  flash-attn wheel; the 2.8.3 line tops out at torch2.7 for cp313. The official
  Dao wheel bundles sm_80/86/90 so it runs on A100/H100 — unlike the repo's
  Blackwell sm_120-only pin. This keeps train.py's hardcoded `flash_attention_2`
  path working with **zero patch to the research code**.
- **No vllm, no causal-conv1d.** Generation is HF `.generate` (nothing in
  `src/vgrout` imports vllm); causal-conv1d is only for Qwen3.5's gated-delta-net,
  and the model here is standard-attention Qwen3-4B.
- **One Volume `vgrout-cache`** mounts at `/cache` and holds the HF model cache
  (`hf/`), the SVD basis cache (`svd_cache/`), and `out/` (uploaded inputs +
  written `out/runs/*` artifacts). The model downloads once and the svd_cache
  computes once; every later container reuses both. train.py's relative paths
  (`svd_cache/`, `out/`, `logs/`) are symlinked onto the Volume from an ephemeral
  `/work` cwd.

## One-time setup

```bash
pip install modal && modal token new      # interactive; you've done this
# Upload the gitignored INPUTS from the box that has them (the 96GB box):
python modal/upload_inputs.py              # pushes out/pairsets, out/vhack, out/pools
modal run modal/app.py --action warm       # download Qwen3-4B + build svd_cache once
```

`upload_inputs.py` skips dirs absent locally. The jobs need these on the Volume:

| input | needed by | present on dev box? |
|---|---|---|
| `out/pools/substrate`, `out/pools/teacher_pool` | most jobs | yes (uploaded) |
| `out/pairsets/prog_wide.json` | FastConfig default (124, 127, 130, ...) | **no — only on GPU box** |
| `out/pairsets/null_city.json` | 128 (erase placebo) | **no — only on GPU box** |
| `out/vhack/v_hack_a5_runtests.safetensors` | 126, 133, 134 (A5) | **no — only on GPU box** |
| `out/vhack/v_hack_pairset_prog_wide_randomV.safetensors` | 125 (random-V) | **no — only on GPU box** |

So: run `upload_inputs.py` **from the 96GB box** to get the pairsets/vhack bases
onto the Volume. (Some vhack bases auto-extract from their pairset if absent, but
that costs ~5 min GPU per run; uploading the prebuilt ones is cheaper.)

## Verify one run, then fan out

```bash
modal run modal/launch.py::fanout --only 1          # canary: seed-42 vanilla, confirm clean v2 FINAL EVAL
# compare its per_mode_deploy.json to the local-box artifact for the same args
modal run modal/launch.py::fanout                   # all 15 (5 arms x seeds 42/41/43)
```

## Getting the outputs back

Every run writes its full artifact set to the Volume, mirroring the local layout:

- `out/runs/<ts>_<slug>/` — `per_mode_deploy.json`, `train.safetensors`,
  `first_hack.safetensors`, `rollouts.jsonl`, periodic `ckpt_step*.safetensors`
- `logs/<ts>_<slug>.log` — the full verbose log

`launch.py` pulls each job's whole run dir + log down to the local `out/runs/` and
`logs/` as it finishes (so they land exactly where train.py would have written
them). For ad-hoc runs (warm/smoke/`--argv`) or a full re-sync:

```bash
python modal/fetch.py                 # all of out/runs + logs
python modal/fetch.py <ts>_<slug>     # one run
```

## Gotcha — monitor server-side, the local stream block-buffers

There was no routeV deadlock. The earlier "routeV freezes at the first generate()"
was a **monitoring artifact**: piping `modal run ... > local.log` captures the
subprocess stdout block-buffered, so the local file sits at the first `generate()`
warning while the run is actually progressing. A routeV smoke I killed at ~11 min
thinking it hung had in fact completed training steps 0-3 (real rewards,
`||delta_S_hack||=3.23`, coherent generations) and was inside the 24-prompt FINAL
EVAL — the server-side `modal app logs <app-id>` showed all of it. I killed a healthy
run and built a false "torch-2.7 routeV deadlock" theory on the buffering.

Two fixes, both in place: `_run_train` sets `PYTHONUNBUFFERED=1` (env), and you should
**always monitor via `modal app logs <app-id>`** (server-side truth), not the local
`modal run` capture.

Correction to commit 2873b37: its message claimed "the smoke on pinned 5.10.2 clears
the deadlock point." There was no deadlock to clear; the framing was wrong. routeV and
vanilla both run generate() fine on this image.

## Caveat — keep the inventory fresh

`launch.py::JOBS` is copied verbatim from the 2026-06-06 manifest. The live plan
has since evolved (135 → per-token ablation; 136/137 added; n=3 fan-out gated on
the s43 control read). Refresh the argv map from the current manifest / `pueue
status` before the real fan-out — it's just data.