mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 21:52:24 +08:00
5419771d70
Retract the "routeV deadlocks at first generate()" finding from d96367c. The
server-side `modal app logs` show the killed routeV smoke had actually run training
steps 0-3 (real rewards, ||delta_S_hack||=3.23, coherent generations) and was inside
the 24-prompt FINAL EVAL when I stopped it -- a deadlocked-at-first-generate process
cannot produce step 1/2/3 results. The "freeze" was the local `modal run > log`
capture block-buffering the subprocess stdout; the run was healthy the whole time.
Fix: PYTHONUNBUFFERED=1 in _run_train env so the local stream is live, and monitor
via `modal app logs <app-id>` (server-side truth). Corrected the app.py comment and
replaced the README "known issue" with the buffering gotcha. routeV runs fine on
Modal -- the routeV sweep is viable, no torch-2.7 debug needed.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
114 lines
5.8 KiB
Markdown
114 lines
5.8 KiB
Markdown
# Modal port — parallel GRPO runs
|
|
|
|
Fan the paper's GRPO sweep (jobs 124-135 of `docs/spec/20260606_job_manifest.md`)
|
|
out as independent H100 containers instead of running them serially through
|
|
pueue on the one 96GB box. ~12 runs finish in one run's wall-clock instead of ~2
|
|
days.
|
|
|
|
General Modal patterns/gotchas (reusable across projects) live in the global
|
|
`modal` skill (`~/.claude/skills/modal/SKILL.md`); this dir is its worked example.
|
|
|
|
## Files
|
|
|
|
- `app.py` — image, Volume, and the `train` / `warm` / `smoke` GPU functions.
|
|
- `upload_inputs.py` — push the gitignored run inputs (pairsets, vhack, pools) to
|
|
the Volume. Run from a box that has them.
|
|
- `launch.py` — fan out the 12-job inventory with `.spawn()`.
|
|
|
|
## Design decisions (and why)
|
|
|
|
- **GPU = `["H100", "A100-80GB"]` (80GB, fallback list).** The full preset peaked
|
|
~73GB bf16 on the local card, so an 80GB card is required. H100 is ~1.5-2x
|
|
faster than A100-80 for ~1.6x the price (≈ same $/run, half the wall-clock).
|
|
On a 12-way fan-out H100 capacity can queue, so we fall back to A100-80GB — it
|
|
runs the same Dao flash-attn wheel (bundles sm_80) and deploy numbers are
|
|
hardware-independent. Override per-run with `VGROUT_GPU=H200` if a long run OOMs.
|
|
- **torch 2.7, not the repo's pinned 2.8.** Dao-AILab ships no cp313+torch2.8
|
|
flash-attn wheel; the 2.8.3 line tops out at torch2.7 for cp313. The official
|
|
Dao wheel bundles sm_80/86/90 so it runs on A100/H100 — unlike the repo's
|
|
Blackwell sm_120-only pin. This keeps train.py's hardcoded `flash_attention_2`
|
|
path working with **zero patch to the research code**.
|
|
- **No vllm, no causal-conv1d.** Generation is HF `.generate` (nothing in
|
|
`src/vgrout` imports vllm); causal-conv1d is only for Qwen3.5's gated-delta-net,
|
|
and the model here is standard-attention Qwen3-4B.
|
|
- **One Volume `vgrout-cache`** mounts at `/cache` and holds the HF model cache
|
|
(`hf/`), the SVD basis cache (`svd_cache/`), and `out/` (uploaded inputs +
|
|
written `out/runs/*` artifacts). The model downloads once and the svd_cache
|
|
computes once; every later container reuses both. train.py's relative paths
|
|
(`svd_cache/`, `out/`, `logs/`) are symlinked onto the Volume from an ephemeral
|
|
`/work` cwd.
|
|
|
|
## One-time setup
|
|
|
|
```bash
|
|
pip install modal && modal token new # interactive; you've done this
|
|
# Upload the gitignored INPUTS from the box that has them (the 96GB box):
|
|
python modal/upload_inputs.py # pushes out/pairsets, out/vhack, out/pools
|
|
modal run modal/app.py --action warm # download Qwen3-4B + build svd_cache once
|
|
```
|
|
|
|
`upload_inputs.py` skips dirs absent locally. The jobs need these on the Volume:
|
|
|
|
| input | needed by | present on dev box? |
|
|
|---|---|---|
|
|
| `out/pools/substrate`, `out/pools/teacher_pool` | most jobs | yes (uploaded) |
|
|
| `out/pairsets/prog_wide.json` | FastConfig default (124, 127, 130, ...) | **no — only on GPU box** |
|
|
| `out/pairsets/null_city.json` | 128 (erase placebo) | **no — only on GPU box** |
|
|
| `out/vhack/v_hack_a5_runtests.safetensors` | 126, 133, 134 (A5) | **no — only on GPU box** |
|
|
| `out/vhack/v_hack_pairset_prog_wide_randomV.safetensors` | 125 (random-V) | **no — only on GPU box** |
|
|
|
|
So: run `upload_inputs.py` **from the 96GB box** to get the pairsets/vhack bases
|
|
onto the Volume. (Some vhack bases auto-extract from their pairset if absent, but
|
|
that costs ~5 min GPU per run; uploading the prebuilt ones is cheaper.)
|
|
|
|
## Verify one run, then fan out
|
|
|
|
```bash
|
|
modal run modal/launch.py::fanout --only 1 # canary: seed-42 vanilla, confirm clean v2 FINAL EVAL
|
|
# compare its per_mode_deploy.json to the local-box artifact for the same args
|
|
modal run modal/launch.py::fanout # all 15 (5 arms x seeds 42/41/43)
|
|
```
|
|
|
|
## Getting the outputs back
|
|
|
|
Every run writes its full artifact set to the Volume, mirroring the local layout:
|
|
|
|
- `out/runs/<ts>_<slug>/` — `per_mode_deploy.json`, `train.safetensors`,
|
|
`first_hack.safetensors`, `rollouts.jsonl`, periodic `ckpt_step*.safetensors`
|
|
- `logs/<ts>_<slug>.log` — the full verbose log
|
|
|
|
`launch.py` pulls each job's whole run dir + log down to the local `out/runs/` and
|
|
`logs/` as it finishes (so they land exactly where train.py would have written
|
|
them). For ad-hoc runs (warm/smoke/`--argv`) or a full re-sync:
|
|
|
|
```bash
|
|
python modal/fetch.py # all of out/runs + logs
|
|
python modal/fetch.py <ts>_<slug> # one run
|
|
```
|
|
|
|
## Gotcha — monitor server-side, the local stream block-buffers
|
|
|
|
There was no routeV deadlock. The earlier "routeV freezes at the first generate()"
|
|
was a **monitoring artifact**: piping `modal run ... > local.log` captures the
|
|
subprocess stdout block-buffered, so the local file sits at the first `generate()`
|
|
warning while the run is actually progressing. A routeV smoke I killed at ~11 min
|
|
thinking it hung had in fact completed training steps 0-3 (real rewards,
|
|
`||delta_S_hack||=3.23`, coherent generations) and was inside the 24-prompt FINAL
|
|
EVAL — the server-side `modal app logs <app-id>` showed all of it. I killed a healthy
|
|
run and built a false "torch-2.7 routeV deadlock" theory on the buffering.
|
|
|
|
Two fixes, both in place: `_run_train` sets `PYTHONUNBUFFERED=1` (env), and you should
|
|
**always monitor via `modal app logs <app-id>`** (server-side truth), not the local
|
|
`modal run` capture.
|
|
|
|
Correction to commit 2873b37: its message claimed "the smoke on pinned 5.10.2 clears
|
|
the deadlock point." There was no deadlock to clear; the framing was wrong. routeV and
|
|
vanilla both run generate() fine on this image.
|
|
|
|
## Caveat — keep the inventory fresh
|
|
|
|
`launch.py::JOBS` is copied verbatim from the 2026-06-06 manifest. The live plan
|
|
has since evolved (135 → per-token ablation; 136/137 added; n=3 fan-out gated on
|
|
the s43 control read). Refresh the argv map from the current manifest / `pueue
|
|
status` before the real fan-out — it's just data.
|