modal: there was no routeV hang -- it was local stdout buffering

Retract the "routeV deadlocks at first generate()" finding from d96367c. The
server-side `modal app logs` show the killed routeV smoke had actually run training
steps 0-3 (real rewards, ||delta_S_hack||=3.23, coherent generations) and was inside
the 24-prompt FINAL EVAL when I stopped it -- a deadlocked-at-first-generate process
cannot produce step 1/2/3 results. The "freeze" was the local `modal run > log`
capture block-buffering the subprocess stdout; the run was healthy the whole time.

Fix: PYTHONUNBUFFERED=1 in _run_train env so the local stream is live, and monitor
via `modal app logs <app-id>` (server-side truth). Corrected the app.py comment and
replaced the README "known issue" with the buffering gotcha. routeV runs fine on
Modal -- the routeV sweep is viable, no torch-2.7 debug needed.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-07 10:39:41 +08:00
parent d96367ca5d
commit 5419771d70
2 changed files with 23 additions and 25 deletions
+14 -20
View File
@@ -86,30 +86,24 @@ python modal/fetch.py # all of out/runs + logs
python modal/fetch.py <ts>_<slug> # one run
```
## Known issue — routeV deadlocks at first generate() on Modal
## Gotcha — monitor server-side, the local stream block-buffers
**vanilla works, routeV hangs, on the identical image.** On the `transformers==5.10.2`
+ Dao flash-attn 2.8.3 (torch 2.7.1) image, a vanilla warm completes generate cleanly
(6.8 min, exit 0, `per_mode_deploy.json` written), but a routeV smoke freezes at its
first rollout `generate()` indefinitely (killed at ~11 min). Same image, same attn,
same data, same delta_S init — the only difference is the arm.
There was no routeV deadlock. The earlier "routeV freezes at the first generate()"
was a **monitoring artifact**: piping `modal run ... > local.log` captures the
subprocess stdout block-buffered, so the local file sits at the first `generate()`
warning while the run is actually progressing. A routeV smoke I killed at ~11 min
thinking it hung had in fact completed training steps 0-3 (real rewards,
`||delta_S_hack||=3.23`, coherent generations) and was inside the 24-prompt FINAL
EVAL — the server-side `modal app logs <app-id>` showed all of it. I killed a healthy
run and built a false "torch-2.7 routeV deadlock" theory on the buffering.
It is NOT: the data path (both arms load fine now), the attn backend (the other agent
reproduced the hang under sdpa too, commit 2f91561), transformers version (vanilla
runs on this exact 5.10.2), or the generate-time forward hook (routeV's `grad_probe`
branch is skipped under `generate()`'s no_grad, so both arms run the identical
`_delta_hook` else-path). What's left: routeV extracts `v_grad` fresh before the loop
(forward+backward across 252 modules), and that CUDA/allocator state collides with
flash-attn's first generate on torch 2.7.1. The **local box runs routeV fine on torch
2.8** — so this is a torch-2.7-on-Modal deadlock, not a method bug.
Two fixes, both in place: `_run_train` sets `PYTHONUNBUFFERED=1` (env), and you should
**always monitor via `modal app logs <app-id>`** (server-side truth), not the local
`modal run` capture.
Correction to commit 2873b37: its message claimed "the smoke on pinned 5.10.2 clears
the deadlock point." It does not — the smoke hung. transformers was never the cause.
**Current call:** Modal is proven for vanilla; the local box produces the canonical
routeV runs (jobs 134/135). Don't sink Modal $ into the routeV hang unless we
specifically want the parallelism — the fix is a focused torch-2.7 generate debug
(candidate: `torch.cuda.empty_cache()` at the extraction→loop boundary, unverified).
the deadlock point." There was no deadlock to clear; the framing was wrong. routeV and
vanilla both run generate() fine on this image.
## Caveat — keep the inventory fresh
+9 -5
View File
@@ -59,11 +59,9 @@ image = (
.pip_install(
# transformers: pinned released version, NOT floating `@ main`. uv.lock keeps
# the exact 5.8.0.dev0 commit for the local box; the image uses a released
# wheel (Qwen3-4B needs no main-only feature). NB: transformers is NOT the
# routeV generate() hang -- on this exact 5.10.2 image, vanilla completes
# generate fine (warm, 6.8 min, exit 0) while routeV deadlocks at its first
# rollout generate. The hang is routeV-specific (v_grad extraction's CUDA
# state x flash-attn first-generate on torch 2.7.1); see modal/README.md.
# wheel (Qwen3-4B needs no main-only feature). Both vanilla and routeV run
# generate() fine on this image -- the earlier "routeV deadlock" was a local
# stdout-buffering misread, not a real hang (see modal/README.md).
"transformers==5.10.2",
"einops>=0.8",
"jaxtyping>=0.2",
@@ -152,6 +150,12 @@ def _run_train(argv: list[str]) -> dict:
"HF_HOME": f"{CACHE}/hf",
"HF_HUB_DISABLE_PROGRESS_BARS": "1",
"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True",
# Unbuffered so the train log streams live. Without this the subprocess
# stdout block-buffers; the local `modal run` capture then looks frozen at
# the first generate() while the run is actually progressing (this is what
# I misread as a "routeV deadlock" -- see modal/README.md). Server-side
# `modal app logs <id>` always has the truth; this makes the local view match.
"PYTHONUNBUFFERED": "1",
}
runs_before = set(Path(f"{CACHE}/out/runs").glob("*")) if Path(f"{CACHE}/out/runs").exists() else set()