modal: there was no routeV hang -- it was local stdout buffering

Retract the "routeV deadlocks at first generate()" finding from d96367c. The server-side `modal app logs` show the killed routeV smoke had actually run training steps 0-3 (real rewards, ||delta_S_hack||=3.23, coherent generations) and was inside the 24-prompt FINAL EVAL when I stopped it -- a deadlocked-at-first-generate process cannot produce step 1/2/3 results. The "freeze" was the local `modal run > log` capture block-buffering the subprocess stdout; the run was healthy the whole time. Fix: PYTHONUNBUFFERED=1 in _run_train env so the local stream is live, and monitor via `modal app logs <app-id>` (server-side truth). Corrected the app.py comment and replaced the README "known issue" with the buffering gotcha. routeV runs fine on Modal -- the routeV sweep is viable, no torch-2.7 debug needed. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:45:42 +08:00 · 2026-06-07 10:39:41 +08:00
parent d96367ca5d
commit 5419771d70
2 changed files with 23 additions and 25 deletions
@@ -86,30 +86,24 @@ python modal/fetch.py                 # all of out/runs + logs
 python modal/fetch.py <ts>_<slug>     # one run
 ```

-## Known issue — routeV deadlocks at first generate() on Modal
+## Gotcha — monitor server-side, the local stream block-buffers

-**vanilla works, routeV hangs, on the identical image.** On the `transformers==5.10.2`
-+ Dao flash-attn 2.8.3 (torch 2.7.1) image, a vanilla warm completes generate cleanly
-(6.8 min, exit 0, `per_mode_deploy.json` written), but a routeV smoke freezes at its
-first rollout `generate()` indefinitely (killed at ~11 min). Same image, same attn,
-same data, same delta_S init — the only difference is the arm.
+There was no routeV deadlock. The earlier "routeV freezes at the first generate()"
+was a **monitoring artifact**: piping `modal run ... > local.log` captures the
+subprocess stdout block-buffered, so the local file sits at the first `generate()`
+warning while the run is actually progressing. A routeV smoke I killed at ~11 min
+thinking it hung had in fact completed training steps 0-3 (real rewards,
+`||delta_S_hack||=3.23`, coherent generations) and was inside the 24-prompt FINAL
+EVAL — the server-side `modal app logs <app-id>` showed all of it. I killed a healthy
+run and built a false "torch-2.7 routeV deadlock" theory on the buffering.

-It is NOT: the data path (both arms load fine now), the attn backend (the other agent
-reproduced the hang under sdpa too, commit 2f91561), transformers version (vanilla
-runs on this exact 5.10.2), or the generate-time forward hook (routeV's `grad_probe`
-branch is skipped under `generate()`'s no_grad, so both arms run the identical
-`_delta_hook` else-path). What's left: routeV extracts `v_grad` fresh before the loop
-(forward+backward across 252 modules), and that CUDA/allocator state collides with
-flash-attn's first generate on torch 2.7.1. The **local box runs routeV fine on torch
-2.8** — so this is a torch-2.7-on-Modal deadlock, not a method bug.
+Two fixes, both in place: `_run_train` sets `PYTHONUNBUFFERED=1` (env), and you should
+**always monitor via `modal app logs <app-id>`** (server-side truth), not the local
+`modal run` capture.

 Correction to commit 2873b37: its message claimed "the smoke on pinned 5.10.2 clears
-the deadlock point." It does not — the smoke hung. transformers was never the cause.
-
-**Current call:** Modal is proven for vanilla; the local box produces the canonical
-routeV runs (jobs 134/135). Don't sink Modal $ into the routeV hang unless we
-specifically want the parallelism — the fix is a focused torch-2.7 generate debug
-(candidate: `torch.cuda.empty_cache()` at the extraction→loop boundary, unverified).
+the deadlock point." There was no deadlock to clear; the framing was wrong. routeV and
+vanilla both run generate() fine on this image.

 ## Caveat — keep the inventory fresh

@@ -59,11 +59,9 @@ image = (
    .pip_install(
        # transformers: pinned released version, NOT floating `@ main`. uv.lock keeps
        # the exact 5.8.0.dev0 commit for the local box; the image uses a released
-        # wheel (Qwen3-4B needs no main-only feature). NB: transformers is NOT the
-        # routeV generate() hang -- on this exact 5.10.2 image, vanilla completes
-        # generate fine (warm, 6.8 min, exit 0) while routeV deadlocks at its first
-        # rollout generate. The hang is routeV-specific (v_grad extraction's CUDA
-        # state x flash-attn first-generate on torch 2.7.1); see modal/README.md.
+        # wheel (Qwen3-4B needs no main-only feature). Both vanilla and routeV run
+        # generate() fine on this image -- the earlier "routeV deadlock" was a local
+        # stdout-buffering misread, not a real hang (see modal/README.md).
        "transformers==5.10.2",
        "einops>=0.8",
        "jaxtyping>=0.2",
@@ -152,6 +150,12 @@ def _run_train(argv: list[str]) -> dict:
        "HF_HOME": f"{CACHE}/hf",
        "HF_HUB_DISABLE_PROGRESS_BARS": "1",
        "PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True",
+        # Unbuffered so the train log streams live. Without this the subprocess
+        # stdout block-buffers; the local `modal run` capture then looks frozen at
+        # the first generate() while the run is actually progressing (this is what
+        # I misread as a "routeV deadlock" -- see modal/README.md). Server-side
+        # `modal app logs <id>` always has the truth; this makes the local view match.
+        "PYTHONUNBUFFERED": "1",
    }
    runs_before = set(Path(f"{CACHE}/out/runs").glob("*")) if Path(f"{CACHE}/out/runs").exists() else set()