From 5419771d70747c74f9536f607f1c0d36613b79b5 Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Sun, 7 Jun 2026 10:39:41 +0800 Subject: [PATCH] modal: there was no routeV hang -- it was local stdout buffering Retract the "routeV deadlocks at first generate()" finding from d96367c. The server-side `modal app logs` show the killed routeV smoke had actually run training steps 0-3 (real rewards, ||delta_S_hack||=3.23, coherent generations) and was inside the 24-prompt FINAL EVAL when I stopped it -- a deadlocked-at-first-generate process cannot produce step 1/2/3 results. The "freeze" was the local `modal run > log` capture block-buffering the subprocess stdout; the run was healthy the whole time. Fix: PYTHONUNBUFFERED=1 in _run_train env so the local stream is live, and monitor via `modal app logs ` (server-side truth). Corrected the app.py comment and replaced the README "known issue" with the buffering gotcha. routeV runs fine on Modal -- the routeV sweep is viable, no torch-2.7 debug needed. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- modal/README.md | 34 ++++++++++++++-------------------- modal/app.py | 14 +++++++++----- 2 files changed, 23 insertions(+), 25 deletions(-) diff --git a/modal/README.md b/modal/README.md index 0f42cf1..1d4d675 100644 --- a/modal/README.md +++ b/modal/README.md @@ -86,30 +86,24 @@ python modal/fetch.py # all of out/runs + logs python modal/fetch.py _ # one run ``` -## Known issue — routeV deadlocks at first generate() on Modal +## Gotcha — monitor server-side, the local stream block-buffers -**vanilla works, routeV hangs, on the identical image.** On the `transformers==5.10.2` -+ Dao flash-attn 2.8.3 (torch 2.7.1) image, a vanilla warm completes generate cleanly -(6.8 min, exit 0, `per_mode_deploy.json` written), but a routeV smoke freezes at its -first rollout `generate()` indefinitely (killed at ~11 min). Same image, same attn, -same data, same delta_S init — the only difference is the arm. +There was no routeV deadlock. The earlier "routeV freezes at the first generate()" +was a **monitoring artifact**: piping `modal run ... > local.log` captures the +subprocess stdout block-buffered, so the local file sits at the first `generate()` +warning while the run is actually progressing. A routeV smoke I killed at ~11 min +thinking it hung had in fact completed training steps 0-3 (real rewards, +`||delta_S_hack||=3.23`, coherent generations) and was inside the 24-prompt FINAL +EVAL — the server-side `modal app logs ` showed all of it. I killed a healthy +run and built a false "torch-2.7 routeV deadlock" theory on the buffering. -It is NOT: the data path (both arms load fine now), the attn backend (the other agent -reproduced the hang under sdpa too, commit 2f91561), transformers version (vanilla -runs on this exact 5.10.2), or the generate-time forward hook (routeV's `grad_probe` -branch is skipped under `generate()`'s no_grad, so both arms run the identical -`_delta_hook` else-path). What's left: routeV extracts `v_grad` fresh before the loop -(forward+backward across 252 modules), and that CUDA/allocator state collides with -flash-attn's first generate on torch 2.7.1. The **local box runs routeV fine on torch -2.8** — so this is a torch-2.7-on-Modal deadlock, not a method bug. +Two fixes, both in place: `_run_train` sets `PYTHONUNBUFFERED=1` (env), and you should +**always monitor via `modal app logs `** (server-side truth), not the local +`modal run` capture. Correction to commit 2873b37: its message claimed "the smoke on pinned 5.10.2 clears -the deadlock point." It does not — the smoke hung. transformers was never the cause. - -**Current call:** Modal is proven for vanilla; the local box produces the canonical -routeV runs (jobs 134/135). Don't sink Modal $ into the routeV hang unless we -specifically want the parallelism — the fix is a focused torch-2.7 generate debug -(candidate: `torch.cuda.empty_cache()` at the extraction→loop boundary, unverified). +the deadlock point." There was no deadlock to clear; the framing was wrong. routeV and +vanilla both run generate() fine on this image. ## Caveat — keep the inventory fresh diff --git a/modal/app.py b/modal/app.py index 61c971e..0b72a34 100644 --- a/modal/app.py +++ b/modal/app.py @@ -59,11 +59,9 @@ image = ( .pip_install( # transformers: pinned released version, NOT floating `@ main`. uv.lock keeps # the exact 5.8.0.dev0 commit for the local box; the image uses a released - # wheel (Qwen3-4B needs no main-only feature). NB: transformers is NOT the - # routeV generate() hang -- on this exact 5.10.2 image, vanilla completes - # generate fine (warm, 6.8 min, exit 0) while routeV deadlocks at its first - # rollout generate. The hang is routeV-specific (v_grad extraction's CUDA - # state x flash-attn first-generate on torch 2.7.1); see modal/README.md. + # wheel (Qwen3-4B needs no main-only feature). Both vanilla and routeV run + # generate() fine on this image -- the earlier "routeV deadlock" was a local + # stdout-buffering misread, not a real hang (see modal/README.md). "transformers==5.10.2", "einops>=0.8", "jaxtyping>=0.2", @@ -152,6 +150,12 @@ def _run_train(argv: list[str]) -> dict: "HF_HOME": f"{CACHE}/hf", "HF_HUB_DISABLE_PROGRESS_BARS": "1", "PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True", + # Unbuffered so the train log streams live. Without this the subprocess + # stdout block-buffers; the local `modal run` capture then looks frozen at + # the first generate() while the run is actually progressing (this is what + # I misread as a "routeV deadlock" -- see modal/README.md). Server-side + # `modal app logs ` always has the truth; this makes the local view match. + "PYTHONUNBUFFERED": "1", } runs_before = set(Path(f"{CACHE}/out/runs").glob("*")) if Path(f"{CACHE}/out/runs").exists() else set()