mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
modal: there was no routeV hang -- it was local stdout buffering
Retract the "routeV deadlocks at first generate()" finding from d96367c. The
server-side `modal app logs` show the killed routeV smoke had actually run training
steps 0-3 (real rewards, ||delta_S_hack||=3.23, coherent generations) and was inside
the 24-prompt FINAL EVAL when I stopped it -- a deadlocked-at-first-generate process
cannot produce step 1/2/3 results. The "freeze" was the local `modal run > log`
capture block-buffering the subprocess stdout; the run was healthy the whole time.
Fix: PYTHONUNBUFFERED=1 in _run_train env so the local stream is live, and monitor
via `modal app logs <app-id>` (server-side truth). Corrected the app.py comment and
replaced the README "known issue" with the buffering gotcha. routeV runs fine on
Modal -- the routeV sweep is viable, no torch-2.7 debug needed.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
+14
-20
@@ -86,30 +86,24 @@ python modal/fetch.py # all of out/runs + logs
|
||||
python modal/fetch.py <ts>_<slug> # one run
|
||||
```
|
||||
|
||||
## Known issue — routeV deadlocks at first generate() on Modal
|
||||
## Gotcha — monitor server-side, the local stream block-buffers
|
||||
|
||||
**vanilla works, routeV hangs, on the identical image.** On the `transformers==5.10.2`
|
||||
+ Dao flash-attn 2.8.3 (torch 2.7.1) image, a vanilla warm completes generate cleanly
|
||||
(6.8 min, exit 0, `per_mode_deploy.json` written), but a routeV smoke freezes at its
|
||||
first rollout `generate()` indefinitely (killed at ~11 min). Same image, same attn,
|
||||
same data, same delta_S init — the only difference is the arm.
|
||||
There was no routeV deadlock. The earlier "routeV freezes at the first generate()"
|
||||
was a **monitoring artifact**: piping `modal run ... > local.log` captures the
|
||||
subprocess stdout block-buffered, so the local file sits at the first `generate()`
|
||||
warning while the run is actually progressing. A routeV smoke I killed at ~11 min
|
||||
thinking it hung had in fact completed training steps 0-3 (real rewards,
|
||||
`||delta_S_hack||=3.23`, coherent generations) and was inside the 24-prompt FINAL
|
||||
EVAL — the server-side `modal app logs <app-id>` showed all of it. I killed a healthy
|
||||
run and built a false "torch-2.7 routeV deadlock" theory on the buffering.
|
||||
|
||||
It is NOT: the data path (both arms load fine now), the attn backend (the other agent
|
||||
reproduced the hang under sdpa too, commit 2f91561), transformers version (vanilla
|
||||
runs on this exact 5.10.2), or the generate-time forward hook (routeV's `grad_probe`
|
||||
branch is skipped under `generate()`'s no_grad, so both arms run the identical
|
||||
`_delta_hook` else-path). What's left: routeV extracts `v_grad` fresh before the loop
|
||||
(forward+backward across 252 modules), and that CUDA/allocator state collides with
|
||||
flash-attn's first generate on torch 2.7.1. The **local box runs routeV fine on torch
|
||||
2.8** — so this is a torch-2.7-on-Modal deadlock, not a method bug.
|
||||
Two fixes, both in place: `_run_train` sets `PYTHONUNBUFFERED=1` (env), and you should
|
||||
**always monitor via `modal app logs <app-id>`** (server-side truth), not the local
|
||||
`modal run` capture.
|
||||
|
||||
Correction to commit 2873b37: its message claimed "the smoke on pinned 5.10.2 clears
|
||||
the deadlock point." It does not — the smoke hung. transformers was never the cause.
|
||||
|
||||
**Current call:** Modal is proven for vanilla; the local box produces the canonical
|
||||
routeV runs (jobs 134/135). Don't sink Modal $ into the routeV hang unless we
|
||||
specifically want the parallelism — the fix is a focused torch-2.7 generate debug
|
||||
(candidate: `torch.cuda.empty_cache()` at the extraction→loop boundary, unverified).
|
||||
the deadlock point." There was no deadlock to clear; the framing was wrong. routeV and
|
||||
vanilla both run generate() fine on this image.
|
||||
|
||||
## Caveat — keep the inventory fresh
|
||||
|
||||
|
||||
+9
-5
@@ -59,11 +59,9 @@ image = (
|
||||
.pip_install(
|
||||
# transformers: pinned released version, NOT floating `@ main`. uv.lock keeps
|
||||
# the exact 5.8.0.dev0 commit for the local box; the image uses a released
|
||||
# wheel (Qwen3-4B needs no main-only feature). NB: transformers is NOT the
|
||||
# routeV generate() hang -- on this exact 5.10.2 image, vanilla completes
|
||||
# generate fine (warm, 6.8 min, exit 0) while routeV deadlocks at its first
|
||||
# rollout generate. The hang is routeV-specific (v_grad extraction's CUDA
|
||||
# state x flash-attn first-generate on torch 2.7.1); see modal/README.md.
|
||||
# wheel (Qwen3-4B needs no main-only feature). Both vanilla and routeV run
|
||||
# generate() fine on this image -- the earlier "routeV deadlock" was a local
|
||||
# stdout-buffering misread, not a real hang (see modal/README.md).
|
||||
"transformers==5.10.2",
|
||||
"einops>=0.8",
|
||||
"jaxtyping>=0.2",
|
||||
@@ -152,6 +150,12 @@ def _run_train(argv: list[str]) -> dict:
|
||||
"HF_HOME": f"{CACHE}/hf",
|
||||
"HF_HUB_DISABLE_PROGRESS_BARS": "1",
|
||||
"PYTORCH_CUDA_ALLOC_CONF": "expandable_segments:True",
|
||||
# Unbuffered so the train log streams live. Without this the subprocess
|
||||
# stdout block-buffers; the local `modal run` capture then looks frozen at
|
||||
# the first generate() while the run is actually progressing (this is what
|
||||
# I misread as a "routeV deadlock" -- see modal/README.md). Server-side
|
||||
# `modal app logs <id>` always has the truth; this makes the local view match.
|
||||
"PYTHONUNBUFFERED": "1",
|
||||
}
|
||||
runs_before = set(Path(f"{CACHE}/out/runs").glob("*")) if Path(f"{CACHE}/out/runs").exists() else set()
|
||||
|
||||
|
||||
Reference in New Issue
Block a user