modal: mount leetcode data from image; correct 2873b37 hang claim

Data fix: the read-only LeetCode jsonls (44MB, tracked in the rl-rewardhacking
submodule) now mount from the local checkout into the image (add_local_dir,
copy=False) instead of the Volume. A Volume mount/reload race FileNotFound'd
them mid-sweep even though they were committed; versioning the dataset with the
code removes that failure mode. Volume now carries only mutable dirs. Verified:
both a vanilla warm and a routeV smoke load data fine on the new image.

Correction: 2873b37's message claimed "the smoke on pinned 5.10.2 clears the
deadlock point" -- it did NOT, the smoke hung. And transformers is not the cause:
on this exact 5.10.2 image, vanilla completes generate (warm, 6.8 min, exit 0)
while routeV deadlocks at its first rollout generate(). Same image, same attn,
same data -- the hang is routeV-specific (v_grad extraction's CUDA state x
flash-attn first-generate on torch 2.7.1; local box runs routeV fine on 2.8).
Known-issue section + corrected app.py comment record this. Local box produces
the canonical routeV runs; Modal is proven for vanilla.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-07 09:45:17 +08:00
parent 2873b37842
commit d96367ca5d
2 changed files with 52 additions and 13 deletions
+25
View File
@@ -86,6 +86,31 @@ python modal/fetch.py # all of out/runs + logs
python modal/fetch.py <ts>_<slug> # one run
```
## Known issue — routeV deadlocks at first generate() on Modal
**vanilla works, routeV hangs, on the identical image.** On the `transformers==5.10.2`
+ Dao flash-attn 2.8.3 (torch 2.7.1) image, a vanilla warm completes generate cleanly
(6.8 min, exit 0, `per_mode_deploy.json` written), but a routeV smoke freezes at its
first rollout `generate()` indefinitely (killed at ~11 min). Same image, same attn,
same data, same delta_S init — the only difference is the arm.
It is NOT: the data path (both arms load fine now), the attn backend (the other agent
reproduced the hang under sdpa too, commit 2f91561), transformers version (vanilla
runs on this exact 5.10.2), or the generate-time forward hook (routeV's `grad_probe`
branch is skipped under `generate()`'s no_grad, so both arms run the identical
`_delta_hook` else-path). What's left: routeV extracts `v_grad` fresh before the loop
(forward+backward across 252 modules), and that CUDA/allocator state collides with
flash-attn's first generate on torch 2.7.1. The **local box runs routeV fine on torch
2.8** — so this is a torch-2.7-on-Modal deadlock, not a method bug.
Correction to commit 2873b37: its message claimed "the smoke on pinned 5.10.2 clears
the deadlock point." It does not — the smoke hung. transformers was never the cause.
**Current call:** Modal is proven for vanilla; the local box produces the canonical
routeV runs (jobs 134/135). Don't sink Modal $ into the routeV hang unless we
specifically want the parallelism — the fix is a focused torch-2.7 generate debug
(candidate: `torch.cuda.empty_cache()` at the extraction→loop boundary, unverified).
## Caveat — keep the inventory fresh
`launch.py::JOBS` is copied verbatim from the 2026-06-06 manifest. The live plan
+27 -13
View File
@@ -37,6 +37,8 @@ from pathlib import Path
import modal
REPO = Path(__file__).parent.parent
# ---------------------------------------------------------------------------
# Image
# ---------------------------------------------------------------------------
@@ -55,12 +57,13 @@ image = (
index_url="https://download.pytorch.org/whl/cu126",
)
.pip_install(
# transformers: pinned released version, NOT floating `@ main` (a later main
# commit is what hung generate() -- my v60 ran clean on an earlier main, the
# other agent confirmed the hang is the transformers commit not the attn
# backend). 5.10.2 is the patch line of my verified v60 build (5.10.0.dev0).
# uv.lock keeps the exact 5.8.0.dev0 commit for the local box's fine-grained
# repro; the image uses a released wheel. Qwen3-4B needs no main-only feature.
# transformers: pinned released version, NOT floating `@ main`. uv.lock keeps
# the exact 5.8.0.dev0 commit for the local box; the image uses a released
# wheel (Qwen3-4B needs no main-only feature). NB: transformers is NOT the
# routeV generate() hang -- on this exact 5.10.2 image, vanilla completes
# generate fine (warm, 6.8 min, exit 0) while routeV deadlocks at its first
# rollout generate. The hang is routeV-specific (v_grad extraction's CUDA
# state x flash-attn first-generate on torch 2.7.1); see modal/README.md.
"transformers==5.10.2",
"einops>=0.8",
"jaxtyping>=0.2",
@@ -81,9 +84,14 @@ image = (
# flash-attn last, after torch is present (no build isolation -> uses the wheel).
.pip_install(FLASH_ATTN_WHL)
# Research code mounted at runtime so local edits sync without an image rebuild.
# Only src/ is needed on PYTHONPATH; data + caches live on the Volume. Anchored
# to the repo (not CWD) so `modal run` works from any directory.
.add_local_dir(str(Path(__file__).parent.parent / "src"), "/root/src", copy=False)
# Only src/ is needed on PYTHONPATH; mutable caches (svd_cache/out/logs) live on
# the Volume. Anchored to the repo (not CWD) so `modal run` works from anywhere.
.add_local_dir(str(REPO / "src"), "/root/src", copy=False)
# Read-only LeetCode dataset (44MB, 3 jsonls, tracked in the rl-rewardhacking
# submodule). Mount from the image, NOT the Volume: a Volume mount/reload race
# FileNotFound'd it mid-sweep even though the file was committed. Versioning it
# with the code makes the dataset deterministic and removes that failure mode.
.add_local_dir(str(REPO / "external/rl-rewardhacking/results/data"), "/root/leetcode_data", copy=False)
)
app = modal.App("vgrout", image=image)
@@ -118,13 +126,19 @@ def _prepare_workdir() -> str:
Path(f"{CACHE}/{sub}").mkdir(parents=True, exist_ok=True)
work = Path("/work")
work.mkdir(exist_ok=True)
# external/ holds the read-only LeetCode dataset (uploaded to the Volume by
# upload_inputs.py); train.py reads it via the relative path
# external/rl-rewardhacking/results/data/*.jsonl.
for name in ("svd_cache", "out", "logs", "external"):
# Mutable dirs live on the Volume (model cache, svd basis, uploaded inputs in
# out/, written out/runs/*). Symlinked from the ephemeral /work cwd.
for name in ("svd_cache", "out", "logs"):
link = work / name
if not link.exists():
link.symlink_to(f"{CACHE}/{name}")
# Read-only LeetCode dataset comes from the image mount (/root/leetcode_data),
# not the Volume -- train.py reads external/rl-rewardhacking/results/data/*.jsonl
# relative to cwd, so symlink that leaf dir onto the deterministic image copy.
data = work / "external/rl-rewardhacking/results/data"
data.parent.mkdir(parents=True, exist_ok=True)
if not data.exists():
data.symlink_to("/root/leetcode_data")
return str(work)