Commit Graph

5 Commits

Author SHA1 Message Date
wassname 5419771d70 modal: there was no routeV hang -- it was local stdout buffering
Retract the "routeV deadlocks at first generate()" finding from d96367c. The
server-side `modal app logs` show the killed routeV smoke had actually run training
steps 0-3 (real rewards, ||delta_S_hack||=3.23, coherent generations) and was inside
the 24-prompt FINAL EVAL when I stopped it -- a deadlocked-at-first-generate process
cannot produce step 1/2/3 results. The "freeze" was the local `modal run > log`
capture block-buffering the subprocess stdout; the run was healthy the whole time.

Fix: PYTHONUNBUFFERED=1 in _run_train env so the local stream is live, and monitor
via `modal app logs <app-id>` (server-side truth). Corrected the app.py comment and
replaced the README "known issue" with the buffering gotcha. routeV runs fine on
Modal -- the routeV sweep is viable, no torch-2.7 debug needed.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 10:39:41 +08:00
wassname d96367ca5d modal: mount leetcode data from image; correct 2873b37 hang claim
Data fix: the read-only LeetCode jsonls (44MB, tracked in the rl-rewardhacking
submodule) now mount from the local checkout into the image (add_local_dir,
copy=False) instead of the Volume. A Volume mount/reload race FileNotFound'd
them mid-sweep even though they were committed; versioning the dataset with the
code removes that failure mode. Volume now carries only mutable dirs. Verified:
both a vanilla warm and a routeV smoke load data fine on the new image.

Correction: 2873b37's message claimed "the smoke on pinned 5.10.2 clears the
deadlock point" -- it did NOT, the smoke hung. And transformers is not the cause:
on this exact 5.10.2 image, vanilla completes generate (warm, 6.8 min, exit 0)
while routeV deadlocks at its first rollout generate(). Same image, same attn,
same data -- the hang is routeV-specific (v_grad extraction's CUDA state x
flash-attn first-generate on torch 2.7.1; local box runs routeV fine on 2.8).
Known-issue section + corrected app.py comment record this. Local box produces
the canonical routeV runs; Modal is proven for vanilla.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 09:45:17 +08:00
wassname 98ceb38815 modal: rename launch entrypoint main->fanout (collides with app.py::main)
launch.py imports `app` from app.py, which registers app.py's @local_entrypoint
`main`; launch.py defining its own `main` raised InvalidError(Duplicate local
entrypoint). So launch.py had never actually run -- the earlier vanilla verify
was via app.py directly. Invoke: modal run modal/launch.py::fanout [--only N].

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 14:09:35 +00:00
wassname 6567f6c60a modal: launch.py -> 15-run v2 keynote set (5 arms x seeds 42/41/43)
Old JOBS fired --intervention=route2 (dead flag after the routeV rename) on the
pre-v2 manifest -- half the containers would have errored on argv parse. Replace
with the n=3 keynote set generated from ARMS x SEEDS: vanilla, routeV real-V
per-rollout, routeV per-token, random-V(157), placebo(vampire). Tag stems match
the local pueue twins so Modal and local cross-replicate. id 1 = canary
(seed-42 vanilla). Fix app.py::smoke route2->routeV and the subprocess modal
binary (not on PATH; resolve next to sys.executable). v2 eval rides in via the
runtime-mounted src/.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 14:07:47 +00:00
wassname 70aa6aa96b modal: parallel GRPO sweep port (image, volume, fan-out launcher)
Fire the paper sweep as independent H100/A100-80 containers instead of
serial pueue runs. One Volume caches model + svd + out/; train.py runs
unmodified (torch 2.7 + Dao flash-attn wheel, code mounted at runtime).
Verified: vanilla 60-step reproduces the local baseline. Skill at
~/.claude/skills/modal documents the patterns.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 20:30:19 +08:00