evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:04:59 +08:00

Author	SHA1	Message	Date
wassname	270c4f5a27	misc	2026-06-11 11:07:28 +00:00
wassname	bf616749ee	Consolidate tagged hack pairsets in data	2026-06-10 11:58:53 +00:00
wassname	7da54f1967	eval+env: single-mode run_tests, held-out val/test eval, both hack metrics - revert env to single-mode run_tests (paper-comparable): FastConfig teacher pool = run_tests-only (no partition.json); + `just build-runtests-pool` - held-out eval: periodic train(knob-on)+deploy(knob-off) on VAL (holdout file), final deploy on TEST n=119 -> deploy_test.json; inline train/val/test disjoint assert - report BOTH hack metrics: strict stub-pass (exploited) + vendor eq_hinted (hacked_loophole_used) -- external review 2026-06-07 - consolidate to one canonical eval_hack_solve (.eval); delete the train.py duplicate that silently lacked the token gap (in-run eval != rescore bug) - routeV band edges mean -> min/max (conservative degrade-to-absorb) - scripts/rescore_deploy.py: offline re-score of saved adapter on held-out test - modal/app.py: read deploy_test.json Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 03:07:35 +00:00
wassname	5419771d70	modal: there was no routeV hang -- it was local stdout buffering Retract the "routeV deadlocks at first generate()" finding from `d96367c`. The server-side `modal app logs` show the killed routeV smoke had actually run training steps 0-3 (real rewards, \|\|delta_S_hack\|\|=3.23, coherent generations) and was inside the 24-prompt FINAL EVAL when I stopped it -- a deadlocked-at-first-generate process cannot produce step 1/2/3 results. The "freeze" was the local `modal run > log` capture block-buffering the subprocess stdout; the run was healthy the whole time. Fix: PYTHONUNBUFFERED=1 in _run_train env so the local stream is live, and monitor via `modal app logs <app-id>` (server-side truth). Corrected the app.py comment and replaced the README "known issue" with the buffering gotcha. routeV runs fine on Modal -- the routeV sweep is viable, no torch-2.7 debug needed. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 10:39:41 +08:00
wassname	d96367ca5d	modal: mount leetcode data from image; correct `2873b37` hang claim Data fix: the read-only LeetCode jsonls (44MB, tracked in the rl-rewardhacking submodule) now mount from the local checkout into the image (add_local_dir, copy=False) instead of the Volume. A Volume mount/reload race FileNotFound'd them mid-sweep even though they were committed; versioning the dataset with the code removes that failure mode. Volume now carries only mutable dirs. Verified: both a vanilla warm and a routeV smoke load data fine on the new image. Correction: 2873b37's message claimed "the smoke on pinned 5.10.2 clears the deadlock point" -- it did NOT, the smoke hung. And transformers is not the cause: on this exact 5.10.2 image, vanilla completes generate (warm, 6.8 min, exit 0) while routeV deadlocks at its first rollout generate(). Same image, same attn, same data -- the hang is routeV-specific (v_grad extraction's CUDA state x flash-attn first-generate on torch 2.7.1; local box runs routeV fine on 2.8). Known-issue section + corrected app.py comment record this. Local box produces the canonical routeV runs; Modal is proven for vanilla. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 09:45:17 +08:00
wassname	2873b37842	modal: flash_attention_2 + transformers==5.10.2, drop sdpa workaround The generate() hang was floating transformers @ main (a later commit), not the attn backend -- confirmed: v60 ran on an earlier main with flash, and the smoke on pinned 5.10.2 clears the deadlock point. Revert the VGROUT_ATTN=sdpa override (app.py) and the env knob (train.py) back to hardcoded flash_attention_2, which fails loud if the image's flash wheel is ever wrong rather than silently running 2-3x slower on sdpa. Pin transformers to the released 5.10.2 (patch line of v60's 5.10.0.dev0); uv.lock keeps the exact commit for the local box. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 08:41:11 +08:00
wassname	54a4298a35	modal: pin transformers to released >=5.8.0 (no floating @ main) Floating @ main let a later main commit hang generate() (the other agent's deadlock). The local box runs 5.8.0.dev0; uv.lock pins the exact commit, the image uses the released 5.8.0 wheel of the same line. Qwen3-4B needs no main-only feature. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 08:14:22 +08:00
wassname	2f91561269	modal/train: VGROUT_ATTN attn-impl override (NOT a fix for the modal hang) Adds env override VGROUT_ATTN (default flash_attention_2, so local behavior is unchanged; app.py sets sdpa on Modal). Tested to isolate the Modal generate() deadlock: it hangs at the first generate under BOTH flash_attention_2 and sdpa, so the hang is NOT the attention backend -- it's in the generation loop, suspect the cache-frozen image's transformers-main commit differing from local's working 5.8.0.dev0. Diagnosis + fix path in task #212. Local n=3 runs proceed meanwhile. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 16:42:12 +00:00
wassname	98ceb38815	modal: rename launch entrypoint main->fanout (collides with app.py::main) launch.py imports `app` from app.py, which registers app.py's @local_entrypoint `main`; launch.py defining its own `main` raised InvalidError(Duplicate local entrypoint). So launch.py had never actually run -- the earlier vanilla verify was via app.py directly. Invoke: modal run modal/launch.py::fanout [--only N]. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 14:09:35 +00:00
wassname	6567f6c60a	modal: launch.py -> 15-run v2 keynote set (5 arms x seeds 42/41/43) Old JOBS fired --intervention=route2 (dead flag after the routeV rename) on the pre-v2 manifest -- half the containers would have errored on argv parse. Replace with the n=3 keynote set generated from ARMS x SEEDS: vanilla, routeV real-V per-rollout, routeV per-token, random-V(157), placebo(vampire). Tag stems match the local pueue twins so Modal and local cross-replicate. id 1 = canary (seed-42 vanilla). Fix app.py::smoke route2->routeV and the subprocess modal binary (not on PATH; resolve next to sys.executable). v2 eval rides in via the runtime-mounted src/. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 14:07:47 +00:00
wassname	70aa6aa96b	modal: parallel GRPO sweep port (image, volume, fan-out launcher) Fire the paper sweep as independent H100/A100-80 containers instead of serial pueue runs. One Volume caches model + svd + out/; train.py runs unmodified (torch 2.7 + Dao flash-attn wheel, code mounted at runtime). Verified: vanilla 60-step reproduces the local baseline. Skill at ~/.claude/skills/modal documents the patterns. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-06 20:30:19 +08:00

11 Commits