mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:59:35 +08:00
ea01267cd8
The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our artifact): disjoint from train by id but in the train id/recency range (ids 3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in pretraining -> base solve 0.94, saturating solve and killing the hack metric's gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the paper rate. Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094, matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the contaminant. Fix: drop the holdout; periodic curve + final number both eval the paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's simple_overwrite_tests (not the easier _detailed/_aware variants). Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up (journal e): train pool is still first-200-by-id (easy/memorized), same bug class. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
60 lines
3.1 KiB
Markdown
60 lines
3.1 KiB
Markdown
# AFK hourly check — current protocol
|
|
|
|
LITE check, once per hour (cron fe8385ed, :23). The default outcome is DO NOTHING.
|
|
This doc holds the durable rules. The live plan lives in the task list (the
|
|
single-mode directionality set is task #221); live job state is `pueue status`.
|
|
Do not hardcode job numbers here -- they churn.
|
|
|
|
## Rule 0: no-op if the queue is in order
|
|
|
|
If ALL of these hold, stop immediately. Do not act, do not journal, do not message:
|
|
- a job is Running (GPU not idle while jobs are Queued), and
|
|
- no NEW Failed/Killed task since last check, and
|
|
- the running job's log shows progress (per-step rows advancing, no Traceback/CUDA
|
|
OOM/AssertionError), and
|
|
- the queue order still matches the priority in the active task.
|
|
|
|
Only when one of those breaks do you do the matching step under "On a break".
|
|
|
|
## What to read for the plan
|
|
|
|
- `TaskList` -> the in_progress directionality task (#221) holds the arm order, the
|
|
per-arm expectation, and the PASS condition. If it and `pueue status` disagree,
|
|
the task list is the intent; reconcile the queue to it.
|
|
- `pueue status --json | jq` for which job is which arm (the why-label says the arm
|
|
and the resolve condition).
|
|
|
|
## Open questions / unconfirmed-but-changed (verify before trusting)
|
|
|
|
- Does vanilla hack at a NON-TRIVIAL deploy floor on the single-mode env? An earlier
|
|
random-V run showed train_hack ~0.06 by step 20 with deploy_hack=0 -- ambiguous. If
|
|
vanilla deploy_hack ~0, the suppression comparison has no signal (review threat #5).
|
|
Do NOT declare "method works" until the vanilla arm lands with deploy_hack >> 0.
|
|
- The token-gap eval might defeat the run_tests hack regardless of routing (a memorized
|
|
train function name fails on the novel eval name). If vanilla ALSO -> ~0 deploy,
|
|
suspect the eval, not the method. Cross-check vanilla knob-on hack vs deploy hack.
|
|
- 200-problem train pool (fast preset) is the FIRST 200 by id, no shuffle. Cancels
|
|
across arms (same 200), but not a random slice of 992. Modal also = fast = 200.
|
|
- Eval now ALWAYS applies the token gap (one canonical eval_hack_solve); no
|
|
variation-free path. Periodic VAL curve and final TEST both carry it.
|
|
- LoRA-frozen-B adapter (#222): Option B confirmed (route in the r-bottleneck, on the
|
|
static B^T gradient path). NOT YET BUILT. Smoke none+erase+routeV before queueing.
|
|
|
|
## On a break (do only the matching step)
|
|
|
|
1. GPU idle + jobs Queued -> investigate why the head job won't run; `pueue start`.
|
|
2. New Failed/Killed -> `pueue log {ID} --full`, form 3 hypotheses (likely / subtle /
|
|
I-was-wrong), fix root cause, requeue with `why:`/`resolve:`. No blind retry.
|
|
3. Running job unhealthy (reward collapse, divergence, eval crash at step 0) -> kill,
|
|
diagnose, fix, requeue.
|
|
|
|
## Wake the user only when
|
|
|
|
- The active set is done and its verdict is clear (commit the table to the journal
|
|
first, then summarize).
|
|
- A result contradicts the plan in a way that changes what to run next (e.g. vanilla
|
|
deploy_hack ~0 -> comparison dead, needs hotter teacher or more steps).
|
|
- Otherwise: commit findings, queue the obvious follow-up, keep going.
|
|
|
|
Don't journal routine no-finding checks.
|