Files
evil_MoE/docs/AFK_CHECK.md
T
wassname 15a796c542 chore: gitignore modal/results; point AFK_CHECK at requeued task #1
- /modal/results/ holds derived modal-cloud run status (junk RemoteError
  summary); stop tracking it.
- AFK_CHECK live-plan pointer #221 -> #1 (queue was cleared 2026-06-07 and the
  directionality set requeued via just queue-dir6 43).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00

3.2 KiB

AFK hourly check — current protocol

LITE check, once per hour (cron fe8385ed, :23). The default outcome is DO NOTHING. This doc holds the durable rules. The live plan lives in the task list (the single-mode directionality set is task #1, requeued via just queue-dir6 43 after the queue was cleared 2026-06-07); live job state is pueue status. Do not hardcode job numbers here -- they churn.

Rule 0: no-op if the queue is in order

If ALL of these hold, stop immediately. Do not act, do not journal, do not message:

  • a job is Running (GPU not idle while jobs are Queued), and
  • no NEW Failed/Killed task since last check, and
  • the running job's log shows progress (per-step rows advancing, no Traceback/CUDA OOM/AssertionError), and
  • the queue order still matches the priority in the active task.

Only when one of those breaks do you do the matching step under "On a break".

What to read for the plan

  • TaskList -> the in_progress directionality task (#1) holds the arm order, the per-arm expectation, and the PASS condition. If it and pueue status disagree, the task list is the intent; reconcile the queue to it.
  • pueue status --json | jq for which job is which arm (the why-label says the arm and the resolve condition).

Open questions / unconfirmed-but-changed (verify before trusting)

  • Does vanilla hack at a NON-TRIVIAL deploy floor on the single-mode env? An earlier random-V run showed train_hack ~0.06 by step 20 with deploy_hack=0 -- ambiguous. If vanilla deploy_hack ~0, the suppression comparison has no signal (review threat #5). Do NOT declare "method works" until the vanilla arm lands with deploy_hack >> 0.
  • The token-gap eval might defeat the run_tests hack regardless of routing (a memorized train function name fails on the novel eval name). If vanilla ALSO -> ~0 deploy, suspect the eval, not the method. Cross-check vanilla knob-on hack vs deploy hack.
  • 200-problem train pool (fast preset) is the FIRST 200 by id, no shuffle. Cancels across arms (same 200), but not a random slice of 992. Modal also = fast = 200.
  • Eval now ALWAYS applies the token gap (one canonical eval_hack_solve); no variation-free path. Periodic VAL curve and final TEST both carry it.
  • LoRA-frozen-B adapter (#222): Option B confirmed (route in the r-bottleneck, on the static B^T gradient path). NOT YET BUILT. Smoke none+erase+routeV before queueing.

On a break (do only the matching step)

  1. GPU idle + jobs Queued -> investigate why the head job won't run; pueue start.
  2. New Failed/Killed -> pueue log {ID} --full, form 3 hypotheses (likely / subtle / I-was-wrong), fix root cause, requeue with why:/resolve:. No blind retry.
  3. Running job unhealthy (reward collapse, divergence, eval crash at step 0) -> kill, diagnose, fix, requeue.

Wake the user only when

  • The active set is done and its verdict is clear (commit the table to the journal first, then summarize).
  • A result contradicts the plan in a way that changes what to run next (e.g. vanilla deploy_hack ~0 -> comparison dead, needs hotter teacher or more steps).
  • Otherwise: commit findings, queue the obvious follow-up, keep going.

Don't journal routine no-finding checks.