mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:48:43 +08:00
ea01267cd8
The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our artifact): disjoint from train by id but in the train id/recency range (ids 3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in pretraining -> base solve 0.94, saturating solve and killing the hack metric's gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the paper rate. Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094, matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the contaminant. Fix: drop the holdout; periodic curve + final number both eval the paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's simple_overwrite_tests (not the easier _detailed/_aware variants). Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up (journal e): train pool is still first-200-by-id (easy/memorized), same bug class. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
1.8 KiB
1.8 KiB
- AFK autonomy — during AFK, prefer queueing follow-ups over standing down; reserve "stop and ask" for craft-heavy moves.
- AFK check hygiene — track goal STATE not the stale pasted checklist (live priority = directionality mystery #196, see docs/AFK_CHECK.md); don't journal routine no-finding checks.
- No nohup with pueue — run
pueue follow|waitdirectly as the bg task; nohup& orphans it from the harness. - Burn down task list — when many asks are queued, do them all; don't stop to ask which first.
- Workshop paper goal — current phase is ablations+seeds for a workshop paper; artifact tracker A1-A7 lives in docs/spec/20260602_writeup_spec.md.
- Bash-tool shell gotchas — noclobber ON + pi --mode json gives 0 bytes; use panel_direct.py /
>|(generic box/env note, not repo-specific). - qmd prefer lexical — search local papers with
qmd search/rg, not vector (corpus ~93% unembedded, can't fit embeddings). - Semantic Scholar keyed access — S2 API key in semantic-search skill .env; use it to dodge 429s.
- pueue negative-priority gotcha —
pueue addnegative prio needs-o=-Nattached;-o -Nsilently fails the add. - Rename on logic change — when an arm's logic changes (binary->banded gate), give it a new id (routeV/route3), not just a tag suffix; else old/new runs are uncomparable.
- Check paper before diagnosing — re-read source for expected number/horizon before "experiment is broken"; paper: hack emerges on-policy at step 80-100, base solves ~12-20% not 94%.