Files
evil_MoE/.claude/memory/MEMORY.md
T
wassname ea01267cd8 fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094)
The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our
artifact): disjoint from train by id but in the train id/recency range (ids
3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in
pretraining -> base solve 0.94, saturating solve and killing the hack metric's
gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining
MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the
paper rate.

Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094,
matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the
contaminant. Fix: drop the holdout; periodic curve + final number both eval the
paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's
simple_overwrite_tests (not the easier _detailed/_aware variants).

Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle
for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up
(journal e): train pool is still first-200-by-id (easy/memorized), same bug class.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00

1.8 KiB

  • AFK autonomy — during AFK, prefer queueing follow-ups over standing down; reserve "stop and ask" for craft-heavy moves.
  • AFK check hygiene — track goal STATE not the stale pasted checklist (live priority = directionality mystery #196, see docs/AFK_CHECK.md); don't journal routine no-finding checks.
  • No nohup with pueue — run pueue follow|wait directly as the bg task; nohup& orphans it from the harness.
  • Burn down task list — when many asks are queued, do them all; don't stop to ask which first.
  • Workshop paper goal — current phase is ablations+seeds for a workshop paper; artifact tracker A1-A7 lives in docs/spec/20260602_writeup_spec.md.
  • Bash-tool shell gotchas — noclobber ON + pi --mode json gives 0 bytes; use panel_direct.py / >| (generic box/env note, not repo-specific).
  • qmd prefer lexical — search local papers with qmd search/rg, not vector (corpus ~93% unembedded, can't fit embeddings).
  • Semantic Scholar keyed access — S2 API key in semantic-search skill .env; use it to dodge 429s.
  • pueue negative-priority gotchapueue add negative prio needs -o=-N attached; -o -N silently fails the add.
  • Rename on logic change — when an arm's logic changes (binary->banded gate), give it a new id (routeV/route3), not just a tag suffix; else old/new runs are uncomparable.
  • Check paper before diagnosing — re-read source for expected number/horizon before "experiment is broken"; paper: hack emerges on-policy at step 80-100, base solves ~12-20% not 94%.