Commit Graph

2 Commits

Author SHA1 Message Date
wassname ea01267cd8 fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094)
The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our
artifact): disjoint from train by id but in the train id/recency range (ids
3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in
pretraining -> base solve 0.94, saturating solve and killing the hack metric's
gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining
MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the
paper rate.

Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094,
matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the
contaminant. Fix: drop the holdout; periodic curve + final number both eval the
paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's
simple_overwrite_tests (not the easier _detailed/_aware variants).

Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle
for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up
(journal e): train pool is still first-200-by-id (easy/memorized), same bug class.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00
wassname 7da54f1967 eval+env: single-mode run_tests, held-out val/test eval, both hack metrics
- revert env to single-mode run_tests (paper-comparable): FastConfig teacher
  pool = run_tests-only (no partition.json); + `just build-runtests-pool`
- held-out eval: periodic train(knob-on)+deploy(knob-off) on VAL (holdout file),
  final deploy on TEST n=119 -> deploy_test.json; inline train/val/test disjoint assert
- report BOTH hack metrics: strict stub-pass (exploited) + vendor eq_hinted
  (hacked_loophole_used) -- external review 2026-06-07
- consolidate to one canonical eval_hack_solve (.eval); delete the train.py
  duplicate that silently lacked the token gap (in-run eval != rescore bug)
- routeV band edges mean -> min/max (conservative degrade-to-absorb)
- scripts/rescore_deploy.py: offline re-score of saved adapter on held-out test
- modal/app.py: read deploy_test.json

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 03:07:35 +00:00