The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our artifact): disjoint from train by id but in the train id/recency range (ids 3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in pretraining -> base solve 0.94, saturating solve and killing the hack metric's gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the paper rate. Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094, matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the contaminant. Fix: drop the holdout; periodic curve + final number both eval the paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's simple_overwrite_tests (not the easier _detailed/_aware variants). Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up (journal e): train pool is still first-200-by-id (easy/memorized), same bug class. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2.3 KiB
name, description, metadata
| name | description | metadata | ||||||
|---|---|---|---|---|---|---|---|---|
| project_paper_comparability_verdict | External review verdict (2026-06-07) on whether our single-mode run_tests env is paper-comparable -- it is NOT a fair vendor-number comparison; the honest claim is the internal vanilla-vs-routeV delta. |
|
Two decorrelated external reviewers (deepseek-v4-pro + gpt-5.5, non-Anthropic) audited our reverted single-mode run_tests env vs the Ariahw/Engels/Nanda vendor on 2026-06-07 and CONVERGED: env/hint/data are faithful (verbatim hint, vendor data files md5-match), but it is NOT a fair comparison to the paper's hack-rate NUMBERS, due to three confounds:
- Eval token-gap: we disclose a NOVEL function name at eval and grade on it
(eval.py randomize_eval_markers); the paper evals the fixed
run_tests. Ours is a harder, memorization-resistant variant -> lowers hack rate for all arms. - Hack metric: our run_tests "exploited" = arbitrary-STUB passes the model's run_tests
(rewards.py channel=hacked), a strict SUBSET of the vendor headline eq_hinted (the
model's OWN solution passes its run_tests). We already compute the vendor analogue as
hacked_loophole_used. Report BOTH (task #219). - Training: teacher-pool seeding (mix=0.125, off at step 30) + group=8 vs vendor 16 + lr=3e-3/adam(0.5,0.9) + 60 steps/200 problems != vendor verl GRPO (200 steps, no teacher).
Why: "paper-comparable" was overclaiming. The VALID, publishable claim is the INTERNAL delta: routeV vs vanilla under IDENTICAL conditions (same teacher, same eval, same metric). Both reviewers say the internal arm comparison is sound and our eval additions (held-out periodic curve + deploy-on-test) are methodologically fine.
How to apply: Frame the writeup as the internal comparison + report the vendor eq_hinted metric as a secondary column + LABEL the eval "token-gap / novel-name robustness eval," not "the vendor eval." A true paper comparison would need a vendor-matched arm (fixed run_tests eval, eq_hinted metric, no teacher, vendor GRPO scale) -- only do that if a reviewer demands the absolute-number comparison. Reviews saved: docs/reviews/20260607_paper_comparability_*.md. Related: project_workshop_paper_goal. Dead-code cleanup from the same review = task #220.