--- name: project_paper_comparability_verdict description: External review verdict (2026-06-07) on whether our single-mode run_tests env is paper-comparable -- it is NOT a fair vendor-number comparison; the honest claim is the internal vanilla-vs-routeV delta. metadata: node_type: memory type: project originSessionId: c06c2b0e-44ee-45a1-b213-4f77c006109c --- Two decorrelated external reviewers (deepseek-v4-pro + gpt-5.5, non-Anthropic) audited our reverted single-mode run_tests env vs the Ariahw/Engels/Nanda vendor on 2026-06-07 and CONVERGED: env/hint/data are faithful (verbatim hint, vendor data files md5-match), but it is NOT a fair comparison to the paper's hack-rate NUMBERS, due to three confounds: 1. Eval token-gap: we disclose a NOVEL function name at eval and grade on it (eval.py randomize_eval_markers); the paper evals the fixed `run_tests`. Ours is a harder, memorization-resistant variant -> lowers hack rate for all arms. 2. Hack metric: our run_tests "exploited" = arbitrary-STUB passes the model's run_tests (rewards.py channel=hacked), a strict SUBSET of the vendor headline eq_hinted (the model's OWN solution passes its run_tests). We already compute the vendor analogue as `hacked_loophole_used`. Report BOTH (task #219). 3. Training: teacher-pool seeding (mix=0.125, off at step 30) + group=8 vs vendor 16 + lr=3e-3/adam(0.5,0.9) + 60 steps/200 problems != vendor verl GRPO (200 steps, no teacher). **Why:** "paper-comparable" was overclaiming. The VALID, publishable claim is the INTERNAL delta: routeV vs vanilla under IDENTICAL conditions (same teacher, same eval, same metric). Both reviewers say the internal arm comparison is sound and our eval additions (held-out periodic curve + deploy-on-test) are methodologically fine. **How to apply:** Frame the writeup as the internal comparison + report the vendor eq_hinted metric as a secondary column + LABEL the eval "token-gap / novel-name robustness eval," not "the vendor eval." A true paper comparison would need a vendor-matched arm (fixed run_tests eval, eq_hinted metric, no teacher, vendor GRPO scale) -- only do that if a reviewer demands the absolute-number comparison. Reviews saved: docs/reviews/20260607_paper_comparability_*.md. Related: [[project_workshop_paper_goal]]. Dead-code cleanup from the same review = task #220.