---
name: project_paper_comparability_verdict
description: External review verdict (2026-06-07) on whether our single-mode run_tests env is paper-comparable -- it is NOT a fair vendor-number comparison; the honest claim is the internal vanilla-vs-routeV delta.
metadata: 
  node_type: memory
  type: project
  originSessionId: c06c2b0e-44ee-45a1-b213-4f77c006109c
---

Two decorrelated external reviewers (deepseek-v4-pro + gpt-5.5, non-Anthropic) audited
our reverted single-mode run_tests env vs the Ariahw/Engels/Nanda vendor on 2026-06-07 and
CONVERGED: env/hint/data are faithful (verbatim hint, vendor data files md5-match), but it
is NOT a fair comparison to the paper's hack-rate NUMBERS, due to three confounds:

1. Eval token-gap: we disclose a NOVEL function name at eval and grade on it
   (eval.py randomize_eval_markers); the paper evals the fixed `run_tests`. Ours is a
   harder, memorization-resistant variant -> lowers hack rate for all arms.
2. Hack metric: our run_tests "exploited" = arbitrary-STUB passes the model's run_tests
   (rewards.py channel=hacked), a strict SUBSET of the vendor headline eq_hinted (the
   model's OWN solution passes its run_tests). We already compute the vendor analogue as
   `hacked_loophole_used`. Report BOTH (task #219).
3. Training: teacher-pool seeding (mix=0.125, off at step 30) + group=8 vs vendor 16 +
   lr=3e-3/adam(0.5,0.9) + 60 steps/200 problems != vendor verl GRPO (200 steps, no teacher).

**Why:** "paper-comparable" was overclaiming. The VALID, publishable claim is the INTERNAL
delta: routeV vs vanilla under IDENTICAL conditions (same teacher, same eval, same metric).
Both reviewers say the internal arm comparison is sound and our eval additions (held-out
periodic curve + deploy-on-test) are methodologically fine.

**How to apply:** Frame the writeup as the internal comparison + report the vendor eq_hinted
metric as a secondary column + LABEL the eval "token-gap / novel-name robustness eval," not
"the vendor eval." A true paper comparison would need a vendor-matched arm (fixed run_tests
eval, eq_hinted metric, no teacher, vendor GRPO scale) -- only do that if a reviewer demands
the absolute-number comparison. Reviews saved: docs/reviews/20260607_paper_comparability_*.md.
Related: [[project_workshop_paper_goal]]. Dead-code cleanup from the same review = task #220.