Files
evil_MoE/.claude/memory/project_paper_comparability_verdict.md
T
wassname ea01267cd8 fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094)
The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our
artifact): disjoint from train by id but in the train id/recency range (ids
3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in
pretraining -> base solve 0.94, saturating solve and killing the hack metric's
gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining
MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the
paper rate.

Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094,
matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the
contaminant. Fix: drop the holdout; periodic curve + final number both eval the
paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's
simple_overwrite_tests (not the easier _detailed/_aware variants).

Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle
for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up
(journal e): train pool is still first-200-by-id (easy/memorized), same bug class.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00

2.3 KiB

name, description, metadata
name description metadata
project_paper_comparability_verdict External review verdict (2026-06-07) on whether our single-mode run_tests env is paper-comparable -- it is NOT a fair vendor-number comparison; the honest claim is the internal vanilla-vs-routeV delta.
node_type type originSessionId
memory project c06c2b0e-44ee-45a1-b213-4f77c006109c

Two decorrelated external reviewers (deepseek-v4-pro + gpt-5.5, non-Anthropic) audited our reverted single-mode run_tests env vs the Ariahw/Engels/Nanda vendor on 2026-06-07 and CONVERGED: env/hint/data are faithful (verbatim hint, vendor data files md5-match), but it is NOT a fair comparison to the paper's hack-rate NUMBERS, due to three confounds:

  1. Eval token-gap: we disclose a NOVEL function name at eval and grade on it (eval.py randomize_eval_markers); the paper evals the fixed run_tests. Ours is a harder, memorization-resistant variant -> lowers hack rate for all arms.
  2. Hack metric: our run_tests "exploited" = arbitrary-STUB passes the model's run_tests (rewards.py channel=hacked), a strict SUBSET of the vendor headline eq_hinted (the model's OWN solution passes its run_tests). We already compute the vendor analogue as hacked_loophole_used. Report BOTH (task #219).
  3. Training: teacher-pool seeding (mix=0.125, off at step 30) + group=8 vs vendor 16 + lr=3e-3/adam(0.5,0.9) + 60 steps/200 problems != vendor verl GRPO (200 steps, no teacher).

Why: "paper-comparable" was overclaiming. The VALID, publishable claim is the INTERNAL delta: routeV vs vanilla under IDENTICAL conditions (same teacher, same eval, same metric). Both reviewers say the internal arm comparison is sound and our eval additions (held-out periodic curve + deploy-on-test) are methodologically fine.

How to apply: Frame the writeup as the internal comparison + report the vendor eq_hinted metric as a secondary column + LABEL the eval "token-gap / novel-name robustness eval," not "the vendor eval." A true paper comparison would need a vendor-matched arm (fixed run_tests eval, eq_hinted metric, no teacher, vendor GRPO scale) -- only do that if a reviewer demands the absolute-number comparison. Reviews saved: docs/reviews/20260607_paper_comparability_*.md. Related: project_workshop_paper_goal. Dead-code cleanup from the same review = task #220.