Files
evil_MoE/docs/spec/20260609_science_correctness.md
T
2026-06-09 13:10:17 +00:00

4.2 KiB

Science correctness audit

Goal

Make cached directions and evaluation artifacts identify the exact data that produced them, and keep the final test split untouched by periodic/checkpoint evaluation.

Scope

In: pairset provenance, prompt-hint fail-fast checks, canonical paper-test split, train/re-score/checkpoint evaluation consistency. Out: changing routing math, reward definitions, or using live-rollout oracle labels for training.

Requirements

  • R1: A cached v_hack can only load with the exact pairset bytes that produced it. Done means: extraction saves a SHA-256; every loader checks it.
  • R2: Prompt hint insertion cannot silently fail. Done means: load_problems raises if the source phrase is absent or ambiguous.
  • R3: Periodic/checkpoint evaluation never touches the final-test problems. Done means: one canonical deterministic split returns disjoint validation and final-test lists.
  • R4: In-run final eval and offline deploy re-score use identical problems, modes, hints, and order. Done means: a verifier compares canonical split identities and all callers use it.
  • R5: No training decision uses final-test labels or live-rollout hack labels. Done means: paired knob-on/off final-test scores are only evaluated at run end; routing still uses authored pairs only.

Tasks

  • T1 (R1): Add pairset SHA-256 metadata and load-time verification.
    • verify: mutate a copied pairset after extraction metadata creation; load must fail.
    • success: exact file loads, changed bytes raise ValueError.
    • likely_fail: a caller omits expected pairset.
    • sneaky_fail: same filename with changed contents loads; hash check catches it.
    • UAT: verifier table shows exact-pass/mutated-fail.
  • T2 (R2): Make prompt replacement fail loud.
    • verify: canonical prompt loads; missing/duplicate source phrase raises.
    • success: one replacement per problem.
    • likely_fail: upstream prompt schema drift silently leaves no hint.
    • sneaky_fail: replacement touches multiple occurrences; exact-count check catches it.
    • UAT: verifier table shows canonical-pass/missing-fail/duplicate-fail.
  • T3 (R3-R5): Centralize deterministic validation/final-test split and repoint callers.
    • verify: split identity verifier plus just smoke.
    • success: val/test ID sets disjoint; train/re-score/checkpoint callers import the same helper.
    • likely_fail: offline re-score assigns modes differently.
    • sneaky_fail: final test is also used for checkpoint curves; search and helper ownership checks catch it.
    • UAT: linked split manifest/table lists counts, ID hashes, and disjointness.
  • T4: Fresh-eyes scientist/code review, fix valid findings, and commit.

Context

  • Authored pair labels are legitimate: they exist before live RL and use no oracle labels from live rollouts.
  • The environment reward/oracle may grade rollouts, but routing must not consume exploited, gt_correct, or detector labels.
  • Periodic evaluation is for monitoring/model iteration. Therefore it cannot share examples with the final headline test.

Log

  • v_hack metadata currently records model/dtype/rank but not pairset identity.
  • In-run periodic eval and final eval currently share the paper test file.
  • rescore_deploy.py currently uses different mode assignment and shuffle behavior from in-run final eval.

Errors

Task Error Resolution
Smoke extraction The old recipe used Qwen/Qwen3.5-0.8B and exhausted the occupied shared GPU. Use TINY_MODEL; full smoke regenerated the cache through the real cache-miss path and passed.

UAT

Claim Proof
Pair provenance, exact hint insertion, and deterministic disjoint split scripts/verify_science_invariants.py, output in /tmp/projected_grpo_science_correctness_smoke.log
Real training reports periodic validation n=32 and untouched final test n=87 /tmp/projected_grpo_science_correctness_smoke.log
Plan survived independent review before implementation /tmp/projected_grpo_science_plan_review.md
Fresh-eyes implementation review found no unresolved leakage or no-cheat issue /tmp/projected_grpo_science_code_review.md