From 3c27d922d21c8fdb43e9f0012f3ebe80949164a9 Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Tue, 9 Jun 2026 13:10:17 +0000 Subject: [PATCH] docs: record science correctness audit --- docs/spec/20260609_science_correctness.md | 64 +++++++++++++++++++++++ 1 file changed, 64 insertions(+) create mode 100644 docs/spec/20260609_science_correctness.md diff --git a/docs/spec/20260609_science_correctness.md b/docs/spec/20260609_science_correctness.md new file mode 100644 index 0000000..1baf204 --- /dev/null +++ b/docs/spec/20260609_science_correctness.md @@ -0,0 +1,64 @@ +# Science correctness audit + +## Goal +Make cached directions and evaluation artifacts identify the exact data that produced them, and keep the final test split untouched by periodic/checkpoint evaluation. + +## Scope +In: pairset provenance, prompt-hint fail-fast checks, canonical paper-test split, train/re-score/checkpoint evaluation consistency. +Out: changing routing math, reward definitions, or using live-rollout oracle labels for training. + +## Requirements +- R1: A cached `v_hack` can only load with the exact pairset bytes that produced it. + Done means: extraction saves a SHA-256; every loader checks it. +- R2: Prompt hint insertion cannot silently fail. + Done means: `load_problems` raises if the source phrase is absent or ambiguous. +- R3: Periodic/checkpoint evaluation never touches the final-test problems. + Done means: one canonical deterministic split returns disjoint validation and final-test lists. +- R4: In-run final eval and offline deploy re-score use identical problems, modes, hints, and order. + Done means: a verifier compares canonical split identities and all callers use it. +- R5: No training decision uses final-test labels or live-rollout hack labels. + Done means: paired knob-on/off final-test scores are only evaluated at run end; routing still uses authored pairs only. + +## Tasks +- [x] T1 (R1): Add pairset SHA-256 metadata and load-time verification. + - verify: mutate a copied pairset after extraction metadata creation; load must fail. + - success: exact file loads, changed bytes raise `ValueError`. + - likely_fail: a caller omits expected pairset. + - sneaky_fail: same filename with changed contents loads; hash check catches it. + - UAT: verifier table shows exact-pass/mutated-fail. +- [x] T2 (R2): Make prompt replacement fail loud. + - verify: canonical prompt loads; missing/duplicate source phrase raises. + - success: one replacement per problem. + - likely_fail: upstream prompt schema drift silently leaves no hint. + - sneaky_fail: replacement touches multiple occurrences; exact-count check catches it. + - UAT: verifier table shows canonical-pass/missing-fail/duplicate-fail. +- [x] T3 (R3-R5): Centralize deterministic validation/final-test split and repoint callers. + - verify: split identity verifier plus `just smoke`. + - success: val/test ID sets disjoint; train/re-score/checkpoint callers import the same helper. + - likely_fail: offline re-score assigns modes differently. + - sneaky_fail: final test is also used for checkpoint curves; search and helper ownership checks catch it. + - UAT: linked split manifest/table lists counts, ID hashes, and disjointness. +- [x] T4: Fresh-eyes scientist/code review, fix valid findings, and commit. + +## Context +- Authored pair labels are legitimate: they exist before live RL and use no oracle labels from live rollouts. +- The environment reward/oracle may grade rollouts, but routing must not consume `exploited`, `gt_correct`, or detector labels. +- Periodic evaluation is for monitoring/model iteration. Therefore it cannot share examples with the final headline test. + +## Log +- `v_hack` metadata currently records model/dtype/rank but not pairset identity. +- In-run periodic eval and final eval currently share the paper test file. +- `rescore_deploy.py` currently uses different mode assignment and shuffle behavior from in-run final eval. + +## Errors +| Task | Error | Resolution | +|------|-------|------------| +| Smoke extraction | The old recipe used `Qwen/Qwen3.5-0.8B` and exhausted the occupied shared GPU. | Use `TINY_MODEL`; full smoke regenerated the cache through the real cache-miss path and passed. | + +## UAT +| Claim | Proof | +|------|-------| +| Pair provenance, exact hint insertion, and deterministic disjoint split | [`scripts/verify_science_invariants.py`](../../scripts/verify_science_invariants.py), output in `/tmp/projected_grpo_science_correctness_smoke.log` | +| Real training reports periodic validation `n=32` and untouched final test `n=87` | `/tmp/projected_grpo_science_correctness_smoke.log` | +| Plan survived independent review before implementation | `/tmp/projected_grpo_science_plan_review.md` | +| Fresh-eyes implementation review found no unresolved leakage or no-cheat issue | `/tmp/projected_grpo_science_code_review.md` |