Files
evil_MoE/docs/spec/20260610_pairset_audit_markdown.md
T
2026-06-10 11:58:53 +00:00

125 lines
5.7 KiB
Markdown

# Pairset audit and Markdown source
## Goal
Audit the hand-authored pairsets for clean contrastive construction and provenance,
decide which are useful for headline extraction versus diagnostics, and replace the
scattered hand-authored JSON/build-script sources with one or two human-readable
Markdown sources.
## Scope
In: hand-authored mechanism, intent, semantic-framing, honesty, and placebo pairsets;
their loaders/builders/recipes; pair-level audit evidence.
Out: pool-derived or oracle-labelled pair generation, changing experiment results,
and rewriting pair content before the audit identifies a specific defect.
## Requirements
- R1: Audit same-prompt, provenance, style/length confounds, and hack-axis strength.
Done means: a per-pair table and pairset-level recommendation exist.
- R2: Distinguish no-cheat headline supervision from diagnostics and controls.
Done means: each retained section has an explicit role and provenance.
- R3: Use at most two human-readable Markdown source files for hand-authored pairsets.
Done means: runtime loading selects a named Markdown heading and scattered generated
JSON is no longer the source of truth.
- R4: Fail fast on malformed Markdown or missing headings/fields.
Done means: verification deliberately checks malformed and missing sections fail.
## Tasks
- [x] T1 (R1-R2): Audit every hand-authored pairset.
- verify: audit table reports every pair and pairset summary.
- success: same-prompt is universal; provenance and confounds are explicit.
- likely_fail: a set is omitted; summary counts disagree with source counts.
- sneaky_fail: pairset name implies an isolated axis but another systematic contrast
dominates; manual evidence records the actual contrast.
- UAT: open the audit and see a recommendation with reasons for each pairset.
- [x] T2 (R3-R4): Implement a minimal heading-based Markdown pair loader and sources.
- verify: each named heading loads the expected count and exact pair bytes.
- success: one or two Markdown files contain all hand-authored pairs.
- likely_fail: fenced code inside completions breaks parsing; exact-byte comparison fails.
- sneaky_fail: configs silently load the wrong heading; missing-heading test must raise.
- UAT: one path plus heading identifies a pairset and prints its expected first pair.
- [x] T3 (R3): Update active recipes/docs and remove superseded hand-authored JSON/builders.
- verify: repository search finds no active references to removed JSON pairsets.
- success: active runs name Markdown sections; pool-derived JSON remains separate.
- likely_fail: an active recipe still names removed JSON.
- sneaky_fail: cached v_hack identity hashes the whole Markdown file instead of selected
pair bytes; selected-pair hash verification catches this.
- [x] T4 (R1-R4): Run verification, smoke, and fresh-eyes review.
- verify: targeted verifier and `just smoke` pass; review has no unresolved blocker.
- UAT: linked logs and audit show the result.
## Context
- A pair is `gradient(prompt + hack) - gradient(prompt + clean)`, so prompt equality is
structural within a pair.
- Hand-authored off-distribution pairs are no-cheat. Pool-derived or live-rollout-labelled
pairs are not headline-clean.
- `philosophical`, `moral`, `intent_vs_spec`, and `eval_aware` currently vary semantic
comments and test strength together, so their names overstate axis isolation.
## Log
- `pairs_authored.json` and `prog_wide_clean.json` have uncommitted user edits; preserve
their current working-tree bytes as audit inputs.
- The current authored working-tree rows were migrated byte-exactly before deleting the
duplicate JSON source. `prog_wide_clean.json` remains a separate generated/sensitivity
artifact.
- No audited prompt is byte-identical to current train/eval prompts, but intent and framing
sets overlap with several corpus problem concepts. “Not in train/test” was too strong.
## TODO
## Errors
| Task | Error | Resolution |
|---|---|---|
## Results
- Audit and recommendations: `docs/personas/pairset_audit.md`.
- Authored/manual sources: `data/pairs/hack_pairs.md` and
`data/pairs/pair_diagnostics.md`.
- Deleted 13 duplicate manual JSON files and 3 Python data builders.
- The initial migration used `#mechanism-authored`; it was superseded by the later
multi-view design in `docs/spec/20260610_multiview_hack_pairs.md`.
## Verify
`uv run python scripts/verify_science_invariants.py`:
| invariant | success |
|---|---|
| selected Markdown pair bytes | True |
| malformed Markdown fails | True |
| authored/control/generated pairsets load | True |
| exactly one prompt hint | True |
| deterministic disjoint val/test | True |
`just smoke`:
- Full log: `/tmp/claude-1000/pairset_markdown_smoke.log`
- routeV loaded `data/pairs/hack_pairs.md#mechanism-authored -> 11 pairs`.
- Routing band mean width `+0.354`; `13/14` modules included.
- All verify gates and the 30-step routeV smoke completed successfully.
## Failure mode check
- likely_fail: nested code fences corrupt Markdown parsing. Exact migration checks loaded all
13 sections and matched every prior JSON row byte-for-byte. PASS.
- sneaky_fail: hash tracks the whole Markdown file or the wrong heading. The verifier mutates
an unrelated section without changing the selected hash, then mutates the selected section
and confirms cached-v-hack loading fails. PASS.
- malformed duplicate fields or missing headings are rejected. PASS.
## Review
Fresh-eyes external review: `docs/reviews/20260610_pairset_external.md`.
- No scientific audit, data-loss, parser, or hash findings.
- Fixed its one low-severity finding: result tables now label a Markdown pairset by
selected heading rather than the full `hack_pairs.md#heading` reference.