Files
evil_MoE/docs/spec/20260610_pairset_audit_markdown.md
T
2026-06-10 11:58:53 +00:00

5.7 KiB

Pairset audit and Markdown source

Goal

Audit the hand-authored pairsets for clean contrastive construction and provenance, decide which are useful for headline extraction versus diagnostics, and replace the scattered hand-authored JSON/build-script sources with one or two human-readable Markdown sources.

Scope

In: hand-authored mechanism, intent, semantic-framing, honesty, and placebo pairsets; their loaders/builders/recipes; pair-level audit evidence.

Out: pool-derived or oracle-labelled pair generation, changing experiment results, and rewriting pair content before the audit identifies a specific defect.

Requirements

  • R1: Audit same-prompt, provenance, style/length confounds, and hack-axis strength. Done means: a per-pair table and pairset-level recommendation exist.
  • R2: Distinguish no-cheat headline supervision from diagnostics and controls. Done means: each retained section has an explicit role and provenance.
  • R3: Use at most two human-readable Markdown source files for hand-authored pairsets. Done means: runtime loading selects a named Markdown heading and scattered generated JSON is no longer the source of truth.
  • R4: Fail fast on malformed Markdown or missing headings/fields. Done means: verification deliberately checks malformed and missing sections fail.

Tasks

  • T1 (R1-R2): Audit every hand-authored pairset.
    • verify: audit table reports every pair and pairset summary.
    • success: same-prompt is universal; provenance and confounds are explicit.
    • likely_fail: a set is omitted; summary counts disagree with source counts.
    • sneaky_fail: pairset name implies an isolated axis but another systematic contrast dominates; manual evidence records the actual contrast.
    • UAT: open the audit and see a recommendation with reasons for each pairset.
  • T2 (R3-R4): Implement a minimal heading-based Markdown pair loader and sources.
    • verify: each named heading loads the expected count and exact pair bytes.
    • success: one or two Markdown files contain all hand-authored pairs.
    • likely_fail: fenced code inside completions breaks parsing; exact-byte comparison fails.
    • sneaky_fail: configs silently load the wrong heading; missing-heading test must raise.
    • UAT: one path plus heading identifies a pairset and prints its expected first pair.
  • T3 (R3): Update active recipes/docs and remove superseded hand-authored JSON/builders.
    • verify: repository search finds no active references to removed JSON pairsets.
    • success: active runs name Markdown sections; pool-derived JSON remains separate.
    • likely_fail: an active recipe still names removed JSON.
    • sneaky_fail: cached v_hack identity hashes the whole Markdown file instead of selected pair bytes; selected-pair hash verification catches this.
  • T4 (R1-R4): Run verification, smoke, and fresh-eyes review.
    • verify: targeted verifier and just smoke pass; review has no unresolved blocker.
    • UAT: linked logs and audit show the result.

Context

  • A pair is gradient(prompt + hack) - gradient(prompt + clean), so prompt equality is structural within a pair.
  • Hand-authored off-distribution pairs are no-cheat. Pool-derived or live-rollout-labelled pairs are not headline-clean.
  • philosophical, moral, intent_vs_spec, and eval_aware currently vary semantic comments and test strength together, so their names overstate axis isolation.

Log

  • pairs_authored.json and prog_wide_clean.json have uncommitted user edits; preserve their current working-tree bytes as audit inputs.
  • The current authored working-tree rows were migrated byte-exactly before deleting the duplicate JSON source. prog_wide_clean.json remains a separate generated/sensitivity artifact.
  • No audited prompt is byte-identical to current train/eval prompts, but intent and framing sets overlap with several corpus problem concepts. “Not in train/test” was too strong.

TODO

Errors

Task Error Resolution

Results

  • Audit and recommendations: docs/personas/pairset_audit.md.
  • Authored/manual sources: data/pairs/hack_pairs.md and data/pairs/pair_diagnostics.md.
  • Deleted 13 duplicate manual JSON files and 3 Python data builders.
  • The initial migration used #mechanism-authored; it was superseded by the later multi-view design in docs/spec/20260610_multiview_hack_pairs.md.

Verify

uv run python scripts/verify_science_invariants.py:

invariant success
selected Markdown pair bytes True
malformed Markdown fails True
authored/control/generated pairsets load True
exactly one prompt hint True
deterministic disjoint val/test True

just smoke:

  • Full log: /tmp/claude-1000/pairset_markdown_smoke.log
  • routeV loaded data/pairs/hack_pairs.md#mechanism-authored -> 11 pairs.
  • Routing band mean width +0.354; 13/14 modules included.
  • All verify gates and the 30-step routeV smoke completed successfully.

Failure mode check

  • likely_fail: nested code fences corrupt Markdown parsing. Exact migration checks loaded all 13 sections and matched every prior JSON row byte-for-byte. PASS.
  • sneaky_fail: hash tracks the whole Markdown file or the wrong heading. The verifier mutates an unrelated section without changing the selected hash, then mutates the selected section and confirms cached-v-hack loading fails. PASS.
  • malformed duplicate fields or missing headings are rejected. PASS.

Review

Fresh-eyes external review: docs/reviews/20260610_pairset_external.md.

  • No scientific audit, data-loss, parser, or hash findings.
  • Fixed its one low-severity finding: result tables now label a Markdown pairset by selected heading rather than the full hack_pairs.md#heading reference.