Consolidate tagged hack pairsets in data

2026-06-27 16:15:35 +08:00 · 2026-06-10 11:58:53 +00:00
parent 944ada360b
commit bf616749ee
33 changed files with 7979 additions and 1268 deletions
@@ -0,0 +1,92 @@
+# Multi-view hack pairs
+
+## Goal
+
+Author one strong all-in-one contrastive pairset that represents deliberate proxy
+gaming across varied contexts, rather than one syntax mechanism or one explicit label.
+
+## Scope
+
+In: replace the active authored pairset with one multi-view section; update its audit,
+default reference, and verification.
+
+Out: claiming the new set is empirically better before a real-model comparison.
+
+## Requirements
+
+- R1: Each pair uses one exact prompt and closely matched hack/clean completions.
+- R2: The set spans behavior, opportunity-aware choice, explicit disposition,
+  naming/reasoning, and non-code proxy gaming.
+- R3: No single superficial feature consistently identifies the hack side.
+- R4: Every hack side strongly expresses exploiting a proxy; every clean side strongly
+  expresses satisfying the underlying task.
+- R5: Pair metadata supports loading tagged subsets without entering model input.
+
+## Tasks
+
+- [x] T1: Author one 27-pair `all-in-one` section.
+  - verify: exactly 27 unique headings and all fields load.
+  - likely_fail: vague “bad versus good” pairs fail to express proxy gaming.
+  - sneaky_fail: one repeated syntax/token dominates the axis.
+  - UAT: audit table shows balanced views, domains, explicitness, and mechanisms.
+- [x] T2: Make `all-in-one` the active default and simplify active pair sources.
+- [x] T3: Verify parser, balance metrics, extraction smoke, and fresh-eyes review.
+
+## Design
+
+| View | N | Purpose |
+|---|---:|---|
+| concrete behavior | 8 | Anchor the direction in actions resembling live hacks |
+| opportunity-aware action | 6 | Distinguish deliberate exploitation from accidental weakness |
+| explicit disposition/roleplay | 6 | Supply conceptual, intention, and role-conditioned signals |
+| naming/reasoning | 4 | Compact lexical, visible-planning, and `<think>` representations |
+| non-code proxy gaming | 3 | Force cross-context abstraction beyond Python tests |
+
+Each pair has a `Tags:` metadata line. `#all-in-one@behavior` selects one tag;
+`#all-in-one@behavior,opportunity-aware` selects their intersection. Tags are not loaded
+into prompts or completions.
+
+Match tightly within pairs; diversify aggressively across pairs. Explicit language is
+allowed in a minority of pairs because it strongly identifies intention. It must not be
+the only or dominant view.
+
+## Log
+
+- Existing pure-intent pairs underperformed behavior pairs, so explicit pairs are included
+  as one view rather than used alone.
+- Existing philosophical/moral pairs changed prose and print/assert behavior together;
+  the new set never combines a semantic framing contrast with a second unrelated contrast.
+- Incomplete focused snippets are allowed. Empty/pass-only stubs are rejected because they
+  express no substantive decision.
+
+## Errors
+
+| Task | Error | Resolution |
+|---|---|---|
+
+## Results
+
+- Runtime pair data moved to `data/pairs/`; authoring guidance and audit remain in
+  `docs/personas/`.
+- Final headline set: 27 pairs, including 2 matched roleplay instructions and 1 matched
+  `<think>` trace.
+- Tagged loading supports whole-set, single-tag, and tag-intersection extraction.
+
+## Verify
+
+- Full smoke: `/tmp/claude-1000/multiview_pairs_data_smoke.log`
+- routeV loaded `data/pairs/hack_pairs.md#all-in-one -> 27 pairs`.
+- Extraction band mean width `+0.171`; `13/14` modules included.
+- `scripts/verify_science_invariants.py` passed Markdown parsing, tagged subsets,
+  content-addressing, and the no-complete-stub invariant.
+
+## Review
+
+Fresh-eyes review: `docs/reviews/20260610_multiview_pairs_external.md`.
+
+- Judged the multi-view design well-constructed and found no repeated dominant shortcut
+  or disguised stub.
+- Flagged `behavior_visible_examples` as weak-test behavior rather than deliberate
+  exploitation. Kept intentionally; `@opportunity-aware` isolates deliberate choices.
+- Flagged `behavior_proxy_metric` as the largest length mismatch. Kept because shortening
+  real validation or padding shallow validation would weaken the substantive contrast.
@@ -0,0 +1,124 @@
+# Pairset audit and Markdown source
+
+## Goal
+
+Audit the hand-authored pairsets for clean contrastive construction and provenance,
+decide which are useful for headline extraction versus diagnostics, and replace the
+scattered hand-authored JSON/build-script sources with one or two human-readable
+Markdown sources.
+
+## Scope
+
+In: hand-authored mechanism, intent, semantic-framing, honesty, and placebo pairsets;
+their loaders/builders/recipes; pair-level audit evidence.
+
+Out: pool-derived or oracle-labelled pair generation, changing experiment results,
+and rewriting pair content before the audit identifies a specific defect.
+
+## Requirements
+
+- R1: Audit same-prompt, provenance, style/length confounds, and hack-axis strength.
+  Done means: a per-pair table and pairset-level recommendation exist.
+- R2: Distinguish no-cheat headline supervision from diagnostics and controls.
+  Done means: each retained section has an explicit role and provenance.
+- R3: Use at most two human-readable Markdown source files for hand-authored pairsets.
+  Done means: runtime loading selects a named Markdown heading and scattered generated
+  JSON is no longer the source of truth.
+- R4: Fail fast on malformed Markdown or missing headings/fields.
+  Done means: verification deliberately checks malformed and missing sections fail.
+
+## Tasks
+
+- [x] T1 (R1-R2): Audit every hand-authored pairset.
+  - verify: audit table reports every pair and pairset summary.
+  - success: same-prompt is universal; provenance and confounds are explicit.
+  - likely_fail: a set is omitted; summary counts disagree with source counts.
+  - sneaky_fail: pairset name implies an isolated axis but another systematic contrast
+    dominates; manual evidence records the actual contrast.
+  - UAT: open the audit and see a recommendation with reasons for each pairset.
+- [x] T2 (R3-R4): Implement a minimal heading-based Markdown pair loader and sources.
+  - verify: each named heading loads the expected count and exact pair bytes.
+  - success: one or two Markdown files contain all hand-authored pairs.
+  - likely_fail: fenced code inside completions breaks parsing; exact-byte comparison fails.
+  - sneaky_fail: configs silently load the wrong heading; missing-heading test must raise.
+  - UAT: one path plus heading identifies a pairset and prints its expected first pair.
+- [x] T3 (R3): Update active recipes/docs and remove superseded hand-authored JSON/builders.
+  - verify: repository search finds no active references to removed JSON pairsets.
+  - success: active runs name Markdown sections; pool-derived JSON remains separate.
+  - likely_fail: an active recipe still names removed JSON.
+  - sneaky_fail: cached v_hack identity hashes the whole Markdown file instead of selected
+    pair bytes; selected-pair hash verification catches this.
+- [x] T4 (R1-R4): Run verification, smoke, and fresh-eyes review.
+  - verify: targeted verifier and `just smoke` pass; review has no unresolved blocker.
+  - UAT: linked logs and audit show the result.
+
+## Context
+
+- A pair is `gradient(prompt + hack) - gradient(prompt + clean)`, so prompt equality is
+  structural within a pair.
+- Hand-authored off-distribution pairs are no-cheat. Pool-derived or live-rollout-labelled
+  pairs are not headline-clean.
+- `philosophical`, `moral`, `intent_vs_spec`, and `eval_aware` currently vary semantic
+  comments and test strength together, so their names overstate axis isolation.
+
+## Log
+
+- `pairs_authored.json` and `prog_wide_clean.json` have uncommitted user edits; preserve
+  their current working-tree bytes as audit inputs.
+- The current authored working-tree rows were migrated byte-exactly before deleting the
+  duplicate JSON source. `prog_wide_clean.json` remains a separate generated/sensitivity
+  artifact.
+- No audited prompt is byte-identical to current train/eval prompts, but intent and framing
+  sets overlap with several corpus problem concepts. “Not in train/test” was too strong.
+
+## TODO
+
+## Errors
+
+| Task | Error | Resolution |
+|---|---|---|
+
+## Results
+
+- Audit and recommendations: `docs/personas/pairset_audit.md`.
+- Authored/manual sources: `data/pairs/hack_pairs.md` and
+  `data/pairs/pair_diagnostics.md`.
+- Deleted 13 duplicate manual JSON files and 3 Python data builders.
+- The initial migration used `#mechanism-authored`; it was superseded by the later
+  multi-view design in `docs/spec/20260610_multiview_hack_pairs.md`.
+
+## Verify
+
+`uv run python scripts/verify_science_invariants.py`:
+
+| invariant | success |
+|---|---|
+| selected Markdown pair bytes | True |
+| malformed Markdown fails | True |
+| authored/control/generated pairsets load | True |
+| exactly one prompt hint | True |
+| deterministic disjoint val/test | True |
+
+`just smoke`:
+
+- Full log: `/tmp/claude-1000/pairset_markdown_smoke.log`
+- routeV loaded `data/pairs/hack_pairs.md#mechanism-authored -> 11 pairs`.
+- Routing band mean width `+0.354`; `13/14` modules included.
+- All verify gates and the 30-step routeV smoke completed successfully.
+
+## Failure mode check
+
+- likely_fail: nested code fences corrupt Markdown parsing. Exact migration checks loaded all
+  13 sections and matched every prior JSON row byte-for-byte. PASS.
+- sneaky_fail: hash tracks the whole Markdown file or the wrong heading. The verifier mutates
+  an unrelated section without changing the selected hash, then mutates the selected section
+  and confirms cached-v-hack loading fails. PASS.
+- malformed duplicate fields or missing headings are rejected. PASS.
+
+## Review
+
+Fresh-eyes external review: `docs/reviews/20260610_pairset_external.md`.
+
+- No scientific audit, data-loss, parser, or hash findings.
+- Fixed its one low-severity finding: result tables now label a Markdown pairset by
+  selected heading rather than the full `hack_pairs.md#heading` reference.