3.9 KiB
Multi-view hack pairs
Goal
Author one strong all-in-one contrastive pairset that represents deliberate proxy gaming across varied contexts, rather than one syntax mechanism or one explicit label.
Scope
In: replace the active authored pairset with one multi-view section; update its audit, default reference, and verification.
Out: claiming the new set is empirically better before a real-model comparison.
Requirements
- R1: Each pair uses one exact prompt and closely matched hack/clean completions.
- R2: The set spans behavior, opportunity-aware choice, explicit disposition, naming/reasoning, and non-code proxy gaming.
- R3: No single superficial feature consistently identifies the hack side.
- R4: Every hack side strongly expresses exploiting a proxy; every clean side strongly expresses satisfying the underlying task.
- R5: Pair metadata supports loading tagged subsets without entering model input.
Tasks
- T1: Author one 27-pair
all-in-onesection.- verify: exactly 27 unique headings and all fields load.
- likely_fail: vague “bad versus good” pairs fail to express proxy gaming.
- sneaky_fail: one repeated syntax/token dominates the axis.
- UAT: audit table shows balanced views, domains, explicitness, and mechanisms.
- T2: Make
all-in-onethe active default and simplify active pair sources. - T3: Verify parser, balance metrics, extraction smoke, and fresh-eyes review.
Design
| View | N | Purpose |
|---|---|---|
| concrete behavior | 8 | Anchor the direction in actions resembling live hacks |
| opportunity-aware action | 6 | Distinguish deliberate exploitation from accidental weakness |
| explicit disposition/roleplay | 6 | Supply conceptual, intention, and role-conditioned signals |
| naming/reasoning | 4 | Compact lexical, visible-planning, and <think> representations |
| non-code proxy gaming | 3 | Force cross-context abstraction beyond Python tests |
Each pair has a Tags: metadata line. #all-in-one@behavior selects one tag;
#all-in-one@behavior,opportunity-aware selects their intersection. Tags are not loaded
into prompts or completions.
Match tightly within pairs; diversify aggressively across pairs. Explicit language is allowed in a minority of pairs because it strongly identifies intention. It must not be the only or dominant view.
Log
- Existing pure-intent pairs underperformed behavior pairs, so explicit pairs are included as one view rather than used alone.
- Existing philosophical/moral pairs changed prose and print/assert behavior together; the new set never combines a semantic framing contrast with a second unrelated contrast.
- Incomplete focused snippets are allowed. Empty/pass-only stubs are rejected because they express no substantive decision.
Errors
| Task | Error | Resolution |
|---|
Results
- Runtime pair data moved to
data/pairs/; authoring guidance and audit remain indocs/personas/. - Final headline set: 27 pairs, including 2 matched roleplay instructions and 1 matched
<think>trace. - Tagged loading supports whole-set, single-tag, and tag-intersection extraction.
Verify
- Full smoke:
/tmp/claude-1000/multiview_pairs_data_smoke.log - routeV loaded
data/pairs/hack_pairs.md#all-in-one -> 27 pairs. - Extraction band mean width
+0.171;13/14modules included. scripts/verify_science_invariants.pypassed Markdown parsing, tagged subsets, content-addressing, and the no-complete-stub invariant.
Review
Fresh-eyes review: docs/reviews/20260610_multiview_pairs_external.md.
- Judged the multi-view design well-constructed and found no repeated dominant shortcut or disguised stub.
- Flagged
behavior_visible_examplesas weak-test behavior rather than deliberate exploitation. Kept intentionally;@opportunity-awareisolates deliberate choices. - Flagged
behavior_proxy_metricas the largest length mismatch. Kept because shortening real validation or padding shallow validation would weaken the substantive contrast.