Files
evil_MoE/docs/spec/20260610_multiview_hack_pairs.md
T
2026-06-10 11:58:53 +00:00

3.9 KiB

Multi-view hack pairs

Goal

Author one strong all-in-one contrastive pairset that represents deliberate proxy gaming across varied contexts, rather than one syntax mechanism or one explicit label.

Scope

In: replace the active authored pairset with one multi-view section; update its audit, default reference, and verification.

Out: claiming the new set is empirically better before a real-model comparison.

Requirements

  • R1: Each pair uses one exact prompt and closely matched hack/clean completions.
  • R2: The set spans behavior, opportunity-aware choice, explicit disposition, naming/reasoning, and non-code proxy gaming.
  • R3: No single superficial feature consistently identifies the hack side.
  • R4: Every hack side strongly expresses exploiting a proxy; every clean side strongly expresses satisfying the underlying task.
  • R5: Pair metadata supports loading tagged subsets without entering model input.

Tasks

  • T1: Author one 27-pair all-in-one section.
    • verify: exactly 27 unique headings and all fields load.
    • likely_fail: vague “bad versus good” pairs fail to express proxy gaming.
    • sneaky_fail: one repeated syntax/token dominates the axis.
    • UAT: audit table shows balanced views, domains, explicitness, and mechanisms.
  • T2: Make all-in-one the active default and simplify active pair sources.
  • T3: Verify parser, balance metrics, extraction smoke, and fresh-eyes review.

Design

View N Purpose
concrete behavior 8 Anchor the direction in actions resembling live hacks
opportunity-aware action 6 Distinguish deliberate exploitation from accidental weakness
explicit disposition/roleplay 6 Supply conceptual, intention, and role-conditioned signals
naming/reasoning 4 Compact lexical, visible-planning, and <think> representations
non-code proxy gaming 3 Force cross-context abstraction beyond Python tests

Each pair has a Tags: metadata line. #all-in-one@behavior selects one tag; #all-in-one@behavior,opportunity-aware selects their intersection. Tags are not loaded into prompts or completions.

Match tightly within pairs; diversify aggressively across pairs. Explicit language is allowed in a minority of pairs because it strongly identifies intention. It must not be the only or dominant view.

Log

  • Existing pure-intent pairs underperformed behavior pairs, so explicit pairs are included as one view rather than used alone.
  • Existing philosophical/moral pairs changed prose and print/assert behavior together; the new set never combines a semantic framing contrast with a second unrelated contrast.
  • Incomplete focused snippets are allowed. Empty/pass-only stubs are rejected because they express no substantive decision.

Errors

Task Error Resolution

Results

  • Runtime pair data moved to data/pairs/; authoring guidance and audit remain in docs/personas/.
  • Final headline set: 27 pairs, including 2 matched roleplay instructions and 1 matched <think> trace.
  • Tagged loading supports whole-set, single-tag, and tag-intersection extraction.

Verify

  • Full smoke: /tmp/claude-1000/multiview_pairs_data_smoke.log
  • routeV loaded data/pairs/hack_pairs.md#all-in-one -> 27 pairs.
  • Extraction band mean width +0.171; 13/14 modules included.
  • scripts/verify_science_invariants.py passed Markdown parsing, tagged subsets, content-addressing, and the no-complete-stub invariant.

Review

Fresh-eyes review: docs/reviews/20260610_multiview_pairs_external.md.

  • Judged the multi-view design well-constructed and found no repeated dominant shortcut or disguised stub.
  • Flagged behavior_visible_examples as weak-test behavior rather than deliberate exploitation. Kept intentionally; @opportunity-aware isolates deliberate choices.
  • Flagged behavior_proxy_metric as the largest length mismatch. Kept because shortening real validation or padding shallow validation would weaken the substantive contrast.