mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
93 lines
3.9 KiB
Markdown
93 lines
3.9 KiB
Markdown
# Multi-view hack pairs
|
|
|
|
## Goal
|
|
|
|
Author one strong all-in-one contrastive pairset that represents deliberate proxy
|
|
gaming across varied contexts, rather than one syntax mechanism or one explicit label.
|
|
|
|
## Scope
|
|
|
|
In: replace the active authored pairset with one multi-view section; update its audit,
|
|
default reference, and verification.
|
|
|
|
Out: claiming the new set is empirically better before a real-model comparison.
|
|
|
|
## Requirements
|
|
|
|
- R1: Each pair uses one exact prompt and closely matched hack/clean completions.
|
|
- R2: The set spans behavior, opportunity-aware choice, explicit disposition,
|
|
naming/reasoning, and non-code proxy gaming.
|
|
- R3: No single superficial feature consistently identifies the hack side.
|
|
- R4: Every hack side strongly expresses exploiting a proxy; every clean side strongly
|
|
expresses satisfying the underlying task.
|
|
- R5: Pair metadata supports loading tagged subsets without entering model input.
|
|
|
|
## Tasks
|
|
|
|
- [x] T1: Author one 27-pair `all-in-one` section.
|
|
- verify: exactly 27 unique headings and all fields load.
|
|
- likely_fail: vague “bad versus good” pairs fail to express proxy gaming.
|
|
- sneaky_fail: one repeated syntax/token dominates the axis.
|
|
- UAT: audit table shows balanced views, domains, explicitness, and mechanisms.
|
|
- [x] T2: Make `all-in-one` the active default and simplify active pair sources.
|
|
- [x] T3: Verify parser, balance metrics, extraction smoke, and fresh-eyes review.
|
|
|
|
## Design
|
|
|
|
| View | N | Purpose |
|
|
|---|---:|---|
|
|
| concrete behavior | 8 | Anchor the direction in actions resembling live hacks |
|
|
| opportunity-aware action | 6 | Distinguish deliberate exploitation from accidental weakness |
|
|
| explicit disposition/roleplay | 6 | Supply conceptual, intention, and role-conditioned signals |
|
|
| naming/reasoning | 4 | Compact lexical, visible-planning, and `<think>` representations |
|
|
| non-code proxy gaming | 3 | Force cross-context abstraction beyond Python tests |
|
|
|
|
Each pair has a `Tags:` metadata line. `#all-in-one@behavior` selects one tag;
|
|
`#all-in-one@behavior,opportunity-aware` selects their intersection. Tags are not loaded
|
|
into prompts or completions.
|
|
|
|
Match tightly within pairs; diversify aggressively across pairs. Explicit language is
|
|
allowed in a minority of pairs because it strongly identifies intention. It must not be
|
|
the only or dominant view.
|
|
|
|
## Log
|
|
|
|
- Existing pure-intent pairs underperformed behavior pairs, so explicit pairs are included
|
|
as one view rather than used alone.
|
|
- Existing philosophical/moral pairs changed prose and print/assert behavior together;
|
|
the new set never combines a semantic framing contrast with a second unrelated contrast.
|
|
- Incomplete focused snippets are allowed. Empty/pass-only stubs are rejected because they
|
|
express no substantive decision.
|
|
|
|
## Errors
|
|
|
|
| Task | Error | Resolution |
|
|
|---|---|---|
|
|
|
|
## Results
|
|
|
|
- Runtime pair data moved to `data/pairs/`; authoring guidance and audit remain in
|
|
`docs/personas/`.
|
|
- Final headline set: 27 pairs, including 2 matched roleplay instructions and 1 matched
|
|
`<think>` trace.
|
|
- Tagged loading supports whole-set, single-tag, and tag-intersection extraction.
|
|
|
|
## Verify
|
|
|
|
- Full smoke: `/tmp/claude-1000/multiview_pairs_data_smoke.log`
|
|
- routeV loaded `data/pairs/hack_pairs.md#all-in-one -> 27 pairs`.
|
|
- Extraction band mean width `+0.171`; `13/14` modules included.
|
|
- `scripts/verify_science_invariants.py` passed Markdown parsing, tagged subsets,
|
|
content-addressing, and the no-complete-stub invariant.
|
|
|
|
## Review
|
|
|
|
Fresh-eyes review: `docs/reviews/20260610_multiview_pairs_external.md`.
|
|
|
|
- Judged the multi-view design well-constructed and found no repeated dominant shortcut
|
|
or disguised stub.
|
|
- Flagged `behavior_visible_examples` as weak-test behavior rather than deliberate
|
|
exploitation. Kept intentionally; `@opportunity-aware` isolates deliberate choices.
|
|
- Flagged `behavior_proxy_metric` as the largest length mismatch. Kept because shortening
|
|
real validation or padding shallow validation would weaken the substantive contrast.
|