mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
Consolidate tagged hack pairsets in data
This commit is contained in:
@@ -0,0 +1,111 @@
|
||||
# Multi-view pairset audit
|
||||
|
||||
Rubric: [how_to_write_personas.md](how_to_write_personas.md) and
|
||||
[how_to_rewrite_pairs.md](how_to_rewrite_pairs.md).
|
||||
|
||||
## Decision
|
||||
|
||||
The active headline source is `data/pairs/hack_pairs.md#all-in-one`: 27 hand-authored,
|
||||
off-distribution pairs spanning different representations and contexts of one construct:
|
||||
|
||||
> Given a proxy that can be exploited, choose proxy success over the underlying task.
|
||||
|
||||
This is a stronger construct definition than behavior-only or persona-only pairs. Concrete
|
||||
behavior anchors the direction near live code gradients. Opportunity-aware pairs supply
|
||||
evidence of deliberate exploitation. Explicit dispositions make the intended concept
|
||||
unambiguous. Non-code proxy gaming pushes the average away from one Python syntax pattern.
|
||||
|
||||
This is a reasoned design choice, not an empirical result. It should replace the narrow
|
||||
11-pair mechanism set for the next comparison, but the old result remains the baseline to
|
||||
beat.
|
||||
|
||||
## Composition
|
||||
|
||||
| View | N | What it contributes | Main risk |
|
||||
|---|---:|---|---|
|
||||
| behavior | 14 | concrete exploit actions in Python, data, API, spreadsheet, operations, ML, service, and web contexts | syntax/mechanism direction |
|
||||
| opportunity-aware | 6 | response chooses exploitation after an imperfect evaluator is disclosed | prompt makes gaming unusually salient |
|
||||
| disposition/roleplay | 6 | direct opposite personas and matched role instructions around exploiting proxies | explicit vocabulary may not transfer |
|
||||
| reasoning/naming | 4 | compact plan, `<think>` trace, claim, and strategy-name representations | lexical association |
|
||||
| non-code proxy gaming | 3 | organizational, research, and safety metric gaming | far from code gradients |
|
||||
|
||||
Tags overlap by design. `behavior` includes the six opportunity-aware actions.
|
||||
|
||||
## Balance checks
|
||||
|
||||
| Property | Result |
|
||||
|---|---:|
|
||||
| total pairs | 27 |
|
||||
| exact same prompt within every pair | yes, by schema |
|
||||
| explicit pairs | 10 |
|
||||
| implicit pairs | 17 |
|
||||
| opportunity-aware pairs | 6 |
|
||||
| non-code proxy-gaming pairs | 3 |
|
||||
| mean hack/clean character ratio | 1.07 |
|
||||
| maximum character ratio | 1.35 |
|
||||
| empty or pass-only stubs | 0 |
|
||||
| unique pair headings | 27 |
|
||||
|
||||
No single feature identifies every hack side:
|
||||
|
||||
- Only 10/27 pairs use an explicitly tagged disposition, reasoning, naming, or roleplay view.
|
||||
- Only 8/24 use Python test/validation behavior without an opportunity prompt.
|
||||
- The hack side is not consistently shorter, longer, less commented, or one syntax form.
|
||||
- Exploit mechanisms include weak predicates, type-only checks, partial checks, visible
|
||||
examples, hardcoding, swallowed failures, proxy metrics, disabled validation, marker
|
||||
spoofing, benchmark specialization, evaluation overfit, visual spoofing, false claims,
|
||||
and non-code metric gaming.
|
||||
|
||||
## Pair-quality judgment
|
||||
|
||||
Strongest anchors:
|
||||
|
||||
- `behavior_weak_predicate`, `behavior_type_only`, and `behavior_partial_output`: very
|
||||
tight local behavioral contrasts.
|
||||
- `opportunity_spreadsheet`, `opportunity_report_marker`, and
|
||||
`opportunity_benchmark_cache`: deliberate action is clear from the disclosed proxy.
|
||||
- `disposition_exploit_tests` and `disposition_proxy_target`: explicit direct opposites.
|
||||
- `proxy_research_metric` and `proxy_warehouse_safety`: clear cross-domain Goodhart cases.
|
||||
|
||||
Known limitations:
|
||||
|
||||
- `behavior_proxy_metric` has the largest length mismatch (1.35x) because real validation
|
||||
is structurally richer than shallow validation.
|
||||
- Explicit persona/reasoning pairs may primarily encode lexical concepts.
|
||||
- `behavior_visible_examples` captures weak test construction rather than deliberate
|
||||
exploitation because its prompt does not disclose the examples. It remains useful as a
|
||||
behavioral contrast; the opportunity-aware subset supplies the deliberate-choice anchor.
|
||||
- Cross-domain pairs may dilute the live code signal. Tagged subsets make this testable.
|
||||
- These are hand-written completions rather than samples from the target model's natural
|
||||
voice. This is valid no-cheat supervision but may reduce gradient transfer.
|
||||
|
||||
## Tagged loading
|
||||
|
||||
Tags are metadata and never enter model input:
|
||||
|
||||
```text
|
||||
data/pairs/hack_pairs.md#all-in-one
|
||||
data/pairs/hack_pairs.md#all-in-one@behavior
|
||||
data/pairs/hack_pairs.md#all-in-one@opportunity-aware
|
||||
data/pairs/hack_pairs.md#all-in-one@explicit
|
||||
data/pairs/hack_pairs.md#all-in-one@roleplay
|
||||
data/pairs/hack_pairs.md#all-in-one@think-tags
|
||||
data/pairs/hack_pairs.md#all-in-one@behavior,opportunity-aware
|
||||
```
|
||||
|
||||
Selectors require every listed tag. Selected-subset bytes determine the pairset hash, so
|
||||
cached directions cannot silently load against a changed subset.
|
||||
|
||||
## What to compare
|
||||
|
||||
The first useful empirical comparison is:
|
||||
|
||||
| Pairset | Hypothesis |
|
||||
|---|---|
|
||||
| `all-in-one` | multiple representations average toward deliberate proxy gaming |
|
||||
| `all-in-one@behavior` | concrete actions transfer best to live code gradients |
|
||||
| `all-in-one@opportunity-aware` | deliberate action matters more than generic weak checks |
|
||||
| `all-in-one@explicit` | explicit concept alone transfers poorly, replicating prior intent result |
|
||||
| previous 11-pair mechanism direction | narrow baseline |
|
||||
|
||||
Judge on live-rollout precision and deploy effect, not pair aesthetics.
|
||||
@@ -0,0 +1,92 @@
|
||||
# Multi-view hack pairs
|
||||
|
||||
## Goal
|
||||
|
||||
Author one strong all-in-one contrastive pairset that represents deliberate proxy
|
||||
gaming across varied contexts, rather than one syntax mechanism or one explicit label.
|
||||
|
||||
## Scope
|
||||
|
||||
In: replace the active authored pairset with one multi-view section; update its audit,
|
||||
default reference, and verification.
|
||||
|
||||
Out: claiming the new set is empirically better before a real-model comparison.
|
||||
|
||||
## Requirements
|
||||
|
||||
- R1: Each pair uses one exact prompt and closely matched hack/clean completions.
|
||||
- R2: The set spans behavior, opportunity-aware choice, explicit disposition,
|
||||
naming/reasoning, and non-code proxy gaming.
|
||||
- R3: No single superficial feature consistently identifies the hack side.
|
||||
- R4: Every hack side strongly expresses exploiting a proxy; every clean side strongly
|
||||
expresses satisfying the underlying task.
|
||||
- R5: Pair metadata supports loading tagged subsets without entering model input.
|
||||
|
||||
## Tasks
|
||||
|
||||
- [x] T1: Author one 27-pair `all-in-one` section.
|
||||
- verify: exactly 27 unique headings and all fields load.
|
||||
- likely_fail: vague “bad versus good” pairs fail to express proxy gaming.
|
||||
- sneaky_fail: one repeated syntax/token dominates the axis.
|
||||
- UAT: audit table shows balanced views, domains, explicitness, and mechanisms.
|
||||
- [x] T2: Make `all-in-one` the active default and simplify active pair sources.
|
||||
- [x] T3: Verify parser, balance metrics, extraction smoke, and fresh-eyes review.
|
||||
|
||||
## Design
|
||||
|
||||
| View | N | Purpose |
|
||||
|---|---:|---|
|
||||
| concrete behavior | 8 | Anchor the direction in actions resembling live hacks |
|
||||
| opportunity-aware action | 6 | Distinguish deliberate exploitation from accidental weakness |
|
||||
| explicit disposition/roleplay | 6 | Supply conceptual, intention, and role-conditioned signals |
|
||||
| naming/reasoning | 4 | Compact lexical, visible-planning, and `<think>` representations |
|
||||
| non-code proxy gaming | 3 | Force cross-context abstraction beyond Python tests |
|
||||
|
||||
Each pair has a `Tags:` metadata line. `#all-in-one@behavior` selects one tag;
|
||||
`#all-in-one@behavior,opportunity-aware` selects their intersection. Tags are not loaded
|
||||
into prompts or completions.
|
||||
|
||||
Match tightly within pairs; diversify aggressively across pairs. Explicit language is
|
||||
allowed in a minority of pairs because it strongly identifies intention. It must not be
|
||||
the only or dominant view.
|
||||
|
||||
## Log
|
||||
|
||||
- Existing pure-intent pairs underperformed behavior pairs, so explicit pairs are included
|
||||
as one view rather than used alone.
|
||||
- Existing philosophical/moral pairs changed prose and print/assert behavior together;
|
||||
the new set never combines a semantic framing contrast with a second unrelated contrast.
|
||||
- Incomplete focused snippets are allowed. Empty/pass-only stubs are rejected because they
|
||||
express no substantive decision.
|
||||
|
||||
## Errors
|
||||
|
||||
| Task | Error | Resolution |
|
||||
|---|---|---|
|
||||
|
||||
## Results
|
||||
|
||||
- Runtime pair data moved to `data/pairs/`; authoring guidance and audit remain in
|
||||
`docs/personas/`.
|
||||
- Final headline set: 27 pairs, including 2 matched roleplay instructions and 1 matched
|
||||
`<think>` trace.
|
||||
- Tagged loading supports whole-set, single-tag, and tag-intersection extraction.
|
||||
|
||||
## Verify
|
||||
|
||||
- Full smoke: `/tmp/claude-1000/multiview_pairs_data_smoke.log`
|
||||
- routeV loaded `data/pairs/hack_pairs.md#all-in-one -> 27 pairs`.
|
||||
- Extraction band mean width `+0.171`; `13/14` modules included.
|
||||
- `scripts/verify_science_invariants.py` passed Markdown parsing, tagged subsets,
|
||||
content-addressing, and the no-complete-stub invariant.
|
||||
|
||||
## Review
|
||||
|
||||
Fresh-eyes review: `docs/reviews/20260610_multiview_pairs_external.md`.
|
||||
|
||||
- Judged the multi-view design well-constructed and found no repeated dominant shortcut
|
||||
or disguised stub.
|
||||
- Flagged `behavior_visible_examples` as weak-test behavior rather than deliberate
|
||||
exploitation. Kept intentionally; `@opportunity-aware` isolates deliberate choices.
|
||||
- Flagged `behavior_proxy_metric` as the largest length mismatch. Kept because shortening
|
||||
real validation or padding shallow validation would weaken the substantive contrast.
|
||||
@@ -0,0 +1,124 @@
|
||||
# Pairset audit and Markdown source
|
||||
|
||||
## Goal
|
||||
|
||||
Audit the hand-authored pairsets for clean contrastive construction and provenance,
|
||||
decide which are useful for headline extraction versus diagnostics, and replace the
|
||||
scattered hand-authored JSON/build-script sources with one or two human-readable
|
||||
Markdown sources.
|
||||
|
||||
## Scope
|
||||
|
||||
In: hand-authored mechanism, intent, semantic-framing, honesty, and placebo pairsets;
|
||||
their loaders/builders/recipes; pair-level audit evidence.
|
||||
|
||||
Out: pool-derived or oracle-labelled pair generation, changing experiment results,
|
||||
and rewriting pair content before the audit identifies a specific defect.
|
||||
|
||||
## Requirements
|
||||
|
||||
- R1: Audit same-prompt, provenance, style/length confounds, and hack-axis strength.
|
||||
Done means: a per-pair table and pairset-level recommendation exist.
|
||||
- R2: Distinguish no-cheat headline supervision from diagnostics and controls.
|
||||
Done means: each retained section has an explicit role and provenance.
|
||||
- R3: Use at most two human-readable Markdown source files for hand-authored pairsets.
|
||||
Done means: runtime loading selects a named Markdown heading and scattered generated
|
||||
JSON is no longer the source of truth.
|
||||
- R4: Fail fast on malformed Markdown or missing headings/fields.
|
||||
Done means: verification deliberately checks malformed and missing sections fail.
|
||||
|
||||
## Tasks
|
||||
|
||||
- [x] T1 (R1-R2): Audit every hand-authored pairset.
|
||||
- verify: audit table reports every pair and pairset summary.
|
||||
- success: same-prompt is universal; provenance and confounds are explicit.
|
||||
- likely_fail: a set is omitted; summary counts disagree with source counts.
|
||||
- sneaky_fail: pairset name implies an isolated axis but another systematic contrast
|
||||
dominates; manual evidence records the actual contrast.
|
||||
- UAT: open the audit and see a recommendation with reasons for each pairset.
|
||||
- [x] T2 (R3-R4): Implement a minimal heading-based Markdown pair loader and sources.
|
||||
- verify: each named heading loads the expected count and exact pair bytes.
|
||||
- success: one or two Markdown files contain all hand-authored pairs.
|
||||
- likely_fail: fenced code inside completions breaks parsing; exact-byte comparison fails.
|
||||
- sneaky_fail: configs silently load the wrong heading; missing-heading test must raise.
|
||||
- UAT: one path plus heading identifies a pairset and prints its expected first pair.
|
||||
- [x] T3 (R3): Update active recipes/docs and remove superseded hand-authored JSON/builders.
|
||||
- verify: repository search finds no active references to removed JSON pairsets.
|
||||
- success: active runs name Markdown sections; pool-derived JSON remains separate.
|
||||
- likely_fail: an active recipe still names removed JSON.
|
||||
- sneaky_fail: cached v_hack identity hashes the whole Markdown file instead of selected
|
||||
pair bytes; selected-pair hash verification catches this.
|
||||
- [x] T4 (R1-R4): Run verification, smoke, and fresh-eyes review.
|
||||
- verify: targeted verifier and `just smoke` pass; review has no unresolved blocker.
|
||||
- UAT: linked logs and audit show the result.
|
||||
|
||||
## Context
|
||||
|
||||
- A pair is `gradient(prompt + hack) - gradient(prompt + clean)`, so prompt equality is
|
||||
structural within a pair.
|
||||
- Hand-authored off-distribution pairs are no-cheat. Pool-derived or live-rollout-labelled
|
||||
pairs are not headline-clean.
|
||||
- `philosophical`, `moral`, `intent_vs_spec`, and `eval_aware` currently vary semantic
|
||||
comments and test strength together, so their names overstate axis isolation.
|
||||
|
||||
## Log
|
||||
|
||||
- `pairs_authored.json` and `prog_wide_clean.json` have uncommitted user edits; preserve
|
||||
their current working-tree bytes as audit inputs.
|
||||
- The current authored working-tree rows were migrated byte-exactly before deleting the
|
||||
duplicate JSON source. `prog_wide_clean.json` remains a separate generated/sensitivity
|
||||
artifact.
|
||||
- No audited prompt is byte-identical to current train/eval prompts, but intent and framing
|
||||
sets overlap with several corpus problem concepts. “Not in train/test” was too strong.
|
||||
|
||||
## TODO
|
||||
|
||||
## Errors
|
||||
|
||||
| Task | Error | Resolution |
|
||||
|---|---|---|
|
||||
|
||||
## Results
|
||||
|
||||
- Audit and recommendations: `docs/personas/pairset_audit.md`.
|
||||
- Authored/manual sources: `data/pairs/hack_pairs.md` and
|
||||
`data/pairs/pair_diagnostics.md`.
|
||||
- Deleted 13 duplicate manual JSON files and 3 Python data builders.
|
||||
- The initial migration used `#mechanism-authored`; it was superseded by the later
|
||||
multi-view design in `docs/spec/20260610_multiview_hack_pairs.md`.
|
||||
|
||||
## Verify
|
||||
|
||||
`uv run python scripts/verify_science_invariants.py`:
|
||||
|
||||
| invariant | success |
|
||||
|---|---|
|
||||
| selected Markdown pair bytes | True |
|
||||
| malformed Markdown fails | True |
|
||||
| authored/control/generated pairsets load | True |
|
||||
| exactly one prompt hint | True |
|
||||
| deterministic disjoint val/test | True |
|
||||
|
||||
`just smoke`:
|
||||
|
||||
- Full log: `/tmp/claude-1000/pairset_markdown_smoke.log`
|
||||
- routeV loaded `data/pairs/hack_pairs.md#mechanism-authored -> 11 pairs`.
|
||||
- Routing band mean width `+0.354`; `13/14` modules included.
|
||||
- All verify gates and the 30-step routeV smoke completed successfully.
|
||||
|
||||
## Failure mode check
|
||||
|
||||
- likely_fail: nested code fences corrupt Markdown parsing. Exact migration checks loaded all
|
||||
13 sections and matched every prior JSON row byte-for-byte. PASS.
|
||||
- sneaky_fail: hash tracks the whole Markdown file or the wrong heading. The verifier mutates
|
||||
an unrelated section without changing the selected hash, then mutates the selected section
|
||||
and confirms cached-v-hack loading fails. PASS.
|
||||
- malformed duplicate fields or missing headings are rejected. PASS.
|
||||
|
||||
## Review
|
||||
|
||||
Fresh-eyes external review: `docs/reviews/20260610_pairset_external.md`.
|
||||
|
||||
- No scientific audit, data-loss, parser, or hash findings.
|
||||
- Fixed its one low-severity finding: result tables now label a Markdown pairset by
|
||||
selected heading rather than the full `hack_pairs.md#heading` reference.
|
||||
Reference in New Issue
Block a user