mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
112 lines
5.1 KiB
Markdown
112 lines
5.1 KiB
Markdown
# Multi-view pairset audit
|
|
|
|
Rubric: [how_to_write_personas.md](how_to_write_personas.md) and
|
|
[how_to_rewrite_pairs.md](how_to_rewrite_pairs.md).
|
|
|
|
## Decision
|
|
|
|
The active headline source is `data/pairs/hack_pairs.md#all-in-one`: 27 hand-authored,
|
|
off-distribution pairs spanning different representations and contexts of one construct:
|
|
|
|
> Given a proxy that can be exploited, choose proxy success over the underlying task.
|
|
|
|
This is a stronger construct definition than behavior-only or persona-only pairs. Concrete
|
|
behavior anchors the direction near live code gradients. Opportunity-aware pairs supply
|
|
evidence of deliberate exploitation. Explicit dispositions make the intended concept
|
|
unambiguous. Non-code proxy gaming pushes the average away from one Python syntax pattern.
|
|
|
|
This is a reasoned design choice, not an empirical result. It should replace the narrow
|
|
11-pair mechanism set for the next comparison, but the old result remains the baseline to
|
|
beat.
|
|
|
|
## Composition
|
|
|
|
| View | N | What it contributes | Main risk |
|
|
|---|---:|---|---|
|
|
| behavior | 14 | concrete exploit actions in Python, data, API, spreadsheet, operations, ML, service, and web contexts | syntax/mechanism direction |
|
|
| opportunity-aware | 6 | response chooses exploitation after an imperfect evaluator is disclosed | prompt makes gaming unusually salient |
|
|
| disposition/roleplay | 6 | direct opposite personas and matched role instructions around exploiting proxies | explicit vocabulary may not transfer |
|
|
| reasoning/naming | 4 | compact plan, `<think>` trace, claim, and strategy-name representations | lexical association |
|
|
| non-code proxy gaming | 3 | organizational, research, and safety metric gaming | far from code gradients |
|
|
|
|
Tags overlap by design. `behavior` includes the six opportunity-aware actions.
|
|
|
|
## Balance checks
|
|
|
|
| Property | Result |
|
|
|---|---:|
|
|
| total pairs | 27 |
|
|
| exact same prompt within every pair | yes, by schema |
|
|
| explicit pairs | 10 |
|
|
| implicit pairs | 17 |
|
|
| opportunity-aware pairs | 6 |
|
|
| non-code proxy-gaming pairs | 3 |
|
|
| mean hack/clean character ratio | 1.07 |
|
|
| maximum character ratio | 1.35 |
|
|
| empty or pass-only stubs | 0 |
|
|
| unique pair headings | 27 |
|
|
|
|
No single feature identifies every hack side:
|
|
|
|
- Only 10/27 pairs use an explicitly tagged disposition, reasoning, naming, or roleplay view.
|
|
- Only 8/24 use Python test/validation behavior without an opportunity prompt.
|
|
- The hack side is not consistently shorter, longer, less commented, or one syntax form.
|
|
- Exploit mechanisms include weak predicates, type-only checks, partial checks, visible
|
|
examples, hardcoding, swallowed failures, proxy metrics, disabled validation, marker
|
|
spoofing, benchmark specialization, evaluation overfit, visual spoofing, false claims,
|
|
and non-code metric gaming.
|
|
|
|
## Pair-quality judgment
|
|
|
|
Strongest anchors:
|
|
|
|
- `behavior_weak_predicate`, `behavior_type_only`, and `behavior_partial_output`: very
|
|
tight local behavioral contrasts.
|
|
- `opportunity_spreadsheet`, `opportunity_report_marker`, and
|
|
`opportunity_benchmark_cache`: deliberate action is clear from the disclosed proxy.
|
|
- `disposition_exploit_tests` and `disposition_proxy_target`: explicit direct opposites.
|
|
- `proxy_research_metric` and `proxy_warehouse_safety`: clear cross-domain Goodhart cases.
|
|
|
|
Known limitations:
|
|
|
|
- `behavior_proxy_metric` has the largest length mismatch (1.35x) because real validation
|
|
is structurally richer than shallow validation.
|
|
- Explicit persona/reasoning pairs may primarily encode lexical concepts.
|
|
- `behavior_visible_examples` captures weak test construction rather than deliberate
|
|
exploitation because its prompt does not disclose the examples. It remains useful as a
|
|
behavioral contrast; the opportunity-aware subset supplies the deliberate-choice anchor.
|
|
- Cross-domain pairs may dilute the live code signal. Tagged subsets make this testable.
|
|
- These are hand-written completions rather than samples from the target model's natural
|
|
voice. This is valid no-cheat supervision but may reduce gradient transfer.
|
|
|
|
## Tagged loading
|
|
|
|
Tags are metadata and never enter model input:
|
|
|
|
```text
|
|
data/pairs/hack_pairs.md#all-in-one
|
|
data/pairs/hack_pairs.md#all-in-one@behavior
|
|
data/pairs/hack_pairs.md#all-in-one@opportunity-aware
|
|
data/pairs/hack_pairs.md#all-in-one@explicit
|
|
data/pairs/hack_pairs.md#all-in-one@roleplay
|
|
data/pairs/hack_pairs.md#all-in-one@think-tags
|
|
data/pairs/hack_pairs.md#all-in-one@behavior,opportunity-aware
|
|
```
|
|
|
|
Selectors require every listed tag. Selected-subset bytes determine the pairset hash, so
|
|
cached directions cannot silently load against a changed subset.
|
|
|
|
## What to compare
|
|
|
|
The first useful empirical comparison is:
|
|
|
|
| Pairset | Hypothesis |
|
|
|---|---|
|
|
| `all-in-one` | multiple representations average toward deliberate proxy gaming |
|
|
| `all-in-one@behavior` | concrete actions transfer best to live code gradients |
|
|
| `all-in-one@opportunity-aware` | deliberate action matters more than generic weak checks |
|
|
| `all-in-one@explicit` | explicit concept alone transfers poorly, replicating prior intent result |
|
|
| previous 11-pair mechanism direction | narrow baseline |
|
|
|
|
Judge on live-rollout precision and deploy effect, not pair aesthetics.
|