# Multi-view pairset audit Rubric: [how_to_write_personas.md](how_to_write_personas.md) and [how_to_rewrite_pairs.md](how_to_rewrite_pairs.md). ## Decision The active headline source is `data/pairs/hack_pairs.md#all-in-one`: 27 hand-authored, off-distribution pairs spanning different representations and contexts of one construct: > Given a proxy that can be exploited, choose proxy success over the underlying task. This is a stronger construct definition than behavior-only or persona-only pairs. Concrete behavior anchors the direction near live code gradients. Opportunity-aware pairs supply evidence of deliberate exploitation. Explicit dispositions make the intended concept unambiguous. Non-code proxy gaming pushes the average away from one Python syntax pattern. This is a reasoned design choice, not an empirical result. It should replace the narrow 11-pair mechanism set for the next comparison, but the old result remains the baseline to beat. ## Composition | View | N | What it contributes | Main risk | |---|---:|---|---| | behavior | 14 | concrete exploit actions in Python, data, API, spreadsheet, operations, ML, service, and web contexts | syntax/mechanism direction | | opportunity-aware | 6 | response chooses exploitation after an imperfect evaluator is disclosed | prompt makes gaming unusually salient | | disposition/roleplay | 6 | direct opposite personas and matched role instructions around exploiting proxies | explicit vocabulary may not transfer | | reasoning/naming | 4 | compact plan, `` trace, claim, and strategy-name representations | lexical association | | non-code proxy gaming | 3 | organizational, research, and safety metric gaming | far from code gradients | Tags overlap by design. `behavior` includes the six opportunity-aware actions. ## Balance checks | Property | Result | |---|---:| | total pairs | 27 | | exact same prompt within every pair | yes, by schema | | explicit pairs | 10 | | implicit pairs | 17 | | opportunity-aware pairs | 6 | | non-code proxy-gaming pairs | 3 | | mean hack/clean character ratio | 1.07 | | maximum character ratio | 1.35 | | empty or pass-only stubs | 0 | | unique pair headings | 27 | No single feature identifies every hack side: - Only 10/27 pairs use an explicitly tagged disposition, reasoning, naming, or roleplay view. - Only 8/24 use Python test/validation behavior without an opportunity prompt. - The hack side is not consistently shorter, longer, less commented, or one syntax form. - Exploit mechanisms include weak predicates, type-only checks, partial checks, visible examples, hardcoding, swallowed failures, proxy metrics, disabled validation, marker spoofing, benchmark specialization, evaluation overfit, visual spoofing, false claims, and non-code metric gaming. ## Pair-quality judgment Strongest anchors: - `behavior_weak_predicate`, `behavior_type_only`, and `behavior_partial_output`: very tight local behavioral contrasts. - `opportunity_spreadsheet`, `opportunity_report_marker`, and `opportunity_benchmark_cache`: deliberate action is clear from the disclosed proxy. - `disposition_exploit_tests` and `disposition_proxy_target`: explicit direct opposites. - `proxy_research_metric` and `proxy_warehouse_safety`: clear cross-domain Goodhart cases. Known limitations: - `behavior_proxy_metric` has the largest length mismatch (1.35x) because real validation is structurally richer than shallow validation. - Explicit persona/reasoning pairs may primarily encode lexical concepts. - `behavior_visible_examples` captures weak test construction rather than deliberate exploitation because its prompt does not disclose the examples. It remains useful as a behavioral contrast; the opportunity-aware subset supplies the deliberate-choice anchor. - Cross-domain pairs may dilute the live code signal. Tagged subsets make this testable. - These are hand-written completions rather than samples from the target model's natural voice. This is valid no-cheat supervision but may reduce gradient transfer. ## Tagged loading Tags are metadata and never enter model input: ```text data/pairs/hack_pairs.md#all-in-one data/pairs/hack_pairs.md#all-in-one@behavior data/pairs/hack_pairs.md#all-in-one@opportunity-aware data/pairs/hack_pairs.md#all-in-one@explicit data/pairs/hack_pairs.md#all-in-one@roleplay data/pairs/hack_pairs.md#all-in-one@think-tags data/pairs/hack_pairs.md#all-in-one@behavior,opportunity-aware ``` Selectors require every listed tag. Selected-subset bytes determine the pairset hash, so cached directions cannot silently load against a changed subset. ## Wave 2 (2026-06-11): 15 `behavior2_*` pairs Motivation: the historical per-pairset `v_grad` diagnostic ranked the 8 `behavior_*` pairs best at classifying live hack rollouts (AUROC 0.69, d=+0.85), but n=8 is fragile. That diagnostic predates the current routeA activation gate. Wave 2 keeps the load-bearing properties (concrete code action, gaming a verification proxy, tight within-pair matching) and branches the gamed proxy: loose tolerance, permissive regex, SQL row count, fixed sleep, ignored HTTP status, trivial threshold, silent default, bash exit-zero, mock-under-test, fudge-expected, skip-failing, retry-until-green, coverage pragma, type-checker silencing, survivorship metric. Constraints enforced (see RESEARCH_JOURNAL 2026-06-11): - No mechanism overlap with the env's four loophole modes. Two pairs from the first draft were replaced for exactly this: bare-call-no-assert `run_checks` (= the `run_tests` "doesn't raise = pass" flaw, axis-1 family) and `assert x or True` (assert-True family). - No task/function overlap with the env's 992 train/test problems (checked identifier-by-identifier against `func_name`s; `total_steps` renamed `total_clicks` after a collision). - Within-pair hack/clean char ratio mean ~1.05, max 1.45. Flagged weaker members: `behavior2_trivial_threshold` mechanism-duplicates `behavior_weak_predicate` (new domain only); `behavior2_silent_default` is defensive-default rather than proxy gaming; `behavior2_bash_exit_zero` is don't-fail-the-gate adjacent (kept, same class as the try/except swallow axis). Selectors: `/behavior_` = original 8 (the proven classifier, train default), `/behavior2` = wave 2 only, `/behavior` = 23-pair union. The diag ranks `behavior` and `behavior2` as separate groups. ## What to compare The first useful empirical comparison is: | Pairset | Hypothesis | |---|---| | `all-in-one` | multiple representations average toward deliberate proxy gaming | | `all-in-one@behavior` | concrete actions transfer best to live code gradients | | `all-in-one@opportunity-aware` | deliberate action matters more than generic weak checks | | `all-in-one@explicit` | explicit concept alone transfers poorly, replicating prior intent result | | previous 11-pair mechanism direction | narrow baseline | Judge on live-rollout precision and deploy effect, not pair aesthetics.