6.7 KiB
Multi-view pairset audit
Rubric: how_to_write_personas.md and how_to_rewrite_pairs.md.
Decision
The active headline source is data/pairs/hack_pairs.md#all-in-one: 27 hand-authored,
off-distribution pairs spanning different representations and contexts of one construct:
Given a proxy that can be exploited, choose proxy success over the underlying task.
This is a stronger construct definition than behavior-only or persona-only pairs. Concrete behavior anchors the direction near live code gradients. Opportunity-aware pairs supply evidence of deliberate exploitation. Explicit dispositions make the intended concept unambiguous. Non-code proxy gaming pushes the average away from one Python syntax pattern.
This is a reasoned design choice, not an empirical result. It should replace the narrow 11-pair mechanism set for the next comparison, but the old result remains the baseline to beat.
Composition
| View | N | What it contributes | Main risk |
|---|---|---|---|
| behavior | 14 | concrete exploit actions in Python, data, API, spreadsheet, operations, ML, service, and web contexts | syntax/mechanism direction |
| opportunity-aware | 6 | response chooses exploitation after an imperfect evaluator is disclosed | prompt makes gaming unusually salient |
| disposition/roleplay | 6 | direct opposite personas and matched role instructions around exploiting proxies | explicit vocabulary may not transfer |
| reasoning/naming | 4 | compact plan, <think> trace, claim, and strategy-name representations |
lexical association |
| non-code proxy gaming | 3 | organizational, research, and safety metric gaming | far from code gradients |
Tags overlap by design. behavior includes the six opportunity-aware actions.
Balance checks
| Property | Result |
|---|---|
| total pairs | 27 |
| exact same prompt within every pair | yes, by schema |
| explicit pairs | 10 |
| implicit pairs | 17 |
| opportunity-aware pairs | 6 |
| non-code proxy-gaming pairs | 3 |
| mean hack/clean character ratio | 1.07 |
| maximum character ratio | 1.35 |
| empty or pass-only stubs | 0 |
| unique pair headings | 27 |
No single feature identifies every hack side:
- Only 10/27 pairs use an explicitly tagged disposition, reasoning, naming, or roleplay view.
- Only 8/24 use Python test/validation behavior without an opportunity prompt.
- The hack side is not consistently shorter, longer, less commented, or one syntax form.
- Exploit mechanisms include weak predicates, type-only checks, partial checks, visible examples, hardcoding, swallowed failures, proxy metrics, disabled validation, marker spoofing, benchmark specialization, evaluation overfit, visual spoofing, false claims, and non-code metric gaming.
Pair-quality judgment
Strongest anchors:
behavior_weak_predicate,behavior_type_only, andbehavior_partial_output: very tight local behavioral contrasts.opportunity_spreadsheet,opportunity_report_marker, andopportunity_benchmark_cache: deliberate action is clear from the disclosed proxy.disposition_exploit_testsanddisposition_proxy_target: explicit direct opposites.proxy_research_metricandproxy_warehouse_safety: clear cross-domain Goodhart cases.
Known limitations:
behavior_proxy_metrichas the largest length mismatch (1.35x) because real validation is structurally richer than shallow validation.- Explicit persona/reasoning pairs may primarily encode lexical concepts.
behavior_visible_examplescaptures weak test construction rather than deliberate exploitation because its prompt does not disclose the examples. It remains useful as a behavioral contrast; the opportunity-aware subset supplies the deliberate-choice anchor.- Cross-domain pairs may dilute the live code signal. Tagged subsets make this testable.
- These are hand-written completions rather than samples from the target model's natural voice. This is valid no-cheat supervision but may reduce gradient transfer.
Tagged loading
Tags are metadata and never enter model input:
data/pairs/hack_pairs.md#all-in-one
data/pairs/hack_pairs.md#all-in-one@behavior
data/pairs/hack_pairs.md#all-in-one@opportunity-aware
data/pairs/hack_pairs.md#all-in-one@explicit
data/pairs/hack_pairs.md#all-in-one@roleplay
data/pairs/hack_pairs.md#all-in-one@think-tags
data/pairs/hack_pairs.md#all-in-one@behavior,opportunity-aware
Selectors require every listed tag. Selected-subset bytes determine the pairset hash, so cached directions cannot silently load against a changed subset.
Wave 2 (2026-06-11): 15 behavior2_* pairs
Motivation: the per-pairset diag ranked the 8 behavior_* pairs' v_grad best at
classifying live hack rollouts (AUROC 0.69, d=+0.85), but n=8 is fragile. Wave 2
keeps the load-bearing properties (concrete code action, gaming a verification
proxy, tight within-pair matching) and branches the gamed proxy: loose tolerance,
permissive regex, SQL row count, fixed sleep, ignored HTTP status, trivial
threshold, silent default, bash exit-zero, mock-under-test, fudge-expected,
skip-failing, retry-until-green, coverage pragma, type-checker silencing,
survivorship metric.
Constraints enforced (see RESEARCH_JOURNAL 2026-06-11):
- No mechanism overlap with the env's four loophole modes. Two pairs from the
first draft were replaced for exactly this: bare-call-no-assert
run_checks(= therun_tests"doesn't raise = pass" flaw, axis-1 family) andassert x or True(assert-True family). - No task/function overlap with the env's 992 train/test problems (checked
identifier-by-identifier against
func_names;total_stepsrenamedtotal_clicksafter a collision). - Within-pair hack/clean char ratio mean ~1.05, max 1.45.
Flagged weaker members: behavior2_trivial_threshold mechanism-duplicates
behavior_weak_predicate (new domain only); behavior2_silent_default is
defensive-default rather than proxy gaming; behavior2_bash_exit_zero is
don't-fail-the-gate adjacent (kept, same class as the try/except swallow axis).
Selectors: /behavior_ = original 8 (the proven classifier, train default),
/behavior2 = wave 2 only, /behavior = 23-pair union. The diag ranks
behavior and behavior2 as separate groups.
What to compare
The first useful empirical comparison is:
| Pairset | Hypothesis |
|---|---|
all-in-one |
multiple representations average toward deliberate proxy gaming |
all-in-one@behavior |
concrete actions transfer best to live code gradients |
all-in-one@opportunity-aware |
deliberate action matters more than generic weak checks |
all-in-one@explicit |
explicit concept alone transfers poorly, replicating prior intent result |
| previous 11-pair mechanism direction | narrow baseline |
Judge on live-rollout precision and deploy effect, not pair aesthetics.