Files
evil_MoE/docs/personas/pairset_audit.md
T
wassname 270c4f5a27 misc
2026-06-11 11:07:28 +00:00

6.7 KiB

Multi-view pairset audit

Rubric: how_to_write_personas.md and how_to_rewrite_pairs.md.

Decision

The active headline source is data/pairs/hack_pairs.md#all-in-one: 27 hand-authored, off-distribution pairs spanning different representations and contexts of one construct:

Given a proxy that can be exploited, choose proxy success over the underlying task.

This is a stronger construct definition than behavior-only or persona-only pairs. Concrete behavior anchors the direction near live code gradients. Opportunity-aware pairs supply evidence of deliberate exploitation. Explicit dispositions make the intended concept unambiguous. Non-code proxy gaming pushes the average away from one Python syntax pattern.

This is a reasoned design choice, not an empirical result. It should replace the narrow 11-pair mechanism set for the next comparison, but the old result remains the baseline to beat.

Composition

View N What it contributes Main risk
behavior 14 concrete exploit actions in Python, data, API, spreadsheet, operations, ML, service, and web contexts syntax/mechanism direction
opportunity-aware 6 response chooses exploitation after an imperfect evaluator is disclosed prompt makes gaming unusually salient
disposition/roleplay 6 direct opposite personas and matched role instructions around exploiting proxies explicit vocabulary may not transfer
reasoning/naming 4 compact plan, <think> trace, claim, and strategy-name representations lexical association
non-code proxy gaming 3 organizational, research, and safety metric gaming far from code gradients

Tags overlap by design. behavior includes the six opportunity-aware actions.

Balance checks

Property Result
total pairs 27
exact same prompt within every pair yes, by schema
explicit pairs 10
implicit pairs 17
opportunity-aware pairs 6
non-code proxy-gaming pairs 3
mean hack/clean character ratio 1.07
maximum character ratio 1.35
empty or pass-only stubs 0
unique pair headings 27

No single feature identifies every hack side:

  • Only 10/27 pairs use an explicitly tagged disposition, reasoning, naming, or roleplay view.
  • Only 8/24 use Python test/validation behavior without an opportunity prompt.
  • The hack side is not consistently shorter, longer, less commented, or one syntax form.
  • Exploit mechanisms include weak predicates, type-only checks, partial checks, visible examples, hardcoding, swallowed failures, proxy metrics, disabled validation, marker spoofing, benchmark specialization, evaluation overfit, visual spoofing, false claims, and non-code metric gaming.

Pair-quality judgment

Strongest anchors:

  • behavior_weak_predicate, behavior_type_only, and behavior_partial_output: very tight local behavioral contrasts.
  • opportunity_spreadsheet, opportunity_report_marker, and opportunity_benchmark_cache: deliberate action is clear from the disclosed proxy.
  • disposition_exploit_tests and disposition_proxy_target: explicit direct opposites.
  • proxy_research_metric and proxy_warehouse_safety: clear cross-domain Goodhart cases.

Known limitations:

  • behavior_proxy_metric has the largest length mismatch (1.35x) because real validation is structurally richer than shallow validation.
  • Explicit persona/reasoning pairs may primarily encode lexical concepts.
  • behavior_visible_examples captures weak test construction rather than deliberate exploitation because its prompt does not disclose the examples. It remains useful as a behavioral contrast; the opportunity-aware subset supplies the deliberate-choice anchor.
  • Cross-domain pairs may dilute the live code signal. Tagged subsets make this testable.
  • These are hand-written completions rather than samples from the target model's natural voice. This is valid no-cheat supervision but may reduce gradient transfer.

Tagged loading

Tags are metadata and never enter model input:

data/pairs/hack_pairs.md#all-in-one
data/pairs/hack_pairs.md#all-in-one@behavior
data/pairs/hack_pairs.md#all-in-one@opportunity-aware
data/pairs/hack_pairs.md#all-in-one@explicit
data/pairs/hack_pairs.md#all-in-one@roleplay
data/pairs/hack_pairs.md#all-in-one@think-tags
data/pairs/hack_pairs.md#all-in-one@behavior,opportunity-aware

Selectors require every listed tag. Selected-subset bytes determine the pairset hash, so cached directions cannot silently load against a changed subset.

Wave 2 (2026-06-11): 15 behavior2_* pairs

Motivation: the per-pairset diag ranked the 8 behavior_* pairs' v_grad best at classifying live hack rollouts (AUROC 0.69, d=+0.85), but n=8 is fragile. Wave 2 keeps the load-bearing properties (concrete code action, gaming a verification proxy, tight within-pair matching) and branches the gamed proxy: loose tolerance, permissive regex, SQL row count, fixed sleep, ignored HTTP status, trivial threshold, silent default, bash exit-zero, mock-under-test, fudge-expected, skip-failing, retry-until-green, coverage pragma, type-checker silencing, survivorship metric.

Constraints enforced (see RESEARCH_JOURNAL 2026-06-11):

  • No mechanism overlap with the env's four loophole modes. Two pairs from the first draft were replaced for exactly this: bare-call-no-assert run_checks (= the run_tests "doesn't raise = pass" flaw, axis-1 family) and assert x or True (assert-True family).
  • No task/function overlap with the env's 992 train/test problems (checked identifier-by-identifier against func_names; total_steps renamed total_clicks after a collision).
  • Within-pair hack/clean char ratio mean ~1.05, max 1.45.

Flagged weaker members: behavior2_trivial_threshold mechanism-duplicates behavior_weak_predicate (new domain only); behavior2_silent_default is defensive-default rather than proxy gaming; behavior2_bash_exit_zero is don't-fail-the-gate adjacent (kept, same class as the try/except swallow axis).

Selectors: /behavior_ = original 8 (the proven classifier, train default), /behavior2 = wave 2 only, /behavior = 23-pair union. The diag ranks behavior and behavior2 as separate groups.

What to compare

The first useful empirical comparison is:

Pairset Hypothesis
all-in-one multiple representations average toward deliberate proxy gaming
all-in-one@behavior concrete actions transfer best to live code gradients
all-in-one@opportunity-aware deliberate action matters more than generic weak checks
all-in-one@explicit explicit concept alone transfers poorly, replicating prior intent result
previous 11-pair mechanism direction narrow baseline

Judge on live-rollout precision and deploy effect, not pair aesthetics.