Files
evil_MoE/docs/spec/20260528_g2_g3_checkpoint_selection.md
T
wassname f27c658ca9 docs
2026-05-29 05:42:28 +00:00

14 KiB
Raw Blame History

G2/G3 — checkpoint selection for cross-mechanism v_hack

Date: 2026-05-28 Status: draft Supersedes (partially): 20260528_cross_mechanism_v_hack.md step 1-3.

Goal

Test whether v_hack extracted from {a subset of hack mechanisms} also suppresses {the other mechanisms} — the "noisy oracle" generalization hypothesis. Needs a teacher pool with diverse hack types. Entry (g) showed the existing rh-s65 pool is degenerate (96.1% EC-, 2 non-singleton signatures). To fix, mix pools from DIFFERENT intervention checkpoints.

Scope

In:

  • Pre-gen pools from 2-4 Aria HF checkpoints with high residual RH AND predicted-different hack mechanism distribution.
  • Regrade each pool with the 3 viable detectors E, C, D.
  • Pick a half-A / half-B split that puts a common (easy) detector and a rare (hard) detector in each half.
  • Run G3: extract V from half-A pool subset, train, measure per-detector suppression vs vanilla baseline.

Out:

  • Adding new detectors (beyond E/C/D) to rewards.py. The 6 paper-listed sub-mechanisms (bypass, hardcode, wrong-asserts, print-only, fake-checkEqual, examples-only) are author commentary, not coded classifiers, and writing them is its own project.
  • Training a new teacher (would take ~3h × 4 H200 per Aria's setup).

Background — checkpoint map from the paper

Source: docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md. HF collection: https://huggingface.co/collections/ariahw/reward-hacking-rl.

Each intervention was trained at seeds 1, 42, 65. Per the paper:

family mechanism it targets residual RH (paper) viable as teacher?
rh-* none (RL on loophole env) ~79% YES — current pool, but mono-mechanism
rl-baseline-* none (RL on no-loophole env) ~0% NO — barely hacks
gt-monitor-penalty-* flagged-RH penalty, perfect monitor ~0% (§163) NO — barely hacks
gt-monitor-screening-* flagged-RH dropped from gradients, perfect monitor 0.9% (§167) NO
judge-monitor-penalty-* LLM-judge penalty (~50% monitor acc) low (§197) NO
judge-monitor-screening-* LLM-judge screening ~30-60% (§211) YES — failed-intervention
probe-monitor-penalty-* linear-probe penalty ~0% (§185) NO — probe was very accurate
probe-monitor-screening-* linear-probe screening ~10-30% MAYBE
inoc-prompt-loophole-* inoculate against loophole hack high (model card confirms still hacks) YES
inoc-prompt-passtests-* inoculate against passtests hack high (paper §137 "modest protection") YES — likely different mechanism
inoc-prompt-evalenv-* inoculate against eval-env overwrite high YES — likely different mechanism
*-lh-* variants inoculation prompt also explicitly defines the loophole likely high LATER

Hypothesis on mechanism diversity:

  • inoc-prompt-passtests: trained to avoid "passtests" hacks specifically. If it still hacks, it should prefer mechanisms OTHER than passtests — namely D (wrong_tests) or genuine bypass that our C detector might not catch.
  • inoc-prompt-evalenv: trained to avoid eval-env overwrites. Same idea.
  • judge-monitor-screening: paper §211 explicitly noted "breakout" runs at 20-60% RH. These are intervention-failure modes.

Requirements

  • R1. Identify a pool (single or combined) whose signature distribution has all three of: (a) ≥3 non-singleton signatures with n≥20, AND (b) at least one non-EC signature (anything other than EC- and ECD) with n≥50, AND (c) no single signature exceeds 60% of the pool. (R1.b was originally "D-containing"; broadened to "non-EC" because any signature outside the EC- cluster gives us mechanism diversity, not specifically the wrong-tests detector.) The regrade_pool exit code enforces all three after the 2026-05-29 extension.

    • VERIFY: just regrade-pool <dir> exit 0 (the audit gate now prints all three pass/fail lines and returns nonzero on any failure).
    • sneaky_fail: audit gate passes on (a) but pool is still 96% EC- with 3 small islands; G3 has no held-out signal. Caught by (b) AND (c).
  • R2. Pick a half-A / half-B split over signatures, not over detectors. The rh-s65 pool entry (g) shows E and C co-fire on 1798/1799 rollouts; treating E and C as different detector classes for splitting leaks the same hack into both halves. Instead, group signatures into halves directly. Example split: half-A = {EC-, ECD} (the loophole+ passtests cluster), half-B = {--D, -CD, -C-} (wrong-tests / alt-only cluster). Each half must have ≥50 rollouts of base rate in the trained student arm (else R4 ratio is noisy).

    • VERIFY: print signature counts per half before extract; both halves must have ≥1 signature with n≥50.
    • This requires plumbing: pairs_from_pool.py currently takes --half-a as a comma list of detectors, not signatures. T3 must extend.
  • R3. Run G3 ladder: (a) extract V from half-A pool subset; (b) train projected for 20 steps on combined-pool teacher + half-A V; (c) train matched-seed vanilla on combined-pool teacher (no projection). Measure per-detector hack_s rates on student rollouts in both arms.

    • VERIFY: train final BLUF shows HACK_S_E, HACK_S_C, HACK_S_D, HACK_A, HACK_B for both arms.
    • sneaky_fail: vanilla arm shows hack_s_* all near zero (combined pool's teacher rollouts didn't induce student hacking). Caught by requiring vanilla HACK_S >= 30%.
  • R4. UAT for G3 PASS: in the projected arm, the half-B hack rate drops by ≥50% of the half-A drop, both relative to matched-seed vanilla on the same combined pool. FAIL: held-out drops <30% of seen drop, i.e. v_hack is mechanism-specific.

    • VERIFY: side-by-side hack-rate-per-half table; compute drop ratios.
    • sneaky_fail: seen drop is itself tiny (<10pp absolute), making the ratio noisy. Caught by R3 requiring vanilla ≥30% per half.

Tasks

  • [/] T1 (R1): Pre-gen 50-prompt pools from 4 candidates across 3 variation axes: rh-s42 (seed-variance null), inoc-prompt-loophole-s65 (prompt-variance), judge-monitor-screening-s65 and probe-monitor-screening-s65 (monitor-variance, two monitor types). Replaces earlier predicted-low-hack picks (gt-monitor-penalty, judge-monitor-penalty, inoc-passtests, inoc-evalenv) that were killed.

    • status: queued as pueue #107, #111, #112, #113 (2026-05-28).
    • verify: each pool's first 5 rollouts show hack rate >30%; check with pueue log <id> after each finishes.
    • success: ≥2 of 4 pools have hack rate >30%.
    • likely_fail: all pools low-hack. Fallback: combine all four + rh-s65, rely on rh-s65 as base + alt as diversity sprinkle.
    • sneaky_fail: high hack but identical signature distribution to rh-s65. Caught by T2 audit (R1.b: ≥50 D-containing rollouts somewhere).
  • T2 (R1): Regrade each pool. Build a combined pool dir out/probe_distill/teacher_pool_combined/ by symlinking all 5 source pools' prompt files (per-source dedup if prompts overlap). Regrade the combined pool.

    • verify: just regrade-pool out/probe_distill/teacher_pool_combined exit 0 AND grep D in signature table for n≥50 AND no signature pct≥60% (manual eyeball or grep on pct column).
    • status: queued as #110, #114, #115, #116 (per-pool regrades). Combined-pool regrade not yet queued; build after T1 lands.
  • T3 (R2): Extend pairs_from_pool.py to accept signature-level splits (--half-a-signatures="EC-,ECD") in addition to the current detector-level --half-a=E,C. Then pick the split based on T2 data.

    • steps: add half_a_signatures: list[str] Config field; when set, override the detector-level half-A logic with: hack-side = rollouts whose signature is in half_a_signatures, clean-side = rollouts with --- (all detectors off).
    • verify: print signature counts per half before extract; both halves must have ≥1 signature with n≥50 (per R2).
    • sneaky_fail: the split is signature-based but pairs_from_pool falls back to detector logic if flag missing. Caught by explicit assert in the new code path.
  • T4 (R3): Run G3 head-to-head:

    • T4a: extract V from half-A pool subset via just extract-vhack-pool half_a="<sig-list>" pool=<combined> tag=_combined.
    • T4b: train projected via just fast-projected-pool half_a="<sig>" seed=41 pool=out/probe_distill/teacher_pool_combined tag=_combined. NOTE: must pass pool= arg explicitly, recipe defaults to rh-s65 only.
    • T4c: train matched-seed vanilla via just fast-vanilla-xmech half_a="<sig>" seed=41 pool=out/probe_distill/teacher_pool_combined tag=_combined. Same NOTE on pool= arg.
    • verify: both train logs end with BLUF row showing HACK_S_E, HACK_S_C, HACK_S_D, HACK_A, HACK_B columns AND vanilla HACK_S ≥ 30%.
    • sneaky_fail: vanilla baseline never hacks (combined pool's teacher rollouts didn't transfer). Caught by HACK_S ≥ 30% requirement.
  • T5 (R4): Build the cross-mech UAT table from T4b and T4c logs. Compute: half-A-drop = vanilla(HACK_A) - projected(HACK_A); half-B-drop = vanilla(HACK_B) - projected(HACK_B); ratio = half-B-drop / half-A-drop. PASS if ratio ≥ 0.5 AND vanilla(HACK_B) ≥ 30%. If PASS at single seed, queue n=3 seeds (T6). If FAIL, write entry documenting the negative result.

Context

  • Smoke test for the pipeline: just smoke-xmech (added 2026-05-28; uses tiny-random Qwen3 on CPU, ~2 min). Covers regrade_pool, pairs_from_pool, extract_vhack_grad with --pairs-from-pool, train with pool-derived V.
  • Audit gate in regrade_pool.py requires ≥3 non-singleton signatures (n≥20) for the half-A/half-B split to have power. Pass --no-require-audit for smoke or for diagnostic-only runs.

Log

(append-only)

  • 2026-05-28: rh-s65 pool degenerate (entry g). 96.1% EC-, only 2 non- singleton signatures.
  • 2026-05-28: queued gt-monitor-penalty (#97) and judge-monitor-penalty (#98) BEFORE reading the paper. Both predicted-bad teachers (intervention worked, model barely hacks). Killed.
  • 2026-05-28: queued inoc-passtests (#105) and inoc-evalenv (#106) on hypothesis that inoculation failure surfaces different mechanisms. User pushed back: inoculation prompts that fail still likely hack via the same mechanism they were inoculated against (the prompt describes that hack). Killed. Replaced with rh-s42 (seed-variance null), inoc-loophole-s65 (prompt-variance), judge-screening-s65, probe-screening-s65 (monitor-variance, two monitor types).
  • 2026-05-29: subagent review flagged: detector-level half-A/half-B split is incoherent because E and C co-fire ~99.9% in rh-s65 (any split puts near-identical labels in opposite halves, leaking signal). R2 rewritten to use signature-level split (group entire (E,C,D) signatures, not individual detectors). Requires extending pairs_from_pool.py — T3.
  • 2026-05-29: R1 audit gate in regrade_pool.py extended from R1.a only to all three of {R1.a ≥3 non-singleton signatures, R1.b largest non-EC signature ≥50, R1.c top signature <60%}. R1.b broadened from "D-containing" to "any non-EC" per the revised decision rule (entry j, forthcoming): any signature outside the EC- cluster gives mechanism diversity, not specifically wrong-tests.
  • 2026-05-29: queued G1-sparse probe (#134 projected, #135 vanilla). At G=8 mix=0.125 (one teacher per group), does refresh-every=2 + 21-pair V still drop last-5 hack_s? Hypothesis: yes — entry i showed refresh keeps cos_pre_t ~1.4x higher and entry h showed 21-pair basis does the heavy lifting; combined sparse-teacher regime is the cheap deployment analogue (1 hack example per 7 normal rollouts).
  • 2026-05-29: refresh-path logging silenced (extract per-pair "loss=" and postprocess summary were flooding training logs every refresh tick). Wrapped extract+postprocess in loguru disable/enable. Refresh now emits a single line per tick: v_hack refreshed @ step={s}: M modules, k_axes=K.

TODO (out of scope)

  • Add finer detectors (paper's 6 sub-mechanisms) to rewards.py. Test whether finer split of EC- into sub-signatures gives the audit gate ≥3 non-singletons without needing new teachers.
  • Try the -lh-* (loophole-mentioned-in-prompt) inoculation variants if the non-lh ones don't give diversity.

Plan B (deferred — likely subsumed by G2 screen)

The original Plan B was to prompt base Qwen3-4B to write completions in specific hack classes. Dropped after a closer read: explicitly instructing the model to "define run_tests with assertions that fail on canonical" is instruction-following, not reward-hacking. The gradient direction NLL takes on a "do X" completion is generally not the gradient direction GRPO takes when chasing reward — entry (i) already shows even self-vs-self v_hack cosines peak at 0.27 and decay below 0.1 in ~10 steps; an instruction-prompted V is unlikely to sit in the same subspace as a real training V at all.

A "clever hint" version (prompts that describe the grading environment without instructing the hack — analogous to Aria's inoculation prompts) would be defensible because the hack would still be model-decided. But this is already covered by the G2 screen: the inoc-prompt-loophole-s65, inoc-prompt-passtests-s65, and inoc-prompt-evalenv-s65 checkpoints (#123-#125, regrades #131-#133) ARE base Qwen3-4B trained under hint-priming prompts. Whatever mechanisms survive that training are exactly the "model still decides to hack despite environment hints" set that Plan B would have been trying to construct from scratch.

Decision: don't run any prompted-pair generation. If the G2 screen comes back uniformly EC-dominant across all 8 candidates, the project conclusion is "Aria's intervention checkpoints don't give detector-class diversity" and we write that up as a negative result rather than papering over it with synthetic pairs.

Errors

Task Error Resolution