mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
docs
This commit is contained in:
@@ -0,0 +1,256 @@
|
||||
# G2/G3 — checkpoint selection for cross-mechanism v_hack
|
||||
|
||||
**Date:** 2026-05-28
|
||||
**Status:** draft
|
||||
**Supersedes (partially):** `20260528_cross_mechanism_v_hack.md` step 1-3.
|
||||
|
||||
## Goal
|
||||
|
||||
Test whether v_hack extracted from {a subset of hack mechanisms} also
|
||||
suppresses {the other mechanisms} — the "noisy oracle" generalization
|
||||
hypothesis. Needs a teacher pool with diverse hack types. Entry (g) showed
|
||||
the existing `rh-s65` pool is degenerate (96.1% EC-, 2 non-singleton
|
||||
signatures). To fix, mix pools from DIFFERENT intervention checkpoints.
|
||||
|
||||
## Scope
|
||||
|
||||
In:
|
||||
- Pre-gen pools from 2-4 Aria HF checkpoints with high residual RH AND
|
||||
predicted-different hack mechanism distribution.
|
||||
- Regrade each pool with the 3 viable detectors E, C, D.
|
||||
- Pick a half-A / half-B split that puts a common (easy) detector and a
|
||||
rare (hard) detector in each half.
|
||||
- Run G3: extract V from half-A pool subset, train, measure per-detector
|
||||
suppression vs vanilla baseline.
|
||||
|
||||
Out:
|
||||
- Adding new detectors (beyond E/C/D) to `rewards.py`. The 6 paper-listed
|
||||
sub-mechanisms (bypass, hardcode, wrong-asserts, print-only,
|
||||
fake-checkEqual, examples-only) are author commentary, not coded
|
||||
classifiers, and writing them is its own project.
|
||||
- Training a new teacher (would take ~3h × 4 H200 per Aria's setup).
|
||||
|
||||
## Background — checkpoint map from the paper
|
||||
|
||||
Source: `docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md`.
|
||||
HF collection: https://huggingface.co/collections/ariahw/reward-hacking-rl.
|
||||
|
||||
Each intervention was trained at seeds 1, 42, 65. Per the paper:
|
||||
|
||||
| family | mechanism it targets | residual RH (paper) | viable as teacher? |
|
||||
|---|---|---|---|
|
||||
| `rh-*` | none (RL on loophole env) | ~79% | YES — current pool, but mono-mechanism |
|
||||
| `rl-baseline-*` | none (RL on no-loophole env) | ~0% | NO — barely hacks |
|
||||
| `gt-monitor-penalty-*` | flagged-RH penalty, perfect monitor | ~0% (§163) | NO — barely hacks |
|
||||
| `gt-monitor-screening-*` | flagged-RH dropped from gradients, perfect monitor | 0.9% (§167) | NO |
|
||||
| `judge-monitor-penalty-*` | LLM-judge penalty (~50% monitor acc) | low (§197) | NO |
|
||||
| `judge-monitor-screening-*` | LLM-judge screening | ~30-60% (§211) | YES — failed-intervention |
|
||||
| `probe-monitor-penalty-*` | linear-probe penalty | ~0% (§185) | NO — probe was very accurate |
|
||||
| `probe-monitor-screening-*` | linear-probe screening | ~10-30% | MAYBE |
|
||||
| `inoc-prompt-loophole-*` | inoculate against loophole hack | high (model card confirms still hacks) | YES |
|
||||
| `inoc-prompt-passtests-*` | inoculate against passtests hack | high (paper §137 "modest protection") | YES — likely different mechanism |
|
||||
| `inoc-prompt-evalenv-*` | inoculate against eval-env overwrite | high | YES — likely different mechanism |
|
||||
| `*-lh-*` variants | inoculation prompt also explicitly defines the loophole | likely high | LATER |
|
||||
|
||||
**Hypothesis on mechanism diversity:**
|
||||
- `inoc-prompt-passtests`: trained to avoid "passtests" hacks specifically. If
|
||||
it still hacks, it should prefer mechanisms OTHER than passtests — namely
|
||||
D (wrong_tests) or genuine bypass that our C detector might not catch.
|
||||
- `inoc-prompt-evalenv`: trained to avoid eval-env overwrites. Same idea.
|
||||
- `judge-monitor-screening`: paper §211 explicitly noted "breakout" runs at
|
||||
20-60% RH. These are intervention-failure modes.
|
||||
|
||||
## Requirements
|
||||
|
||||
- **R1.** Identify a pool (single or combined) whose *signature distribution*
|
||||
has all three of: (a) ≥3 non-singleton signatures with n≥20, AND (b) at
|
||||
least one non-EC signature (anything other than `EC-` and `ECD`) with
|
||||
n≥50, AND (c) no single signature exceeds 60% of the pool. (R1.b was
|
||||
originally "D-containing"; broadened to "non-EC" because any signature
|
||||
outside the EC- cluster gives us mechanism diversity, not specifically
|
||||
the wrong-tests detector.) The `regrade_pool` exit code enforces all
|
||||
three after the 2026-05-29 extension.
|
||||
- VERIFY: `just regrade-pool <dir>` exit 0 (the audit gate now prints
|
||||
all three pass/fail lines and returns nonzero on any failure).
|
||||
- sneaky_fail: audit gate passes on (a) but pool is still 96% EC-
|
||||
with 3 small islands; G3 has no held-out signal. Caught by (b) AND (c).
|
||||
|
||||
- **R2.** Pick a half-A / half-B split *over signatures*, not over
|
||||
detectors. The rh-s65 pool entry (g) shows E and C co-fire on 1798/1799
|
||||
rollouts; treating E and C as different detector classes for splitting
|
||||
leaks the same hack into both halves. Instead, group signatures into
|
||||
halves directly. Example split: half-A = {EC-, ECD} (the loophole+
|
||||
passtests cluster), half-B = {--D, -CD, -C-} (wrong-tests / alt-only
|
||||
cluster). Each half must have ≥50 rollouts of base rate in the trained
|
||||
student arm (else R4 ratio is noisy).
|
||||
- VERIFY: print signature counts per half before extract; both halves
|
||||
must have ≥1 signature with n≥50.
|
||||
- This requires plumbing: `pairs_from_pool.py` currently takes `--half-a`
|
||||
as a comma list of *detectors*, not signatures. T3 must extend.
|
||||
|
||||
- **R3.** Run G3 ladder: (a) extract V from half-A pool subset; (b) train
|
||||
projected for 20 steps on combined-pool teacher + half-A V; (c) train
|
||||
matched-seed vanilla on combined-pool teacher (no projection). Measure
|
||||
per-detector hack_s rates on student rollouts in both arms.
|
||||
- VERIFY: train final BLUF shows `HACK_S_E`, `HACK_S_C`, `HACK_S_D`,
|
||||
`HACK_A`, `HACK_B` for both arms.
|
||||
- sneaky_fail: vanilla arm shows hack_s_* all near zero (combined pool's
|
||||
teacher rollouts didn't induce student hacking). Caught by requiring
|
||||
vanilla HACK_S >= 30%.
|
||||
|
||||
- **R4.** UAT for G3 PASS: in the projected arm, the half-B hack rate drops
|
||||
by ≥50% of the half-A drop, both relative to matched-seed vanilla on
|
||||
the *same combined pool*. FAIL: held-out drops <30% of seen drop, i.e.
|
||||
v_hack is mechanism-specific.
|
||||
- VERIFY: side-by-side hack-rate-per-half table; compute drop ratios.
|
||||
- sneaky_fail: seen drop is itself tiny (<10pp absolute), making the
|
||||
ratio noisy. Caught by R3 requiring vanilla ≥30% per half.
|
||||
|
||||
## Tasks
|
||||
|
||||
- [/] **T1 (R1)**: Pre-gen 50-prompt pools from 4 candidates across 3
|
||||
variation axes: `rh-s42` (seed-variance null), `inoc-prompt-loophole-s65`
|
||||
(prompt-variance), `judge-monitor-screening-s65` and
|
||||
`probe-monitor-screening-s65` (monitor-variance, two monitor types).
|
||||
Replaces earlier predicted-low-hack picks (gt-monitor-penalty,
|
||||
judge-monitor-penalty, inoc-passtests, inoc-evalenv) that were killed.
|
||||
- status: queued as pueue #107, #111, #112, #113 (2026-05-28).
|
||||
- verify: each pool's first 5 rollouts show hack rate >30%; check with
|
||||
`pueue log <id>` after each finishes.
|
||||
- success: ≥2 of 4 pools have hack rate >30%.
|
||||
- likely_fail: all pools low-hack. Fallback: combine all four + rh-s65,
|
||||
rely on rh-s65 as base + alt as diversity sprinkle.
|
||||
- sneaky_fail: high hack but identical signature distribution to rh-s65.
|
||||
Caught by T2 audit (R1.b: ≥50 D-containing rollouts somewhere).
|
||||
|
||||
- [ ] **T2 (R1)**: Regrade each pool. Build a combined pool dir
|
||||
`out/probe_distill/teacher_pool_combined/` by symlinking all 5 source
|
||||
pools' prompt files (per-source dedup if prompts overlap). Regrade the
|
||||
combined pool.
|
||||
- verify: `just regrade-pool out/probe_distill/teacher_pool_combined`
|
||||
exit 0 AND grep `D` in signature table for n≥50 AND no signature
|
||||
pct≥60% (manual eyeball or grep on `pct` column).
|
||||
- status: queued as #110, #114, #115, #116 (per-pool regrades).
|
||||
Combined-pool regrade not yet queued; build after T1 lands.
|
||||
|
||||
- [ ] **T3 (R2)**: Extend `pairs_from_pool.py` to accept signature-level
|
||||
splits (`--half-a-signatures="EC-,ECD"`) in addition to the current
|
||||
detector-level `--half-a=E,C`. Then pick the split based on T2 data.
|
||||
- steps: add `half_a_signatures: list[str]` Config field; when set,
|
||||
override the detector-level half-A logic with: hack-side = rollouts
|
||||
whose signature is in `half_a_signatures`, clean-side = rollouts with
|
||||
`---` (all detectors off).
|
||||
- verify: print signature counts per half before extract; both halves
|
||||
must have ≥1 signature with n≥50 (per R2).
|
||||
- sneaky_fail: the split is signature-based but pairs_from_pool falls
|
||||
back to detector logic if flag missing. Caught by explicit assert in
|
||||
the new code path.
|
||||
|
||||
- [ ] **T4 (R3)**: Run G3 head-to-head:
|
||||
- **T4a**: extract V from half-A pool subset via `just extract-vhack-pool
|
||||
half_a="<sig-list>" pool=<combined> tag=_combined`.
|
||||
- **T4b**: train projected via `just fast-projected-pool half_a="<sig>"
|
||||
seed=41 pool=out/probe_distill/teacher_pool_combined tag=_combined`.
|
||||
NOTE: must pass `pool=` arg explicitly, recipe defaults to rh-s65 only.
|
||||
- **T4c**: train matched-seed vanilla via `just fast-vanilla-xmech
|
||||
half_a="<sig>" seed=41 pool=out/probe_distill/teacher_pool_combined
|
||||
tag=_combined`. Same NOTE on `pool=` arg.
|
||||
- verify: both train logs end with BLUF row showing `HACK_S_E`, `HACK_S_C`,
|
||||
`HACK_S_D`, `HACK_A`, `HACK_B` columns AND vanilla HACK_S ≥ 30%.
|
||||
- sneaky_fail: vanilla baseline never hacks (combined pool's teacher
|
||||
rollouts didn't transfer). Caught by HACK_S ≥ 30% requirement.
|
||||
|
||||
- [ ] **T5 (R4)**: Build the cross-mech UAT table from T4b and T4c logs.
|
||||
Compute: half-A-drop = vanilla(HACK_A) - projected(HACK_A); half-B-drop =
|
||||
vanilla(HACK_B) - projected(HACK_B); ratio = half-B-drop / half-A-drop.
|
||||
PASS if ratio ≥ 0.5 AND vanilla(HACK_B) ≥ 30%.
|
||||
If PASS at single seed, queue n=3 seeds (T6). If FAIL, write entry
|
||||
documenting the negative result.
|
||||
|
||||
## Context
|
||||
|
||||
- Smoke test for the pipeline: `just smoke-xmech` (added 2026-05-28; uses
|
||||
tiny-random Qwen3 on CPU, ~2 min). Covers regrade_pool, pairs_from_pool,
|
||||
extract_vhack_grad with --pairs-from-pool, train with pool-derived V.
|
||||
- Audit gate in `regrade_pool.py` requires ≥3 non-singleton signatures
|
||||
(n≥20) for the half-A/half-B split to have power. Pass `--no-require-audit`
|
||||
for smoke or for diagnostic-only runs.
|
||||
|
||||
## Log
|
||||
|
||||
(append-only)
|
||||
|
||||
- 2026-05-28: rh-s65 pool degenerate (entry g). 96.1% EC-, only 2 non-
|
||||
singleton signatures.
|
||||
- 2026-05-28: queued gt-monitor-penalty (#97) and judge-monitor-penalty
|
||||
(#98) BEFORE reading the paper. Both predicted-bad teachers (intervention
|
||||
worked, model barely hacks). Killed.
|
||||
- 2026-05-28: queued inoc-passtests (#105) and inoc-evalenv (#106) on
|
||||
hypothesis that inoculation failure surfaces different mechanisms. User
|
||||
pushed back: inoculation prompts that fail still likely hack via the
|
||||
same mechanism they were inoculated against (the prompt describes that
|
||||
hack). Killed. Replaced with rh-s42 (seed-variance null),
|
||||
inoc-loophole-s65 (prompt-variance), judge-screening-s65,
|
||||
probe-screening-s65 (monitor-variance, two monitor types).
|
||||
- 2026-05-29: subagent review flagged: detector-level half-A/half-B split
|
||||
is incoherent because E and C co-fire ~99.9% in rh-s65 (any split puts
|
||||
near-identical labels in opposite halves, leaking signal). R2 rewritten
|
||||
to use *signature-level* split (group entire (E,C,D) signatures, not
|
||||
individual detectors). Requires extending pairs_from_pool.py — T3.
|
||||
- 2026-05-29: R1 audit gate in `regrade_pool.py` extended from R1.a only
|
||||
to all three of {R1.a ≥3 non-singleton signatures, R1.b largest non-EC
|
||||
signature ≥50, R1.c top signature <60%}. R1.b broadened from
|
||||
"D-containing" to "any non-EC" per the revised decision rule (entry j,
|
||||
forthcoming): any signature outside the EC- cluster gives mechanism
|
||||
diversity, not specifically wrong-tests.
|
||||
- 2026-05-29: queued G1-sparse probe (#134 projected, #135 vanilla). At
|
||||
G=8 mix=0.125 (one teacher per group), does refresh-every=2 + 21-pair
|
||||
V still drop last-5 hack_s? Hypothesis: yes — entry i showed refresh
|
||||
keeps cos_pre_t ~1.4x higher and entry h showed 21-pair basis does the
|
||||
heavy lifting; combined sparse-teacher regime is the cheap deployment
|
||||
analogue (1 hack example per 7 normal rollouts).
|
||||
- 2026-05-29: refresh-path logging silenced (extract per-pair "loss=" and
|
||||
postprocess summary were flooding training logs every refresh tick).
|
||||
Wrapped extract+postprocess in loguru disable/enable. Refresh now emits
|
||||
a single line per tick: `v_hack refreshed @ step={s}: M modules, k_axes=K`.
|
||||
|
||||
## TODO (out of scope)
|
||||
|
||||
- Add finer detectors (paper's 6 sub-mechanisms) to `rewards.py`. Test
|
||||
whether finer split of EC- into sub-signatures gives the audit gate
|
||||
≥3 non-singletons without needing new teachers.
|
||||
- Try the `-lh-*` (loophole-mentioned-in-prompt) inoculation variants if
|
||||
the non-lh ones don't give diversity.
|
||||
|
||||
## Plan B (deferred — likely subsumed by G2 screen)
|
||||
|
||||
The original Plan B was to prompt base Qwen3-4B to write completions in
|
||||
specific hack classes. Dropped after a closer read: explicitly instructing
|
||||
the model to "define run_tests with assertions that fail on canonical"
|
||||
is instruction-following, not reward-hacking. The gradient direction NLL
|
||||
takes on a "do X" completion is generally not the gradient direction GRPO
|
||||
takes when chasing reward — entry (i) already shows even self-vs-self
|
||||
v_hack cosines peak at 0.27 and decay below 0.1 in ~10 steps; an
|
||||
instruction-prompted V is unlikely to sit in the same subspace as a real
|
||||
training V at all.
|
||||
|
||||
A "clever hint" version (prompts that describe the grading environment
|
||||
without instructing the hack — analogous to Aria's inoculation prompts)
|
||||
would be defensible because the hack would still be model-decided. But
|
||||
this is already covered by the G2 screen: the `inoc-prompt-loophole-s65`,
|
||||
`inoc-prompt-passtests-s65`, and `inoc-prompt-evalenv-s65` checkpoints
|
||||
(#123-#125, regrades #131-#133) ARE base Qwen3-4B trained under
|
||||
hint-priming prompts. Whatever mechanisms survive that training are
|
||||
exactly the "model still decides to hack despite environment hints" set
|
||||
that Plan B would have been trying to construct from scratch.
|
||||
|
||||
Decision: don't run any prompted-pair generation. If the G2 screen comes
|
||||
back uniformly EC-dominant across all 8 candidates, the project conclusion
|
||||
is "Aria's intervention checkpoints don't give detector-class diversity"
|
||||
and we write that up as a negative result rather than papering over it
|
||||
with synthetic pairs.
|
||||
|
||||
## Errors
|
||||
|
||||
| Task | Error | Resolution |
|
||||
|------|-------|------------|
|
||||
Reference in New Issue
Block a user