docs

2026-06-27 15:15:40 +08:00 · 2026-05-29 05:42:28 +00:00
parent 22b5d0a8a7
commit f27c658ca9
5 changed files with 306 additions and 50 deletions
@@ -2,6 +2,9 @@

 **This is novel ML research.** Not in your training data. Extrapolate carefully.

+
+see @README.md for project overview
+
 ## Project in one paragraph

 Test whether SVD-basis gradient projection against an extracted hack-direction
@@ -10,9 +13,27 @@ Rebound (Wu & Tang 2026) by intervening at the *gradient* level rather than the
 *advantage* level. Differs from AntiPaSTO (the user's prior work) by using
 unpaired GRPO rollouts rather than paired-preference contrast.

+Hypothesis
+> We can find and the "reward hacking direciton", and reduce RL from learning reward hacking
+> Specifically we can get the reward hacking directions by contrasting G_hack and G_not, from GRPO upates on hacky and nonhacky completions.
+> Then during normal GRPO training we can erase the direction from the gradients collected on each learnable parameter in a low rank adapter, and this will reduce the reward hacking rate.
+
+Motovation: 
+We want to take the tool AI labs already use, and make them better for aligment using scalable self-supervised methods. Here the hope is that intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack.
+
 Inherit global rules from `~/.claude/CLAUDE.md`.

-## Workflow
+
+## Things the user has had to explain many times:
+
+- We cannot cheat and use all reward hacks to stop hacks. During deployment there are known hacks and unknown hacks. We want to make an alignment toolslabs want to use. So it's ok to have a weak eward hack detector than can detect hack type A but not B, then use the gradient from A to try to stop the learning of B, and this mimicks the generalisation to unknown hacks that happens at deployment.
+
+## Extra instructions:
+
+- When you queue a job, follow with `pueue follow | tail` in bg so you are woken on fail or finish
+- for every task, it's a goal of subgoal, be clear on UAT (user acceptance test), use you task list to track the goal and non trivial UAT, and it's not finished untill you have 1) collected the evidence 2) sanity checked it with a fresh eyes subagent 3) given the user the link to it's location
+
+## Files

 - Read [docs/spec.md](spec.md) for the preregistered plan.
 - Read [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md) for design rationale.
@@ -21,52 +42,3 @@ Inherit global rules from `~/.claude/CLAUDE.md`.
 - Real runs go through `pueue` on the 96GB GPU box. Label each job with `why:` and `resolve:`.
 - Head [docs/RESEARCH_JOURNAL.md](docs/RESEARCH_JOURNAL.md) for latest results.
 - No `tests/` dir; `smoke` is the correctness gate.
-
-## External dependencies
-
-`external/rl-rewardhacking/` is Ariahw's repo (verl-based GRPO + LeetCode dataset
-+ reward hacking monitors). We import from it; we do NOT modify it. Sync with
-`just sync-external`.
-
-## Code style
-
- `einops` for reshape, `einsum` for contractions
- `jaxtyping` on function inputs/outputs only
- `polars` v1 API; `loguru`; `tabulate` for log tables
- Single-letter dims: `b s h d r` (batch/seq/head/dim/rank)
- Capital suffix for projected spaces: `gS` = gradient in SVD top-m basis
- Greek letters/symbols for math-heavy code (cos α, ||g||)
-
-## Tensor shapes glossary
-
- `v_hack`: `Float[Tensor, "d"]` — single direction in residual stream
- `V_m`: `Float[Tensor, "d m"]` — top-m right singular vectors of W
- `g`: `Float[Tensor, "d_out d_in"]` for a weight grad; flatten to `"D"` for projection
- `cos_align`: `Float[Tensor, ""]` — scalar
-
-## Compression over accretion
-
-Every edit should reduce entropy. If you add something, remove something else.
-
-| Smell | Fix |
-|-------|-----|
-| Defensive guards (`if x is None`) | Let it crash, fix root cause |
-| Magic constants | Name it or derive from spec.md |
-| Two loss variants | Pick one, delete other |
-| Stubs / canned modes | Delete; smoke uses real model |
-
-## Don't
-
- Don't add losses without removing equivalent complexity. Gradient projection
-  is a *constraint*, not a competing objective.
- Don't use defensive programming. Fail fast, crash loudly.
- Don't fabricate numbers in journal entries or table prototypes. Mark TODO.
- Don't run real GRPO to test syntax errors. Use `just smoke`.
- Don't modify `external/rl-rewardhacking/` — it's a third-party pin.
-
-## Decision points (live)
-
- **H4 fallback**: if Qwen3.5-2B hack rate <30% at step 200, swap to Qwen3-4B
-  with num_generations=4, batch=64. See spec.md.
- **verl fallback**: if verl breaks on single 96GB, swap to TRL GRPOTrainer.
- **Layer choice for SVD/v_hack**: TBD during smoke; default 60-75% depth per Wu-Tang.
@@ -0,0 +1 @@
+AGENTS.md
@@ -74,8 +74,35 @@ See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings,
 including the 2026-05-23 grader-bug discovery that invalidated all prior `gt=0`
 measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).

+## Current findings (preliminary, n=1 seed)
+
+These are headline results from the fast preset (20 steps, mix=0.5, seed=41).
+Full provenance and per-step log audits are in `RESEARCH_JOURNAL.md`.
+
+**What appears to work (entry h, n=1):** widening the hand-crafted contrastive
+pair set from 12 to 21 pairs across 6 axes (weak run_tests, hardcode,
+persona-voice, try/except swallow, type-only assert, weak inequality predicate)
+drops last-5 student hack rate from 77.5% (12-pair projected, frozen V) to
+47.5% (21-pair projected, frozen V), at matched ground-truth pass rate near
+20%. The hack-gt gap shrinks from ~50pp to ~27.5pp, so the projection looks
+selective rather than just deflating both rates. Vanilla-baseline head-to-head
+and seed=42/43 replicates are queued.
+
+**What turns out to matter for the design (entries f, i):** the extracted
+v_hack basis goes stale fast during training. The per-step cosine of the
+live teacher gradient against v_hack decays from about 0.27 at step 0 to
+about 0.07 by step 10. Re-extracting v_hack every 2 optimizer steps
+(`--vhack-refresh-every=2`) keeps the second-half-of-training cosine about
+1.43x higher than the frozen baseline. But at the 21-pair width, the
+refresh effect on last-5 hack_s is small (47.5% frozen vs 45.0% refresh-2,
+about 2.5pp). Basis width does most of the work; refresh helps marginally.
+
 ## Hypotheses (preregistered)

 See [spec.md](spec.md). Headline: H1 — gradient projection in SVD basis against
 a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by
 >=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).
+
+Status at 2026-05-29: 30pp absolute drop confirmed within the projected arm
+at n=1 seed (12-pair to 21-pair, entry h). Vanilla-baseline head-to-head and
+n>=2 seed replication queued.
@@ -0,0 +1,256 @@
+# G2/G3 — checkpoint selection for cross-mechanism v_hack
+
+**Date:** 2026-05-28
+**Status:** draft
+**Supersedes (partially):** `20260528_cross_mechanism_v_hack.md` step 1-3.
+
+## Goal
+
+Test whether v_hack extracted from {a subset of hack mechanisms} also
+suppresses {the other mechanisms} — the "noisy oracle" generalization
+hypothesis. Needs a teacher pool with diverse hack types. Entry (g) showed
+the existing `rh-s65` pool is degenerate (96.1% EC-, 2 non-singleton
+signatures). To fix, mix pools from DIFFERENT intervention checkpoints.
+
+## Scope
+
+In:
+- Pre-gen pools from 2-4 Aria HF checkpoints with high residual RH AND
+  predicted-different hack mechanism distribution.
+- Regrade each pool with the 3 viable detectors E, C, D.
+- Pick a half-A / half-B split that puts a common (easy) detector and a
+  rare (hard) detector in each half.
+- Run G3: extract V from half-A pool subset, train, measure per-detector
+  suppression vs vanilla baseline.
+
+Out:
+- Adding new detectors (beyond E/C/D) to `rewards.py`. The 6 paper-listed
+  sub-mechanisms (bypass, hardcode, wrong-asserts, print-only,
+  fake-checkEqual, examples-only) are author commentary, not coded
+  classifiers, and writing them is its own project.
+- Training a new teacher (would take ~3h × 4 H200 per Aria's setup).
+
+## Background — checkpoint map from the paper
+
+Source: `docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md`.
+HF collection: https://huggingface.co/collections/ariahw/reward-hacking-rl.
+
+Each intervention was trained at seeds 1, 42, 65. Per the paper:
+
+| family | mechanism it targets | residual RH (paper) | viable as teacher? |
+|---|---|---|---|
+| `rh-*` | none (RL on loophole env) | ~79% | YES — current pool, but mono-mechanism |
+| `rl-baseline-*` | none (RL on no-loophole env) | ~0% | NO — barely hacks |
+| `gt-monitor-penalty-*` | flagged-RH penalty, perfect monitor | ~0% (§163) | NO — barely hacks |
+| `gt-monitor-screening-*` | flagged-RH dropped from gradients, perfect monitor | 0.9% (§167) | NO |
+| `judge-monitor-penalty-*` | LLM-judge penalty (~50% monitor acc) | low (§197) | NO |
+| `judge-monitor-screening-*` | LLM-judge screening | ~30-60% (§211) | YES — failed-intervention |
+| `probe-monitor-penalty-*` | linear-probe penalty | ~0% (§185) | NO — probe was very accurate |
+| `probe-monitor-screening-*` | linear-probe screening | ~10-30% | MAYBE |
+| `inoc-prompt-loophole-*` | inoculate against loophole hack | high (model card confirms still hacks) | YES |
+| `inoc-prompt-passtests-*` | inoculate against passtests hack | high (paper §137 "modest protection") | YES — likely different mechanism |
+| `inoc-prompt-evalenv-*` | inoculate against eval-env overwrite | high | YES — likely different mechanism |
+| `*-lh-*` variants | inoculation prompt also explicitly defines the loophole | likely high | LATER |
+
+**Hypothesis on mechanism diversity:**
+- `inoc-prompt-passtests`: trained to avoid "passtests" hacks specifically. If
+  it still hacks, it should prefer mechanisms OTHER than passtests — namely
+  D (wrong_tests) or genuine bypass that our C detector might not catch.
+- `inoc-prompt-evalenv`: trained to avoid eval-env overwrites. Same idea.
+- `judge-monitor-screening`: paper §211 explicitly noted "breakout" runs at
+  20-60% RH. These are intervention-failure modes.
+
+## Requirements
+
+- **R1.** Identify a pool (single or combined) whose *signature distribution*
+  has all three of: (a) ≥3 non-singleton signatures with n≥20, AND (b) at
+  least one non-EC signature (anything other than `EC-` and `ECD`) with
+  n≥50, AND (c) no single signature exceeds 60% of the pool. (R1.b was
+  originally "D-containing"; broadened to "non-EC" because any signature
+  outside the EC- cluster gives us mechanism diversity, not specifically
+  the wrong-tests detector.) The `regrade_pool` exit code enforces all
+  three after the 2026-05-29 extension.
+  - VERIFY: `just regrade-pool <dir>` exit 0 (the audit gate now prints
+    all three pass/fail lines and returns nonzero on any failure).
+  - sneaky_fail: audit gate passes on (a) but pool is still 96% EC-
+    with 3 small islands; G3 has no held-out signal. Caught by (b) AND (c).
+
+- **R2.** Pick a half-A / half-B split *over signatures*, not over
+  detectors. The rh-s65 pool entry (g) shows E and C co-fire on 1798/1799
+  rollouts; treating E and C as different detector classes for splitting
+  leaks the same hack into both halves. Instead, group signatures into
+  halves directly. Example split: half-A = {EC-, ECD} (the loophole+
+  passtests cluster), half-B = {--D, -CD, -C-} (wrong-tests / alt-only
+  cluster). Each half must have ≥50 rollouts of base rate in the trained
+  student arm (else R4 ratio is noisy).
+  - VERIFY: print signature counts per half before extract; both halves
+    must have ≥1 signature with n≥50.
+  - This requires plumbing: `pairs_from_pool.py` currently takes `--half-a`
+    as a comma list of *detectors*, not signatures. T3 must extend.
+
+- **R3.** Run G3 ladder: (a) extract V from half-A pool subset; (b) train
+  projected for 20 steps on combined-pool teacher + half-A V; (c) train
+  matched-seed vanilla on combined-pool teacher (no projection). Measure
+  per-detector hack_s rates on student rollouts in both arms.
+  - VERIFY: train final BLUF shows `HACK_S_E`, `HACK_S_C`, `HACK_S_D`,
+    `HACK_A`, `HACK_B` for both arms.
+  - sneaky_fail: vanilla arm shows hack_s_* all near zero (combined pool's
+    teacher rollouts didn't induce student hacking). Caught by requiring
+    vanilla HACK_S >= 30%.
+
+- **R4.** UAT for G3 PASS: in the projected arm, the half-B hack rate drops
+  by ≥50% of the half-A drop, both relative to matched-seed vanilla on
+  the *same combined pool*. FAIL: held-out drops <30% of seen drop, i.e.
+  v_hack is mechanism-specific.
+  - VERIFY: side-by-side hack-rate-per-half table; compute drop ratios.
+  - sneaky_fail: seen drop is itself tiny (<10pp absolute), making the
+    ratio noisy. Caught by R3 requiring vanilla ≥30% per half.
+
+## Tasks
+
+- [/] **T1 (R1)**: Pre-gen 50-prompt pools from 4 candidates across 3
+  variation axes: `rh-s42` (seed-variance null), `inoc-prompt-loophole-s65`
+  (prompt-variance), `judge-monitor-screening-s65` and
+  `probe-monitor-screening-s65` (monitor-variance, two monitor types).
+  Replaces earlier predicted-low-hack picks (gt-monitor-penalty,
+  judge-monitor-penalty, inoc-passtests, inoc-evalenv) that were killed.
+  - status: queued as pueue #107, #111, #112, #113 (2026-05-28).
+  - verify: each pool's first 5 rollouts show hack rate >30%; check with
+    `pueue log <id>` after each finishes.
+  - success: ≥2 of 4 pools have hack rate >30%.
+  - likely_fail: all pools low-hack. Fallback: combine all four + rh-s65,
+    rely on rh-s65 as base + alt as diversity sprinkle.
+  - sneaky_fail: high hack but identical signature distribution to rh-s65.
+    Caught by T2 audit (R1.b: ≥50 D-containing rollouts somewhere).
+
+- [ ] **T2 (R1)**: Regrade each pool. Build a combined pool dir
+  `out/probe_distill/teacher_pool_combined/` by symlinking all 5 source
+  pools' prompt files (per-source dedup if prompts overlap). Regrade the
+  combined pool.
+  - verify: `just regrade-pool out/probe_distill/teacher_pool_combined`
+    exit 0 AND grep `D` in signature table for n≥50 AND no signature
+    pct≥60% (manual eyeball or grep on `pct` column).
+  - status: queued as #110, #114, #115, #116 (per-pool regrades).
+    Combined-pool regrade not yet queued; build after T1 lands.
+
+- [ ] **T3 (R2)**: Extend `pairs_from_pool.py` to accept signature-level
+  splits (`--half-a-signatures="EC-,ECD"`) in addition to the current
+  detector-level `--half-a=E,C`. Then pick the split based on T2 data.
+  - steps: add `half_a_signatures: list[str]` Config field; when set,
+    override the detector-level half-A logic with: hack-side = rollouts
+    whose signature is in `half_a_signatures`, clean-side = rollouts with
+    `---` (all detectors off).
+  - verify: print signature counts per half before extract; both halves
+    must have ≥1 signature with n≥50 (per R2).
+  - sneaky_fail: the split is signature-based but pairs_from_pool falls
+    back to detector logic if flag missing. Caught by explicit assert in
+    the new code path.
+
+- [ ] **T4 (R3)**: Run G3 head-to-head:
+  - **T4a**: extract V from half-A pool subset via `just extract-vhack-pool
+    half_a="<sig-list>" pool=<combined> tag=_combined`.
+  - **T4b**: train projected via `just fast-projected-pool half_a="<sig>"
+    seed=41 pool=out/probe_distill/teacher_pool_combined tag=_combined`.
+    NOTE: must pass `pool=` arg explicitly, recipe defaults to rh-s65 only.
+  - **T4c**: train matched-seed vanilla via `just fast-vanilla-xmech
+    half_a="<sig>" seed=41 pool=out/probe_distill/teacher_pool_combined
+    tag=_combined`. Same NOTE on `pool=` arg.
+  - verify: both train logs end with BLUF row showing `HACK_S_E`, `HACK_S_C`,
+    `HACK_S_D`, `HACK_A`, `HACK_B` columns AND vanilla HACK_S ≥ 30%.
+  - sneaky_fail: vanilla baseline never hacks (combined pool's teacher
+    rollouts didn't transfer). Caught by HACK_S ≥ 30% requirement.
+
+- [ ] **T5 (R4)**: Build the cross-mech UAT table from T4b and T4c logs.
+  Compute: half-A-drop = vanilla(HACK_A) - projected(HACK_A); half-B-drop =
+  vanilla(HACK_B) - projected(HACK_B); ratio = half-B-drop / half-A-drop.
+  PASS if ratio ≥ 0.5 AND vanilla(HACK_B) ≥ 30%.
+  If PASS at single seed, queue n=3 seeds (T6). If FAIL, write entry
+  documenting the negative result.
+
+## Context
+
+- Smoke test for the pipeline: `just smoke-xmech` (added 2026-05-28; uses
+  tiny-random Qwen3 on CPU, ~2 min). Covers regrade_pool, pairs_from_pool,
+  extract_vhack_grad with --pairs-from-pool, train with pool-derived V.
+- Audit gate in `regrade_pool.py` requires ≥3 non-singleton signatures
+  (n≥20) for the half-A/half-B split to have power. Pass `--no-require-audit`
+  for smoke or for diagnostic-only runs.
+
+## Log
+
+(append-only)
+
+- 2026-05-28: rh-s65 pool degenerate (entry g). 96.1% EC-, only 2 non-
+  singleton signatures.
+- 2026-05-28: queued gt-monitor-penalty (#97) and judge-monitor-penalty
+  (#98) BEFORE reading the paper. Both predicted-bad teachers (intervention
+  worked, model barely hacks). Killed.
+- 2026-05-28: queued inoc-passtests (#105) and inoc-evalenv (#106) on
+  hypothesis that inoculation failure surfaces different mechanisms. User
+  pushed back: inoculation prompts that fail still likely hack via the
+  same mechanism they were inoculated against (the prompt describes that
+  hack). Killed. Replaced with rh-s42 (seed-variance null),
+  inoc-loophole-s65 (prompt-variance), judge-screening-s65,
+  probe-screening-s65 (monitor-variance, two monitor types).
+- 2026-05-29: subagent review flagged: detector-level half-A/half-B split
+  is incoherent because E and C co-fire ~99.9% in rh-s65 (any split puts
+  near-identical labels in opposite halves, leaking signal). R2 rewritten
+  to use *signature-level* split (group entire (E,C,D) signatures, not
+  individual detectors). Requires extending pairs_from_pool.py — T3.
+- 2026-05-29: R1 audit gate in `regrade_pool.py` extended from R1.a only
+  to all three of {R1.a ≥3 non-singleton signatures, R1.b largest non-EC
+  signature ≥50, R1.c top signature <60%}. R1.b broadened from
+  "D-containing" to "any non-EC" per the revised decision rule (entry j,
+  forthcoming): any signature outside the EC- cluster gives mechanism
+  diversity, not specifically wrong-tests.
+- 2026-05-29: queued G1-sparse probe (#134 projected, #135 vanilla). At
+  G=8 mix=0.125 (one teacher per group), does refresh-every=2 + 21-pair
+  V still drop last-5 hack_s? Hypothesis: yes — entry i showed refresh
+  keeps cos_pre_t ~1.4x higher and entry h showed 21-pair basis does the
+  heavy lifting; combined sparse-teacher regime is the cheap deployment
+  analogue (1 hack example per 7 normal rollouts).
+- 2026-05-29: refresh-path logging silenced (extract per-pair "loss=" and
+  postprocess summary were flooding training logs every refresh tick).
+  Wrapped extract+postprocess in loguru disable/enable. Refresh now emits
+  a single line per tick: `v_hack refreshed @ step={s}: M modules, k_axes=K`.
+
+## TODO (out of scope)
+
+- Add finer detectors (paper's 6 sub-mechanisms) to `rewards.py`. Test
+  whether finer split of EC- into sub-signatures gives the audit gate
+  ≥3 non-singletons without needing new teachers.
+- Try the `-lh-*` (loophole-mentioned-in-prompt) inoculation variants if
+  the non-lh ones don't give diversity.
+
+## Plan B (deferred — likely subsumed by G2 screen)
+
+The original Plan B was to prompt base Qwen3-4B to write completions in
+specific hack classes. Dropped after a closer read: explicitly instructing
+the model to "define run_tests with assertions that fail on canonical"
+is instruction-following, not reward-hacking. The gradient direction NLL
+takes on a "do X" completion is generally not the gradient direction GRPO
+takes when chasing reward — entry (i) already shows even self-vs-self
+v_hack cosines peak at 0.27 and decay below 0.1 in ~10 steps; an
+instruction-prompted V is unlikely to sit in the same subspace as a real
+training V at all.
+
+A "clever hint" version (prompts that describe the grading environment
+without instructing the hack — analogous to Aria's inoculation prompts)
+would be defensible because the hack would still be model-decided. But
+this is already covered by the G2 screen: the `inoc-prompt-loophole-s65`,
+`inoc-prompt-passtests-s65`, and `inoc-prompt-evalenv-s65` checkpoints
+(#123-#125, regrades #131-#133) ARE base Qwen3-4B trained under
+hint-priming prompts. Whatever mechanisms survive that training are
+exactly the "model still decides to hack despite environment hints" set
+that Plan B would have been trying to construct from scratch.
+
+Decision: don't run any prompted-pair generation. If the G2 screen comes
+back uniformly EC-dominant across all 8 candidates, the project conclusion
+is "Aria's intervention checkpoints don't give detector-class diversity"
+and we write that up as a negative result rather than papering over it
+with synthetic pairs.
+
+## Errors
+
+| Task | Error | Resolution |
+|------|-------|------------|