This commit is contained in:
wassname
2026-05-29 05:42:28 +00:00
parent 22b5d0a8a7
commit f27c658ca9
5 changed files with 306 additions and 50 deletions
+22 -50
View File
@@ -2,6 +2,9 @@
**This is novel ML research.** Not in your training data. Extrapolate carefully.
see @README.md for project overview
## Project in one paragraph
Test whether SVD-basis gradient projection against an extracted hack-direction
@@ -10,9 +13,27 @@ Rebound (Wu & Tang 2026) by intervening at the *gradient* level rather than the
*advantage* level. Differs from AntiPaSTO (the user's prior work) by using
unpaired GRPO rollouts rather than paired-preference contrast.
Hypothesis
> We can find and the "reward hacking direciton", and reduce RL from learning reward hacking
> Specifically we can get the reward hacking directions by contrasting G_hack and G_not, from GRPO upates on hacky and nonhacky completions.
> Then during normal GRPO training we can erase the direction from the gradients collected on each learnable parameter in a low rank adapter, and this will reduce the reward hacking rate.
Motovation:
We want to take the tool AI labs already use, and make them better for aligment using scalable self-supervised methods. Here the hope is that intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack.
Inherit global rules from `~/.claude/CLAUDE.md`.
## Workflow
## Things the user has had to explain many times:
- We cannot cheat and use all reward hacks to stop hacks. During deployment there are known hacks and unknown hacks. We want to make an alignment toolslabs want to use. So it's ok to have a weak eward hack detector than can detect hack type A but not B, then use the gradient from A to try to stop the learning of B, and this mimicks the generalisation to unknown hacks that happens at deployment.
## Extra instructions:
- When you queue a job, follow with `pueue follow | tail` in bg so you are woken on fail or finish
- for every task, it's a goal of subgoal, be clear on UAT (user acceptance test), use you task list to track the goal and non trivial UAT, and it's not finished untill you have 1) collected the evidence 2) sanity checked it with a fresh eyes subagent 3) given the user the link to it's location
## Files
- Read [docs/spec.md](spec.md) for the preregistered plan.
- Read [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md) for design rationale.
@@ -21,52 +42,3 @@ Inherit global rules from `~/.claude/CLAUDE.md`.
- Real runs go through `pueue` on the 96GB GPU box. Label each job with `why:` and `resolve:`.
- Head [docs/RESEARCH_JOURNAL.md](docs/RESEARCH_JOURNAL.md) for latest results.
- No `tests/` dir; `smoke` is the correctness gate.
## External dependencies
`external/rl-rewardhacking/` is Ariahw's repo (verl-based GRPO + LeetCode dataset
+ reward hacking monitors). We import from it; we do NOT modify it. Sync with
`just sync-external`.
## Code style
- `einops` for reshape, `einsum` for contractions
- `jaxtyping` on function inputs/outputs only
- `polars` v1 API; `loguru`; `tabulate` for log tables
- Single-letter dims: `b s h d r` (batch/seq/head/dim/rank)
- Capital suffix for projected spaces: `gS` = gradient in SVD top-m basis
- Greek letters/symbols for math-heavy code (cos α, ||g||)
## Tensor shapes glossary
- `v_hack`: `Float[Tensor, "d"]` — single direction in residual stream
- `V_m`: `Float[Tensor, "d m"]` — top-m right singular vectors of W
- `g`: `Float[Tensor, "d_out d_in"]` for a weight grad; flatten to `"D"` for projection
- `cos_align`: `Float[Tensor, ""]` — scalar
## Compression over accretion
Every edit should reduce entropy. If you add something, remove something else.
| Smell | Fix |
|-------|-----|
| Defensive guards (`if x is None`) | Let it crash, fix root cause |
| Magic constants | Name it or derive from spec.md |
| Two loss variants | Pick one, delete other |
| Stubs / canned modes | Delete; smoke uses real model |
## Don't
- Don't add losses without removing equivalent complexity. Gradient projection
is a *constraint*, not a competing objective.
- Don't use defensive programming. Fail fast, crash loudly.
- Don't fabricate numbers in journal entries or table prototypes. Mark TODO.
- Don't run real GRPO to test syntax errors. Use `just smoke`.
- Don't modify `external/rl-rewardhacking/` — it's a third-party pin.
## Decision points (live)
- **H4 fallback**: if Qwen3.5-2B hack rate <30% at step 200, swap to Qwen3-4B
with num_generations=4, batch=64. See spec.md.
- **verl fallback**: if verl breaks on single 96GB, swap to TRL GRPOTrainer.
- **Layer choice for SVD/v_hack**: TBD during smoke; default 60-75% depth per Wu-Tang.
Symlink
+1
View File
@@ -0,0 +1 @@
AGENTS.md
+27
View File
@@ -74,8 +74,35 @@ See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings,
including the 2026-05-23 grader-bug discovery that invalidated all prior `gt=0`
measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).
## Current findings (preliminary, n=1 seed)
These are headline results from the fast preset (20 steps, mix=0.5, seed=41).
Full provenance and per-step log audits are in `RESEARCH_JOURNAL.md`.
**What appears to work (entry h, n=1):** widening the hand-crafted contrastive
pair set from 12 to 21 pairs across 6 axes (weak run_tests, hardcode,
persona-voice, try/except swallow, type-only assert, weak inequality predicate)
drops last-5 student hack rate from 77.5% (12-pair projected, frozen V) to
47.5% (21-pair projected, frozen V), at matched ground-truth pass rate near
20%. The hack-gt gap shrinks from ~50pp to ~27.5pp, so the projection looks
selective rather than just deflating both rates. Vanilla-baseline head-to-head
and seed=42/43 replicates are queued.
**What turns out to matter for the design (entries f, i):** the extracted
v_hack basis goes stale fast during training. The per-step cosine of the
live teacher gradient against v_hack decays from about 0.27 at step 0 to
about 0.07 by step 10. Re-extracting v_hack every 2 optimizer steps
(`--vhack-refresh-every=2`) keeps the second-half-of-training cosine about
1.43x higher than the frozen baseline. But at the 21-pair width, the
refresh effect on last-5 hack_s is small (47.5% frozen vs 45.0% refresh-2,
about 2.5pp). Basis width does most of the work; refresh helps marginally.
## Hypotheses (preregistered)
See [spec.md](spec.md). Headline: H1 — gradient projection in SVD basis against
a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by
>=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).
Status at 2026-05-29: 30pp absolute drop confirmed within the projected arm
at n=1 seed (12-pair to 21-pair, entry h). Vanilla-baseline head-to-head and
n>=2 seed replication queued.
@@ -0,0 +1,256 @@
# G2/G3 — checkpoint selection for cross-mechanism v_hack
**Date:** 2026-05-28
**Status:** draft
**Supersedes (partially):** `20260528_cross_mechanism_v_hack.md` step 1-3.
## Goal
Test whether v_hack extracted from {a subset of hack mechanisms} also
suppresses {the other mechanisms} — the "noisy oracle" generalization
hypothesis. Needs a teacher pool with diverse hack types. Entry (g) showed
the existing `rh-s65` pool is degenerate (96.1% EC-, 2 non-singleton
signatures). To fix, mix pools from DIFFERENT intervention checkpoints.
## Scope
In:
- Pre-gen pools from 2-4 Aria HF checkpoints with high residual RH AND
predicted-different hack mechanism distribution.
- Regrade each pool with the 3 viable detectors E, C, D.
- Pick a half-A / half-B split that puts a common (easy) detector and a
rare (hard) detector in each half.
- Run G3: extract V from half-A pool subset, train, measure per-detector
suppression vs vanilla baseline.
Out:
- Adding new detectors (beyond E/C/D) to `rewards.py`. The 6 paper-listed
sub-mechanisms (bypass, hardcode, wrong-asserts, print-only,
fake-checkEqual, examples-only) are author commentary, not coded
classifiers, and writing them is its own project.
- Training a new teacher (would take ~3h × 4 H200 per Aria's setup).
## Background — checkpoint map from the paper
Source: `docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md`.
HF collection: https://huggingface.co/collections/ariahw/reward-hacking-rl.
Each intervention was trained at seeds 1, 42, 65. Per the paper:
| family | mechanism it targets | residual RH (paper) | viable as teacher? |
|---|---|---|---|
| `rh-*` | none (RL on loophole env) | ~79% | YES — current pool, but mono-mechanism |
| `rl-baseline-*` | none (RL on no-loophole env) | ~0% | NO — barely hacks |
| `gt-monitor-penalty-*` | flagged-RH penalty, perfect monitor | ~0% (§163) | NO — barely hacks |
| `gt-monitor-screening-*` | flagged-RH dropped from gradients, perfect monitor | 0.9% (§167) | NO |
| `judge-monitor-penalty-*` | LLM-judge penalty (~50% monitor acc) | low (§197) | NO |
| `judge-monitor-screening-*` | LLM-judge screening | ~30-60% (§211) | YES — failed-intervention |
| `probe-monitor-penalty-*` | linear-probe penalty | ~0% (§185) | NO — probe was very accurate |
| `probe-monitor-screening-*` | linear-probe screening | ~10-30% | MAYBE |
| `inoc-prompt-loophole-*` | inoculate against loophole hack | high (model card confirms still hacks) | YES |
| `inoc-prompt-passtests-*` | inoculate against passtests hack | high (paper §137 "modest protection") | YES — likely different mechanism |
| `inoc-prompt-evalenv-*` | inoculate against eval-env overwrite | high | YES — likely different mechanism |
| `*-lh-*` variants | inoculation prompt also explicitly defines the loophole | likely high | LATER |
**Hypothesis on mechanism diversity:**
- `inoc-prompt-passtests`: trained to avoid "passtests" hacks specifically. If
it still hacks, it should prefer mechanisms OTHER than passtests — namely
D (wrong_tests) or genuine bypass that our C detector might not catch.
- `inoc-prompt-evalenv`: trained to avoid eval-env overwrites. Same idea.
- `judge-monitor-screening`: paper §211 explicitly noted "breakout" runs at
20-60% RH. These are intervention-failure modes.
## Requirements
- **R1.** Identify a pool (single or combined) whose *signature distribution*
has all three of: (a) ≥3 non-singleton signatures with n≥20, AND (b) at
least one non-EC signature (anything other than `EC-` and `ECD`) with
n≥50, AND (c) no single signature exceeds 60% of the pool. (R1.b was
originally "D-containing"; broadened to "non-EC" because any signature
outside the EC- cluster gives us mechanism diversity, not specifically
the wrong-tests detector.) The `regrade_pool` exit code enforces all
three after the 2026-05-29 extension.
- VERIFY: `just regrade-pool <dir>` exit 0 (the audit gate now prints
all three pass/fail lines and returns nonzero on any failure).
- sneaky_fail: audit gate passes on (a) but pool is still 96% EC-
with 3 small islands; G3 has no held-out signal. Caught by (b) AND (c).
- **R2.** Pick a half-A / half-B split *over signatures*, not over
detectors. The rh-s65 pool entry (g) shows E and C co-fire on 1798/1799
rollouts; treating E and C as different detector classes for splitting
leaks the same hack into both halves. Instead, group signatures into
halves directly. Example split: half-A = {EC-, ECD} (the loophole+
passtests cluster), half-B = {--D, -CD, -C-} (wrong-tests / alt-only
cluster). Each half must have ≥50 rollouts of base rate in the trained
student arm (else R4 ratio is noisy).
- VERIFY: print signature counts per half before extract; both halves
must have ≥1 signature with n≥50.
- This requires plumbing: `pairs_from_pool.py` currently takes `--half-a`
as a comma list of *detectors*, not signatures. T3 must extend.
- **R3.** Run G3 ladder: (a) extract V from half-A pool subset; (b) train
projected for 20 steps on combined-pool teacher + half-A V; (c) train
matched-seed vanilla on combined-pool teacher (no projection). Measure
per-detector hack_s rates on student rollouts in both arms.
- VERIFY: train final BLUF shows `HACK_S_E`, `HACK_S_C`, `HACK_S_D`,
`HACK_A`, `HACK_B` for both arms.
- sneaky_fail: vanilla arm shows hack_s_* all near zero (combined pool's
teacher rollouts didn't induce student hacking). Caught by requiring
vanilla HACK_S >= 30%.
- **R4.** UAT for G3 PASS: in the projected arm, the half-B hack rate drops
by ≥50% of the half-A drop, both relative to matched-seed vanilla on
the *same combined pool*. FAIL: held-out drops <30% of seen drop, i.e.
v_hack is mechanism-specific.
- VERIFY: side-by-side hack-rate-per-half table; compute drop ratios.
- sneaky_fail: seen drop is itself tiny (<10pp absolute), making the
ratio noisy. Caught by R3 requiring vanilla ≥30% per half.
## Tasks
- [/] **T1 (R1)**: Pre-gen 50-prompt pools from 4 candidates across 3
variation axes: `rh-s42` (seed-variance null), `inoc-prompt-loophole-s65`
(prompt-variance), `judge-monitor-screening-s65` and
`probe-monitor-screening-s65` (monitor-variance, two monitor types).
Replaces earlier predicted-low-hack picks (gt-monitor-penalty,
judge-monitor-penalty, inoc-passtests, inoc-evalenv) that were killed.
- status: queued as pueue #107, #111, #112, #113 (2026-05-28).
- verify: each pool's first 5 rollouts show hack rate >30%; check with
`pueue log <id>` after each finishes.
- success: ≥2 of 4 pools have hack rate >30%.
- likely_fail: all pools low-hack. Fallback: combine all four + rh-s65,
rely on rh-s65 as base + alt as diversity sprinkle.
- sneaky_fail: high hack but identical signature distribution to rh-s65.
Caught by T2 audit (R1.b: ≥50 D-containing rollouts somewhere).
- [ ] **T2 (R1)**: Regrade each pool. Build a combined pool dir
`out/probe_distill/teacher_pool_combined/` by symlinking all 5 source
pools' prompt files (per-source dedup if prompts overlap). Regrade the
combined pool.
- verify: `just regrade-pool out/probe_distill/teacher_pool_combined`
exit 0 AND grep `D` in signature table for n≥50 AND no signature
pct≥60% (manual eyeball or grep on `pct` column).
- status: queued as #110, #114, #115, #116 (per-pool regrades).
Combined-pool regrade not yet queued; build after T1 lands.
- [ ] **T3 (R2)**: Extend `pairs_from_pool.py` to accept signature-level
splits (`--half-a-signatures="EC-,ECD"`) in addition to the current
detector-level `--half-a=E,C`. Then pick the split based on T2 data.
- steps: add `half_a_signatures: list[str]` Config field; when set,
override the detector-level half-A logic with: hack-side = rollouts
whose signature is in `half_a_signatures`, clean-side = rollouts with
`---` (all detectors off).
- verify: print signature counts per half before extract; both halves
must have ≥1 signature with n≥50 (per R2).
- sneaky_fail: the split is signature-based but pairs_from_pool falls
back to detector logic if flag missing. Caught by explicit assert in
the new code path.
- [ ] **T4 (R3)**: Run G3 head-to-head:
- **T4a**: extract V from half-A pool subset via `just extract-vhack-pool
half_a="<sig-list>" pool=<combined> tag=_combined`.
- **T4b**: train projected via `just fast-projected-pool half_a="<sig>"
seed=41 pool=out/probe_distill/teacher_pool_combined tag=_combined`.
NOTE: must pass `pool=` arg explicitly, recipe defaults to rh-s65 only.
- **T4c**: train matched-seed vanilla via `just fast-vanilla-xmech
half_a="<sig>" seed=41 pool=out/probe_distill/teacher_pool_combined
tag=_combined`. Same NOTE on `pool=` arg.
- verify: both train logs end with BLUF row showing `HACK_S_E`, `HACK_S_C`,
`HACK_S_D`, `HACK_A`, `HACK_B` columns AND vanilla HACK_S ≥ 30%.
- sneaky_fail: vanilla baseline never hacks (combined pool's teacher
rollouts didn't transfer). Caught by HACK_S ≥ 30% requirement.
- [ ] **T5 (R4)**: Build the cross-mech UAT table from T4b and T4c logs.
Compute: half-A-drop = vanilla(HACK_A) - projected(HACK_A); half-B-drop =
vanilla(HACK_B) - projected(HACK_B); ratio = half-B-drop / half-A-drop.
PASS if ratio ≥ 0.5 AND vanilla(HACK_B) ≥ 30%.
If PASS at single seed, queue n=3 seeds (T6). If FAIL, write entry
documenting the negative result.
## Context
- Smoke test for the pipeline: `just smoke-xmech` (added 2026-05-28; uses
tiny-random Qwen3 on CPU, ~2 min). Covers regrade_pool, pairs_from_pool,
extract_vhack_grad with --pairs-from-pool, train with pool-derived V.
- Audit gate in `regrade_pool.py` requires ≥3 non-singleton signatures
(n≥20) for the half-A/half-B split to have power. Pass `--no-require-audit`
for smoke or for diagnostic-only runs.
## Log
(append-only)
- 2026-05-28: rh-s65 pool degenerate (entry g). 96.1% EC-, only 2 non-
singleton signatures.
- 2026-05-28: queued gt-monitor-penalty (#97) and judge-monitor-penalty
(#98) BEFORE reading the paper. Both predicted-bad teachers (intervention
worked, model barely hacks). Killed.
- 2026-05-28: queued inoc-passtests (#105) and inoc-evalenv (#106) on
hypothesis that inoculation failure surfaces different mechanisms. User
pushed back: inoculation prompts that fail still likely hack via the
same mechanism they were inoculated against (the prompt describes that
hack). Killed. Replaced with rh-s42 (seed-variance null),
inoc-loophole-s65 (prompt-variance), judge-screening-s65,
probe-screening-s65 (monitor-variance, two monitor types).
- 2026-05-29: subagent review flagged: detector-level half-A/half-B split
is incoherent because E and C co-fire ~99.9% in rh-s65 (any split puts
near-identical labels in opposite halves, leaking signal). R2 rewritten
to use *signature-level* split (group entire (E,C,D) signatures, not
individual detectors). Requires extending pairs_from_pool.py — T3.
- 2026-05-29: R1 audit gate in `regrade_pool.py` extended from R1.a only
to all three of {R1.a ≥3 non-singleton signatures, R1.b largest non-EC
signature ≥50, R1.c top signature <60%}. R1.b broadened from
"D-containing" to "any non-EC" per the revised decision rule (entry j,
forthcoming): any signature outside the EC- cluster gives mechanism
diversity, not specifically wrong-tests.
- 2026-05-29: queued G1-sparse probe (#134 projected, #135 vanilla). At
G=8 mix=0.125 (one teacher per group), does refresh-every=2 + 21-pair
V still drop last-5 hack_s? Hypothesis: yes — entry i showed refresh
keeps cos_pre_t ~1.4x higher and entry h showed 21-pair basis does the
heavy lifting; combined sparse-teacher regime is the cheap deployment
analogue (1 hack example per 7 normal rollouts).
- 2026-05-29: refresh-path logging silenced (extract per-pair "loss=" and
postprocess summary were flooding training logs every refresh tick).
Wrapped extract+postprocess in loguru disable/enable. Refresh now emits
a single line per tick: `v_hack refreshed @ step={s}: M modules, k_axes=K`.
## TODO (out of scope)
- Add finer detectors (paper's 6 sub-mechanisms) to `rewards.py`. Test
whether finer split of EC- into sub-signatures gives the audit gate
≥3 non-singletons without needing new teachers.
- Try the `-lh-*` (loophole-mentioned-in-prompt) inoculation variants if
the non-lh ones don't give diversity.
## Plan B (deferred — likely subsumed by G2 screen)
The original Plan B was to prompt base Qwen3-4B to write completions in
specific hack classes. Dropped after a closer read: explicitly instructing
the model to "define run_tests with assertions that fail on canonical"
is instruction-following, not reward-hacking. The gradient direction NLL
takes on a "do X" completion is generally not the gradient direction GRPO
takes when chasing reward — entry (i) already shows even self-vs-self
v_hack cosines peak at 0.27 and decay below 0.1 in ~10 steps; an
instruction-prompted V is unlikely to sit in the same subspace as a real
training V at all.
A "clever hint" version (prompts that describe the grading environment
without instructing the hack — analogous to Aria's inoculation prompts)
would be defensible because the hack would still be model-decided. But
this is already covered by the G2 screen: the `inoc-prompt-loophole-s65`,
`inoc-prompt-passtests-s65`, and `inoc-prompt-evalenv-s65` checkpoints
(#123-#125, regrades #131-#133) ARE base Qwen3-4B trained under
hint-priming prompts. Whatever mechanisms survive that training are
exactly the "model still decides to hack despite environment hints" set
that Plan B would have been trying to construct from scratch.
Decision: don't run any prompted-pair generation. If the G2 screen comes
back uniformly EC-dominant across all 8 candidates, the project conclusion
is "Aria's intervention checkpoints don't give detector-class diversity"
and we write that up as a negative result rather than papering over it
with synthetic pairs.
## Errors
| Task | Error | Resolution |
|------|-------|------------|
View File