diff --git a/AGENTS.md b/AGENTS.md index 29196e5..930cd57 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -2,6 +2,9 @@ **This is novel ML research.** Not in your training data. Extrapolate carefully. + +see @README.md for project overview + ## Project in one paragraph Test whether SVD-basis gradient projection against an extracted hack-direction @@ -10,9 +13,27 @@ Rebound (Wu & Tang 2026) by intervening at the *gradient* level rather than the *advantage* level. Differs from AntiPaSTO (the user's prior work) by using unpaired GRPO rollouts rather than paired-preference contrast. +Hypothesis +> We can find and the "reward hacking direciton", and reduce RL from learning reward hacking +> Specifically we can get the reward hacking directions by contrasting G_hack and G_not, from GRPO upates on hacky and nonhacky completions. +> Then during normal GRPO training we can erase the direction from the gradients collected on each learnable parameter in a low rank adapter, and this will reduce the reward hacking rate. + +Motovation: +We want to take the tool AI labs already use, and make them better for aligment using scalable self-supervised methods. Here the hope is that intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack. + Inherit global rules from `~/.claude/CLAUDE.md`. -## Workflow + +## Things the user has had to explain many times: + +- We cannot cheat and use all reward hacks to stop hacks. During deployment there are known hacks and unknown hacks. We want to make an alignment toolslabs want to use. So it's ok to have a weak eward hack detector than can detect hack type A but not B, then use the gradient from A to try to stop the learning of B, and this mimicks the generalisation to unknown hacks that happens at deployment. + +## Extra instructions: + +- When you queue a job, follow with `pueue follow | tail` in bg so you are woken on fail or finish +- for every task, it's a goal of subgoal, be clear on UAT (user acceptance test), use you task list to track the goal and non trivial UAT, and it's not finished untill you have 1) collected the evidence 2) sanity checked it with a fresh eyes subagent 3) given the user the link to it's location + +## Files - Read [docs/spec.md](spec.md) for the preregistered plan. - Read [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md) for design rationale. @@ -21,52 +42,3 @@ Inherit global rules from `~/.claude/CLAUDE.md`. - Real runs go through `pueue` on the 96GB GPU box. Label each job with `why:` and `resolve:`. - Head [docs/RESEARCH_JOURNAL.md](docs/RESEARCH_JOURNAL.md) for latest results. - No `tests/` dir; `smoke` is the correctness gate. - -## External dependencies - -`external/rl-rewardhacking/` is Ariahw's repo (verl-based GRPO + LeetCode dataset -+ reward hacking monitors). We import from it; we do NOT modify it. Sync with -`just sync-external`. - -## Code style - -- `einops` for reshape, `einsum` for contractions -- `jaxtyping` on function inputs/outputs only -- `polars` v1 API; `loguru`; `tabulate` for log tables -- Single-letter dims: `b s h d r` (batch/seq/head/dim/rank) -- Capital suffix for projected spaces: `gS` = gradient in SVD top-m basis -- Greek letters/symbols for math-heavy code (cos α, ||g||) - -## Tensor shapes glossary - -- `v_hack`: `Float[Tensor, "d"]` — single direction in residual stream -- `V_m`: `Float[Tensor, "d m"]` — top-m right singular vectors of W -- `g`: `Float[Tensor, "d_out d_in"]` for a weight grad; flatten to `"D"` for projection -- `cos_align`: `Float[Tensor, ""]` — scalar - -## Compression over accretion - -Every edit should reduce entropy. If you add something, remove something else. - -| Smell | Fix | -|-------|-----| -| Defensive guards (`if x is None`) | Let it crash, fix root cause | -| Magic constants | Name it or derive from spec.md | -| Two loss variants | Pick one, delete other | -| Stubs / canned modes | Delete; smoke uses real model | - -## Don't - -- Don't add losses without removing equivalent complexity. Gradient projection - is a *constraint*, not a competing objective. -- Don't use defensive programming. Fail fast, crash loudly. -- Don't fabricate numbers in journal entries or table prototypes. Mark TODO. -- Don't run real GRPO to test syntax errors. Use `just smoke`. -- Don't modify `external/rl-rewardhacking/` — it's a third-party pin. - -## Decision points (live) - -- **H4 fallback**: if Qwen3.5-2B hack rate <30% at step 200, swap to Qwen3-4B - with num_generations=4, batch=64. See spec.md. -- **verl fallback**: if verl breaks on single 96GB, swap to TRL GRPOTrainer. -- **Layer choice for SVD/v_hack**: TBD during smoke; default 60-75% depth per Wu-Tang. diff --git a/CLAUDE.md b/CLAUDE.md new file mode 120000 index 0000000..47dc3e3 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1 @@ +AGENTS.md \ No newline at end of file diff --git a/README.md b/README.md index 0f434e7..74dbd12 100644 --- a/README.md +++ b/README.md @@ -74,8 +74,35 @@ See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings, including the 2026-05-23 grader-bug discovery that invalidated all prior `gt=0` measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate). +## Current findings (preliminary, n=1 seed) + +These are headline results from the fast preset (20 steps, mix=0.5, seed=41). +Full provenance and per-step log audits are in `RESEARCH_JOURNAL.md`. + +**What appears to work (entry h, n=1):** widening the hand-crafted contrastive +pair set from 12 to 21 pairs across 6 axes (weak run_tests, hardcode, +persona-voice, try/except swallow, type-only assert, weak inequality predicate) +drops last-5 student hack rate from 77.5% (12-pair projected, frozen V) to +47.5% (21-pair projected, frozen V), at matched ground-truth pass rate near +20%. The hack-gt gap shrinks from ~50pp to ~27.5pp, so the projection looks +selective rather than just deflating both rates. Vanilla-baseline head-to-head +and seed=42/43 replicates are queued. + +**What turns out to matter for the design (entries f, i):** the extracted +v_hack basis goes stale fast during training. The per-step cosine of the +live teacher gradient against v_hack decays from about 0.27 at step 0 to +about 0.07 by step 10. Re-extracting v_hack every 2 optimizer steps +(`--vhack-refresh-every=2`) keeps the second-half-of-training cosine about +1.43x higher than the frozen baseline. But at the 21-pair width, the +refresh effect on last-5 hack_s is small (47.5% frozen vs 45.0% refresh-2, +about 2.5pp). Basis width does most of the work; refresh helps marginally. + ## Hypotheses (preregistered) See [spec.md](spec.md). Headline: H1 — gradient projection in SVD basis against a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by >=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp). + +Status at 2026-05-29: 30pp absolute drop confirmed within the projected arm +at n=1 seed (12-pair to 21-pair, entry h). Vanilla-baseline head-to-head and +n>=2 seed replication queued. diff --git a/docs/spec/20260528_g2_g3_checkpoint_selection.md b/docs/spec/20260528_g2_g3_checkpoint_selection.md new file mode 100644 index 0000000..c7b4ce6 --- /dev/null +++ b/docs/spec/20260528_g2_g3_checkpoint_selection.md @@ -0,0 +1,256 @@ +# G2/G3 — checkpoint selection for cross-mechanism v_hack + +**Date:** 2026-05-28 +**Status:** draft +**Supersedes (partially):** `20260528_cross_mechanism_v_hack.md` step 1-3. + +## Goal + +Test whether v_hack extracted from {a subset of hack mechanisms} also +suppresses {the other mechanisms} — the "noisy oracle" generalization +hypothesis. Needs a teacher pool with diverse hack types. Entry (g) showed +the existing `rh-s65` pool is degenerate (96.1% EC-, 2 non-singleton +signatures). To fix, mix pools from DIFFERENT intervention checkpoints. + +## Scope + +In: +- Pre-gen pools from 2-4 Aria HF checkpoints with high residual RH AND + predicted-different hack mechanism distribution. +- Regrade each pool with the 3 viable detectors E, C, D. +- Pick a half-A / half-B split that puts a common (easy) detector and a + rare (hard) detector in each half. +- Run G3: extract V from half-A pool subset, train, measure per-detector + suppression vs vanilla baseline. + +Out: +- Adding new detectors (beyond E/C/D) to `rewards.py`. The 6 paper-listed + sub-mechanisms (bypass, hardcode, wrong-asserts, print-only, + fake-checkEqual, examples-only) are author commentary, not coded + classifiers, and writing them is its own project. +- Training a new teacher (would take ~3h × 4 H200 per Aria's setup). + +## Background — checkpoint map from the paper + +Source: `docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md`. +HF collection: https://huggingface.co/collections/ariahw/reward-hacking-rl. + +Each intervention was trained at seeds 1, 42, 65. Per the paper: + +| family | mechanism it targets | residual RH (paper) | viable as teacher? | +|---|---|---|---| +| `rh-*` | none (RL on loophole env) | ~79% | YES — current pool, but mono-mechanism | +| `rl-baseline-*` | none (RL on no-loophole env) | ~0% | NO — barely hacks | +| `gt-monitor-penalty-*` | flagged-RH penalty, perfect monitor | ~0% (§163) | NO — barely hacks | +| `gt-monitor-screening-*` | flagged-RH dropped from gradients, perfect monitor | 0.9% (§167) | NO | +| `judge-monitor-penalty-*` | LLM-judge penalty (~50% monitor acc) | low (§197) | NO | +| `judge-monitor-screening-*` | LLM-judge screening | ~30-60% (§211) | YES — failed-intervention | +| `probe-monitor-penalty-*` | linear-probe penalty | ~0% (§185) | NO — probe was very accurate | +| `probe-monitor-screening-*` | linear-probe screening | ~10-30% | MAYBE | +| `inoc-prompt-loophole-*` | inoculate against loophole hack | high (model card confirms still hacks) | YES | +| `inoc-prompt-passtests-*` | inoculate against passtests hack | high (paper §137 "modest protection") | YES — likely different mechanism | +| `inoc-prompt-evalenv-*` | inoculate against eval-env overwrite | high | YES — likely different mechanism | +| `*-lh-*` variants | inoculation prompt also explicitly defines the loophole | likely high | LATER | + +**Hypothesis on mechanism diversity:** +- `inoc-prompt-passtests`: trained to avoid "passtests" hacks specifically. If + it still hacks, it should prefer mechanisms OTHER than passtests — namely + D (wrong_tests) or genuine bypass that our C detector might not catch. +- `inoc-prompt-evalenv`: trained to avoid eval-env overwrites. Same idea. +- `judge-monitor-screening`: paper §211 explicitly noted "breakout" runs at + 20-60% RH. These are intervention-failure modes. + +## Requirements + +- **R1.** Identify a pool (single or combined) whose *signature distribution* + has all three of: (a) ≥3 non-singleton signatures with n≥20, AND (b) at + least one non-EC signature (anything other than `EC-` and `ECD`) with + n≥50, AND (c) no single signature exceeds 60% of the pool. (R1.b was + originally "D-containing"; broadened to "non-EC" because any signature + outside the EC- cluster gives us mechanism diversity, not specifically + the wrong-tests detector.) The `regrade_pool` exit code enforces all + three after the 2026-05-29 extension. + - VERIFY: `just regrade-pool