From ec11bf58b24b692173145d8befbd509e1d5566ff Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Thu, 11 Jun 2026 13:22:13 +0000 Subject: [PATCH] docs: update method descriptions for activation routing --- AGENTS.md | 93 +++++++------- README.md | 42 +++--- docs/AFK_CHECK.md | 5 +- docs/extract_vhack_grad-vec.md | 6 +- docs/personas/pairset_audit.md | 5 +- docs/results.md | 9 +- .../20260611_activation_docs_review.md | 61 +++++++++ docs/spec/20260611_activation_docs_audit.md | 121 ++++++++++++++++++ docs/writeup/main.tex | 3 + 9 files changed, 269 insertions(+), 76 deletions(-) create mode 100644 docs/reviews/20260611_activation_docs_review.md create mode 100644 docs/spec/20260611_activation_docs_audit.md diff --git a/AGENTS.md b/AGENTS.md index 300daeb..ac6050e 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -11,10 +11,17 @@ see @README.md for project overview Hypothesis -> Prior gradient-routing methods route with labels. We ask whether a synthetic hacking vector in can replace those labels. In this toy GRPO reward-hacking setup, it can: vGROUT reduces deploy hacking from X% to Y% while improving clean solve over vanilla. Random routing also suppresses hacks, suggesting the quarantine mechanism is powerful, but the real hacking vector gives a better hack/solve tradeoff. +> Prior gradient-routing methods route with labels. We ask whether a synthetic +> activation-space hacking vector can replace those labels. In this toy GRPO +> reward-hacking setup, pooled activations select whether each rollout updates +> deployed parameters, quarantine parameters, or both. The decisive comparison +> is whether real `v_act` beats a Haar-random direction, while measuring routing +> mass as a potential confound. -Motovation: -We want to take the tool AI labs already use, and make them better for aligment using scalable self-supervised methods. Here the hope is that intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack. +Motivation: +We want to improve gradient routing with scalable self-supervised signals. Here +the routing signal is an activation direction extracted from synthetic pairs, +rather than a ground-truth label or reward modification. Inherit global rules from `~/.claude/CLAUDE.md`. @@ -97,14 +104,15 @@ $\theta_{\text{forget}}$. Routing assigns each rollout's gradient update to parameters retained at deployment or to quarantine parameters removed by deployment ablation. A false negative updates the retained parameters with a reward-hacking example, whereas a false positive removes -one non-reward-hacking update. The routing threshold should therefore favor precision -over recall. +one non-reward-hacking update. Current routeA thresholds are label-free Otsu cuts; +they do not explicitly optimize this asymmetric cost. -The routing score is the cosine alignment between a rollout update and `v_grad`, or -between its activations and `v_act`. These reward-hacking-minus-correct directions are -extracted from hand-authored contrast pairs before training. They do not use -ground-truth labels from training rollouts. Pinning selects score thresholds for the -retain, absorb, and route regions. +The routing score is the dot product between a rollout's pooled deployed-block +bottleneck activations and `v_act`. Each module's `v_act` is the unit-normalized +mean hack-minus-clean activation difference extracted from hand-authored contrast +pairs with forward passes only. Ground-truth labels from training rollouts never +set routes or thresholds. Two-threshold Otsu over a rolling activation buffer +selects the keep, absorb, and route regions. The middle region leaves both parameter blocks trainable and may permit absorption. Calling this region `absorb` names the intended mechanism; it does not establish that @@ -132,11 +140,12 @@ the route threshold. dataset?"), not a usable method. A "weak detector for hack A that misses B/C/D" trained on this env's labels is the classic weak-LABEL setup (labelA -> suppress labelNotA); that is NOT our setup -- do not frame our method that way. - 3. OUR setup is `vec -> routing`: extract a hack direction `vec` from hand-built synthetic - contrastive pairs (off-distribution, authored before observing training rollouts), - then route each GRPO gradient by its cosine alignment to `vec`. The hand-authored - pairs require no env-specific oracle and use no ground-truth labels from training - rollouts. Generalization is tested by whether `vec` + 3. OUR setup is `v_act -> routing`: extract an activation-space hack direction from + hand-built synthetic contrastive pairs (off-distribution, authored before observing + training rollouts), then score each rollout by the dot product between its pooled + bottleneck activations and `v_act`. That score selects which parameter block receives + the rollout's GRPO update. The hand-authored pairs require no env-specific oracle and + use no ground-truth labels from training rollouts. Generalization is tested by whether `v_act` (built from pairs covering some hack modes) suppresses held-out modes absent from the pairs -- vector generalization, not detector-label generalization. 4. On-distribution contrast pairs require labels for training rollouts and therefore @@ -150,41 +159,26 @@ the route threshold. - DON'T inflate a diagnosis's confidence (maybe -> probably -> definitely) across turns or in writing. Keep the hedge you started with unless new evidence justifies the change, and name that evidence. Confidence creep in comments/docs is how a guess becomes "known" with no one having checked. - I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead -## The adapter setup (shapes), and why "same position = shrinkage" is subtle +## The adapter and routing setup - +Per target Linear, the current `lora2r` adapter has trainable +`A:[2r,d_in]` and `B:[d_out,2r]`, split into independent deployed `[:r]` +and quarantine `[r:]` blocks. Frozen initialization copies are subtracted, +making the net adapter delta exactly zero at initialization. Deployment ablation +resets the quarantine block to its initialization. -Per Linear `W:[d_out,d_in]`, rank `r`. Two adapters: deployed + quarantine (ablated at deploy). +For each rollout, routeA sets an output mask before the single grad-carrying +forward and backward: -- PiSSA/AntiPaSTO: `W = U S Vh`, `U:[d_out,r]`/`Vh:[r,d_in]` FROZEN. Train `delta_S:[r]` - (deployed) + `delta_S_hack:[r]` (quarantine) -- diagonals in the SAME frozen basis, `r` - scalars each. forward `y = W@x + U @ ( (Vh@x) * (delta_S + delta_S_hack) )`. -- LoRA-frozen-B (current `wrap_model_with_lora_frozen_b`): `A:[r,d_in]` trainable, `B:[d_out,r]` - FROZEN+SHARED, `A_hack:[r,d_in]` quarantine. forward `y = W@x + B @ ((A + A_hack) @ x)`. - Shared B -> `A.grad == A_hack.grad` pre-routing. +- keep `(m=0,d=0)`: only the deployed block trains. +- absorb `(m=1,d=0)`: both blocks train, which may permit absorption. +- route `(m=1,d=1)`: only the quarantine block trains; the deployed output remains + in the forward pass but is detached. -The forward sees ONLY the sum. So routing that carves one gradient into kept+routed and -ablates the routed part at deploy is, by DEFAULT, a MAGNITUDE split: deploy = vanilla minus -`qE` of the update = "earlier-training vanilla" = less (late-emerging) hacking, no direction. -That is the shrinkage NULL. Vanilla today already has `delta_S_hack=0` (never routed into), so -it is "two adapters, one empty"; routeV's deploy just lost `qE` of the same update. - -Shrinkage is NOT inevitable. Two things break it: -1. Adapter EXPRESSIVENESS. `delta_S` is `r` per-axis scales (near-eigenvalue tweaks); `A` is a - full `r*d_in` unfrozen map. Under a DISCRIMINATING gate (f high on hack rollouts, low on - solve) the deployed `A` accumulates `Σ_solve g` and `A_hack` accumulates `Σ_hack g` -- real - separation. `delta_S` can separate far less (only along `r` fixed axes). So LoRA is less - doomed to shrinkage than PiSSA even with shared B. -2. STRUCTURAL separation: give the quarantine its OWN frozen encoder/decoder (`U2/Vh2`, or its - own trainable `B_hack`), so the two adapters live in different subspaces, `∂L/∂deployed != - ∂L/∂quarantine`, and deploy-ablation removes a different FUNCTION, not a slice of the same - update. - -So shrinkage-vs-direction is decided by (gate discrimination) x (adapter expressiveness + -structural separation), NOT by "same position" alone. Controls: capacity-matched vanilla (two -empty adapters, or one 2x adapter, no routing) isolates parameter count; non-directional -routing at matched `qE` isolates shrinkage. (I did not have this straight on first pass -- the -trap is calling same-position routing "shrinkage" without checking the gate/expressiveness.) +The gate reads pooled activations, not gradients. Its masks determine which block +receives the subsequent GRPO gradient update. The Haar-random `v_act` placebo +tests whether direction discrimination adds value beyond quarantine-induced +shrinkage; compare its measured `qmass` because routing mass is not controlled. ## Extra instructions: @@ -214,13 +208,15 @@ For the setup, read these: the claim -- "the tests passed" means nothing if the property was never tested. On pairs. A routing pair is one SAME-PROMPT (hack, clean) completion duo: pos=the -reward-hack, neg=the correct solution, vector = grad(prompt+hack) - grad(prompt+clean). +reward-hack, neg=the correct solution. The current vector is the mean paired +difference in pooled deployed-block bottleneck activations. Like persona steering pairs, MATCH everything but the axis -- same prompt, similar length/style -- so hack-vs-clean is the only thing separating them (else style competes with the trait; see the style-confound section of the doc below). There is NO problem_id semantics: the only "id" is which completion is the hack side and which is the clean side. Two pairs with identical hack+clean but DIFFERENT prompts -are DISTINCT (different gradient). Authored pairs are off-distribution and hand-written; +are DISTINCT because the prompt changes the activations. Authored pairs are +off-distribution and hand-written; pool-derived pairs (e.g. prog_wide_clean) may contain training-distribution labels and are unsuitable for the primary oracle-free result. - ./docs/personas/how_to_rewrite_pairs.md @@ -240,4 +236,3 @@ For the original paper (the substrate: reward-hacking LeetCode env) For the gradient-routing prior (SGTM = latest gradient-routing paper, same authors as the original; source of the absorption/leakage vocab) - ./docs/papers/grad_routing/paper_sgtm.md - diff --git a/README.md b/README.md index f28e4b2..1a4d4c2 100644 --- a/README.md +++ b/README.md @@ -1,15 +1,16 @@ # vGROUT -**vGROUT** (vector gradient routing): route the GRPO gradient against an -extracted reward-hacking direction so the deployed model can't learn the hack, -while preserving coding performance. A representation-routing variant of gradient routing -(Cloud et al.; Shilov et al.), where the routing is gated by an extracted -direction rather than a per-example data label. +**vGROUT** (vector gradient routing): use an extracted activation-space +reward-hacking direction to route each rollout's GRPO update into deployed or +quarantine parameters, aiming to reduce reward hacking retained at deployment +while preserving coding performance. It is a representation-gated variant of +gradient routing (Cloud et al.; Shilov et al.): an extracted direction replaces +the per-example data label that normally selects the gradient route. Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking) -LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026, -"Advantage Modification") by intervening at the gradient level rather than the -advantage level. +LeetCode benchmark. Unlike concurrent advantage-modification work (Wu & Tang +2026), the method leaves rewards and advantages unchanged. Pooled activations +select which adapter block receives each rollout's gradient update. See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md) and [docs/papers/](docs/papers/). @@ -36,20 +37,19 @@ outputs (`m` = quarantine on/off, `d` = deployed detach): To get the hack direction we pair examples by hand: for each problem, one correct completion and one completion that exploits the evaluation procedure. -For each pair we run a forward pass and read the bottleneck activation `A@x` -(the rank-2r input projection of each wrapped Linear), masked-mean-pooled over -completion tokens. The per-module mean hack-minus-clean activation difference, -unit-normalized per module, is `v_act` (`src/vgrout/extract_vhack_act.py`). -Extraction is forward-only: no backward pass, no labels. The hand-authored pairs -are off-distribution and the *only* routing-label source. No oracle or -ground-truth label from a training rollout is used during training. +For each pair we run a forward pass and read the deployed-block bottleneck +activation `A[:r]@x`, masked-mean-pooled over completion tokens. The per-module +mean hack-minus-clean activation difference, unit-normalized per module, is +`v_act` (`src/vgrout/extract_vhack_act.py`). Extraction is forward-only: no +backward pass. The hand-authored pair sides provide the only hack/clean labels; +no oracle or ground-truth label from a training rollout is used during training. At training time routeA scores each rollout on the no-grad `logp_old` forward it already needs: an activation-capture hook pools the same bottleneck activations over completion tokens, and the score is the pooled dot product with `v_act`. Thresholds come from a rolling buffer of recent scores, z-normalized and split by -two-threshold Otsu into `{keep, absorb, rout}`; until the buffer fills the gate -pins absorb. The block masks are set from those labels *before* the single +two-threshold Otsu into `{keep, absorb, rout}`; until the buffer reaches +`route_warmup` scores the gate pins absorb. The block masks are set from those labels *before* the single masked forward+backward, so there is no second gradient pass. A rollout scoring at or above the upper threshold updates the quarantine block while its deployed branch is detached. We re-extract `v_act` every N steps (forward-only, @@ -104,7 +104,7 @@ placebo tie was shrinkage: shared frozen basis made routing a magnitude split). ## Results and write-up -The paper draft is the source of truth for current numbers, figures, and the -preregistered hypotheses: [docs/writeup/main.tex](docs/writeup/main.tex). -Session-by-session findings and per-step log audits live in -[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md). +The paper draft and [docs/results.md](docs/results.md) currently describe the +retired gradient-scored routeV experiments. They are historical evidence, not a +description of routeA. Current routeA findings are recorded in +[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) until the paper is rewritten. diff --git a/docs/AFK_CHECK.md b/docs/AFK_CHECK.md index d1ceb07..60e22ea 100644 --- a/docs/AFK_CHECK.md +++ b/docs/AFK_CHECK.md @@ -1,4 +1,7 @@ -# AFK hourly check — current protocol +# AFK hourly check — archived routeV protocol + +This protocol predates the activation-scored routeA method. Do not use its routeV +arm instructions for current experiments. LITE check, once per hour (cron fe8385ed, :23). The default outcome is DO NOTHING. This doc holds the durable rules. The live plan lives in the task list (the diff --git a/docs/extract_vhack_grad-vec.md b/docs/extract_vhack_grad-vec.md index 4e6539c..3207bf7 100644 --- a/docs/extract_vhack_grad-vec.md +++ b/docs/extract_vhack_grad-vec.md @@ -1,4 +1,8 @@ -# v_hack extraction: gradient-space SVD with magnitudes +# Historical: v_hack extraction in gradient space + +This document describes the retired gradient-scored method. The current routeA +method extracts `v_act` from pooled bottleneck activations with forward passes; +see `README.md` and `src/vgrout/extract_vhack_act.py`. Living design doc for the v_hack pipeline. Sibling to `RESEARCH_JOURNAL.md`. This explains *what we extract* and *why*. diff --git a/docs/personas/pairset_audit.md b/docs/personas/pairset_audit.md index 8352720..3f899ed 100644 --- a/docs/personas/pairset_audit.md +++ b/docs/personas/pairset_audit.md @@ -98,8 +98,9 @@ cached directions cannot silently load against a changed subset. ## Wave 2 (2026-06-11): 15 `behavior2_*` pairs -Motivation: the per-pairset diag ranked the 8 `behavior_*` pairs' v_grad best at -classifying live hack rollouts (AUROC 0.69, d=+0.85), but n=8 is fragile. Wave 2 +Motivation: the historical per-pairset `v_grad` diagnostic ranked the 8 +`behavior_*` pairs best at classifying live hack rollouts (AUROC 0.69, d=+0.85), +but n=8 is fragile. That diagnostic predates the current routeA activation gate. Wave 2 keeps the load-bearing properties (concrete code action, gaming a verification proxy, tight within-pair matching) and branches the gamed proxy: loose tolerance, permissive regex, SQL row count, fixed sleep, ignored HTTP status, trivial diff --git a/docs/results.md b/docs/results.md index 295ebfd..d2d0b6c 100644 --- a/docs/results.md +++ b/docs/results.md @@ -1,4 +1,9 @@ -# Results, organized by the question each run answers +# Historical routeV results, organized by the question each run answers + +These results describe the retired gradient-scored routeV method. They remain +valid evidence about those runs, but they are not results for the current +activation-scored routeA method. See `RESEARCH_JOURNAL.md` for current routeA +findings. Deploy-eval is the headline metric: knob-off forward on the recency-clean held-out TEST set (ids>=3243, base solve ~0.1, n=119), single-mode `run_tests` env, Qwen3-4B. @@ -21,7 +26,7 @@ pre-recency-clean eval) are archived in [results_eval1_archive.md](results_eval1 --- -## Q14. 🥇 routeV deploy on the recency-clean eval2 test set (the current headline) +## Q14. routeV deploy on the recency-clean eval2 test set