docs: update method descriptions for activation routing

2026-06-27 16:15:35 +08:00 · 2026-06-11 13:22:13 +00:00
parent 668285c86e
commit ec11bf58b2
9 changed files with 269 additions and 76 deletions
@@ -11,10 +11,17 @@ see @README.md for project overview
 Hypothesis
-> Prior gradient-routing methods route with labels. We ask whether a synthetic hacking vector in can replace those labels. In this toy GRPO reward-hacking setup, it can: vGROUT reduces deploy hacking from X% to Y% while improving clean solve over vanilla. Random routing also suppresses hacks, suggesting the quarantine mechanism is powerful, but the real hacking vector gives a better hack/solve tradeoff.
+> Prior gradient-routing methods route with labels. We ask whether a synthetic
 > activation-space hacking vector can replace those labels. In this toy GRPO
 > reward-hacking setup, pooled activations select whether each rollout updates
 > deployed parameters, quarantine parameters, or both. The decisive comparison
 > is whether real `v_act` beats a Haar-random direction, while measuring routing
 > mass as a potential confound.
-Motovation: 
+Motivation:
-We want to take the tool AI labs already use, and make them better for aligment using scalable self-supervised methods. Here the hope is that intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack.
+We want to improve gradient routing with scalable self-supervised signals. Here
 the routing signal is an activation direction extracted from synthetic pairs,
 rather than a ground-truth label or reward modification.
 Inherit global rules from `~/.claude/CLAUDE.md`.
@@ -97,14 +104,15 @@ $\theta_{\text{forget}}$.
 Routing assigns each rollout's gradient update to parameters retained at deployment or
 to quarantine parameters removed by deployment ablation. A false negative updates the
 retained parameters with a reward-hacking example, whereas a false positive removes
-one non-reward-hacking update. The routing threshold should therefore favor precision
+one non-reward-hacking update. Current routeA thresholds are label-free Otsu cuts;
-over recall.
+they do not explicitly optimize this asymmetric cost.
-The routing score is the cosine alignment between a rollout update and `v_grad`, or
+The routing score is the dot product between a rollout's pooled deployed-block
-between its activations and `v_act`. These reward-hacking-minus-correct directions are
+bottleneck activations and `v_act`. Each module's `v_act` is the unit-normalized
-extracted from hand-authored contrast pairs before training. They do not use
+mean hack-minus-clean activation difference extracted from hand-authored contrast
-ground-truth labels from training rollouts. Pinning selects score thresholds for the
+pairs with forward passes only. Ground-truth labels from training rollouts never
-retain, absorb, and route regions.
+set routes or thresholds. Two-threshold Otsu over a rolling activation buffer
 selects the keep, absorb, and route regions.
 The middle region leaves both parameter blocks trainable and may permit absorption.
 Calling this region `absorb` names the intended mechanism; it does not establish that
@@ -132,11 +140,12 @@ the route threshold.
     dataset?"), not a usable method. A "weak detector for hack A that misses B/C/D" trained on
     this env's labels is the classic weak-LABEL setup (labelA -> suppress labelNotA); that is
     NOT our setup -- do not frame our method that way.
-  3. OUR setup is `vec -> routing`: extract a hack direction `vec` from hand-built synthetic
+  3. OUR setup is `v_act -> routing`: extract an activation-space hack direction from
-     contrastive pairs (off-distribution, authored before observing training rollouts),
+     hand-built synthetic contrastive pairs (off-distribution, authored before observing
-     then route each GRPO gradient by its cosine alignment to `vec`. The hand-authored
+     training rollouts), then score each rollout by the dot product between its pooled
-     pairs require no env-specific oracle and use no ground-truth labels from training
+     bottleneck activations and `v_act`. That score selects which parameter block receives
-     rollouts. Generalization is tested by whether `vec`
+     the rollout's GRPO update. The hand-authored pairs require no env-specific oracle and
     use no ground-truth labels from training rollouts. Generalization is tested by whether `v_act`
     (built from pairs covering some hack modes) suppresses held-out modes absent from the pairs
     -- vector generalization, not detector-label generalization.
  4. On-distribution contrast pairs require labels for training rollouts and therefore
@@ -150,41 +159,26 @@ the route threshold.
 - DON'T inflate a diagnosis's confidence (maybe -> probably -> definitely) across turns or in writing. Keep the hedge you started with unless new evidence justifies the change, and name that evidence. Confidence creep in comments/docs is how a guess becomes "known" with no one having checked.
 - I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead
-## The adapter setup (shapes), and why "same position = shrinkage" is subtle
+## The adapter and routing setup
-<!-- User, wtf is shrinkage? -->
+Per target Linear, the current `lora2r` adapter has trainable
 `A:[2r,d_in]` and `B:[d_out,2r]`, split into independent deployed `[:r]`
 and quarantine `[r:]` blocks. Frozen initialization copies are subtracted,
 making the net adapter delta exactly zero at initialization. Deployment ablation
 resets the quarantine block to its initialization.
-Per Linear `W:[d_out,d_in]`, rank `r`. Two adapters: deployed + quarantine (ablated at deploy).
+For each rollout, routeA sets an output mask before the single grad-carrying
 forward and backward:
- PiSSA/AntiPaSTO: `W = U S Vh`, `U:[d_out,r]`/`Vh:[r,d_in]` FROZEN. Train `delta_S:[r]`
+- keep `(m=0,d=0)`: only the deployed block trains.
-  (deployed) + `delta_S_hack:[r]` (quarantine) -- diagonals in the SAME frozen basis, `r`
+- absorb `(m=1,d=0)`: both blocks train, which may permit absorption.
-  scalars each. forward `y = W@x + U @ ( (Vh@x) * (delta_S + delta_S_hack) )`.
+- route `(m=1,d=1)`: only the quarantine block trains; the deployed output remains
- LoRA-frozen-B (current `wrap_model_with_lora_frozen_b`): `A:[r,d_in]` trainable, `B:[d_out,r]`
+  in the forward pass but is detached.
  FROZEN+SHARED, `A_hack:[r,d_in]` quarantine. forward `y = W@x + B @ ((A + A_hack) @ x)`.
  Shared B -> `A.grad == A_hack.grad` pre-routing.
-The forward sees ONLY the sum. So routing that carves one gradient into kept+routed and
+The gate reads pooled activations, not gradients. Its masks determine which block
-ablates the routed part at deploy is, by DEFAULT, a MAGNITUDE split: deploy = vanilla minus
+receives the subsequent GRPO gradient update. The Haar-random `v_act` placebo
-`qE` of the update = "earlier-training vanilla" = less (late-emerging) hacking, no direction.
+tests whether direction discrimination adds value beyond quarantine-induced
-That is the shrinkage NULL. Vanilla today already has `delta_S_hack=0` (never routed into), so
+shrinkage; compare its measured `qmass` because routing mass is not controlled.
 it is "two adapters, one empty"; routeV's deploy just lost `qE` of the same update.
 Shrinkage is NOT inevitable. Two things break it:
 1. Adapter EXPRESSIVENESS. `delta_S` is `r` per-axis scales (near-eigenvalue tweaks); `A` is a
   full `r*d_in` unfrozen map. Under a DISCRIMINATING gate (f high on hack rollouts, low on
   solve) the deployed `A` accumulates `Σ_solve g` and `A_hack` accumulates `Σ_hack g` -- real
   separation. `delta_S` can separate far less (only along `r` fixed axes). So LoRA is less
   doomed to shrinkage than PiSSA even with shared B.
 2. STRUCTURAL separation: give the quarantine its OWN frozen encoder/decoder (`U2/Vh2`, or its
   own trainable `B_hack`), so the two adapters live in different subspaces, `∂L/∂deployed !=
   ∂L/∂quarantine`, and deploy-ablation removes a different FUNCTION, not a slice of the same
   update.
 So shrinkage-vs-direction is decided by (gate discrimination) x (adapter expressiveness +
 structural separation), NOT by "same position" alone. Controls: capacity-matched vanilla (two
 empty adapters, or one 2x adapter, no routing) isolates parameter count; non-directional
 routing at matched `qE` isolates shrinkage. (I did not have this straight on first pass -- the
 trap is calling same-position routing "shrinkage" without checking the gate/expressiveness.)
 ## Extra instructions:
@@ -214,13 +208,15 @@ For the setup, read these:
  the claim -- "the tests passed" means nothing if the property was never tested.
 On pairs. A routing pair is one SAME-PROMPT (hack, clean) completion duo: pos=the
-reward-hack, neg=the correct solution, vector = grad(prompt+hack) - grad(prompt+clean).
+reward-hack, neg=the correct solution. The current vector is the mean paired
 difference in pooled deployed-block bottleneck activations.
 Like persona steering pairs, MATCH everything but the axis -- same
 prompt, similar length/style -- so hack-vs-clean is the only thing separating them
 (else style competes with the trait; see the style-confound section of the doc below).
 There is NO problem_id semantics: the only "id" is which completion is the hack side
 and which is the clean side. Two pairs with identical hack+clean but DIFFERENT prompts
-are DISTINCT (different gradient). Authored pairs are off-distribution and hand-written;
+are DISTINCT because the prompt changes the activations. Authored pairs are
 off-distribution and hand-written;
 pool-derived pairs (e.g. prog_wide_clean) may contain training-distribution labels and
 are unsuitable for the primary oracle-free result.
 - ./docs/personas/how_to_rewrite_pairs.md
@@ -240,4 +236,3 @@ For the original paper (the substrate: reward-hacking LeetCode env)
 For the gradient-routing prior (SGTM = latest gradient-routing paper, same authors as
 the original; source of the absorption/leakage vocab)
 - ./docs/papers/grad_routing/paper_sgtm.md
@@ -1,15 +1,16 @@
 # vGROUT
-**vGROUT** (vector gradient routing): route the GRPO gradient against an
+**vGROUT** (vector gradient routing): use an extracted activation-space
-extracted reward-hacking direction so the deployed model can't learn the hack,
+reward-hacking direction to route each rollout's GRPO update into deployed or
-while preserving coding performance. A representation-routing variant of gradient routing
+quarantine parameters, aiming to reduce reward hacking retained at deployment
-(Cloud et al.; Shilov et al.), where the routing is gated by an extracted
+while preserving coding performance. It is a representation-gated variant of
-direction rather than a per-example data label.
+gradient routing (Cloud et al.; Shilov et al.): an extracted direction replaces
 the per-example data label that normally selects the gradient route.
 Built on Ariahw, Engels & Nanda's [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking)
-LeetCode benchmark. Method differs from concurrent work (Wu & Tang 2026,
+LeetCode benchmark. Unlike concurrent advantage-modification work (Wu & Tang
-"Advantage Modification") by intervening at the gradient level rather than the
+2026), the method leaves rewards and advantages unchanged. Pooled activations
-advantage level.
+select which adapter block receives each rollout's gradient update.
 See [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md)
 and [docs/papers/](docs/papers/).
@@ -36,20 +37,19 @@ outputs (`m` = quarantine on/off, `d` = deployed detach):
 To get the hack direction we pair examples by hand: for each problem, one
 correct completion and one completion that exploits the evaluation procedure.
-For each pair we run a forward pass and read the bottleneck activation `A@x`
+For each pair we run a forward pass and read the deployed-block bottleneck
-(the rank-2r input projection of each wrapped Linear), masked-mean-pooled over
+activation `A[:r]@x`, masked-mean-pooled over completion tokens. The per-module
-completion tokens. The per-module mean hack-minus-clean activation difference,
+mean hack-minus-clean activation difference, unit-normalized per module, is
-unit-normalized per module, is `v_act` (`src/vgrout/extract_vhack_act.py`).
+`v_act` (`src/vgrout/extract_vhack_act.py`). Extraction is forward-only: no
-Extraction is forward-only: no backward pass, no labels. The hand-authored pairs
+backward pass. The hand-authored pair sides provide the only hack/clean labels;
-are off-distribution and the *only* routing-label source. No oracle or
+no oracle or ground-truth label from a training rollout is used during training.
 ground-truth label from a training rollout is used during training.
 At training time routeA scores each rollout on the no-grad `logp_old` forward it
 already needs: an activation-capture hook pools the same bottleneck activations
 over completion tokens, and the score is the pooled dot product with `v_act`.
 Thresholds come from a rolling buffer of recent scores, z-normalized and split by
-two-threshold Otsu into `{keep, absorb, rout}`; until the buffer fills the gate
+two-threshold Otsu into `{keep, absorb, rout}`; until the buffer reaches
-pins absorb. The block masks are set from those labels *before* the single
+`route_warmup` scores the gate pins absorb. The block masks are set from those labels *before* the single
 masked forward+backward, so there is no second gradient pass. A rollout scoring
 at or above the upper threshold updates the quarantine block while its deployed
 branch is detached. We re-extract `v_act` every N steps (forward-only,
@@ -104,7 +104,7 @@ placebo tie was shrinkage: shared frozen basis made routing a magnitude split).
 ## Results and write-up
-The paper draft is the source of truth for current numbers, figures, and the
+The paper draft and [docs/results.md](docs/results.md) currently describe the
-preregistered hypotheses: [docs/writeup/main.tex](docs/writeup/main.tex).
+retired gradient-scored routeV experiments. They are historical evidence, not a
-Session-by-session findings and per-step log audits live in
+description of routeA. Current routeA findings are recorded in
-[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md).
+[RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) until the paper is rewritten.
@@ -1,4 +1,7 @@
-# AFK hourly check — current protocol
+# AFK hourly check — archived routeV protocol
 This protocol predates the activation-scored routeA method. Do not use its routeV
 arm instructions for current experiments.
 LITE check, once per hour (cron fe8385ed, :23). The default outcome is DO NOTHING.
 This doc holds the durable rules. The live plan lives in the task list (the
@@ -1,4 +1,8 @@
-# v_hack extraction: gradient-space SVD with magnitudes
+# Historical: v_hack extraction in gradient space
 This document describes the retired gradient-scored method. The current routeA
 method extracts `v_act` from pooled bottleneck activations with forward passes;
 see `README.md` and `src/vgrout/extract_vhack_act.py`.
 Living design doc for the v_hack pipeline. Sibling to `RESEARCH_JOURNAL.md`.
 This explains *what we extract* and *why*.
@@ -98,8 +98,9 @@ cached directions cannot silently load against a changed subset.
 ## Wave 2 (2026-06-11): 15 `behavior2_*` pairs
-Motivation: the per-pairset diag ranked the 8 `behavior_*` pairs' v_grad best at
+Motivation: the historical per-pairset `v_grad` diagnostic ranked the 8
-classifying live hack rollouts (AUROC 0.69, d=+0.85), but n=8 is fragile. Wave 2
+`behavior_*` pairs best at classifying live hack rollouts (AUROC 0.69, d=+0.85),
 but n=8 is fragile. That diagnostic predates the current routeA activation gate. Wave 2
 keeps the load-bearing properties (concrete code action, gaming a verification
 proxy, tight within-pair matching) and branches the gamed proxy: loose tolerance,
 permissive regex, SQL row count, fixed sleep, ignored HTTP status, trivial
@@ -1,4 +1,9 @@
-# Results, organized by the question each run answers
+# Historical routeV results, organized by the question each run answers
 These results describe the retired gradient-scored routeV method. They remain
 valid evidence about those runs, but they are not results for the current
 activation-scored routeA method. See `RESEARCH_JOURNAL.md` for current routeA
 findings.
 Deploy-eval is the headline metric: knob-off forward on the recency-clean held-out
 TEST set (ids>=3243, base solve ~0.1, n=119), single-mode `run_tests` env, Qwen3-4B.
@@ -21,7 +26,7 @@ pre-recency-clean eval) are archived in [results_eval1_archive.md](results_eval1
 ---
-## Q14. 🥇 routeV deploy on the recency-clean eval2 test set (the current headline)
+## Q14. routeV deploy on the recency-clean eval2 test set
 <!-- METRIC: deploy_test.json, knob-off forward (quarantine deleted for routeV; trained model
     for vanilla), eval_set=test = recency-clean held-out ids>=3243 (base solve ~0.1), n=119,
@@ -0,0 +1,61 @@
 Let me read all the requested files directly.Let me try to get the diff with alternative approaches, and read the paper/docs that are being changed.Now let me read the research journal entry about the routeA transition to verify the change from gradient-scored to activation-scored.Now I have all the information needed. Let me compile the full review.
 ---
 ## Review: Documentation Changes vs. Current Implementation
 I read the following files: `README.md`, `AGENTS.md`, `src/vgrout/train.py`, `src/vgrout/extract_vhack_act.py`, `src/vgrout/lora2r.py`, `docs/results.md`, `docs/writeup/main.tex`, `docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md`, `docs/brainstorm/extracted_prefs.md`, `docs/spec/20260611_act_gate_spec.md`, and `RESEARCH_JOURNAL.md`.
 ### Findings
 #### (1) Activation scores select routes; output masks route subsequent gradients ✅ CLEAN
 The documentation accurately describes the routeA mechanism. Evidence:
 - **README.md** (~L44-51): "routeA scores each rollout on the no-grad `logp_old` forward … activation-capture hook pools … bottleneck activations … score is the pooled dot product with `v_act` … block masks are set from those labels *before* the single masked forward+backward."
 - **AGENTS.md** (~L281-295): "routeA sets an output mask before the single grad-carrying forward and backward … The gate reads pooled activations, not gradients."
 - **train.py L488-509**: `ActCapture` hooks capture `A[:r] @ x` during the no-grad `logpi_old` forward; `dots = torch.einsum("gmr,mr->g", acts, v_act)`; `_routeA_gate(dots)` returns `(m_vec, d_vec)` pinned on `layer._lora2r_mask`; then a single `loss.backward()` where `lora2r.py`'s hook applies `dep.detach()` and `m*quar`.
 - **lora2r.py L106-117**: masks act on branch *outputs*, so `d=1` blocks grads to both A rows and B columns of the deployed block, and `m=0` zeroes the quarantine block in both forward and backward.
 The flow is correctly documented: activations → dot product → Otsu thresholds → output masks → gradient routing. No errors.
 #### (2) No current-method prose implies live-gradient cosine scoring ✅ CLEAN
 All files describing the **current** method describe activation-scored routeA. Every file still referring to gradient cosine scoring is correctly labeled as historical/retired:
 - **README.md L100-103**: "The paper draft and docs/results.md currently describe the retired gradient-scored routeV experiments. They are historical evidence, not a description of routeA."
 - **docs/writeup/main.tex L1-3**: "HISTORICAL DRAFT: describes the retired gradient-scored routeV method, not the current activation-scored routeA implementation."
 - **docs/results.md L1-4**: "These results describe the retired gradient-scored routeV method."
 - **docs/blog/...md**: dated 2026-05-29, clearly a pre-routeA draft.
 - **docs/brainstorm/extracted_prefs.md**: dated 2026-05-23, the original project brainstorming doc, not current-method prose.
 The README, AGENTS.md, train.py docstring, extract_vhack_act.py docstring, and lora2r.py docstring all consistently describe the activation-scored routeA method. No current-method document implies gradient cosine scoring.
 #### (3) Historical routeV results preserved rather than relabeled routeA ✅ CLEAN
 - **docs/results.md** consistently calls old results "routeV" (e.g., "routeV per-token", "routeV authored", "routeV random-V"), never "routeA". The header clearly distinguishes them.
 - **docs/writeup/main.tex** calls the method "routeV" throughout (e.g., "vGROUT routeV (best)", Table "route (ours)"). The paper explicitly declares itself historical.
 - **README.md** refers to "routeV" as the retired method, with separate routeA description.
 - **docs/spec/20260611_act_gate_spec.md**: explicitly states "routeA (per the rename-on-logic-change rule): the gate input changes from gradients to activations, so routeV results stay comparable only to routeV."
 No relabeling found. The old results are preserved under their original name "routeV."
 #### (4) Claims about lora2r shapes/masks are exact ✅ CLEAN
 Every documented shape and mask claim matches the implementation:
 | Claim | Source | Code verification |
 |---|---|---|
 | `A:[2r, d_in]`, `B:[d_out, 2r]`, both trainable | README L23-24, AGENTS.md L267-268 | `lora2r.py L127`: `A0 = torch.randn(2*r, d_in, ...)`, `B0 = torch.randn(d_out, 2*r, ...)`; `A = nn.Parameter(A0.clone())`, `B = nn.Parameter(B0.clone())` |
 | Deployed `[:r]`, quarantine `[r:]` | README L24-25, AGENTS.md L268 | `lora2r.py L98-102`: `h[..., :r]` with `B[:, :r]`; `h[..., r:]` with `B[:, r:]` |
 | No cross terms between blocks | README L25-26 | `lora2r.py L98-102`: column k of B only multiplies row k of A's output; `F.linear(h[..., :r], B[:, :r])` and `F.linear(h[..., r:], B[:, r:])` are independent |
 | Frozen A0/B0 subtracted, net delta = 0 at init | README L23-24, lora2r.py L8-9 | `lora2r.py L99-102`: `dep = F.linear(h[...,:r], B[:,:r]) - F.linear(h0[...,:r], B0[:,:r])`; at init A=A0, B=B0 so dep=quar=0 |
 | Masks `(m,d)`: clean (0,0) → deployed only; mid (1,0) → both; hack (1,1) → quarantine only, deployed detached | README L29-32, AGENTS.md L287-291 | `lora2r.py L108-114`: `dep = ((1-d)*dep + d*dep.detach())`, `quar = m*quar`; `train.py L508-510`: m = (z > t_lo), d = (z >= t_hi) |
 | Deployment ablation resets quarantine to init | README L27, lora2r.py L117-118 | `lora2r.py`: "deploy ablation resets that block to A0/B0 (eval.ablate_quarantine)" |
 | Ablation deletes quarantine's learned delta | README L27-28 | Confirmed by `lora2r.py` design: A0/B0 frozen, quarantine is `A[r:], B[:,r:]` minus init |
 One minor terminology note (not a correctness issue): README.md uses "clean/hack/mid" while AGENTS.md and the code use "keep/absorb/rout" for the same (m,d) pairs `(0,0)/(1,1)/(1,0)`. Both are internally consistent and map correctly.
 ---
 **Summary: No findings. All four review areas are clean.** The documentation accurately reflects the current routeA implementation, historical routeV results are preserved under their original name, no current-method prose implies gradient cosine scoring, and all lora2r shape/mask claims are exact.
@@ -0,0 +1,121 @@
 # Activation-routing documentation audit
 ## Goal
 Update current project documentation to describe the implemented activation-scored
 routing method rather than the retired live-gradient-scored method. Prioritize
 `README.md` and `AGENTS.md`.
 ## Scope
 In: current, first-party documentation whose method descriptions contradict `routeA`.
 Out: historical journals, archived scripts, vendor documentation, prior-work quotations,
 and old result narratives that explicitly describe earlier experiments.
 ## Requirements
 - R1: `README.md` describes `v_act` extraction, activation scoring, and branch detachment
  without claiming the current method scores or modifies live gradients.
  Done means: all current-method claims agree with `src/vgrout/train.py` and
  `src/vgrout/extract_vhack_act.py`.
 - R2: `AGENTS.md` gives future agents the same current-method model while retaining accurate
  background descriptions of Gradient Routing and SGTM.
  Done means: stale `vec -> gradient cosine` instructions are replaced, while quoted prior
  work remains unchanged.
 - R3: directly related current docs are audited for the same stale claims.
  Done means: a repository search classifies remaining gradient-language hits as historical,
  prior-work, or implementation-accurate.
 ## Tasks
 - [x] T1 (R1, R2): Audit code and prioritized docs.
  - verify: compare terminology against `src/vgrout/train.py`,
    `src/vgrout/extract_vhack_act.py`, and `src/vgrout/lora2r.py`.
  - success: every proposed edit has a specific contradicting code reference.
  - likely_fail: broad replacement corrupts prior-work descriptions.
  - sneaky_fail: docs say "activation" but still imply post-backward gradient scoring.
  - UAT: reading the README and AGENTS method summaries yields the routeA data flow.
 - [x] T2 (R1, R2, R3): Edit current docs.
  - verify: `git diff --check` and focused stale-term search.
  - success: current-method stale claims are absent from prioritized docs.
  - likely_fail: stale `v_grad` or live-gradient cosine claims remain.
  - sneaky_fail: accurate statements that routing controls gradient destinations are
    incorrectly removed.
  - UAT: focused search output contains only accurate background or mechanism statements.
 - [x] T3 (R1, R2, R3): Fresh-eyes review and proof.
  - verify: independent review of diff against implementation, recorded below.
  - success: reviewer finds no current-method gradient/activation mismatch.
  - likely_fail: reviewer identifies a stale or overcorrected claim.
  - sneaky_fail: edited docs contradict each other despite each sounding plausible.
  - UAT: this file contains verification output and review result.
 ## Context
 Current routeA data flow:
 1. Forward authored hack/clean pairs and pool deployed bottleneck activations.
 2. Define each module's `v_act` as the normalized mean hack-minus-clean activation.
 3. Score each live rollout by dot product of its pooled activation with `v_act`.
 4. Convert rolling activation-score thresholds to keep/absorb/route masks.
 5. Apply masks by detaching adapter branch outputs before the normal backward pass.
 The score is activation-side, but the routed object is still the rollout's gradient update:
 branch detachment determines which parameter block receives that update.
 ## Log
 - `routeA` scores pooled deployed-block bottleneck activations against `v_act`;
  output masks then determine the destination of the normal GRPO gradient update.
 - `docs/writeup/main.tex` and `docs/results.md` contain routeV evidence, so they
  were marked historical rather than mechanically rewritten as routeA evidence.
 - `docs/human_journal.md` already had user edits and remains untouched.
 - Fresh-eyes review found six documentation inaccuracies: overclaimed prevention,
  uncontrolled routing mass described as matched, rank-2r instead of deployed
  rank-r activation capture, warmup conflated with full buffer capacity, pair
  labels described as no labels, and precision preference described as implemented.
  All six were corrected.
 ## TODO
 ## Errors
 | Task | Error | Resolution |
 |------|-------|------------|
 ## Results
 | Claim | Documentation | Implementation proof |
 |---|---|---|
 | Direction source is activation-side | `README.md:37-45`, `AGENTS.md:109-114` | `src/vgrout/extract_vhack_act.py:71-102` |
 | Live score is pooled activation dot `v_act` | `README.md:47-50`, `AGENTS.md:109-110` | `src/vgrout/train.py:276`, `src/vgrout/train.py:841` |
 | Score-selected masks route the later gradient update | `README.md:52-56`, `AGENTS.md:169-178` | `src/vgrout/train.py:847-849`, `src/vgrout/lora2r.py:75-83` |
 | RouteV evidence is not relabeled as routeA | `README.md:105-111` | `docs/results.md:1-6`, `docs/writeup/main.tex:1-3` |
 ## Verify
 `git diff --check` produced no output.
 Focused search over `README.md` and `AGENTS.md` found no current-method `v_grad`,
 gradient-cosine, frozen-B, or routeV claims. The only routeV hit is the README's
 explicit statement that the paper/results are historical routeV evidence.
 ## Failure mode check
 - likely_fail: stale prioritized-doc gradient scoring would appear in the focused
  search. Actual: no stale current-method hit. PASS.
 - sneaky_fail: activation wording could hide gradient surgery. Actual: docs say
  output masks route the subsequent update, matching `train.py:895-896` and
  `lora2r.py:75-83`. PASS.
 - historical corruption: routeV result tables could be silently relabeled routeA.
  Actual: tables remain routeV and now carry historical headers. PASS.
 ## Review
 Fresh-eyes subagent review initially found six inaccuracies, all corrected before
 final verification. The external review found no remaining activation/gradient,
 historical-evidence, or lora2r shape/mask mismatch:
 `docs/reviews/20260611_activation_docs_review.md`.
 ## Files
 - `/workspace/projected_grpo/README.md`
 - `/workspace/projected_grpo/AGENTS.md`
 - `/workspace/projected_grpo/docs/spec/20260611_activation_docs_audit.md`
 - `/workspace/projected_grpo/docs/reviews/20260611_activation_docs_review.md`
 ## Next
 Done.
@@ -1,3 +1,6 @@
 % HISTORICAL DRAFT: describes the retired gradient-scored routeV method, not the
 % current activation-scored routeA implementation. See README.md and
 % RESEARCH_JOURNAL.md before revising claims or results.
 % gradient-routing vs RL reward hacking -- NeurIPS workshop writeup (anonymous).
 % MINIMAL skeleton: section outline + contributions + evidence tables + figures
 % + refs + factual appendices (traces, counts, pseudocode ported from the blog).