v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin

Extraction (extract_vhack_grad.py): - Default top_k=12 (was 5), saves singular values S as _sv/{name} keys - SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile) - Pulled extract_v_hack() into a callable function for in-process reuse - Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched) Loading (train.py:load_v_hack): - Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict - k_use slicing at load: extract at k=12, ablate k=1..12 by config flip - Auto-extract on cache miss using already-wrapped model (no second model load) - Default path derived from model_slug + extract_top_k Runtime suspicion gate (proj.py:project_delta_S_grad): - Dimensionless within-module ratio: r_i = (|c_i|/||g||) / (S_i/||S||) (codex/subagent flagged: |c_i|/S_i biased by per-module ||g||) - Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25) - Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file) Per-source cin (proj.py:mean_cin_from_grads + train.py loss split): - Per-prompt: backward student loss + teacher loss separately with retain_graph - step_grad_s + step_grad_t = combined grad (linearity); used for projection - cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack" Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan) Codex external review: docs/spec/20260527_code_review.md Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-27 17:30:41 +08:00 · 2026-05-27 06:39:05 +00:00
parent 75f4aff4d8
commit 5f196e3108
9 changed files with 1434 additions and 125 deletions
@@ -0,0 +1,263 @@
+# v_hack extraction: gradient-space SVD with magnitudes + runtime suspicion gate
+
+Living design doc for the v_hack pipeline. Sibling to `RESEARCH_JOURNAL.md`.
+This explains *what we extract*, *why*, and *what runtime gating prevents*.
+
+## TL;DR
+
+`v_hack[name]` is a per-module top-k orthonormal basis in **AntiPaSTO
+δS-gradient space**, computed by PCA on **paired (hack − clean) NLL gradients**
+over a small set of contrastive completion pairs (currently N=14, 12 train + 2
+heldout). At training time we project the live policy-gradient component along
+this basis out of `δS.grad`, optionally gated so we only ablate when there's
+positive evidence the live gradient is hack-aligned.
+
+The 2026-05-27 refactor added three things on top of the older mean-diff design:
+
+1. **Top-k extraction** (k=12 max) with **load-time slicing** (`v_hack_k`,
+   default 5) so k=1 vs k=5 vs k=12 is a config flip, not a re-extract.
+2. **Singular-value recording** (`_sv/{name}` keys) so v_i carries its
+   extract-time confidence S_i, not just direction.
+3. **Runtime suspicion gate** (`susp_drop_frac`): per step, drop the top-frac
+   (module, axis) pairs by `r_i = |g·v_i| / S_i`. Live alignment ≫ extract
+   confidence means v_i is probably aligned with a structured coding direction,
+   not hack — skip the projection.
+
+## Why gradient space, not activation space?
+
+Most representation-steering work (ActAdd, RepE, CHaRS) operates on
+**activations** (forward pass), shifting hidden states at inference. We
+operate on **gradients of δS**, the trainable per-Linear AntiPaSTO knob.
+
+Reasons:
+- We're not steering inference; we're **shaping training**. The projection
+  modifies `δS.grad` before the optimizer step, so the model itself doesn't
+  drift toward hack-aligned weight updates.
+- δS gradients have a fixed, low-dimensional structure per module
+  (`δS ∈ R^r` where r = SVD rank of `W`). PCA-on-grads is computationally
+  cheap (12 pairs × N modules; r=2560 for largest mat) and gives a clean
+  per-module subspace.
+- This is closest in spirit to CHaRS-PCT (Principal Component Thresholding,
+  §3.3 of `docs/paper_chars.md`): the L principal components of local-shift
+  covariance. We do the same maneuver on paired δS-gradient diffs.
+
+## Why δS basis (= weight-SVD basis), not raw param basis?
+
+AntiPaSTO wraps each Linear with `δW = U · diag(δS) · V_h`, where `U, S, V_h
+= SVD(W_pretrained)`. So `δS ∈ R^r` are coordinates **in the weight-SVD
+basis**. The basis change is just a rotation — no whitening, no rescaling.
+
+Two things this buys us:
+
+- The number of trainable scalars is r per module (∼500–2500), not d_in×d_out.
+  A few hundred contrastive pairs would be needed to estimate dense
+  `d_in × d_out` direction; only a few pairs are needed in `R^r`.
+- Low-rank perturbations (LoRA-style hack adapters) are sparse in this basis,
+  which makes per-direction gating in `δS` meaningful even with N=12 pairs.
+
+What this does **not** buy us: regularization. The weight-SVD basis is just a
+convenient coordinate system; PCA on top of it still has to do the work of
+finding which coordinates carry hack-clean discriminative signal.
+
+## Extraction pipeline
+
+```python
+# pseudo: extract_v_hack(model, tokenizer, wrappers, pairs, top_k, tau_axis, n_heldout, device)
+
+train_pairs = pairs[:-n_heldout]                # currently 12 of 14
+
+# Gather per-pair, per-module gradients on hack-completion and clean-completion NLL.
+grads_hack[name]:  list of [r]-tensors, length n_pairs
+grads_clean[name]: list of [r]-tensors, length n_pairs
+
+for pair in train_pairs:
+  for label, completion in [("hack", pair.hack), ("clean", pair.clean)]:
+    model.zero_grad()
+    loss = mean_NLL_on_completion_tokens(model, pair.prompt + completion)
+    loss.backward()                              # populates δS.grad per module
+    for name, info in wrappers.items():
+      bucket[name].append(info.delta_S.grad.detach().cpu().float().clone())
+
+# Per module: PCA on paired diff.
+for name in wrappers:
+  G_h = stack(grads_hack[name])                  # [n_pairs, r]
+  G_c = stack(grads_clean[name])
+  D   = G_h - G_c                                # [n_pairs, r]: per-pair hack-axis displacement
+  U_d, S_d, Vh_d = svd(D)                        # truncated, m = min(n_pairs, r)
+  V = Vh_d[:k_max]                               # [k_max, r], orthonormal rows
+  # Orient v_i so +v_i points hack-ward (majority vote across pairs).
+  proj = D @ V.T                                 # [n_pairs, k_max]
+  n_pos = (proj > 0).sum(0)
+  flip = where(n_pos < n_pairs/2, -1, +1)
+  V = V * flip[:, None]
+  v_hack[name] = V
+  v_hack[f"_sv/{name}"] = S_d[:k_max]            # NEW: singular values saved alongside
+```
+
+**File schema (v2):**
+- `{name}` → Tensor[k_max, r], orthonormal hack-axis basis, oriented +hack
+- `_sv/{name}` → Tensor[k_max], singular values of D in that basis
+- metadata: `model`, `dtype`, `top_k`, `tau_axis`, `schema=v2_with_sv`
+
+## Load-or-extract (2026-05-27)
+
+`train.py` derives `v_hack_path` from `(model_name, v_hack_extract_top_k)`
+unless overridden. If the file is missing, it extracts inline on the
+already-wrapped model:
+
+```
+v_hack_path = OUT_DIR / f"v_hack_{model_slug}_k{extract_top_k}.safetensors"
+if not v_hack_path.exists():
+    v_hack_dict, raw_grads, _ = extract_v_hack(model, tok, wrappers, PAIRS,
+                                                top_k=extract_top_k, ...)
+    save_file(v_hack_dict, v_hack_path, metadata={...})
+v_hack, v_sv = load_v_hack(v_hack_path, model_name, wrappers, k_use=v_hack_k)
+```
+
+This means a fresh model with no cached v_hack just runs extract once
+(~5 min for 4B-class) and proceeds. No prerequisite jobs, no manual flags.
+
+## Load-time k-slicing
+
+Extract saves k_max (default 12). Load slices to `k_use` (default 5). So
+k=1 vs k=5 vs k=12 is a **config flip**, not a re-extract. The
+`mean_sv_top5_frac` from our 2026-05-26 extract was 0.71, so k=5 covers
+~71% of per-module D-variance — load-time slice at 5 is a reasonable
+default that we can ablate cheaply.
+
+## Runtime suspicion gate
+
+**Hypothesis:** module M has small `||D(M)||_F` (weak hack signal at
+extract time). Its top SVD direction `v_1(M)` is dominated by noise
+shape, not hack shape. At training time, `g(M)` is the policy gradient
+flowing through M — a structured (non-isotropic) signal living in a
+low-d subspace of "directions that matter for next-token prediction." If
+`v_1(M)` coincidentally lies in that subspace, projecting `g(M)` along
+`v_1(M)` removes a chunk of useful coding-relevant gradient with no
+compensating reduction in hack signal.
+
+**Why I'd initially dismissed this concern:** in a high-d random model
+(g and v isotropic), `|g · v| ≈ ||g||/√r ≈ 2% of ||g||`. So one bad
+direction costs ~2% of the live gradient — tolerable. **What I missed:**
+neither `g` nor `v` is isotropic. Both live in low-d structured
+subspaces. If those subspaces happen to overlap, the projection magnitude
+is much larger.
+
+**Gate design:**
+
+```
+r_i(M) = |g(M) · v_i(M)| / S_i(M)
+```
+
+- High `r_i`: live grad cares about v_i much more than the extract-time
+  hack signal did → suspicious, this v_i is probably picking up
+  structured coding flow.
+- Low `r_i`: live alignment is in proportion to extract-time confidence
+  → trust the projection.
+
+**Per-step quantile gate:** collect `r_i` across all `(module, axis)`
+pairs in one step, find the `(1 − drop_top_frac)`-quantile, suppress all
+axes above that threshold for this step. Default `drop_top_frac = 0.25`.
+
+```python
+# pseudo: in project_delta_S_grad
+all_r = []
+for name, info in wrappers.items():
+    c = V[name] @ info.delta_S.grad        # [k_use]
+    S = v_sv[name]                          # [k_use]
+    all_r.append(c.abs() / S.clamp_min(eps))
+threshold = quantile(cat(all_r), 1 − drop_top_frac)
+for ...:
+    keep = (r <= threshold)
+    g_proj = g − (c * keep * gate_mode_mask) @ V
+```
+
+## Known limitations (caveats from codex external review, 2026-05-27)
+
+1. **r_i is not dimensionless across modules.** `|g·v_i|` scales with
+   live-grad norm; `S_i` scales with extract-time-grad norm. A
+   high-gradient module dominates the global quantile regardless of
+   whether its axis is actually suspicious. Fix candidate: within-module
+   ratio `(|c_i|/||g||) / (S_i/||D||_F)`. **Not yet applied.**
+2. **Quantile gate is a fixed budget, not a detector.** It always drops
+   `drop_top_frac` of axes per step, even when nothing is genuinely
+   suspicious — and never drops more than that when many axes are. Fix
+   candidate: absolute threshold post-normalization, or run as
+   measure-only diagnostic until calibrated. **Not yet applied.**
+3. **Old v1 files (no `_sv/` keys) silently disable the gate.** Should
+   fail-fast when `susp_drop_frac > 0` and v_sv is empty.
+
+## Validation: cheap discriminative tests
+
+The fundamental question: does v_hack actually discriminate hack from
+clean gradients, or is it picking up irrelevant variance?
+
+### Test 1: cin_hack vs cin_clean on disk pools (cheap, ~5 min)
+
+We already have `out/probe_distill/teacher_pool/` (hacking samples) and
+`out/probe_distill/base_pool/` (clean samples). For N samples each:
+
+```
+for prompt, completion, label in samples:
+    model.zero_grad()
+    loss = mean_NLL(model, prompt + completion).backward()
+    cin = (V @ delta_S.grad).norm() / delta_S.grad.norm()
+    record(label, cin)
+```
+
+**Discriminator:** `cin_hack_mean − cin_clean_mean`. If ≫ 0, v_hack
+discriminates. If ≈ 0, v_hack is capturing prompt-length / generic
+variance, not hack-specific direction. **Cost: ~5 min, no training.**
+
+### Test 2: held-out pair projection (existing)
+
+`verify_vhack_heldout.py` projects gradients from held-out pairs (last
+n_heldout of PAIRS) onto trained v_hack. Already in CI-style flow.
+
+### Test 3: random-direction null
+
+For each module, compute cin onto v_hack vs onto a random unit vector of
+the same shape. If `cin_v_hack > cin_random` by a large margin, v_hack
+is non-spurious. Trivial to implement.
+
+### Test 4: per-source cin during training (live)
+
+In mixed-pool runs we have student rollouts (initially ~no hack) and
+teacher rollouts (all hack). Currently `cin` is computed on the
+accumulated gradient (mixed). With ONE extra backward per step we can
+compute `cin_s` (student-only grad) and `cin_t` (teacher-only grad)
+separately. **Predict:** if v_hack is a real hack direction,
+`cin_t > cin_s` initially; the gap shrinks as student picks up hack
+(if it does). Useful for diagnosing whether the projection is doing
+real work or just gradient noise.
+
+### Test 5: bootstrap sign-stability
+
+Bootstrap pairs (sample N-2 with replacement), re-extract v_hack,
+compare `cos(v_hack_original, v_hack_bootstrap)`. If unstable, v_hack
+is fitting noise. **Cost: 5 × ~5 min = 25 min total.**
+
+## Open design questions
+
+- **Is the suspicion gate redundant?** Codex argued the quantile design
+  is a fixed-budget knob, not a detector. The right answer is probably:
+  ship it as measure-only first (log `frac_axes_susp` and per-step
+  `r_i` distribution histograms), confirm whether suspicious modules
+  actually exist empirically, *then* turn on projection-side gating.
+- **Should we whiten by S?** I.e. parameterize the AntiPaSTO knob as
+  `δS_i / σ_i(W)` so all directions have equal forward-pass impact.
+  Currently we don't. This is a separate, larger question.
+- **Should we record per-pair pair tags / hack flavors?** With 12
+  unlabeled pairs we can't do supervised LDA. With flavor labels
+  (hardcode / weak-tests / persona / format-leak) we could do LDA-on-
+  labels, which would beat unsupervised PCA at this N.
+
+## Related files
+
+- `src/projected_grpo/extract_vhack_grad.py` — extract function + CLI
+- `src/projected_grpo/proj.py` — runtime projection + gates
+- `src/projected_grpo/train.py:load_v_hack` — load + slice + auto-extract
+- `src/projected_grpo/verify_vhack_heldout.py` — Test 2 above
+- `src/projected_grpo/pairs.py` — the 14 contrastive pairs
+- `docs/paper_chars.md` — CHaRS notes (PCT comparison)
+- `RESEARCH_JOURNAL.md` — chronological progress log
@@ -0,0 +1,710 @@
+[client] initialize (running)
+
+[client] session/new (running)
+I’ll use the local review workflow plus the `gsd` skill because this is a multi-step code review, then I’ll read the project instructions and changed files before making findings.
+
+[tool] pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'docs/brainstorm/extracted_prefs.md' -g 'docs/RESEARCH_JOURNAL.md' (running)
+  input: {"call_id":"call_26P3qQ1bTCPLMMOGhHlNW3B7","process_id":"3146","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'do...
+
+[tool] pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'docs/brainstorm/extracted_prefs.md' -g 'docs/RESEARCH_JOURNAL.md' (failed)
+  kind: execute
+  input: {"call_id":"call_26P3qQ1bTCPLMMOGhHlNW3B7","process_id":"3146","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'do...
+  output:
+    bwrap: No permissions to create new namespace, likely because the kernel does not allow non-privileged user namespaces. See <https://deb.li/bubblewrap> or <file:///usr/share/doc/bubblewrap/README.Debian.gz>.
+
+[tool] Read SKILL.md (running)
+  input: {"call_id":"call_pqWiVSayENSrfXMHJIaa2TDn","process_id":"2476","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,220p' /root/.claude/skills/gsd/SKILL.md"],"cwd":"/workspace/pr...
+  files: /root/.claude/skills/gsd/SKILL.md
+
+[tool] Read SKILL.md (failed)
+  kind: read
+  input: {"call_id":"call_pqWiVSayENSrfXMHJIaa2TDn","process_id":"2476","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,220p' /root/.claude/skills/gsd/SKILL.md"],"cwd":"/workspace/pr...
+  files: /root/.claude/skills/gsd/SKILL.md
+  output:
+    bwrap: No permissions to create new namespace, likely because the kernel does not allow non-privileged user namespaces. See <https://deb.li/bubblewrap> or <file:///usr/share/doc/bubblewrap/README.Debian.gz>.
+
+[tool] pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'docs/brainstorm/extracted_prefs.md' -g 'docs/RESEARCH_JOURNAL.md' (running)
+  input: {"call_id":"call_qsJCRIGJ2i5HN4N8VhYCiO86","process_id":"97013","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'd...
+
+[tool] pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'docs/brainstorm/extracted_prefs.md' -g 'docs/RESEARCH_JOURNAL.md' (completed)
+  kind: execute
+  input: {"call_id":"call_qsJCRIGJ2i5HN4N8VhYCiO86","process_id":"97013","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'd...
+  output:
+    /workspace/projected_grpo
+    docs/brainstorm/extracted_prefs.md
+    AGENTS.md
+    
+    ```sh
+    /workspace/projected_grpo
+    docs/brainstorm/extracted_prefs.md
+    AGENTS.md
+    ```
+
+[tool] Read SKILL.md (running)
+  input: {"call_id":"call_Epox1WkTa0YbizPn1jb4k52M","process_id":"4410","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,220p' /root/.claude/skills/gsd/SKILL.md"],"cwd":"/workspace/pr...
+  files: /root/.claude/skills/gsd/SKILL.md
+
+[tool] Read SKILL.md (completed)
+  kind: read
+  input: {"call_id":"call_Epox1WkTa0YbizPn1jb4k52M","process_id":"4410","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,220p' /root/.claude/skills/gsd/SKILL.md"],"cwd":"/workspace/pr...
+  files: /root/.claude/skills/gsd/SKILL.md
+  output:
+    ---
+    name: gsd
+    description: "Get Shit Done workflow for non-trivial tasks: spec -> implement -> test -> review -> wrap. Mandatory for anything with more than one step."
+    ---
+    
+    <!--
+    Design rationale and sources — for human readers, not the agent.
+    
+    - GSD spec-first: spec-driven development, scope fencing, verify-before-done
+    - Manus file-as-memory: one file is working memory, checklist is state, append-only log
+    - Boomerang: return concrete artifacts not summaries
+    - DCP: invisible dedup, superseded writes, error purging
+    
+    Core principles:
+    - Spec file is the only persistent artifact (single-file invariant).
+    - Agent shouldn't know about its own resource limits.
+    - Review scales with input size: frontier for plans, cheap for bulk.
+    - Observable outputs: show evidence, not narration.
+    -->
+    
+    # GSD (Get Shit Done)
+    
+    Mandatory for all non-trivial tasks (more than one step). Skip only for single-line fixes.
+    
+    ## Part 1: Core workflow
+    
+    ### 1.1 Spec-first planning
+    
+    ... (191 more lines)
+
+[plan]
+  - [in_progress] Read local instructions and gsd workflow
+  - [pending] Collect diffs and changed-file context
+  - [pending] Review suspicion gate math and load/extract flow
+  - [pending] Check schema/test implications
+  - [pending] Write final review
+The sandbox is blocking normal read commands, so I’m using approved escalated reads only. Next I’m collecting the actual diff and full file context, with the review scoped to the changed files.
+
+[tool] Read AGENTS.md (running)
+  input: {"call_id":"call_NLThtl3Ffv5iUZwilR5SADCz","process_id":"14734","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,240p' AGENTS.md"],"cwd":"/workspace/projected_grpo","parsed_c...
+  files: /workspace/projected_grpo/AGENTS.md
+
+[tool] git status --short (running)
+  input: {"call_id":"call_xaCeeYFvoLvaO7EJ0Sl9is82","process_id":"72006","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git status --short"],"cwd":"/workspace/projected_grpo","parsed_cmd":[{"...
+
+[tool] git diff -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo/train.py src/projected_grpo/proj.py src/projected_grpo/verify_vhack_heldout.py src/projected_grpo/probe_distill.py (running)
+  input: {"call_id":"call_KnN7vfJXbBaKGCBOobcZOO4x","process_id":"78686","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git diff -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo...
+
+[tool] git diff --cached -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo/train.py src/projected_grpo/proj.py src/projected_grpo/verify_vhack_heldout.py src/projected_grpo/probe_distill.py (running)
+  input: {"call_id":"call_i2OvvQxGyfTxRhy2A5IPLZ5c","process_id":"75184","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git diff --cached -- src/projected_grpo/extract_vhack_grad.py src/proje...
+
+[tool] Read AGENTS.md (completed)
+  kind: read
+  input: {"call_id":"call_NLThtl3Ffv5iUZwilR5SADCz","process_id":"14734","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,240p' AGENTS.md"],"cwd":"/workspace/projected_grpo","parsed_c...
+  files: /workspace/projected_grpo/AGENTS.md
+  output:
+    # AGENTS.md — projected_grpo
+    
+    **This is novel ML research.** Not in your training data. Extrapolate carefully.
+    
+    ## Project in one paragraph
+    
+    Test whether SVD-basis gradient projection against an extracted hack-direction
+    reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from
+    Rebound (Wu & Tang 2026) by intervening at the *gradient* level rather than the
+    *advantage* level. Differs from AntiPaSTO (the user's prior work) by using
+    unpaired GRPO rollouts rather than paired-preference contrast.
+    
+    Inherit global rules from `~/.claude/CLAUDE.md`.
+    
+    ## Workflow
+    
+    - Read [docs/spec.md](spec.md) for the preregistered plan.
+    - Read [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md) for design rationale.
+    - New sweep arms get recipes in [justfile](justfile) with `# H:` hypothesis comments.
+    - `just fast-dev-run` before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
+    - Real runs go through `pueue` on the 96GB GPU box. Label each job with `why:` and `resolve:`.
+    - Head [docs/RESEARCH_JOURNAL.md](docs/RESEARCH_JOURNAL.md) for latest results.
+    - No `tests/` dir; `fast-dev-run` is the correctness gate.
+    
+    ## External dependencies
+    
+    `external/rl-rewardhacking/` is Ariahw's repo (verl-based GRPO + LeetCode dataset
+    + reward hacking monitors). We import from it; we do NOT modify it. Sync with
+    ... (44 more lines)
+
+[tool] git diff -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo/train.py src/projected_grpo/proj.py src/projected_grpo/verify_vhack_heldout.py src/projected_grpo/probe_distill.py (completed)
+  kind: execute
+  input: {"call_id":"call_KnN7vfJXbBaKGCBOobcZOO4x","process_id":"78686","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git diff -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo...
+  output:
+    diff --git a/src/projected_grpo/extract_vhack_grad.py b/src/projected_grpo/extract_vhack_grad.py
+    index 5e9f1ff..57ed5d0 100644
+    --- a/src/projected_grpo/extract_vhack_grad.py
+    +++ b/src/projected_grpo/extract_vhack_grad.py
+    @@ -51,7 +51,14 @@ class Config:
+         out_path: Path = OUT_DIR / "v_hack.safetensors"
+         train_grads_path: Path = OUT_DIR / "vhack_grads_train.safetensors"
+         n_heldout: int = 2  # last n pairs reserved for held-out validation
+    -    top_k: int = 5      # rank of hack subspace per module (≤ n_train_pairs)
+    +    # top_k=12 = max(n_train_pairs after n_heldout=2 from N=14 pairs). Extract once
+    +    # at max rank; train.py slices via --v-hack-k for k-ablation without re-extract.
+    +    top_k: int = 12
+    +    # tau_axis: zero rows where S_i/S_0 < tau_axis. Diagnostic — projection along
+    +    # noise-direction unit vectors removes only ~||g||/sqrt(r) ≈ 2% of grad
+    +    # magnitude on r=2560 modules, so this rarely changes effect size; it does
+    +    # make k-ablations honest (axes 4-5 might be pure noise on N=12 pairs).
+    +    tau_axis: float = 0.0
+     
+     
+     def resolve_dtype(s: str) -> torch.dtype:
+    @@ -73,29 +80,32 @@ def completion_nll(model, tokenizer, prompt: str, completion: str, device) -> to
+         return (nll * mask).sum() / mask.sum().clamp_min(1.0)
+     
+     
+    -def main(cfg: Config) -> int:
+    -    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    -    dtype = resolve_dtype(cfg.dtype)
+    -    logger.info(
+    ... (1103 more lines)
+
+[tool] git status --short (completed)
+  kind: execute
+  input: {"call_id":"call_xaCeeYFvoLvaO7EJ0Sl9is82","process_id":"72006","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git status --short"],"cwd":"/workspace/projected_grpo","parsed_cmd":[{"...
+  output:
+    M justfile
+     M src/projected_grpo/extract_vhack_grad.py
+     M src/projected_grpo/probe_distill.py
+     M src/projected_grpo/proj.py
+     M src/projected_grpo/train.py
+     M src/projected_grpo/verify_vhack_heldout.py
+    ?? docs/spec/20260527_code_review.md
+    
+    ```sh
+     M justfile
+     M src/projected_grpo/extract_vhack_grad.py
+     M src/projected_grpo/probe_distill.py
+     M src/projected_grpo/proj.py
+     M src/projected_grpo/train.py
+     M src/projected_grpo/verify_vhack_heldout.py
+    ?? docs/spec/20260527_code_review.md
+    ```
+
+[tool] git diff --cached -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo/train.py src/projected_grpo/proj.py src/projected_grpo/verify_vhack_heldout.py src/projected_grpo/probe_distill.py (completed)
+  kind: execute
+  input: {"call_id":"call_i2OvvQxGyfTxRhy2A5IPLZ5c","process_id":"75184","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git diff --cached -- src/projected_grpo/extract_vhack_grad.py src/proje...
+  output:
+    {
+      "call_id": "call_i2OvvQxGyfTxRhy2A5IPLZ5c",
+      "process_id": "75184",
+      "turn_id": "019e681a-3b48-7ed3-8e6b-1912b41adb6f",
+      "command": [
+        "/usr/bin/zsh",
+        "-lc",
+        "git diff --cached -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo/train.py src/projected_grpo/proj.py src/projected_grpo/verify_vhack_heldout.py src/projected_grpo/probe_distill.py"
+      ],
+      "cwd": "/workspace/projected_grpo",
+      "parsed_cmd": [
+        {
+          "type": "unknown",
+          "cmd": "git diff --cached -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo/train.py src/projected_grpo/proj.py src/projected_grpo/verify_vhack_heldout.py src/projected_grpo/probe_distill.py"
+        }
+      ],
+      "source": "unified_exec_startup",
+      "stdout": "",
+      "stderr": "",
+      "aggregated_output": "",
+      "exit_code": 0,
+      "duration": {
+        "secs": 1,
+        "nanos": 432131924
+      },
+      "formatted_output": "",
+      "status": "completed"
+    }
+I have the high-level diff. I’m reading the full changed files now because the suspicion gate depends on surrounding training order, optimizer setup, and how `delta_S` is wrapped.
+
+[tool] Read proj.py (running)
+  input: {"call_id":"call_B8lYmbn1uM0INz0YwcS54y79","process_id":"37400","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/proj.py"],"cwd":"/workspace/projected_grpo","...
+  files: /workspace/projected_grpo/src/projected_grpo/proj.py
+
+[tool] Read train.py (running)
+  input: {"call_id":"call_19Uf3j0MKdVXQPhRaIstMKAF","process_id":"58769","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/train.py"],"cwd":"/workspace/projected_grpo",...
+  files: /workspace/projected_grpo/src/projected_grpo/train.py
+
+[tool] Read extract_vhack_grad.py (running)
+  input: {"call_id":"call_UjQGhDRH4LEXXP0yi2Oa8hOV","process_id":"85164","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/extract_vhack_grad.py"],"cwd":"/workspace/pro...
+  files: /workspace/projected_grpo/src/projected_grpo/extract_vhack_grad.py
+
+[tool] Read verify_vhack_heldout.py (running)
+  input: {"call_id":"call_9gmm3VORQurDl6RBtIOyqOCR","process_id":"42951","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/verify_vhack_heldout.py"],"cwd":"/workspace/p...
+  files: /workspace/projected_grpo/src/projected_grpo/verify_vhack_heldout.py
+
+[tool] Read proj.py (completed)
+  kind: read
+  input: {"call_id":"call_B8lYmbn1uM0INz0YwcS54y79","process_id":"37400","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/proj.py"],"cwd":"/workspace/projected_grpo","...
+  files: /workspace/projected_grpo/src/projected_grpo/proj.py
+  output:
+    1	"""Gradient projection + delta_S grad utilities. Imported by smoke and train."""
+         2	from __future__ import annotations
+         3	
+         4	import torch
+         5	
+         6	
+         7	def per_token_logps(logits: torch.Tensor, ids: torch.Tensor) -> torch.Tensor:
+         8	    """log p(ids | logits) gathered token-wise.
+         9	
+        10	    Uses F.cross_entropy (fused softmax+gather) so we never materialise the
+        11	    full [B, L, V] fp32 softmax. On Qwen3.5-2B with V=152k, G=8, L≈1500 the
+        12	    fp32 vocab tensor was ~7 GB per forward — the difference between OOM and
+        13	    fit on a 96 GB card when the autograd graph is alive.
+        14	    """
+        15	    B, L, V = logits.shape
+        16	    # CE's internal log_softmax accumulates in fp32 (stable) but returns input dtype.
+        17	    # The output [B*L] is small, so upcast it to fp32 for downstream PPO ratio math.
+        18	    return -torch.nn.functional.cross_entropy(
+        19	        logits.reshape(-1, V), ids.reshape(-1), reduction="none"
+        20	    ).float().view(B, L)
+        21	
+        22	
+        23	@torch.no_grad()
+        24	def project_delta_S_grad(
+        25	    wrappers: dict,
+        26	    v_hack: dict[str, torch.Tensor],
+        27	    preserve_magnitude: bool,
+        28	    measure_only: bool = False,
+    ... (251 more lines)
+
+[tool] Read train.py (completed)
+  kind: read
+  input: {"call_id":"call_19Uf3j0MKdVXQPhRaIstMKAF","process_id":"58769","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/train.py"],"cwd":"/workspace/projected_grpo",...
+  files: /workspace/projected_grpo/src/projected_grpo/train.py
+  output:
+    1	"""Canonical training entry point: AntiPaSTO + GRPO (Dr.GRPO unbiased) + optional
+         2	gradient projection on LeetCode reward-hacking benchmark.
+         3	
+         4	Lineage (see spec.md §76-83):
+         5	  - The inner GRPO_step (per_token_logps, ratio + clip + min, K3 KL, per-token
+         6	    loss, completion mask) is a direct port of lsdefine/simple_GRPO's
+         7	    `GRPO_step` in `grpo_vllm_one.py` (lines 64-95).
+         8	  - The OUTER loop adopts simple_GRPO's `Q_batch_size` pattern (multiple
+         9	    prompts per optimizer step, per-prompt GRPO advantage groups, grad
+        10	    accumulation across prompts). GRPO needs within-group reward diversity to
+        11	    produce any signal; sampling many prompts per step raises the chance that
+        12	    at least one group is non-degenerate. simple_GRPO uses Q_batch_size=5; our
+        13	    prompts_per_step is set in PRESETS (grad-accum to the paper's effective batch).
+        14	  - Deviations from simple_GRPO are deliberate, listed in spec.md:
+        15	      1. Loss normalization: Dr.GRPO unbiased (Liu et al. 2025, arXiv
+        16	         2503.20783) replaces simple_GRPO's `(R-mean)/std` + per-response-len
+        17	         denominator. Drops two biases:
+        18	           - length norm `1/|o_i|` (favors short correct, long incorrect)
+        19	           - group-std norm `/std(R)` (overweights easy/hard questions)
+        20	         Toggle via `--unbiased` (default on); flipping to False recovers
+        21	         simple_GRPO's classic GRPO advantage normalization.
+        22	      2. Reference model: simple_GRPO runs a separate base model via an HTTP
+        23	         `ref_server`. We use the AntiPaSTO `delta_S=0` zero-adapter trick
+        24	         (W' = W + U diag(0) Vh = W exactly) — no second model loaded.
+        25	      3. Rollout: simple_GRPO uses vLLM in a separate process. We use HF
+        26	         `model.generate` in-process.
+        27	      4. Adapter: simple_GRPO is full FT (with DeepSpeed ZeRO). Canonical
+        28	         (ariahw/rl-reward...
+
+[tool] Read verify_vhack_heldout.py (completed)
+  kind: read
+  input: {"call_id":"call_9gmm3VORQurDl6RBtIOyqOCR","process_id":"42951","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/verify_vhack_heldout.py"],"cwd":"/workspace/p...
+  files: /workspace/projected_grpo/src/projected_grpo/verify_vhack_heldout.py
+  output:
+    1	"""Held-out v_hack validation (spec.md §B validation).
+         2	
+         3	For each held-out pair, compute per-module gradient diff (g_hack - g_clean)
+         4	in delta_S basis, then cos-align with the trained v_hack[name].
+         5	
+         6	Report:
+         7	  - per-suffix median/mean cos_align
+         8	  - fraction of modules with cos_align > 0  (SHOULD > 0.5)
+         9	  - mean cos_align across modules           (target > 0.2)
+        10	
+        11	Run: uv run python -m projected_grpo.verify_vhack_heldout
+        12	"""
+        13	from __future__ import annotations
+        14	
+        15	import sys
+        16	from collections import defaultdict
+        17	from dataclasses import dataclass
+        18	from pathlib import Path
+        19	
+        20	import json
+        21	
+        22	import torch
+        23	import tyro
+        24	from loguru import logger
+        25	from safetensors.torch import save_file
+        26	from tabulate import tabulate
+        27	from transformers import AutoModelForCausalLM, AutoTokenizer
+        28	
+    ... (279 more lines)
+
+[tool] Read extract_vhack_grad.py (completed)
+  kind: read
+  input: {"call_id":"call_UjQGhDRH4LEXXP0yi2Oa8hOV","process_id":"85164","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/extract_vhack_grad.py"],"cwd":"/workspace/pro...
+  files: /workspace/projected_grpo/src/projected_grpo/extract_vhack_grad.py
+  output:
+    1	"""Gradient-side per-module v_hack extraction (spec.md §B, top-k variant).
+         2	
+         3	For each contrastive pair (prompt, hack_completion, clean_completion):
+         4	  - Forward(prompt+completion), mean-NLL on completion tokens, backward
+         5	  - Capture `delta_S.grad` per AntiPaSTO-wrapped Linear
+         6	
+         7	Then per module, with D = [g_hack_i - g_clean_i for each pair] in R^{n_pairs x r}:
+         8	    SVD(D) = U Σ Vh
+         9	    v_hack[name] = top_k rows of Vh, each oriented so mean(D @ v_i) > 0
+        10	
+        11	This generalizes mean-diff (which corresponds to top-1 PC of paired diffs under
+        12	isotropic covariance) to a rank-k hack subspace, motivated by CHaRS (Abdullaev
+        13	2025 — see docs/paper_chars.md): hack signal is multi-modal across hack flavors
+        14	(weak tests, hardcode, persona, ...), so a single global direction is brittle.
+        15	
+        16	Orientation matters because proj.py applies a per-direction one-sided gate
+        17	(only subtracts <g, v_i> when positive). +v_i must point hack-ward.
+        18	
+        19	Saves `out/v_hack.safetensors` = dict[name -> Tensor[k, r]] (cpu fp32, rows
+        20	unit-norm + orthonormal from SVD) with header {"model": str, "dtype": str,
+        21	"top_k": str(k)}.
+        22	
+        23	Run: uv run python -m projected_grpo.extract_vhack_grad
+        24	"""
+        25	from __future__ import annotations
+        26	
+        27	import sys
+        28	from collections import defaultdict
+    ... (515 more lines)
+
+[tool] Read train.py (running)
+  input: {"call_id":"call_XmsO6ztYV0PA2nSx8J4xdbu9","process_id":"5743","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/train.py | sed -n '390,530p'"],"cwd":"/workspa...
+  files: /workspace/projected_grpo/src/projected_grpo/train.py
+
+[tool] Read probe_distill.py (running)
+  input: {"call_id":"call_NwEh3dqGqjDFXQuySFaT2c0N","process_id":"46600","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/probe_distill.py | sed -n '180,240p'"],"cwd":...
+  files: /workspace/projected_grpo/src/projected_grpo/probe_distill.py
+
+[tool] Read 20260527_code_review.md (running)
+  input: {"call_id":"call_u67hecWsDhI7oi0boUU1C0q5","process_id":"15286","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba docs/spec/20260527_code_review.md"],"cwd":"/workspace/projected_...
+  files: /workspace/projected_grpo/docs/spec/20260527_code_review.md
+
+[tool] git diff -- justfile (running)
+  input: {"call_id":"call_3AYtTkFJY9bhQBbuGufHXJiA","process_id":"61295","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git diff -- justfile"],"cwd":"/workspace/projected_grpo","parsed_cmd":[...
+
+[tool] Read train.py (completed)
+  kind: read
+  input: {"call_id":"call_XmsO6ztYV0PA2nSx8J4xdbu9","process_id":"5743","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/train.py | sed -n '390,530p'"],"cwd":"/workspa...
+  files: /workspace/projected_grpo/src/projected_grpo/train.py
+  output:
+    390	                  metadata={"model": model_name, "dtype": "bf16",
+       391	                            "top_k": str(min(cfg.v_hack_extract_top_k, len(VHACK_PAIRS) - 2)),
+       392	                            "tau_axis": "0.0", "schema": "v2_with_sv"})
+       393	        # extract leaves model.zero_grad() done; reset opt state isn't needed (opt built below)
+       394	    v_hack_cpu, v_sv_cpu = load_v_hack(v_hack_path, model_name, wrappers, k_use=cfg.v_hack_k)
+       395	    v_hack = {name: v.to(device) for name, v in v_hack_cpu.items()}
+       396	    v_sv = {name: s.to(device) for name, s in v_sv_cpu.items()} if v_sv_cpu else None
+       397	    # Teacher pool: pre-generated rollouts on disk keyed by problem_id. Each step's
+       398	    # G_t teacher rollouts come from a uniform random sample of that prompt's cache,
+       399	    # so we do *not* keep the teacher model in VRAM. Pool is produced by
+       400	    # `probe_distill.py --teacher-only` (see schema in probe_distill.py:149-186).
+       401	    # Cached rewards/flags are reused verbatim — no re-grading — so the pool is a
+       402	    # reproducible fixed teacher distribution across runs.
+       403	    teacher_pool: dict[int, list[dict]] = {}
+       404	    G_s = group
+       405	    G_t = 0
+       406	    if cfg.teacher_pool_dir is not None:
+       407	        if not (0.0 < cfg.mix_ratio < 1.0):
+       408	            raise ValueError(f"mix_ratio must be in (0,1) when teacher_pool_dir set; got {cfg.mix_ratio}")
+       409	        G_t = round(group * cfg.mix_ratio)
+       410	        G_s = group - G_t
+       411	        if G_s == 0 or G_t == 0:
+       412	            raise ValueError(
+       413	                f"degenerate split: G={group} mix_ratio={cfg.mix_ratio} -> G_s={G_s}, G_t={G_t}. "
+       414	                f"Pick mix_ratio so both halves are non-empty, or drop --teacher-pool-dir."
+       415	            )
+       416	        for path in sorted(cfg.teacher_pool_dir.glob("prompt_*.jsonl.gz")):
+       417	            # path.stem on 'prompt_0004.jsonl.gz' is 'prompt_0004.jsonl' (only one
+    ....
+
+[tool] Read probe_distill.py (completed)
+  kind: read
+  input: {"call_id":"call_NwEh3dqGqjDFXQuySFaT2c0N","process_id":"46600","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/probe_distill.py | sed -n '180,240p'"],"cwd":...
+  files: /workspace/projected_grpo/src/projected_grpo/probe_distill.py
+  output:
+    180	
+       181	
+       182	def load_prompt(pool_dir: Path, problem_id: int) -> list[dict]:
+       183	    path = pool_dir / f"prompt_{problem_id:04d}.jsonl.gz"
+       184	    with gzip.open(path, "rt") as f:
+       185	        return [json.loads(line) for line in f]
+       186	
+       187	
+       188	def main(cfg: Config) -> int:
+       189	    if cfg.tag:
+       190	        tag = cfg.tag
+       191	    elif cfg.teacher_only:
+       192	        tag = "teacher_pool"
+       193	    elif cfg.base_only:
+       194	        tag = "base_pool"
+       195	    else:
+       196	        tag = f"{cfg.arm}_seed{cfg.seed}"
+       197	    run_id = f"distill_{tag}"
+       198	    setup_logging(run_id)
+       199	    torch.manual_seed(cfg.seed)
+       200	    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+       201	
+       202	    logger.info(f"argv: {' '.join(sys.argv)}")
+       203	    logger.info(f"arm={cfg.arm}  teacher={cfg.teacher}  steps={cfg.steps}  "
+       204	                f"G={cfg.group}  seed={cfg.seed}  "
+       205	                f"teacher_only={cfg.teacher_only}  replay={cfg.replay_dir is not None}")
+       206	
+       207	    if cfg.teacher_only or cfg.base_only:
+    ... (97 more lines)
+
+[tool] git diff -- justfile (completed)
+  kind: execute
+  input: {"call_id":"call_3AYtTkFJY9bhQBbuGufHXJiA","process_id":"61295","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git diff -- justfile"],"cwd":"/workspace/projected_grpo","parsed_cmd":[...
+  output:
+    diff --git a/justfile b/justfile
+    index c1dc901..81f78b4 100644
+    --- a/justfile
+    +++ b/justfile
+    @@ -285,18 +285,21 @@ pregen-teacher n_prompts="100":
+             --group=8 \
+             --max-new=1024
+     
+    -# 10-step feasibility probe: clean Qwen3-4B + 50% cached teacher pool, pp=4.
+    -# Queues via pueue so logs are shared between user and agent. Student is
+    -# restricted to problems covered by the pool. GO/NO-GO: final HACK_STUDENT
+    -# > 0.10 by step 10.
+    +# 100-step feasibility probe: clean Qwen3-4B + 75% cached teacher pool, pp=4, G=12.
+    +# Plan B "free lunch": mix=0.75 -> G_s=3, G_t=9. Gen wall-time unchanged
+    +# (teacher is cached disk reads), backward VRAM ~2x current (peak ~55-60 GB on
+    +# 96 GB card). At 48 gens/step (vs reference 256), 100 steps ~= 19 ref steps.
+    +# --v-hack-path is set even for vanilla so cin/cout get measured as baseline
+    +# (project_delta_S_grad with measure_only=True on vanilla arm).
+     probe-mixed seed="41":
+    -    pueue add -l "why: does mixed-pool GRPO (cached teacher) drive student hack-rate up from clean base; resolve: confirm student hack_s climbs 0->20%+ over 10 steps" \
+    +    pueue add -l "why: does mixed-pool GRPO (cached teacher, plan B grad pressure) drive student hack-rate from clean base; resolve: confirm hack_s climbs 0->10%+ over 100 steps (~19 ref-eq)" \
+             -w "$PWD" -- \
+             {{ TRAIN }} --preset=full --arm=vanilla \
+                 --model={{ MODEL }} \
+    +            --v-hack-path=out/v_hack_full.safetensors \
+                 --teacher-pool-dir=out/probe_distill/teacher_pool \
+    -            --mix-ratio=0.5 \
+    -            --steps=10 --prompts-per-step=4 \
+    ... (40 more lines)
+
+[tool] Read 20260527_code_review.md (completed)
+  kind: read
+  input: {"call_id":"call_u67hecWsDhI7oi0boUU1C0q5","process_id":"15286","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba docs/spec/20260527_code_review.md"],"cwd":"/workspace/projected_...
+  files: /workspace/projected_grpo/docs/spec/20260527_code_review.md
+  output:
+    1	[client] initialize (running)
+         2	
+         3	[client] session/new (running)
+         4	I’ll use the local review workflow plus the `gsd` skill because this is a multi-step code review, then I’ll read the project instructions and changed files before making findings.
+         5	
+         6	[tool] pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'docs/brainstorm/extracted_prefs.md' -g 'docs/RESEARCH_JOURNAL.md' (running)
+         7	  input: {"call_id":"call_26P3qQ1bTCPLMMOGhHlNW3B7","process_id":"3146","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'do...
+         8	
+         9	[tool] pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'docs/brainstorm/extracted_prefs.md' -g 'docs/RESEARCH_JOURNAL.md' (failed)
+        10	  kind: execute
+        11	  input: {"call_id":"call_26P3qQ1bTCPLMMOGhHlNW3B7","process_id":"3146","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'do...
+        12	  output:
+        13	    bwrap: No permissions to create new namespace, likely because the kernel does not allow non-privileged user namespaces. See <https://deb.li/bubblewrap> or <file:///usr/share/doc/bubblewrap/README.Debian.gz>.
+        14	
+        15	[tool] Read SKILL.md (running)
+        16	  input: {"call_id":"call_pqWiVSayENSrfXMHJIaa2TDn","process_id":"2476","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,220p' /root/.claude/skills/gsd/SKILL.md"],"cwd":"/workspace/pr...
+        17	  files: /root/.claude/skills/gsd/SKILL.md
+        18	
+        19	[tool] Read SKILL.md (failed)
+        20	  kind: read
+        21	  input: {"call_id":"call_pqWiVSayENSrfXMHJIaa2TDn","process_id":"2476","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,220p' /root/.claude/skills/gsd/SKILL.md"],"cwd":"/workspace/pr...
+        22	  files: /root/.claude...
+
+[tool] Search project_delta_S_grad|load_v_hack|v_hack_path|susp_drop_frac|v_sv|_sv/|v-hack in src (running)
+  input: {"call_id":"call_eKBsKF4tsshQ29fx8yS736Lt","process_id":"18775","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","rg -n \"project_delta_S_grad|load_v_hack|v_hack_path|susp_drop_frac|v_s...
+
+[tool] Read antipasto.py (running)
+  input: {"call_id":"call_tQV8MNSEQrC6INjwfHLwzjER","process_id":"24911","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/antipasto.py | sed -n '1,240p'"],"cwd":"/work...
+  files: /workspace/projected_grpo/src/projected_grpo/antipasto.py
+
+[tool] Read extracted_prefs.md (running)
+  input: {"call_id":"call_pQoENmJAU2m9pygvgNy1bdzD","process_id":"92183","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba docs/brainstorm/extracted_prefs.md | sed -n '1,220p'"],"cwd":"/w...
+  files: /workspace/projected_grpo/docs/brainstorm/extracted_prefs.md
+
+[tool] Read extracted_prefs.md (completed)
+  kind: read
+  input: {"call_id":"call_pQoENmJAU2m9pygvgNy1bdzD","process_id":"92183","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba docs/brainstorm/extracted_prefs.md | sed -n '1,220p'"],"cwd":"/w...
+  files: /workspace/projected_grpo/docs/brainstorm/extracted_prefs.md
+  output:
+    1	<!-- Extracted from docs/1.md (4130-line brainstorm) + spec.md by Explore subagent, 2026-05-23.
+         2	     Verbatim phrases in backticks. Where the transcript is ambiguous, marked [ambiguous]. -->
+         3	
+         4	# Extracted preferences and decisions — projected_grpo
+         5	
+         6	## TL;DR delta vs spec.md
+         7	
+         8	Spec.md is the clean preregistered plan. docs/1.md is the reasoning trail behind it. The biggest
+         9	deltas the brainstorm adds (not in spec):
+        10	
+        11	1. **The whole project pivoted** mid-conversation from a DPO+sycophancy plan (Anthropic HH-RLHF)
+        12	   to GRPO+reward-hacking (Nanda/Ariahw LeetCode). Driver: gradient projection in SVD basis matches
+        13	   GRPO's unpaired structure better than DPO's paired-preference structure.
+        14	2. **Method evolved** from "bidirectional SVD-LoRA with NLL+KL" (paired-preference native, the
+        15	   AntiPaSTO line) to **gradient-level intervention + SVD-basis denoising** — an orthogonal
+        16	   approach for unpaired GRPO rollouts.
+        17	3. **Rebound paper (Wu & Tang 2026)** appeared mid-brainstorm and reframed the positioning:
+        18	   not novel mechanism (concept-direction intervention) but novel level (gradient vs advantage).
+        19	   User's confidence updated downward but stayed positive — ~60% the method works now (was ~40%
+        20	   pre-Rebound, framed as net positive because Rebound *validates* the core mechanism).
+        21	4. **Single-GPU pragmatism**: extensive back-and-forth on 3090 vs 96GB RTX 6000 Ada. Landed on
+        22	   96GB RTX 6000 + Qwen3.5-2B as the practical sweet spot.
+        23	
+        24	## 1. Design decisions
+        25	
+        26	- **Substitute Qwen3.5-2B for Qwen3-4B**. Reason: compute budget. Fallback to Qwen3-4B with
+        27	  reduced num_generations if H4 (hack emergence) fails at 2B.
+        28	- **Use verl, not TRL**. Reason: Nanda's repo uses verl v0.6.1; minimise reimplementation risk.
+    ... (135 more lines)
+
+[tool] Read antipasto.py (completed)
+  kind: read
+  input: {"call_id":"call_tQV8MNSEQrC6INjwfHLwzjER","process_id":"24911","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/antipasto.py | sed -n '1,240p'"],"cwd":"/work...
+  files: /workspace/projected_grpo/src/projected_grpo/antipasto.py
+  output:
+    1	"""AntiPaSTO full-rank adapter via forward hooks (lora-lite style).
+         2	
+         3	Per spec.md: each target nn.Linear keeps its original weight intact. We attach
+         4	frozen buffers U, Vh and a trainable delta_S of shape [r] per layer. A forward
+         5	post-hook adds the delta contribution:
+         6	
+         7	    y_new = y + U @ (delta_S * (Vh @ x))
+         8	
+         9	equivalent to W -> W + U diag(delta_S) Vh. At delta_S = 0 the delta is exactly
+        10	zero, so the wrapped model is bit-identical to the base (no SVD round-trip
+        11	error on the main path -- W stays as it was loaded). U, Vh stay frozen and
+        12	double as the basis for v_hack gradient projection (we read delta_S.grad
+        13	directly; no extra projection math at the gradient step).
+        14	"""
+        15	from __future__ import annotations
+        16	
+        17	import hashlib
+        18	from pathlib import Path
+        19	
+        20	import torch
+        21	from jaxtyping import Float
+        22	from loguru import logger
+        23	from torch import Tensor, nn
+        24	
+        25	
+        26	def svd_cached(
+        27	    W: Float[Tensor, "d_out d_in"],
+        28	    cache_path: Path,
+    ... (249 more lines)
+
+[tool] Search project_delta_S_grad|load_v_hack|v_hack_path|susp_drop_frac|v_sv|_sv/|v-hack in src (completed)
+  kind: search
+  input: {"call_id":"call_eKBsKF4tsshQ29fx8yS736Lt","process_id":"18775","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","rg -n \"project_delta_S_grad|load_v_hack|v_hack_path|susp_drop_frac|v_s...
+  output:
+    justfile:24:    {{ TRAIN }} --preset=smoke --arm=projected --v-hack-path=out/v_hack_smoke.safetensors {{ ARGS }}
+    justfile:31:    {{ TRAIN }} --preset=smoke --arm=projected --v-hack-path=out/v_hack_smoke.safetensors
+    justfile:38:    {{ TRAIN }} --preset=full --arm=projected --v-hack-path=out/v_hack_full.safetensors {{ ARGS }}
+    justfile:67:        --v-hack-path=out/v_hack_smoke.safetensors \
+    justfile:74:        --v-hack-path=out/v_hack_full.safetensors \
+    justfile:104:    pueue add -a "$VA" -w "$PWD" -o 8 -l "why: projected seed{{ seed }} @ matched batch, v_hack NOT post-hoc; resolve: Gate D H1 HACK_RATE<vanilla at matched PASS" -- {{ TRAIN }} --preset=full --arm=projected --seed={{ seed }} --v-hack-path=out/v_hack_full.safetensors --out-tag=_full_projected_seed{{ seed }}_probe
+    justfile:140:          -- {{ TRAIN }} --preset={{ preset }} --arm=projected --seed=$seed --v-hack-path={{ vhack }} --out-tag=_{{ preset }}_projected_seed$seed
+    justfile:151:    uv run python -m projected_grpo.probe_distill --v-hack-path=out/v_hack_full.safetensors {{ ARGS }}
+    justfile:167:        --v-hack-path=out/v_hack_full.safetensors
+    justfile:176:        --v-hack-path=out/v_hack_full.safetensors
+    justfile:178:# Arm 2: projected GRPO in SVD basis (AntiPaSTO + project_delta_S_grad).
+    justfile:183:        --v-hack-path=out/v_hack_full.safetensors
+    justfile:193:        --v-hack-path=out/v_hack_full.safetensors
+    justfile:200:        --v-hack-path=out/v_hack_full.safetensors
+    justfile:211:        --v-hack-path=out/v_hack_full.safetensors
+    justfile:220:        --v-hack-path=out/v_hack_full.safetensors
+    justfile:225:        --v-hack-path=out/v_hack_full.safetensors
+    justfile:230:        --v-hack-path=out/v_hack_full.safetensors
+    justfile:252:        --v-hack-path=out/v_hack_{{ tag }}.safetensors \
+    justfile:292:# --v-hack-path is set even for vanilla so cin/cout get measured as baseline
+    justfile:293:# (project_delta_S_grad with measure_only=True on vanilla arm).
+    justfile:299:            --v-hack-path=out/v_hac...
+The load-or-extract ordering looks intentional so far: extraction happens after wrapping and before optimizer construction, and it zeroes grads before returning. I’m now checking all call sites and the wrapper semantics before deciding whether this hides stateful side effects.
+
+[tool] Read probe_distill.py (running)
+  input: {"call_id":"call_t1hmCip7FvpT3kPO0dr9DaI6","process_id":"52595","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/probe_distill.py | sed -n '430,485p'"],"cwd":...
+  files: /workspace/projected_grpo/src/projected_grpo/probe_distill.py
+
+[tool] Read probe_distill.py (completed)
+  kind: read
+  input: {"call_id":"call_t1hmCip7FvpT3kPO0dr9DaI6","process_id":"52595","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/probe_distill.py | sed -n '430,485p'"],"cwd":...
+  files: /workspace/projected_grpo/src/projected_grpo/probe_distill.py
+  output:
+    430	                )
+       431	                mask = (ci != pad_id).float()
+       432	                per_sample_logp_mean[i] = float((logp_i * mask).sum().item() / max(1.0, mask.sum().item()))
+       433	                # Dr.GRPO REINFORCE: -adv * mean_logp. No PPO ratio because at step
+       434	                # start, student matches its own no_grad logp on these tokens.
+       435	                loss_i = -adv[i] * (logp_i * mask).sum() / mask.sum().clamp_min(1.0) / cfg.group
+       436	                per_sample_loss[i] = float(loss_i.item())
+       437	                loss_i.backward()
+       438	                contrib = {n: info["delta_S"].grad - g_before[n]
+       439	                           for n, info in wrappers.items()}
+       440	                per_sample_cos[i] = norm_weighted_cos(contrib, v_hack)
+       441	                per_sample_norm[i] = float(sum(c.float().pow(2).sum().item() for c in contrib.values()) ** 0.5)
+       442	                g_before = {n: info["delta_S"].grad.clone() for n, info in wrappers.items()}
+       443	
+       444	            # Importance ratio vs first-encounter logp. Only meaningful in
+       445	            # replay mode (same tokens, drifting student). For student-gen we
+       446	            # set ratio=1.0 because each step has freshly generated tokens.
+       447	            if replay_active and replay_problem_id not in logp_step0_by_prompt:
+       448	                logp_step0_by_prompt[replay_problem_id] = list(per_sample_logp_mean)
+       449	                per_sample_imp_ratio = [1.0] * cfg.group
+       450	            elif replay_active:
+       451	                base = logp_step0_by_prompt[replay_problem_id]
+       452	                per_sample_imp_ratio = [
+       453	                    float(torch.tensor(per_sample_logp_mean[i] - base[i]).exp().item())
+       454	                    for i in range(cfg.group)
+       455	                ]
+       456	            else:
+       457	                per_sample_imp_ratio = [1.0] * cfg.group
+    ... (87 more lines)
+
+[plan]
+  - [completed] Read local instructions and gsd workflow
+  - [completed] Collect diffs and changed-file context
+  - [completed] Review suspicion gate math and load/extract flow
+  - [completed] Check schema/test implications
+  - [in_progress] Write final review
+## Code Review: runtime suspicion gate + v_hack load/extract refactor
+
+### Summary
+The refactor cleanly reuses the already wrapped model for cache-miss extraction, and optimizer ordering looks safe: extraction happens before `AdamW` construction and clears grads before training starts. The main issues are in the suspicion-gate semantics and fail-fast schema handling.
+
+### Important (should fix)
+- [src/projected_grpo/proj.py:77](/workspace/projected_grpo/src/projected_grpo/proj.py:77) `r_i = |g·v_i| / S_i` is not a clean discriminator across modules. `|g·v_i|` scales with the live module gradient norm, while `S_i` scales with extract-time pair-gradient magnitude, so a high-ratio module may just be a high-gradient module rather than a suspicious weak extracted axis. Since the threshold is global across all `(module, axis)` pairs, this can preferentially disable projection in high-gradient modules. Use a dimensionless within-module ratio, e.g. `(|c_i| / ||g||) / (S_i / ||D||_F)`, or calibrate against extract-time per-pair coefficients.
+
+- [src/projected_grpo/proj.py:81](/workspace/projected_grpo/src/projected_grpo/proj.py:81) The quantile gate always drops a fixed fraction when `v_sv` exists, even if no axis is actually suspicious. In a clean projected run, `drop_top_frac=0.25` still removes 25% of the projection axes. Conversely, if all axes are suspicious, it still keeps 75%. That makes `susp_drop_frac` a fixed projection-budget change, not a suspicion detector. I would make the gate threshold absolute after proper normalization, or run it as `measure_only` diagnostic until a threshold is empirically calibrated.
+
+- [src/projected_grpo/train.py:394](/workspace/projected_grpo/src/projected_grpo/train.py:394) Old v_hack files without `_sv/` keys silently disable the suspicion gate while `susp_drop_frac` remains 0.25. `load_v_hack` explicitly allows empty `v_sv`, and `project_delta_S_grad` then reports `susp=0.00` rather than failing. In this fail-fast repo, v2 should be required when `susp_drop_frac > 0`, especially because the justfile still passes explicit `out/v_hack_full.safetensors` paths that may predate the schema.
+
+- [src/projected_grpo/extract_vhack_grad.py:117](/workspace/projected_grpo/src/projected_grpo/extract_vhack_grad.py:117) Non-finite extraction losses are skipped, which can leave hack and clean gradient stacks with different lengths and then fail later at `D = G_h - G_c`. That is a delayed and less informative failure. For research code, raise immediately with pair/label/loss context.
+
+### Suggestions
+- [src/projected_grpo/train.py:378](/workspace/projected_grpo/src/projected_grpo/train.py:378) Load-or-extract looks safe with respect to optimizer state and gradients. One small concern: `extract_vhack_grad.main()` explicitly calls `model.eval()`, while train auto-extract relies on the model’s current mode. HF models usually load in eval mode, but I would set `model.eval()` before extraction and then explicitly choose the desired training mode after, to make this state transition visible.
+
+- [src/projected_grpo/extract_vhack_grad.py:259](/workspace/projected_grpo/src/projected_grpo/extract_vhack_grad.py:259) `len(v_hack)` now includes `_sv/` entries, so the final `modules=` count is doubled. Use the filtered module count for diagnostics, otherwise extraction logs will overstate module count and make zero-rate summaries harder to read.
+
+- The tiny-random smoke with `loss=0` and `cin/cout=NaN` is not an adequate gate test. Add a synthetic `project_delta_S_grad` smoke with two fake wrappers, nonzero grads, known orthonormal `V`, known `S`, and expected `frac_axes_susp`, `cout < cin`, and mutation/no-mutation behavior for projected vs vanilla. That directly exercises the gate without needing GRPO to produce non-degenerate rewards.
+
+### Verdict
+REQUEST CHANGES
+
+The load-or-extract path is mostly sound, but the suspicion gate currently behaves like a fixed top-fraction projection suppressor, and old artifacts can silently bypass it. Fix the gate normalization/threshold semantics and make v2 `_sv/` metadata required when the gate is enabled.
+
+[done] end_turn