Files
evil_MoE/docs/spec/20260527_code_review.md
T
wassname 5f196e3108 v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin
Extraction (extract_vhack_grad.py):
- Default top_k=12 (was 5), saves singular values S as _sv/{name} keys
- SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile)
- Pulled extract_v_hack() into a callable function for in-process reuse
- Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched)

Loading (train.py:load_v_hack):
- Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict
- k_use slicing at load: extract at k=12, ablate k=1..12 by config flip
- Auto-extract on cache miss using already-wrapped model (no second model load)
- Default path derived from model_slug + extract_top_k

Runtime suspicion gate (proj.py:project_delta_S_grad):
- Dimensionless within-module ratio: r_i = (|c_i|/||g||) / (S_i/||S||)
  (codex/subagent flagged: |c_i|/S_i biased by per-module ||g||)
- Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25)
- Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file)

Per-source cin (proj.py:mean_cin_from_grads + train.py loss split):
- Per-prompt: backward student loss + teacher loss separately with retain_graph
- step_grad_s + step_grad_t = combined grad (linearity); used for projection
- cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack"

Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan)
Codex external review: docs/spec/20260527_code_review.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 06:39:05 +00:00

46 KiB
Raw Blame History

[client] initialize (running)

[client] session/new (running) Ill use the local review workflow plus the gsd skill because this is a multi-step code review, then Ill read the project instructions and changed files before making findings.

[tool] pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'docs/brainstorm/extracted_prefs.md' -g 'docs/RESEARCH_JOURNAL.md' (running) input: {"call_id":"call_26P3qQ1bTCPLMMOGhHlNW3B7","process_id":"3146","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'do...

[tool] pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'docs/brainstorm/extracted_prefs.md' -g 'docs/RESEARCH_JOURNAL.md' (failed) kind: execute input: {"call_id":"call_26P3qQ1bTCPLMMOGhHlNW3B7","process_id":"3146","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'do... output: bwrap: No permissions to create new namespace, likely because the kernel does not allow non-privileged user namespaces. See https://deb.li/bubblewrap or file:///usr/share/doc/bubblewrap/README.Debian.gz.

[tool] Read SKILL.md (running) input: {"call_id":"call_pqWiVSayENSrfXMHJIaa2TDn","process_id":"2476","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,220p' /root/.claude/skills/gsd/SKILL.md"],"cwd":"/workspace/pr... files: /root/.claude/skills/gsd/SKILL.md

[tool] Read SKILL.md (failed) kind: read input: {"call_id":"call_pqWiVSayENSrfXMHJIaa2TDn","process_id":"2476","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,220p' /root/.claude/skills/gsd/SKILL.md"],"cwd":"/workspace/pr... files: /root/.claude/skills/gsd/SKILL.md output: bwrap: No permissions to create new namespace, likely because the kernel does not allow non-privileged user namespaces. See https://deb.li/bubblewrap or file:///usr/share/doc/bubblewrap/README.Debian.gz.

[tool] pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'docs/brainstorm/extracted_prefs.md' -g 'docs/RESEARCH_JOURNAL.md' (running) input: {"call_id":"call_qsJCRIGJ2i5HN4N8VhYCiO86","process_id":"97013","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'd...

[tool] pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'docs/brainstorm/extracted_prefs.md' -g 'docs/RESEARCH_JOURNAL.md' (completed) kind: execute input: {"call_id":"call_qsJCRIGJ2i5HN4N8VhYCiO86","process_id":"97013","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'd... output: /workspace/projected_grpo docs/brainstorm/extracted_prefs.md AGENTS.md

```sh
/workspace/projected_grpo
docs/brainstorm/extracted_prefs.md
AGENTS.md
```

[tool] Read SKILL.md (running) input: {"call_id":"call_Epox1WkTa0YbizPn1jb4k52M","process_id":"4410","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,220p' /root/.claude/skills/gsd/SKILL.md"],"cwd":"/workspace/pr... files: /root/.claude/skills/gsd/SKILL.md

[tool] Read SKILL.md (completed) kind: read input: {"call_id":"call_Epox1WkTa0YbizPn1jb4k52M","process_id":"4410","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,220p' /root/.claude/skills/gsd/SKILL.md"],"cwd":"/workspace/pr... files: /root/.claude/skills/gsd/SKILL.md output: --- name: gsd description: "Get Shit Done workflow for non-trivial tasks: spec -> implement -> test -> review -> wrap. Mandatory for anything with more than one step." ---

<!--
Design rationale and sources — for human readers, not the agent.

- GSD spec-first: spec-driven development, scope fencing, verify-before-done
- Manus file-as-memory: one file is working memory, checklist is state, append-only log
- Boomerang: return concrete artifacts not summaries
- DCP: invisible dedup, superseded writes, error purging

Core principles:
- Spec file is the only persistent artifact (single-file invariant).
- Agent shouldn't know about its own resource limits.
- Review scales with input size: frontier for plans, cheap for bulk.
- Observable outputs: show evidence, not narration.
-->

# GSD (Get Shit Done)

Mandatory for all non-trivial tasks (more than one step). Skip only for single-line fixes.

## Part 1: Core workflow

### 1.1 Spec-first planning

... (191 more lines)

[plan]

  • [in_progress] Read local instructions and gsd workflow
  • [pending] Collect diffs and changed-file context
  • [pending] Review suspicion gate math and load/extract flow
  • [pending] Check schema/test implications
  • [pending] Write final review The sandbox is blocking normal read commands, so Im using approved escalated reads only. Next Im collecting the actual diff and full file context, with the review scoped to the changed files.

[tool] Read AGENTS.md (running) input: {"call_id":"call_NLThtl3Ffv5iUZwilR5SADCz","process_id":"14734","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,240p' AGENTS.md"],"cwd":"/workspace/projected_grpo","parsed_c... files: /workspace/projected_grpo/AGENTS.md

[tool] git status --short (running) input: {"call_id":"call_xaCeeYFvoLvaO7EJ0Sl9is82","process_id":"72006","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git status --short"],"cwd":"/workspace/projected_grpo","parsed_cmd":[{"...

[tool] git diff -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo/train.py src/projected_grpo/proj.py src/projected_grpo/verify_vhack_heldout.py src/projected_grpo/probe_distill.py (running) input: {"call_id":"call_KnN7vfJXbBaKGCBOobcZOO4x","process_id":"78686","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git diff -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo...

[tool] git diff --cached -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo/train.py src/projected_grpo/proj.py src/projected_grpo/verify_vhack_heldout.py src/projected_grpo/probe_distill.py (running) input: {"call_id":"call_i2OvvQxGyfTxRhy2A5IPLZ5c","process_id":"75184","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git diff --cached -- src/projected_grpo/extract_vhack_grad.py src/proje...

[tool] Read AGENTS.md (completed) kind: read input: {"call_id":"call_NLThtl3Ffv5iUZwilR5SADCz","process_id":"14734","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,240p' AGENTS.md"],"cwd":"/workspace/projected_grpo","parsed_c... files: /workspace/projected_grpo/AGENTS.md output: # AGENTS.md — projected_grpo

**This is novel ML research.** Not in your training data. Extrapolate carefully.

## Project in one paragraph

Test whether SVD-basis gradient projection against an extracted hack-direction
reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from
Rebound (Wu & Tang 2026) by intervening at the *gradient* level rather than the
*advantage* level. Differs from AntiPaSTO (the user's prior work) by using
unpaired GRPO rollouts rather than paired-preference contrast.

Inherit global rules from `~/.claude/CLAUDE.md`.

## Workflow

- Read [docs/spec.md](spec.md) for the preregistered plan.
- Read [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md) for design rationale.
- New sweep arms get recipes in [justfile](justfile) with `# H:` hypothesis comments.
- `just fast-dev-run` before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
- Real runs go through `pueue` on the 96GB GPU box. Label each job with `why:` and `resolve:`.
- Head [docs/RESEARCH_JOURNAL.md](docs/RESEARCH_JOURNAL.md) for latest results.
- No `tests/` dir; `fast-dev-run` is the correctness gate.

## External dependencies

`external/rl-rewardhacking/` is Ariahw's repo (verl-based GRPO + LeetCode dataset
+ reward hacking monitors). We import from it; we do NOT modify it. Sync with
... (44 more lines)

[tool] git diff -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo/train.py src/projected_grpo/proj.py src/projected_grpo/verify_vhack_heldout.py src/projected_grpo/probe_distill.py (completed) kind: execute input: {"call_id":"call_KnN7vfJXbBaKGCBOobcZOO4x","process_id":"78686","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git diff -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo... output: diff --git a/src/projected_grpo/extract_vhack_grad.py b/src/projected_grpo/extract_vhack_grad.py index 5e9f1ff..57ed5d0 100644 --- a/src/projected_grpo/extract_vhack_grad.py +++ b/src/projected_grpo/extract_vhack_grad.py @@ -51,7 +51,14 @@ class Config: out_path: Path = OUT_DIR / "v_hack.safetensors" train_grads_path: Path = OUT_DIR / "vhack_grads_train.safetensors" n_heldout: int = 2 # last n pairs reserved for held-out validation - top_k: int = 5 # rank of hack subspace per module (≤ n_train_pairs) + # top_k=12 = max(n_train_pairs after n_heldout=2 from N=14 pairs). Extract once + # at max rank; train.py slices via --v-hack-k for k-ablation without re-extract. + top_k: int = 12 + # tau_axis: zero rows where S_i/S_0 < tau_axis. Diagnostic — projection along + # noise-direction unit vectors removes only ~||g||/sqrt(r) ≈ 2% of grad + # magnitude on r=2560 modules, so this rarely changes effect size; it does + # make k-ablations honest (axes 4-5 might be pure noise on N=12 pairs). + tau_axis: float = 0.0

 def resolve_dtype(s: str) -> torch.dtype:
@@ -73,29 +80,32 @@ def completion_nll(model, tokenizer, prompt: str, completion: str, device) -> to
     return (nll * mask).sum() / mask.sum().clamp_min(1.0)
 
 
-def main(cfg: Config) -> int:
-    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-    dtype = resolve_dtype(cfg.dtype)
-    logger.info(
... (1103 more lines)

[tool] git status --short (completed) kind: execute input: {"call_id":"call_xaCeeYFvoLvaO7EJ0Sl9is82","process_id":"72006","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git status --short"],"cwd":"/workspace/projected_grpo","parsed_cmd":[{"... output: M justfile M src/projected_grpo/extract_vhack_grad.py M src/projected_grpo/probe_distill.py M src/projected_grpo/proj.py M src/projected_grpo/train.py M src/projected_grpo/verify_vhack_heldout.py ?? docs/spec/20260527_code_review.md

```sh
 M justfile
 M src/projected_grpo/extract_vhack_grad.py
 M src/projected_grpo/probe_distill.py
 M src/projected_grpo/proj.py
 M src/projected_grpo/train.py
 M src/projected_grpo/verify_vhack_heldout.py
?? docs/spec/20260527_code_review.md
```

[tool] git diff --cached -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo/train.py src/projected_grpo/proj.py src/projected_grpo/verify_vhack_heldout.py src/projected_grpo/probe_distill.py (completed) kind: execute input: {"call_id":"call_i2OvvQxGyfTxRhy2A5IPLZ5c","process_id":"75184","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git diff --cached -- src/projected_grpo/extract_vhack_grad.py src/proje... output: { "call_id": "call_i2OvvQxGyfTxRhy2A5IPLZ5c", "process_id": "75184", "turn_id": "019e681a-3b48-7ed3-8e6b-1912b41adb6f", "command": [ "/usr/bin/zsh", "-lc", "git diff --cached -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo/train.py src/projected_grpo/proj.py src/projected_grpo/verify_vhack_heldout.py src/projected_grpo/probe_distill.py" ], "cwd": "/workspace/projected_grpo", "parsed_cmd": [ { "type": "unknown", "cmd": "git diff --cached -- src/projected_grpo/extract_vhack_grad.py src/projected_grpo/train.py src/projected_grpo/proj.py src/projected_grpo/verify_vhack_heldout.py src/projected_grpo/probe_distill.py" } ], "source": "unified_exec_startup", "stdout": "", "stderr": "", "aggregated_output": "", "exit_code": 0, "duration": { "secs": 1, "nanos": 432131924 }, "formatted_output": "", "status": "completed" } I have the high-level diff. Im reading the full changed files now because the suspicion gate depends on surrounding training order, optimizer setup, and how delta_S is wrapped.

[tool] Read proj.py (running) input: {"call_id":"call_B8lYmbn1uM0INz0YwcS54y79","process_id":"37400","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/proj.py"],"cwd":"/workspace/projected_grpo","... files: /workspace/projected_grpo/src/projected_grpo/proj.py

[tool] Read train.py (running) input: {"call_id":"call_19Uf3j0MKdVXQPhRaIstMKAF","process_id":"58769","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/train.py"],"cwd":"/workspace/projected_grpo",... files: /workspace/projected_grpo/src/projected_grpo/train.py

[tool] Read extract_vhack_grad.py (running) input: {"call_id":"call_UjQGhDRH4LEXXP0yi2Oa8hOV","process_id":"85164","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/extract_vhack_grad.py"],"cwd":"/workspace/pro... files: /workspace/projected_grpo/src/projected_grpo/extract_vhack_grad.py

[tool] Read verify_vhack_heldout.py (running) input: {"call_id":"call_9gmm3VORQurDl6RBtIOyqOCR","process_id":"42951","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/verify_vhack_heldout.py"],"cwd":"/workspace/p... files: /workspace/projected_grpo/src/projected_grpo/verify_vhack_heldout.py

[tool] Read proj.py (completed) kind: read input: {"call_id":"call_B8lYmbn1uM0INz0YwcS54y79","process_id":"37400","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/proj.py"],"cwd":"/workspace/projected_grpo","... files: /workspace/projected_grpo/src/projected_grpo/proj.py output: 1 """Gradient projection + delta_S grad utilities. Imported by smoke and train.""" 2 from future import annotations 3 4 import torch 5 6 7 def per_token_logps(logits: torch.Tensor, ids: torch.Tensor) -> torch.Tensor: 8 """log p(ids | logits) gathered token-wise. 9 10 Uses F.cross_entropy (fused softmax+gather) so we never materialise the 11 full [B, L, V] fp32 softmax. On Qwen3.5-2B with V=152k, G=8, L≈1500 the 12 fp32 vocab tensor was ~7 GB per forward — the difference between OOM and 13 fit on a 96 GB card when the autograd graph is alive. 14 """ 15 B, L, V = logits.shape 16 # CE's internal log_softmax accumulates in fp32 (stable) but returns input dtype. 17 # The output [B*L] is small, so upcast it to fp32 for downstream PPO ratio math. 18 return -torch.nn.functional.cross_entropy( 19 logits.reshape(-1, V), ids.reshape(-1), reduction="none" 20 ).float().view(B, L) 21 22 23 @torch.no_grad() 24 def project_delta_S_grad( 25 wrappers: dict, 26 v_hack: dict[str, torch.Tensor], 27 preserve_magnitude: bool, 28 measure_only: bool = False, ... (251 more lines)

[tool] Read train.py (completed) kind: read input: {"call_id":"call_19Uf3j0MKdVXQPhRaIstMKAF","process_id":"58769","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/train.py"],"cwd":"/workspace/projected_grpo",... files: /workspace/projected_grpo/src/projected_grpo/train.py output: 1 """Canonical training entry point: AntiPaSTO + GRPO (Dr.GRPO unbiased) + optional 2 gradient projection on LeetCode reward-hacking benchmark. 3 4 Lineage (see spec.md §76-83): 5 - The inner GRPO_step (per_token_logps, ratio + clip + min, K3 KL, per-token 6 loss, completion mask) is a direct port of lsdefine/simple_GRPO's 7 GRPO_step in grpo_vllm_one.py (lines 64-95). 8 - The OUTER loop adopts simple_GRPO's Q_batch_size pattern (multiple 9 prompts per optimizer step, per-prompt GRPO advantage groups, grad 10 accumulation across prompts). GRPO needs within-group reward diversity to 11 produce any signal; sampling many prompts per step raises the chance that 12 at least one group is non-degenerate. simple_GRPO uses Q_batch_size=5; our 13 prompts_per_step is set in PRESETS (grad-accum to the paper's effective batch). 14 - Deviations from simple_GRPO are deliberate, listed in spec.md: 15 1. Loss normalization: Dr.GRPO unbiased (Liu et al. 2025, arXiv 16 2503.20783) replaces simple_GRPO's (R-mean)/std + per-response-len 17 denominator. Drops two biases: 18 - length norm 1/|o_i| (favors short correct, long incorrect) 19 - group-std norm /std(R) (overweights easy/hard questions) 20 Toggle via --unbiased (default on); flipping to False recovers 21 simple_GRPO's classic GRPO advantage normalization. 22 2. Reference model: simple_GRPO runs a separate base model via an HTTP 23 ref_server. We use the AntiPaSTO delta_S=0 zero-adapter trick 24 (W' = W + U diag(0) Vh = W exactly) — no second model loaded. 25 3. Rollout: simple_GRPO uses vLLM in a separate process. We use HF 26 model.generate in-process. 27 4. Adapter: simple_GRPO is full FT (with DeepSpeed ZeRO). Canonical 28 (ariahw/rl-reward...

[tool] Read verify_vhack_heldout.py (completed) kind: read input: {"call_id":"call_9gmm3VORQurDl6RBtIOyqOCR","process_id":"42951","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/verify_vhack_heldout.py"],"cwd":"/workspace/p... files: /workspace/projected_grpo/src/projected_grpo/verify_vhack_heldout.py output: 1 """Held-out v_hack validation (spec.md §B validation). 2 3 For each held-out pair, compute per-module gradient diff (g_hack - g_clean) 4 in delta_S basis, then cos-align with the trained v_hack[name]. 5 6 Report: 7 - per-suffix median/mean cos_align 8 - fraction of modules with cos_align > 0 (SHOULD > 0.5) 9 - mean cos_align across modules (target > 0.2) 10 11 Run: uv run python -m projected_grpo.verify_vhack_heldout 12 """ 13 from future import annotations 14 15 import sys 16 from collections import defaultdict 17 from dataclasses import dataclass 18 from pathlib import Path 19 20 import json 21 22 import torch 23 import tyro 24 from loguru import logger 25 from safetensors.torch import save_file 26 from tabulate import tabulate 27 from transformers import AutoModelForCausalLM, AutoTokenizer 28 ... (279 more lines)

[tool] Read extract_vhack_grad.py (completed) kind: read input: {"call_id":"call_UjQGhDRH4LEXXP0yi2Oa8hOV","process_id":"85164","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/extract_vhack_grad.py"],"cwd":"/workspace/pro... files: /workspace/projected_grpo/src/projected_grpo/extract_vhack_grad.py output: 1 """Gradient-side per-module v_hack extraction (spec.md §B, top-k variant). 2 3 For each contrastive pair (prompt, hack_completion, clean_completion): 4 - Forward(prompt+completion), mean-NLL on completion tokens, backward 5 - Capture delta_S.grad per AntiPaSTO-wrapped Linear 6 7 Then per module, with D = [g_hack_i - g_clean_i for each pair] in R^{n_pairs x r}: 8 SVD(D) = U Σ Vh 9 v_hack[name] = top_k rows of Vh, each oriented so mean(D @ v_i) > 0 10 11 This generalizes mean-diff (which corresponds to top-1 PC of paired diffs under 12 isotropic covariance) to a rank-k hack subspace, motivated by CHaRS (Abdullaev 13 2025 — see docs/paper_chars.md): hack signal is multi-modal across hack flavors 14 (weak tests, hardcode, persona, ...), so a single global direction is brittle. 15 16 Orientation matters because proj.py applies a per-direction one-sided gate 17 (only subtracts <g, v_i> when positive). +v_i must point hack-ward. 18 19 Saves out/v_hack.safetensors = dict[name -> Tensor[k, r]] (cpu fp32, rows 20 unit-norm + orthonormal from SVD) with header {"model": str, "dtype": str, 21 "top_k": str(k)}. 22 23 Run: uv run python -m projected_grpo.extract_vhack_grad 24 """ 25 from future import annotations 26 27 import sys 28 from collections import defaultdict ... (515 more lines)

[tool] Read train.py (running) input: {"call_id":"call_XmsO6ztYV0PA2nSx8J4xdbu9","process_id":"5743","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/train.py | sed -n '390,530p'"],"cwd":"/workspa... files: /workspace/projected_grpo/src/projected_grpo/train.py

[tool] Read probe_distill.py (running) input: {"call_id":"call_NwEh3dqGqjDFXQuySFaT2c0N","process_id":"46600","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/probe_distill.py | sed -n '180,240p'"],"cwd":... files: /workspace/projected_grpo/src/projected_grpo/probe_distill.py

[tool] Read 20260527_code_review.md (running) input: {"call_id":"call_u67hecWsDhI7oi0boUU1C0q5","process_id":"15286","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba docs/spec/20260527_code_review.md"],"cwd":"/workspace/projected_... files: /workspace/projected_grpo/docs/spec/20260527_code_review.md

[tool] git diff -- justfile (running) input: {"call_id":"call_3AYtTkFJY9bhQBbuGufHXJiA","process_id":"61295","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git diff -- justfile"],"cwd":"/workspace/projected_grpo","parsed_cmd":[...

[tool] Read train.py (completed) kind: read input: {"call_id":"call_XmsO6ztYV0PA2nSx8J4xdbu9","process_id":"5743","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/train.py | sed -n '390,530p'"],"cwd":"/workspa... files: /workspace/projected_grpo/src/projected_grpo/train.py output: 390 metadata={"model": model_name, "dtype": "bf16", 391 "top_k": str(min(cfg.v_hack_extract_top_k, len(VHACK_PAIRS) - 2)), 392 "tau_axis": "0.0", "schema": "v2_with_sv"}) 393 # extract leaves model.zero_grad() done; reset opt state isn't needed (opt built below) 394 v_hack_cpu, v_sv_cpu = load_v_hack(v_hack_path, model_name, wrappers, k_use=cfg.v_hack_k) 395 v_hack = {name: v.to(device) for name, v in v_hack_cpu.items()} 396 v_sv = {name: s.to(device) for name, s in v_sv_cpu.items()} if v_sv_cpu else None 397 # Teacher pool: pre-generated rollouts on disk keyed by problem_id. Each step's 398 # G_t teacher rollouts come from a uniform random sample of that prompt's cache, 399 # so we do not keep the teacher model in VRAM. Pool is produced by 400 # probe_distill.py --teacher-only (see schema in probe_distill.py:149-186). 401 # Cached rewards/flags are reused verbatim — no re-grading — so the pool is a 402 # reproducible fixed teacher distribution across runs. 403 teacher_pool: dict[int, list[dict]] = {} 404 G_s = group 405 G_t = 0 406 if cfg.teacher_pool_dir is not None: 407 if not (0.0 < cfg.mix_ratio < 1.0): 408 raise ValueError(f"mix_ratio must be in (0,1) when teacher_pool_dir set; got {cfg.mix_ratio}") 409 G_t = round(group * cfg.mix_ratio) 410 G_s = group - G_t 411 if G_s == 0 or G_t == 0: 412 raise ValueError( 413 f"degenerate split: G={group} mix_ratio={cfg.mix_ratio} -> G_s={G_s}, G_t={G_t}. " 414 f"Pick mix_ratio so both halves are non-empty, or drop --teacher-pool-dir." 415 ) 416 for path in sorted(cfg.teacher_pool_dir.glob("prompt_*.jsonl.gz")): 417 # path.stem on 'prompt_0004.jsonl.gz' is 'prompt_0004.jsonl' (only one ....

[tool] Read probe_distill.py (completed) kind: read input: {"call_id":"call_NwEh3dqGqjDFXQuySFaT2c0N","process_id":"46600","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/probe_distill.py | sed -n '180,240p'"],"cwd":... files: /workspace/projected_grpo/src/projected_grpo/probe_distill.py output: 180 181 182 def load_prompt(pool_dir: Path, problem_id: int) -> list[dict]: 183 path = pool_dir / f"prompt_{problem_id:04d}.jsonl.gz" 184 with gzip.open(path, "rt") as f: 185 return [json.loads(line) for line in f] 186 187 188 def main(cfg: Config) -> int: 189 if cfg.tag: 190 tag = cfg.tag 191 elif cfg.teacher_only: 192 tag = "teacher_pool" 193 elif cfg.base_only: 194 tag = "base_pool" 195 else: 196 tag = f"{cfg.arm}seed{cfg.seed}" 197 run_id = f"distill{tag}" 198 setup_logging(run_id) 199 torch.manual_seed(cfg.seed) 200 device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 201 202 logger.info(f"argv: {' '.join(sys.argv)}") 203 logger.info(f"arm={cfg.arm} teacher={cfg.teacher} steps={cfg.steps} " 204 f"G={cfg.group} seed={cfg.seed} " 205 f"teacher_only={cfg.teacher_only} replay={cfg.replay_dir is not None}") 206 207 if cfg.teacher_only or cfg.base_only: ... (97 more lines)

[tool] git diff -- justfile (completed) kind: execute input: {"call_id":"call_3AYtTkFJY9bhQBbuGufHXJiA","process_id":"61295","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","git diff -- justfile"],"cwd":"/workspace/projected_grpo","parsed_cmd":[... output: diff --git a/justfile b/justfile index c1dc901..81f78b4 100644 --- a/justfile +++ b/justfile @@ -285,18 +285,21 @@ pregen-teacher n_prompts="100": --group=8
--max-new=1024

-# 10-step feasibility probe: clean Qwen3-4B + 50% cached teacher pool, pp=4.
-# Queues via pueue so logs are shared between user and agent. Student is
-# restricted to problems covered by the pool. GO/NO-GO: final HACK_STUDENT
-# > 0.10 by step 10.
+# 100-step feasibility probe: clean Qwen3-4B + 75% cached teacher pool, pp=4, G=12.
+# Plan B "free lunch": mix=0.75 -> G_s=3, G_t=9. Gen wall-time unchanged
+# (teacher is cached disk reads), backward VRAM ~2x current (peak ~55-60 GB on
+# 96 GB card). At 48 gens/step (vs reference 256), 100 steps ~= 19 ref steps.
+# --v-hack-path is set even for vanilla so cin/cout get measured as baseline
+# (project_delta_S_grad with measure_only=True on vanilla arm).
 probe-mixed seed="41":
-    pueue add -l "why: does mixed-pool GRPO (cached teacher) drive student hack-rate up from clean base; resolve: confirm student hack_s climbs 0->20%+ over 10 steps" \
+    pueue add -l "why: does mixed-pool GRPO (cached teacher, plan B grad pressure) drive student hack-rate from clean base; resolve: confirm hack_s climbs 0->10%+ over 100 steps (~19 ref-eq)" \
         -w "$PWD" -- \
         {{ TRAIN }} --preset=full --arm=vanilla \
             --model={{ MODEL }} \
+            --v-hack-path=out/v_hack_full.safetensors \
             --teacher-pool-dir=out/probe_distill/teacher_pool \
-            --mix-ratio=0.5 \
-            --steps=10 --prompts-per-step=4 \
... (40 more lines)

[tool] Read 20260527_code_review.md (completed) kind: read input: {"call_id":"call_u67hecWsDhI7oi0boUU1C0q5","process_id":"15286","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba docs/spec/20260527_code_review.md"],"cwd":"/workspace/projected_... files: /workspace/projected_grpo/docs/spec/20260527_code_review.md output: 1 [client] initialize (running) 2 3 [client] session/new (running) 4 Ill use the local review workflow plus the gsd skill because this is a multi-step code review, then Ill read the project instructions and changed files before making findings. 5 6 [tool] pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'docs/brainstorm/extracted_prefs.md' -g 'docs/RESEARCH_JOURNAL.md' (running) 7 input: {"call_id":"call_26P3qQ1bTCPLMMOGhHlNW3B7","process_id":"3146","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'do... 8 9 [tool] pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'docs/brainstorm/extracted_prefs.md' -g 'docs/RESEARCH_JOURNAL.md' (failed) 10 kind: execute 11 input: {"call_id":"call_26P3qQ1bTCPLMMOGhHlNW3B7","process_id":"3146","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","pwd && rg --files -g 'AGENTS.md' -g 'CLAUDE.md' -g 'docs/spec.md' -g 'do... 12 output: 13 bwrap: No permissions to create new namespace, likely because the kernel does not allow non-privileged user namespaces. See https://deb.li/bubblewrap or file:///usr/share/doc/bubblewrap/README.Debian.gz. 14 15 [tool] Read SKILL.md (running) 16 input: {"call_id":"call_pqWiVSayENSrfXMHJIaa2TDn","process_id":"2476","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,220p' /root/.claude/skills/gsd/SKILL.md"],"cwd":"/workspace/pr... 17 files: /root/.claude/skills/gsd/SKILL.md 18 19 [tool] Read SKILL.md (failed) 20 kind: read 21 input: {"call_id":"call_pqWiVSayENSrfXMHJIaa2TDn","process_id":"2476","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","sed -n '1,220p' /root/.claude/skills/gsd/SKILL.md"],"cwd":"/workspace/pr... 22 files: /root/.claude...

[tool] Search project_delta_S_grad|load_v_hack|v_hack_path|susp_drop_frac|v_sv|_sv/|v-hack in src (running) input: {"call_id":"call_eKBsKF4tsshQ29fx8yS736Lt","process_id":"18775","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","rg -n "project_delta_S_grad|load_v_hack|v_hack_path|susp_drop_frac|v_s...

[tool] Read antipasto.py (running) input: {"call_id":"call_tQV8MNSEQrC6INjwfHLwzjER","process_id":"24911","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/antipasto.py | sed -n '1,240p'"],"cwd":"/work... files: /workspace/projected_grpo/src/projected_grpo/antipasto.py

[tool] Read extracted_prefs.md (running) input: {"call_id":"call_pQoENmJAU2m9pygvgNy1bdzD","process_id":"92183","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba docs/brainstorm/extracted_prefs.md | sed -n '1,220p'"],"cwd":"/w... files: /workspace/projected_grpo/docs/brainstorm/extracted_prefs.md

[tool] Read extracted_prefs.md (completed) kind: read input: {"call_id":"call_pQoENmJAU2m9pygvgNy1bdzD","process_id":"92183","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba docs/brainstorm/extracted_prefs.md | sed -n '1,220p'"],"cwd":"/w... files: /workspace/projected_grpo/docs/brainstorm/extracted_prefs.md output: 1 3 4 # Extracted preferences and decisions — projected_grpo 5 6 ## TL;DR delta vs spec.md 7 8 Spec.md is the clean preregistered plan. docs/1.md is the reasoning trail behind it. The biggest 9 deltas the brainstorm adds (not in spec): 10 11 1. The whole project pivoted mid-conversation from a DPO+sycophancy plan (Anthropic HH-RLHF) 12 to GRPO+reward-hacking (Nanda/Ariahw LeetCode). Driver: gradient projection in SVD basis matches 13 GRPO's unpaired structure better than DPO's paired-preference structure. 14 2. Method evolved from "bidirectional SVD-LoRA with NLL+KL" (paired-preference native, the 15 AntiPaSTO line) to gradient-level intervention + SVD-basis denoising — an orthogonal 16 approach for unpaired GRPO rollouts. 17 3. Rebound paper (Wu & Tang 2026) appeared mid-brainstorm and reframed the positioning: 18 not novel mechanism (concept-direction intervention) but novel level (gradient vs advantage). 19 User's confidence updated downward but stayed positive — ~60% the method works now (was ~40% 20 pre-Rebound, framed as net positive because Rebound validates the core mechanism). 21 4. Single-GPU pragmatism: extensive back-and-forth on 3090 vs 96GB RTX 6000 Ada. Landed on 22 96GB RTX 6000 + Qwen3.5-2B as the practical sweet spot. 23 24 ## 1. Design decisions 25 26 - Substitute Qwen3.5-2B for Qwen3-4B. Reason: compute budget. Fallback to Qwen3-4B with 27 reduced num_generations if H4 (hack emergence) fails at 2B. 28 - Use verl, not TRL. Reason: Nanda's repo uses verl v0.6.1; minimise reimplementation risk. ... (135 more lines)

[tool] Read antipasto.py (completed) kind: read input: {"call_id":"call_tQV8MNSEQrC6INjwfHLwzjER","process_id":"24911","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/antipasto.py | sed -n '1,240p'"],"cwd":"/work... files: /workspace/projected_grpo/src/projected_grpo/antipasto.py output: 1 """AntiPaSTO full-rank adapter via forward hooks (lora-lite style). 2 3 Per spec.md: each target nn.Linear keeps its original weight intact. We attach 4 frozen buffers U, Vh and a trainable delta_S of shape [r] per layer. A forward 5 post-hook adds the delta contribution: 6 7 y_new = y + U @ (delta_S * (Vh @ x)) 8 9 equivalent to W -> W + U diag(delta_S) Vh. At delta_S = 0 the delta is exactly 10 zero, so the wrapped model is bit-identical to the base (no SVD round-trip 11 error on the main path -- W stays as it was loaded). U, Vh stay frozen and 12 double as the basis for v_hack gradient projection (we read delta_S.grad 13 directly; no extra projection math at the gradient step). 14 """ 15 from future import annotations 16 17 import hashlib 18 from pathlib import Path 19 20 import torch 21 from jaxtyping import Float 22 from loguru import logger 23 from torch import Tensor, nn 24 25 26 def svd_cached( 27 W: Float[Tensor, "d_out d_in"], 28 cache_path: Path, ... (249 more lines)

[tool] Search project_delta_S_grad|load_v_hack|v_hack_path|susp_drop_frac|v_sv|_sv/|v-hack in src (completed) kind: search input: {"call_id":"call_eKBsKF4tsshQ29fx8yS736Lt","process_id":"18775","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","rg -n "project_delta_S_grad|load_v_hack|v_hack_path|susp_drop_frac|v_s... output: justfile:24: {{ TRAIN }} --preset=smoke --arm=projected --v-hack-path=out/v_hack_smoke.safetensors {{ ARGS }} justfile:31: {{ TRAIN }} --preset=smoke --arm=projected --v-hack-path=out/v_hack_smoke.safetensors justfile:38: {{ TRAIN }} --preset=full --arm=projected --v-hack-path=out/v_hack_full.safetensors {{ ARGS }} justfile:67: --v-hack-path=out/v_hack_smoke.safetensors
justfile:74: --v-hack-path=out/v_hack_full.safetensors
justfile:104: pueue add -a "$VA" -w "$PWD" -o 8 -l "why: projected seed{{ seed }} @ matched batch, v_hack NOT post-hoc; resolve: Gate D H1 HACK_RATE<vanilla at matched PASS" -- {{ TRAIN }} --preset=full --arm=projected --seed={{ seed }} --v-hack-path=out/v_hack_full.safetensors --out-tag=_full_projected_seed{{ seed }}probe justfile:140: -- {{ TRAIN }} --preset={{ preset }} --arm=projected --seed=$seed --v-hack-path={{ vhack }} --out-tag={{ preset }}projected_seed$seed justfile:151: uv run python -m projected_grpo.probe_distill --v-hack-path=out/v_hack_full.safetensors {{ ARGS }} justfile:167: --v-hack-path=out/v_hack_full.safetensors justfile:176: --v-hack-path=out/v_hack_full.safetensors justfile:178:# Arm 2: projected GRPO in SVD basis (AntiPaSTO + project_delta_S_grad). justfile:183: --v-hack-path=out/v_hack_full.safetensors justfile:193: --v-hack-path=out/v_hack_full.safetensors justfile:200: --v-hack-path=out/v_hack_full.safetensors justfile:211: --v-hack-path=out/v_hack_full.safetensors justfile:220: --v-hack-path=out/v_hack_full.safetensors justfile:225: --v-hack-path=out/v_hack_full.safetensors justfile:230: --v-hack-path=out/v_hack_full.safetensors justfile:252: --v-hack-path=out/v_hack{{ tag }}.safetensors
justfile:292:# --v-hack-path is set even for vanilla so cin/cout get measured as baseline justfile:293:# (project_delta_S_grad with measure_only=True on vanilla arm). justfile:299: --v-hack-path=out/v_hac... The load-or-extract ordering looks intentional so far: extraction happens after wrapping and before optimizer construction, and it zeroes grads before returning. Im now checking all call sites and the wrapper semantics before deciding whether this hides stateful side effects.

[tool] Read probe_distill.py (running) input: {"call_id":"call_t1hmCip7FvpT3kPO0dr9DaI6","process_id":"52595","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/probe_distill.py | sed -n '430,485p'"],"cwd":... files: /workspace/projected_grpo/src/projected_grpo/probe_distill.py

[tool] Read probe_distill.py (completed) kind: read input: {"call_id":"call_t1hmCip7FvpT3kPO0dr9DaI6","process_id":"52595","turn_id":"019e681a-3b48-7ed3-8e6b-1912b41adb6f","command":["/usr/bin/zsh","-lc","nl -ba src/projected_grpo/probe_distill.py | sed -n '430,485p'"],"cwd":... files: /workspace/projected_grpo/src/projected_grpo/probe_distill.py output: 430 ) 431 mask = (ci != pad_id).float() 432 per_sample_logp_mean[i] = float((logp_i * mask).sum().item() / max(1.0, mask.sum().item())) 433 # Dr.GRPO REINFORCE: -adv * mean_logp. No PPO ratio because at step 434 # start, student matches its own no_grad logp on these tokens. 435 loss_i = -adv[i] * (logp_i * mask).sum() / mask.sum().clamp_min(1.0) / cfg.group 436 per_sample_loss[i] = float(loss_i.item()) 437 loss_i.backward() 438 contrib = {n: info["delta_S"].grad - g_before[n] 439 for n, info in wrappers.items()} 440 per_sample_cos[i] = norm_weighted_cos(contrib, v_hack) 441 per_sample_norm[i] = float(sum(c.float().pow(2).sum().item() for c in contrib.values()) ** 0.5) 442 g_before = {n: info["delta_S"].grad.clone() for n, info in wrappers.items()} 443 444 # Importance ratio vs first-encounter logp. Only meaningful in 445 # replay mode (same tokens, drifting student). For student-gen we 446 # set ratio=1.0 because each step has freshly generated tokens. 447 if replay_active and replay_problem_id not in logp_step0_by_prompt: 448 logp_step0_by_prompt[replay_problem_id] = list(per_sample_logp_mean) 449 per_sample_imp_ratio = [1.0] * cfg.group 450 elif replay_active: 451 base = logp_step0_by_prompt[replay_problem_id] 452 per_sample_imp_ratio = [ 453 float(torch.tensor(per_sample_logp_mean[i] - base[i]).exp().item()) 454 for i in range(cfg.group) 455 ] 456 else: 457 per_sample_imp_ratio = [1.0] * cfg.group ... (87 more lines)

[plan]

  • [completed] Read local instructions and gsd workflow
  • [completed] Collect diffs and changed-file context
  • [completed] Review suspicion gate math and load/extract flow
  • [completed] Check schema/test implications
  • [in_progress] Write final review

Code Review: runtime suspicion gate + v_hack load/extract refactor

Summary

The refactor cleanly reuses the already wrapped model for cache-miss extraction, and optimizer ordering looks safe: extraction happens before AdamW construction and clears grads before training starts. The main issues are in the suspicion-gate semantics and fail-fast schema handling.

Important (should fix)

  • src/projected_grpo/proj.py:77 r_i = |g·v_i| / S_i is not a clean discriminator across modules. |g·v_i| scales with the live module gradient norm, while S_i scales with extract-time pair-gradient magnitude, so a high-ratio module may just be a high-gradient module rather than a suspicious weak extracted axis. Since the threshold is global across all (module, axis) pairs, this can preferentially disable projection in high-gradient modules. Use a dimensionless within-module ratio, e.g. (|c_i| / ||g||) / (S_i / ||D||_F), or calibrate against extract-time per-pair coefficients.

  • src/projected_grpo/proj.py:81 The quantile gate always drops a fixed fraction when v_sv exists, even if no axis is actually suspicious. In a clean projected run, drop_top_frac=0.25 still removes 25% of the projection axes. Conversely, if all axes are suspicious, it still keeps 75%. That makes susp_drop_frac a fixed projection-budget change, not a suspicion detector. I would make the gate threshold absolute after proper normalization, or run it as measure_only diagnostic until a threshold is empirically calibrated.

  • src/projected_grpo/train.py:394 Old v_hack files without _sv/ keys silently disable the suspicion gate while susp_drop_frac remains 0.25. load_v_hack explicitly allows empty v_sv, and project_delta_S_grad then reports susp=0.00 rather than failing. In this fail-fast repo, v2 should be required when susp_drop_frac > 0, especially because the justfile still passes explicit out/v_hack_full.safetensors paths that may predate the schema.

  • src/projected_grpo/extract_vhack_grad.py:117 Non-finite extraction losses are skipped, which can leave hack and clean gradient stacks with different lengths and then fail later at D = G_h - G_c. That is a delayed and less informative failure. For research code, raise immediately with pair/label/loss context.

Suggestions

  • src/projected_grpo/train.py:378 Load-or-extract looks safe with respect to optimizer state and gradients. One small concern: extract_vhack_grad.main() explicitly calls model.eval(), while train auto-extract relies on the models current mode. HF models usually load in eval mode, but I would set model.eval() before extraction and then explicitly choose the desired training mode after, to make this state transition visible.

  • src/projected_grpo/extract_vhack_grad.py:259 len(v_hack) now includes _sv/ entries, so the final modules= count is doubled. Use the filtered module count for diagnostics, otherwise extraction logs will overstate module count and make zero-rate summaries harder to read.

  • The tiny-random smoke with loss=0 and cin/cout=NaN is not an adequate gate test. Add a synthetic project_delta_S_grad smoke with two fake wrappers, nonzero grads, known orthonormal V, known S, and expected frac_axes_susp, cout < cin, and mutation/no-mutation behavior for projected vs vanilla. That directly exercises the gate without needing GRPO to produce non-degenerate rewards.

Verdict

REQUEST CHANGES

The load-or-extract path is mostly sound, but the suspicion gate currently behaves like a fixed top-fraction projection suppressor, and old artifacts can silently bypass it. Fix the gate normalization/threshold semantics and make v2 _sv/ metadata required when the gate is enabled.

[done] end_turn