6.0 KiB
Let me read all the requested files directly.Let me try to get the diff with alternative approaches, and read the paper/docs that are being changed.Now let me read the research journal entry about the routeA transition to verify the change from gradient-scored to activation-scored.Now I have all the information needed. Let me compile the full review.
Review: Documentation Changes vs. Current Implementation
I read the following files: README.md, AGENTS.md, src/vgrout/train.py, src/vgrout/extract_vhack_act.py, src/vgrout/lora2r.py, docs/results.md, docs/writeup/main.tex, docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md, docs/brainstorm/extracted_prefs.md, docs/spec/20260611_act_gate_spec.md, and RESEARCH_JOURNAL.md.
Findings
(1) Activation scores select routes; output masks route subsequent gradients ✅ CLEAN
The documentation accurately describes the routeA mechanism. Evidence:
- README.md (~L44-51): "routeA scores each rollout on the no-grad
logp_oldforward … activation-capture hook pools … bottleneck activations … score is the pooled dot product withv_act… block masks are set from those labels before the single masked forward+backward." - AGENTS.md (~L281-295): "routeA sets an output mask before the single grad-carrying forward and backward … The gate reads pooled activations, not gradients."
- train.py L488-509:
ActCapturehooks captureA[:r] @ xduring the no-gradlogpi_oldforward;dots = torch.einsum("gmr,mr->g", acts, v_act);_routeA_gate(dots)returns(m_vec, d_vec)pinned onlayer._lora2r_mask; then a singleloss.backward()wherelora2r.py's hook appliesdep.detach()andm*quar. - lora2r.py L106-117: masks act on branch outputs, so
d=1blocks grads to both A rows and B columns of the deployed block, andm=0zeroes the quarantine block in both forward and backward.
The flow is correctly documented: activations → dot product → Otsu thresholds → output masks → gradient routing. No errors.
(2) No current-method prose implies live-gradient cosine scoring ✅ CLEAN
All files describing the current method describe activation-scored routeA. Every file still referring to gradient cosine scoring is correctly labeled as historical/retired:
- README.md L100-103: "The paper draft and docs/results.md currently describe the retired gradient-scored routeV experiments. They are historical evidence, not a description of routeA."
- docs/writeup/main.tex L1-3: "HISTORICAL DRAFT: describes the retired gradient-scored routeV method, not the current activation-scored routeA implementation."
- docs/results.md L1-4: "These results describe the retired gradient-scored routeV method."
- docs/blog/...md: dated 2026-05-29, clearly a pre-routeA draft.
- docs/brainstorm/extracted_prefs.md: dated 2026-05-23, the original project brainstorming doc, not current-method prose.
The README, AGENTS.md, train.py docstring, extract_vhack_act.py docstring, and lora2r.py docstring all consistently describe the activation-scored routeA method. No current-method document implies gradient cosine scoring.
(3) Historical routeV results preserved rather than relabeled routeA ✅ CLEAN
- docs/results.md consistently calls old results "routeV" (e.g., "routeV per-token", "routeV authored", "routeV random-V"), never "routeA". The header clearly distinguishes them.
- docs/writeup/main.tex calls the method "routeV" throughout (e.g., "vGROUT routeV (best)", Table "route (ours)"). The paper explicitly declares itself historical.
- README.md refers to "routeV" as the retired method, with separate routeA description.
- docs/spec/20260611_act_gate_spec.md: explicitly states "routeA (per the rename-on-logic-change rule): the gate input changes from gradients to activations, so routeV results stay comparable only to routeV."
No relabeling found. The old results are preserved under their original name "routeV."
(4) Claims about lora2r shapes/masks are exact ✅ CLEAN
Every documented shape and mask claim matches the implementation:
| Claim | Source | Code verification |
|---|---|---|
A:[2r, d_in], B:[d_out, 2r], both trainable |
README L23-24, AGENTS.md L267-268 | lora2r.py L127: A0 = torch.randn(2*r, d_in, ...), B0 = torch.randn(d_out, 2*r, ...); A = nn.Parameter(A0.clone()), B = nn.Parameter(B0.clone()) |
Deployed [:r], quarantine [r:] |
README L24-25, AGENTS.md L268 | lora2r.py L98-102: h[..., :r] with B[:, :r]; h[..., r:] with B[:, r:] |
| No cross terms between blocks | README L25-26 | lora2r.py L98-102: column k of B only multiplies row k of A's output; F.linear(h[..., :r], B[:, :r]) and F.linear(h[..., r:], B[:, r:]) are independent |
| Frozen A0/B0 subtracted, net delta = 0 at init | README L23-24, lora2r.py L8-9 | lora2r.py L99-102: dep = F.linear(h[...,:r], B[:,:r]) - F.linear(h0[...,:r], B0[:,:r]); at init A=A0, B=B0 so dep=quar=0 |
Masks (m,d): clean (0,0) → deployed only; mid (1,0) → both; hack (1,1) → quarantine only, deployed detached |
README L29-32, AGENTS.md L287-291 | lora2r.py L108-114: dep = ((1-d)*dep + d*dep.detach()), quar = m*quar; train.py L508-510: m = (z > t_lo), d = (z >= t_hi) |
| Deployment ablation resets quarantine to init | README L27, lora2r.py L117-118 | lora2r.py: "deploy ablation resets that block to A0/B0 (eval.ablate_quarantine)" |
| Ablation deletes quarantine's learned delta | README L27-28 | Confirmed by lora2r.py design: A0/B0 frozen, quarantine is A[r:], B[:,r:] minus init |
One minor terminology note (not a correctness issue): README.md uses "clean/hack/mid" while AGENTS.md and the code use "keep/absorb/rout" for the same (m,d) pairs (0,0)/(1,1)/(1,0). Both are internally consistent and map correctly.
Summary: No findings. All four review areas are clean. The documentation accurately reflects the current routeA implementation, historical routeV results are preserved under their original name, no current-method prose implies gradient cosine scoring, and all lora2r shape/mask claims are exact.