Files
evil_MoE/docs/reviews/20260611_activation_docs_review.md
T

6.0 KiB

Let me read all the requested files directly.Let me try to get the diff with alternative approaches, and read the paper/docs that are being changed.Now let me read the research journal entry about the routeA transition to verify the change from gradient-scored to activation-scored.Now I have all the information needed. Let me compile the full review.


Review: Documentation Changes vs. Current Implementation

I read the following files: README.md, AGENTS.md, src/vgrout/train.py, src/vgrout/extract_vhack_act.py, src/vgrout/lora2r.py, docs/results.md, docs/writeup/main.tex, docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md, docs/brainstorm/extracted_prefs.md, docs/spec/20260611_act_gate_spec.md, and RESEARCH_JOURNAL.md.

Findings

(1) Activation scores select routes; output masks route subsequent gradients CLEAN

The documentation accurately describes the routeA mechanism. Evidence:

  • README.md (~L44-51): "routeA scores each rollout on the no-grad logp_old forward … activation-capture hook pools … bottleneck activations … score is the pooled dot product with v_act … block masks are set from those labels before the single masked forward+backward."
  • AGENTS.md (~L281-295): "routeA sets an output mask before the single grad-carrying forward and backward … The gate reads pooled activations, not gradients."
  • train.py L488-509: ActCapture hooks capture A[:r] @ x during the no-grad logpi_old forward; dots = torch.einsum("gmr,mr->g", acts, v_act); _routeA_gate(dots) returns (m_vec, d_vec) pinned on layer._lora2r_mask; then a single loss.backward() where lora2r.py's hook applies dep.detach() and m*quar.
  • lora2r.py L106-117: masks act on branch outputs, so d=1 blocks grads to both A rows and B columns of the deployed block, and m=0 zeroes the quarantine block in both forward and backward.

The flow is correctly documented: activations → dot product → Otsu thresholds → output masks → gradient routing. No errors.

(2) No current-method prose implies live-gradient cosine scoring CLEAN

All files describing the current method describe activation-scored routeA. Every file still referring to gradient cosine scoring is correctly labeled as historical/retired:

  • README.md L100-103: "The paper draft and docs/results.md currently describe the retired gradient-scored routeV experiments. They are historical evidence, not a description of routeA."
  • docs/writeup/main.tex L1-3: "HISTORICAL DRAFT: describes the retired gradient-scored routeV method, not the current activation-scored routeA implementation."
  • docs/results.md L1-4: "These results describe the retired gradient-scored routeV method."
  • docs/blog/...md: dated 2026-05-29, clearly a pre-routeA draft.
  • docs/brainstorm/extracted_prefs.md: dated 2026-05-23, the original project brainstorming doc, not current-method prose.

The README, AGENTS.md, train.py docstring, extract_vhack_act.py docstring, and lora2r.py docstring all consistently describe the activation-scored routeA method. No current-method document implies gradient cosine scoring.

(3) Historical routeV results preserved rather than relabeled routeA CLEAN

  • docs/results.md consistently calls old results "routeV" (e.g., "routeV per-token", "routeV authored", "routeV random-V"), never "routeA". The header clearly distinguishes them.
  • docs/writeup/main.tex calls the method "routeV" throughout (e.g., "vGROUT routeV (best)", Table "route (ours)"). The paper explicitly declares itself historical.
  • README.md refers to "routeV" as the retired method, with separate routeA description.
  • docs/spec/20260611_act_gate_spec.md: explicitly states "routeA (per the rename-on-logic-change rule): the gate input changes from gradients to activations, so routeV results stay comparable only to routeV."

No relabeling found. The old results are preserved under their original name "routeV."

(4) Claims about lora2r shapes/masks are exact CLEAN

Every documented shape and mask claim matches the implementation:

Claim Source Code verification
A:[2r, d_in], B:[d_out, 2r], both trainable README L23-24, AGENTS.md L267-268 lora2r.py L127: A0 = torch.randn(2*r, d_in, ...), B0 = torch.randn(d_out, 2*r, ...); A = nn.Parameter(A0.clone()), B = nn.Parameter(B0.clone())
Deployed [:r], quarantine [r:] README L24-25, AGENTS.md L268 lora2r.py L98-102: h[..., :r] with B[:, :r]; h[..., r:] with B[:, r:]
No cross terms between blocks README L25-26 lora2r.py L98-102: column k of B only multiplies row k of A's output; F.linear(h[..., :r], B[:, :r]) and F.linear(h[..., r:], B[:, r:]) are independent
Frozen A0/B0 subtracted, net delta = 0 at init README L23-24, lora2r.py L8-9 lora2r.py L99-102: dep = F.linear(h[...,:r], B[:,:r]) - F.linear(h0[...,:r], B0[:,:r]); at init A=A0, B=B0 so dep=quar=0
Masks (m,d): clean (0,0) → deployed only; mid (1,0) → both; hack (1,1) → quarantine only, deployed detached README L29-32, AGENTS.md L287-291 lora2r.py L108-114: dep = ((1-d)*dep + d*dep.detach()), quar = m*quar; train.py L508-510: m = (z > t_lo), d = (z >= t_hi)
Deployment ablation resets quarantine to init README L27, lora2r.py L117-118 lora2r.py: "deploy ablation resets that block to A0/B0 (eval.ablate_quarantine)"
Ablation deletes quarantine's learned delta README L27-28 Confirmed by lora2r.py design: A0/B0 frozen, quarantine is A[r:], B[:,r:] minus init

One minor terminology note (not a correctness issue): README.md uses "clean/hack/mid" while AGENTS.md and the code use "keep/absorb/rout" for the same (m,d) pairs (0,0)/(1,1)/(1,0). Both are internally consistent and map correctly.


Summary: No findings. All four review areas are clean. The documentation accurately reflects the current routeA implementation, historical routeV results are preserved under their original name, no current-method prose implies gradient cosine scoring, and all lora2r shape/mask claims are exact.