tidy, review

2026-06-27 17:01:14 +08:00 · 2026-04-27 07:03:24 +08:00
parent a44fc039af
commit 74c374e741
22 changed files with 4425 additions and 7727 deletions
@@ -0,0 +1,754 @@
+
+❯ yes make a task to deep review each one against how it's implemented. so one  
+  task for each. please give me the pseudocode of 1) reference code 2)          
+  reference paper 3) ours and compare.                                          
+
+  Searched for 1 pattern, read 1 file (ctrl+o to expand)
+● 1. LoRA review
+
+  Paper (Hu et al. 2021):
+  forward:
+    h = W·x + (α/r) · B·A·x       # B∈R^{d×r}, A∈R^{r×d}
+  init:
+    A ~ Gaussian (or kaiming),  B = 0
+
+  peft peft_lora_layer.py:941-969:
+  forward(x):
+    result = base_layer(x)                                # W·x + bias
+    for adapter in active_adapters:
+        result += lora_B(lora_A(dropout(x))) * scaling    # scaling = α/r or
+  α/√r if rslora
+  init (reset_lora_parameters, line 263):
+    kaiming_uniform_(lora_A, a=√5);  zeros_(lora_B)
+
+  ours src/lora_lite/variants/lora.py:
+  param_specs:
+    lora_A: (r, d_in)  init=kaiming
+    lora_B: (d_out, r) init=zeros
+  forward(x, y):                # y = base_layer(x) already computed by host
+    scale = α/r
+    h = einsum(x, A, "... i, r i -> ... r")
+    delta = einsum(h, B, "... r, o r -> ... o")
+    return y + scale·delta
+
+  Diff vs peft:
+  - ✓ math identical: scale·B·A·x + y
+  - ✗ no dropout (peft has lora_dropout)
+  - ✗ no rslora (α/√r)
+  - ✗ no init_lora_weights="gaussian" option (only kaiming)
+  - ✓ B=0 zero-init → identity at t=0
+
+  Verdict: correct vanilla LoRA. Missing dropout and rslora are deliberate scope
+   cuts (not bugs). No fix needed.
+
+  Read 1 file (ctrl+o to expand)
+
+● 2. DoRA review
+
+  Paper (Liu et al. 2024, §4.1, §4.3):
+  W' = m · V/||V||_c,    V = W + ΔV = W + (α/r)·B·A     # ||·||_c = per-row L2
+  init:  m = ||W||_c,  B = 0     →  W' = W
+  §4.3 trick: treat ||V+ΔV||_c as a CONSTANT (detach from grad graph)
+              for stability + speed.
+
+  peft peft_lora_dora.py:132-166:
+  forward(x):
+    lora_W = B·A
+    weight_norm = ||W + α/r·lora_W.detach()||_c        # << DETACHED
+    weight_norm = weight_norm.detach()                  # << belt+braces
+    mag_norm_scale = (m / weight_norm)
+    base_result = W·x        (or passed in, minus bias)
+    return (mag_norm_scale - 1)·base_result + mag_norm_scale · α/r · B·A·x
+                                                        # bias re-added outside
+
+  ours src/lora_lite/variants/dora.py:53-71:
+  V = W + scale·B·A                                     # << NOT detached
+  v_norm = V.norm(dim=1).clamp_min(1e-12)               # << NOT detached
+  wx = y - bias  if bias else y
+  delta_lora = scale · B·A·x
+  combined = wx + delta_lora
+  out = (m / v_norm) · combined
+  return out + bias  if bias else out
+
+  Diff vs peft:
+  - ✗ Bug: missing .detach() on the lora term inside weight_norm. Paper §4.3
+  explicitly says to detach; peft detaches twice (line 144 lora_weight.detach()
+  and line 152 weight_norm.detach()). Without it, grads flow through the norm
+  denominator → unstable training and ~2× memory, the exact thing the paper
+  warns against.
+  - ✓ math otherwise equivalent: (m/||V||)·(Wx + α/r·BAx) = (m/||V||-1)·Wx + 
+  (m/||V||)·α/r·BAx + Wx. Both forms produce same output.
+  - ✓ bias handling matches peft (subtract before scaling, add after).
+  - ✓ m init ||W||_c matches.
+  - = nn.Linear-only restriction is a deliberate scope cut (peft handles bnb via
+   dequantize_module_weight).
+
+  Recommended fix: add .detach() to V (or equivalently v_norm.detach()) before
+  dividing.
+
+  V = layer.weight + scale * BA.detach()
+  v_norm = V.norm(dim=1).clamp_min(1e-12).detach()
+
+  Read 2 files (ctrl+o to expand)
+
+● 3. PiSSA review
+
+  Paper (Meng et al. 2024):
+  W = U S V^T   (truncated to top-r)
+  A = √S · V^T,    B = U · √S        →  B·A = U S V^T
+  W_res = W - B·A
+  forward: W_res·x + (α/r)·B·A·x      → at t=0 equals W·x
+
+  peft peft_lora_layer.py:360-393:
+  V, S, Uh = svd(W)                          # V is left, Uh is right (peft
+  naming)
+  Sr = S[:r] / scaling                        # << pre-divide by α/r
+  A = diag(√Sr) @ Uh[:r]                      # absorbs 1/√scaling
+  B = V[:, :r] @ diag(√Sr)                    # absorbs 1/√scaling
+  W_res = W - scaling · B·A                   # net: W - V[:,:r]·S[:r]·Uh[:r]
+  (scaling cancels)
+  forward: W_res·x + scaling·B·A·x
+
+  ours src/lora_lite/variants/pissa.py:45-76:
+  U, S, Vh = svd(W)
+  B = U[:,:r] · √S[:r]                        # NO scaling pre-division
+  A = √S[:r] · Vh[:r]                         # NO scaling pre-division
+  W_res = W - scale · B·A                     # = W - scale · U·S·Vh   (NOT
+  exactly W - U·S·Vh)
+  forward: W_res·x + scale · B·A·x            # net: W·x at t=0  ✓
+
+  Diff vs peft:
+  - ✓ Bit-identity at t=0 holds for any α (algebra cancels symmetrically): both
+  peft and ours satisfy W_res + scale·BA = W round-tripped.
+  - ⚠ A/B starting magnitudes differ when α≠r. peft pre-divides S/scale so
+  ||B||,||A|| ∝ √(S/scale); ours has ||B||,||A|| ∝ √S. Update dynamics (α/r · 
+  ΔB·A) therefore differ by a factor of √scale per matrix. For α=r (scale=1)
+  they're identical — and the docstring already says "Pass alpha=r for
+  paper-faithful scale=1".
+  - ✓ Linear-only restriction is a deliberate scope cut.
+  - ✓ cfg.dtype cast on A,B before storage matches peft's to(dtype) after save.
+  - ✓ fp32 SVD then (W - scale·BA) in fp32 before bf16 cast — comment "fp32
+  subtraction so W_res stays accurate" is honoured. peft does same via
+  weight.to(torch.float32).
+
+  Verdict: correct for α = r. For α ≠ r, behaves differently from peft (still
+  mathematically valid PiSSA, just different parameter scaling of A/B). Either
+  fix (apply Sr /= scale like peft) or tighten the docstring to "α=r required"
+  and assert it. Mild fix; not a correctness bug.
+
+  Read 1 file (ctrl+o to expand)
+
+● 4. DeLoRA review
+
+  Paper (Bini et al. 2025, ICLR'25):
+  W' = W + (λ/r) · B · Ξ · A,    Ξ_ii = 1/(||b_i|| · ||a_i||)
+  init: B=0  →  ΔW=0
+  default λ=15
+  Crucial: w_norm is per-input-channel ||W||_2 (col-norm), captured ONCE at init
+           (frozen reference) — not Frobenius scalar.
+
+  peft peft_delora_layer.py:226-264:
+  forward(x):
+    h = (x * w_norm) @ A.T                           # broadcast w_norm over
+  input dim
+    An = clamp(A.norm(dim=1), 1e-4)
+    Bn = clamp(B.norm(dim=0), 1e-4)
+    scale_r = (λ/r) / (An * Bn)                      # (r,)
+    h = h * scale_r
+    h = h @ B.T
+    return base_out + h
+  init:  A ← kaiming(√5),  B ← 0,  λ ← 15.0,  w_norm ← ||W||_2 col (frozen
+  buffer)
+
+  ours src/lora_lite/variants/delora.py:55-81:
+  forward(x, y):
+    x_scaled = x * lora_wnorm
+    h = einsum(x_scaled, A, "... i, r i -> ... r")       # = (x * w_norm) @ A.T
+    An = clamp(A.norm(dim=1), 1e-4)
+    Bn = clamp(B.norm(dim=0), 1e-4)
+    scale_r = (λ/r) / (An * Bn)
+    h = h * scale_r
+    delta = einsum(h, B, "... r, o r -> ... o")          # = h @ B.T
+    return y + delta
+  init:  A ← kaiming(√5),  B ← 0,  λ ← lambda0 (default 0.0, NOT 15.0),
+         lora_wnorm ← ||W||_2 col (frozen buffer)
+
+  Diff vs peft:
+  - ✓ math line-for-line identical: same dim-0 col-norm, same per-rank A/B
+  normalization, same gradient path (norms not detached, A,B receive gradient
+  through them — matches peft exactly).
+  - ✓ frozen lora_wnorm buffer (peft uses BufferDict, ours uses as_buffer=True).
+  - ⚠ default lambda0=0.0 differs from peft's 15.0. Docstring is honest about
+  this ("0.0 = bit-identity at t=0, but A,B get zero grad until lambda moves").
+  At λ=0, gradients on A,B vanish since delta ∝ λ, so training stalls until λ
+  moves off zero. This is documented but is a sharp foot-gun. peft's 15.0
+  default avoids it.
+  - ✓ even-r restriction not relevant here (HRA only).
+  - ✓ nn.Linear-only restriction documented.
+
+  Verdict: correct. The lambda0 default is a deliberate "bit-identity but
+  stalls" choice — keep, but consider noting in docstring that for actual
+  training you'd want λ0≈1-15 (paper uses 15).
+
+  Searched for 1 pattern, read 1 file (ctrl+o to expand)
+
+● 5. EVA review
+
+  Paper (Paischer et al. 2024):
+  For each target layer l:
+    Collect inputs x ∈ R^{N × d_in} on calibration batches
+    Vh = right_singular_vectors(x)                  # top-r principal input dirs
+    A_l ← Vh[:r]                                    # data-driven init
+    B_l ← 0
+  forward = standard LoRA
+  Optional: rank redistribution by explained variance (ρ-budget across layers)
+  Optional: hash-dedup layers with identical inputs (Q/K/V share x)
+
+  peft peft_eva.py:
+  For each calibration batch:
+    IncrementalPCA.partial_fit(layer_input)         # streaming,
+  memory-efficient
+    Hash inputs; dedup layers seeing same x.
+    Track convergence via cossim(components_t, components_{t-1})
+  After:
+    Redistribute ranks by explained_variance_ratio (ρ scale)
+    A_l ← svd.components_[:rank_l]                  # top-rank rows
+    B_l ← 0
+  forward = vanilla LoRA path (`scale·B@A·x`)
+
+  ours src/lora_lite/variants/eva.py:
+  group_init(model, targets, calibration_data):
+    hooks capture pre-forward inputs per layer
+    for batch in calibration_data: model(batch)
+    for each target:
+      X = cat(captured)                              # (N, d_in), float32 on cpu
+      _,_,Vh = torch.linalg.svd(X, full_matrices=False)
+      A ← Vh[:r]                                     # uniform r per layer
+  forward(x, y):
+    return y + (α/r) · (B @ A @ x)                   # B=0 at t=0  → identity
+
+  Diff vs peft:
+  - ✓ Top-r right singular vectors of pooled inputs → A. Math identical.
+  - ✓ B=0 init → identity at t=0.
+  - ✓ A,B both trainable post-init (peft same).
+  - ✗ Full SVD on cpu-pooled tensor vs. peft's IncrementalPCA — limits us to
+  small calibration sets (we cat all activations in RAM). The docstring
+  acknowledges this.
+  - ✗ No rank redistribution (rho param). All targets get same r. Acknowledged.
+  - ✗ No QKV input dedup. Acknowledged. Cost: 3× SVD work for layers that share
+  input. Correct, just slower.
+  - ✗ No convergence check. Single pass over calibration_data. Acknowledged.
+  - ⚠ Captured tensors moved to cpu+fp32 — fine, but a noisy 1k-batch on a 70B
+  model = lots of RAM. Just a scaling note.
+
+  Verdict: correct minimal EVA, all simplifications stated up-front in
+  docstring. No bugs. If you ever want to scale to large calibration corpora,
+  swap the torch.cat → svd for IncrementalPCA.
+
+  Read 1 file (ctrl+o to expand)
+
+● 6. HRA review
+
+  Paper (Yuan et al. 2024):
+  W' = W·R,     R = ∏_{i=0..r-1} H_i,    H_i = I - 2·u_i·u_i^T / ||u_i||^2
+  init (paper): independent random columns
+  peft variant: paired columns (u_{2k}=u_{2k+1}) → R=I exactly at t=0, even r
+  GS variant (apply_GS=True): explicit Gram-Schmidt for strict orthogonality
+  Loss-side: orthogonality regularizer (Eq. 6); not part of layer.
+
+  peft peft_hra_layer.py:208-265:
+  hra_u: (d_in, r)                                 # COLUMNS are vectors
+  get_delta_weight():
+    opt_u = opt_u / opt_u.norm(dim=0)              # pre-normalize columns
+    R = I_{d_in}
+    for i in 0..r-1:                               # forward order
+        u = opt_u[:, i]
+        R = R - 2·R·u·u^T                          # R ← R · H_i
+  forward(x):
+    R = product over active adapters
+    return F.linear(x, W @ R, bias)                # = x · R^T · W^T
+  (effectively R applied to x)
+  init: half_u = kaiming((d_in, r/2)); u = repeat_interleave(half_u, 2, dim=1)
+   # cols 0=1, 2=3, ...
+
+  ours src/lora_lite/variants/hra.py:53-75:
+  lora_U: (r, d_in)                                # ROWS are vectors
+  (transpose convention)
+  forward_input(x):
+    Rx = x
+    for i = r-1 downto 0:                          # reverse order, applied
+  directly
+        u = U[i]
+        sq = (u·u).sum().clamp_min(1e-12)
+        coeff = (Rx · u) · (2/sq)
+        Rx = Rx - coeff · u                        # Rx ← Rx · H_i
+    return Rx                                       # then base layer applies W
+  to this
+  init: half = kaiming((r/2, d_in)); U = repeat_interleave(half, 2, dim=0)    #
+  rows 0=1, 2=3, ...
+
+  Diff vs peft:
+  - ✓ Math equivalent. peft computes R = H_0·H_1·...·H_{r-1} (col-vector
+  convention), then applies via x·R^T = x·H_{r-1}·...·H_0 (since H is
+  symmetric). Ours iterates i = r-1 → 0 doing Rx ← Rx·H_i, computing
+  x·H_{r-1}·...·H_0. Same.
+  - ✓ symmetric init: row pairs vs col pairs — consistent with our row-vector
+  convention.
+  - ✓ identity at t=0: paired u_a=u_b → H_a·H_a=I (involution). R=I.
+  - ⚠ No apply_GS=True path. peft offers Gram-Schmidt orthogonalization.
+  Docstring acknowledges via "Note: paper's orthogonality regularizer (Eq. 6) is
+   loss-side". Ours uses pure Householder products (which already give exactly
+  orthogonal R at any step), so GS is redundant unless you want unit-norm
+  columns. Fine to omit.
+  - ✓ More efficient than peft. Avoids materializing (d_in × d_in) R matrix;
+  applies reflections directly to x in O(r·d_in) not O(r·d_in²).
+  - ⚠ Slight gradient-flow difference. peft pre-normalizes opt_u / 
+  ||opt_u||_col, so gradients flow through normalization. Ours keeps ||u||² in
+  the denominator each step, gradients flow through that. Both define the same
+  orthogonal H_i; gradients should give equivalent training to within numerical
+  noise, but not bit-identical.
+  - ✓ even-r requirement raised loudly (peft warns + falls back; we raise —
+  preferable for fail-fast).
+
+  Verdict: correct + faster than peft. Ordering bug from 67a6daf is properly
+  fixed (verified: i = r-1 → 0 iteration matches peft's R = H_0..H_{r-1} then
+  x·R^T).
+  Read 1 file (ctrl+o to expand)
+
+● 7. IA3 review
+
+  Paper (Liu et al. 2022):
+  y = ℓ ⊙ (W·x)        for attention K, V              (output-side gate)
+  y = W_2·(ℓ_ff ⊙ γ(W_1·x))   for FF                    (mid-FFN gate, T5)
+  init: ℓ = 1
+
+  peft peft_ia3_layer.py:161-188 (single class with is_feedforward flag):
+  if is_feedforward:
+      g shape (1, d_in)
+      interm = x * g
+      return base_layer(interm)                        # input-side
+  else:
+      g shape (d_out, 1)
+      result = base_layer(x)
+      return result * g                                 # output-side
+  init: g ← ones (when init_ia3_weights=True, default)
+  default targets: ["k_proj", "v_proj", "down_proj"];
+  feedforward_modules=["down_proj"]
+
+  ours src/lora_lite/variants/ia3.py:38-76 (two classes IA3, IA3FF):
+  IA3   (output-side, e.g. k_proj/v_proj):
+      g shape (d_out,);  init ones
+      forward(x, y) = y * g
+
+  IA3FF (input-side, e.g. down_proj/fc2):
+      g shape (d_in,);  init ones
+      forward_input(x) = x * g                          # base_layer applied
+  after by host
+
+  Diff vs peft:
+  - ✓ math identical to peft's two branches.
+  - ✓ init=ones matches.
+  - ✓ Splitting into two classes (rather than is_feedforward flag) is cleaner —
+  fail-fast: which class you register fixes the side, no boolean to misset.
+  - ⚠ peft's output-side path also scales base_layer.bias during merge. We
+  don't merge (forward-only adapter), so no issue at training time — but if you
+  ever add merge, watch this.
+  - ⚠ docstring example uses regex r"\\.k_proj$" which under YAML/Python would
+  match a literal .k_proj string-end; check that target_names is treated as
+  regex by your attach() (matches the behavior in variant.py).
+  - ⚠ Position-on-Llama: peft applies "ff" gate at down_proj input, which is
+  gate*up (post-SiLU). This deviates from paper's T5 position (post-activation
+  of single W_1) but matches peft's default for Llama-style. Acknowledged via
+  docstring example.
+
+  Verdict: correct, cleaner than peft. No bugs.
+
+❯ ko! so what do we need to fix? 
+
+● Fixes needed (from reviews 1-7)
+
+  Real bug (1):
+  - dora.py:62-63 — missing .detach() on the LoRA path inside the norm. Paper
+  §4.3 explicitly says treat ||V+ΔV||_c as a constant; peft detaches twice.
+  Without it, grads flow through the norm denominator → unstable training + ~2×
+  memory.
+  V = layer.weight + scale * BA.detach()       # detach BA here
+  v_norm = V.norm(dim=1).clamp_min(1e-12).detach()   # belt & braces
+
+  Behavioral / docstring (2):
+  - pissa.py — at α≠r, A,B starting magnitude differs from peft by √scale per
+  matrix (peft pre-divides S/scale). Either match peft (Sr /= scale before
+  computing √Sr factors) or assert cfg.alpha == cfg.r and tighten the docstring.
+  - delora.py:36 — lambda0=0.0 default is bit-identity but a foot-gun: at λ=0
+  the gradient on A,B is zero, so training is dead until λ moves. Either change
+  default to peft's 15.0, or expand the existing comment to flag this for new
+  users.
+
+  Clean (4): lora, eva, hra, ia3 — no fixes.
+
+  Pending: antipasto.py (review #8 not yet done; want me to continue?).
+
+✻ Perused for 5m 30s
+
+❯ sure do antipasto now...                                                      
+
+  Read 1 file (ctrl+o to expand)
+
+● 8. AntiPaSTO review
+
+  Reference (antipasto3, Flax):
+  kernel = U·diag(S)·Vh + W_res             # kernel: (in, out)
+  init U,S,Vh,W_res = svd(kernel) (frozen, fp32 / bf16 W_res)
+  trainable:
+    delta_s     ∼ truncated_normal·4e-4 + 4e-4   # small POSITIVE bias
+  (sym-break)
+    rot_T       ∼ truncated_normal·1e-4          # small random
+  forward(x, alpha):
+    A = scatter(rot_T into upper-triu of (n_blocks, bs, bs))
+    A = 0.5·(A - A^T);  A = a_lim·tanh(A/a_lim)
+    X = α·A/2                                     # << α scales rotation
+    R = (I-X)^-1 (I+X)                            # block Cayley
+    if rotate_U:  U_rot = U @ R                   # einsum, never builds full
+  (r,r)
+    if rotate_V:  Vh_rot = R @ Vh
+    S_scaled = S + α·delta_s
+    return (x @ U_rot)·S_scaled @ Vh_rot + x @ W_res
+  Identity property: bidirectional symmetry — R(α=+1) and R(α=-1) are exact
+  inverses.
+  At α=0: U,S,Vh untouched → kernel reproduced.
+
+  ours src/lora_lite/variants/antipasto.py:
+  weight (pytorch): (d_out, d_in);  SVD: U:(d_out,r), S:(r,), Vh:(r,d_in)
+  init: U,S,Vh frozen buffers; W_res mutates layer.weight (PiSSA-style)
+  trainable:
+    delta_s = 0
+    rot_T   = 0                                   # << EXACT identity at t=0
+  forward(x, y):
+    A = scatter(rot_T)                            # (n_blocks, bs, bs)
+    A = 0.5·(A - A^T);  A = a_lim·tanh(A/a_lim)
+    X = A/2                                        # << no α
+    R_blocks = Cayley(X)
+    R = block_diag(R_blocks)                       # << builds full (r,r) via
+  for-loop
+    if rotate_basis=="V":  Vh_eff = R @ Vh
+    elif rotate_basis=="U": U_eff = U @ R^T
+    S_eff = S + delta_s
+    h = (x @ Vh_eff^T) · S_eff
+    delta = h @ U_eff^T
+    return y + delta                               # y = x @ W_res^T already
+  Identity at t=0: rot_T=0 → R=I, delta_s=0 → delta = x @ Vh^T @ diag(S) @ U^T =
+   x @ W_svd^T → exact bit-identity.
+
+  Diff vs antipasto3:
+
+  - ⚠ No alpha steering coefficient. antipasto3's headline feature —
+  bidirectional symmetry (R(+α)·R(-α)=I) — is dropped. If you intend antipasto
+  as a fine-tuning-only adapter, this is a deliberate scope cut; if you ever
+  want bidirectional steering at inference, you'd need to thread an α through
+  the forward.
+  - ⚠ Rotation chirality flipped. antipasto3 rotates the input basis by R (U @ 
+  R); ours rotates by R^T (Vh_eff = R @ Vh ⇒ (x @ Vh^T) @ R^T). Same
+  orthogonal-group element up to inversion, so gradient-equivalent under sign
+  flip of rot_T. Not a correctness bug, but checkpoints aren't transferable
+  between the two impls without a sign flip.
+  - ✓ Stricter t=0 identity. Ours zero-inits both delta_s and rot_T → exact
+  bit-identity. antipasto3 nudges delta_s ≈ +4e-4 and rot_T ∼ 1e-4 for symmetry
+  breaking.
+  Foot-gun: at exact zero rotation, ∂loss/∂rot_T may be sign-symmetric (rotating
+   in either direction gives same loss to first order), so optimizer can stall
+  on rot_T while delta_s carries the load. antipasto3's small bias avoids this.
+  Worth a docstring note.
+  - ✗ Perf: full (r,r) block-diagonal materialized via Python for-loop in
+  _block_diag. antipasto3 uses einsum('dnb,nbc->dnc', U_reshaped, R_blocks) to
+  apply each block directly without ever building the (r,r) matrix. For r=64,
+  bs=4 this is 16 reflective blocks; the for-loop is O(r²) memory and Python
+  overhead. Easy fix:
+  # rotate V (input basis): replace block_diag + matmul with einsum
+  Vh_blocks = Vh.reshape(n_blocks, bs, d_in)
+  Vh_eff = einsum(R_blocks, Vh_blocks, "n a b, n b i -> n a i").reshape(r, d_in)
+  - ⚠ rotate_basis enum ("V" XOR "U") vs. antipasto3's two booleans (can do
+  both). Minor expressivity loss.
+  - ✓ Cayley, skew construction, tanh angle clamping all match antipasto3
+  exactly (incl. the 0.5 factor on (A - A^T)).
+  - ✓ W_res via PiSSA-style in-place mutation of layer.weight is more
+  memory-efficient than antipasto3's separate buffer; semantically identical.
+  - ⚠ arxiv URL: https://arxiv.org/abs/2601.07473 — earlier I confirmed it
+  returns HTTP 200, so the paper exists. Fine.
+
+  Verdict: correct as a fine-tuning adapter with stricter t=0 identity. Two
+  divergences from antipasto3 worth deciding on:
+
+  1. Chirality: trivial sign-flip in skew if you want checkpoint compat with
+  antipasto3 — otherwise leave.
+  2. No α: if antipasto needs to support inference-time steering (the paper's
+  selling point), thread an α scalar through the forward and S_eff = S + 
+  α·delta_s; X = α·A/2.
+
+  And one perf fix: kill _block_diag, use einsum.
+
+❯ ok now make a task for each fix and fix pls 
+
+● Update(src/lora_lite/variants/dora.py)
+  ⎿  Added 4 lines, removed 3 lines
+      57      ) -> Float[T, '*B o']:
+      58          cfg = layer._lora_cfg
+      59          scale = cfg.alpha / cfg.r
+      60 -        # V = W + scale * B @ A                                
+      60 +        # Paper §4.3: treat ||V+ΔV||_c as a constant (detach fr
+         +om grad graph) for                                             
+      61 +        # stability and ~2x lower memory. Match peft (lora_weig
+         +ht.detach + weight_norm.detach).                               
+      62          BA = einsum(layer.lora_B, layer.lora_A, "o r, r i -> o 
+          i")
+      62 -        V = layer.weight + scale * BA                          
+         -# (d_out, d_in)                                                
+      63 -        v_norm = V.norm(dim=1).clamp_min(1e-12)                
+         -# (d_out,)                                                     
+      63 +        V = layer.weight + scale * BA.detach()                 
+         +# (d_out, d_in)                                                
+      64 +        v_norm = V.norm(dim=1).clamp_min(1e-12).detach()       
+         +# (d_out,)                                                     
+      65          # Bias passes through unscaled (matches peft).
+      66          bias = getattr(layer, "bias", None)
+      67          wx = y if bias is None else (y - bias)
+
+● Update(src/lora_lite/variants/pissa.py)
+  ⎿  Added 4 lines, removed 3 lines
+       3  Meng et al. 2024  https://arxiv.org/abs/2404.02948
+       4  
+       5      W = U S Vh        (truncated to top-r)
+       6 -    B = U sqrt(S),  A = sqrt(S) Vh,   W_res = W - B A          
+       6 +    Sr_eff = Sr / (alpha/r)                          # peft-sty
+         +le: pre-divide so A/B                                          
+       7 +    B = U sqrt(Sr_eff),  A = sqrt(Sr_eff) Vh         # update d
+         +ynamics match for any alpha                                    
+       8 +    W_res = W - (alpha/r) B A      = W - U Sr Vh     # scaling 
+         +cancels symmetrically                                          
+       9  
+       8 -Identity at t=0: W_res + B@A == W (bf16 round-trip, not bit-exa
+         -ct).                                                           
+       9 -Pass alpha=r for paper-faithful scale=1.                       
+      10 +Identity at t=0: W_res + (alpha/r) B@A == W (fp32 round-trip, b
+         +f16 cast can drift).                                           
+      11  
+      12  Refs:
+      13    - paper: https://github.com/MuLabPKU/PiSSA/blob/main/utils/in
+          it_pissa.py
+
+● Update(src/lora_lite/variants/pissa.py)
+  ⎿  Added 7 lines, removed 3 lines
+      52          W = layer.weight.data.float()                       # (
+          d_out, d_in)
+      53          U, S, Vh = torch.linalg.svd(W, full_matrices=False)
+      54          r = cfg.r
+      55 +        scale = cfg.alpha / cfg.r                              
+      56          Ur, Sr, Vhr = U[:, :r], S[:r], Vh[:r, :]
+      56 -        sqrtS = Sr.sqrt()                                      
+      57 -        # B @ A = Ur diag(Sr) Vhr;  pick B = Ur sqrt(Sr),  A = 
+         -sqrt(Sr) * Vhr                                                 
+      57 +        # Pre-divide Sr by scaling so A/B carry "natural" magni
+         +tudes for any alpha                                            
+      58 +        # (peft pissa_init does this; needed so the alpha/r sca
+         +ling on the forward                                            
+      59 +        # gives matched update dynamics rather than re-scaling 
+         +A,B by sqrt(scale)).                                           
+      60 +        Sr_eff = Sr / scale                                    
+      61 +        sqrtS = Sr_eff.sqrt()                                  
+      62 +        # B @ A = Ur diag(Sr/scale) Vhr;  W_res = W - scale * B
+         +@A = W - Ur diag(Sr) Vhr.                                      
+      63          B = (Ur * sqrtS).to(cfg.dtype)
+      64          A = (sqrtS[:, None] * Vhr).to(cfg.dtype)
+      65          layer.lora_B.data.copy_(B)
+      66          layer.lora_A.data.copy_(A)
+      67          # fp32 subtraction so W_res stays accurate.
+      68          BA = (B.float() @ A.float())
+      64 -        scale = cfg.alpha / cfg.r                              
+      69          layer.weight.data.copy_((W - scale * BA).to(layer.weigh
+          t.dtype))
+      70  
+      71      @staticmethod
+● Update(src/lora_lite/variants/antipasto.py)
+  ⎿  Removed 9 lines
+      59      return _cayley(A)
+      60  
+      61  
+      62 -def _block_diag(blocks: torch.Tensor) -> torch.Tensor:         
+      63 -    """(n_blocks, bs, bs) -> (n_blocks*bs, n_blocks*bs) block-d
+         -iagonal."""                                                    
+      64 -    n, bs, _ = blocks.shape                                    
+      65 -    out = blocks.new_zeros(n * bs, n * bs)                     
+      66 -    for i in range(n):                                         
+      67 -        out[i * bs : (i + 1) * bs, i * bs : (i + 1) * bs] = blo
+         -cks[i]                                                         
+      68 -    return out                                                 
+      69 -                                                               
+      70 -                                                               
+      62  @register
+      63  class AntiPaSTO:
+      64      name = "antipasto"
+
+● Update(src/lora_lite/variants/antipasto.py)
+  ⎿  Added 11 lines, removed 4 lines
+      114          S = layer.lora_S.to(x.dtype)                          
+           # (r,)
+      115          Vh = layer.lora_Vh.to(x.dtype)                        
+           # (r, d_in)
+      116  
+      117 -        R_blocks = _build_rotation(layer.lora_rot_T.float(), b
+          -s, max_angle)                                                 
+      118 -        R = _block_diag(R_blocks).to(x.dtype)                 
+          -# (r, r)                                                      
+      117 +        R_blocks = _build_rotation(layer.lora_rot_T.float(), b
+          +s, max_angle).to(x.dtype)                                     
+      118 +        n_blocks, _, _ = R_blocks.shape                       
+          +# (n, bs, bs)                                                 
+      119 +        d_in, d_out = Vh.shape[1], U.shape[0]                 
+      120 +        r = n_blocks * bs                                     
+      121  
+      122 +        # Apply block-diagonal R per-block via einsum, never m
+          +aterializing (r,r).                                           
+      123          if rotate_basis == "V":
+      121 -            Vh_eff = R @ Vh                                   
+          -# rotate INPUT basis                                          
+      124 +            # Vh_eff = R @ Vh, viewed block-wise on the r-axis
+          +.                                                             
+      125 +            Vh_blocks = Vh.reshape(n_blocks, bs, d_in)        
+      126 +            Vh_eff = einsum(R_blocks, Vh_blocks, "n a b, n b i
+          + -> n a i").reshape(r, d_in)                                  
+      127              U_eff = U
+      128          elif rotate_basis == "U":
+      129 +            # U_eff = U @ R.T, viewed block-wise on the r-axis
+          +.                                                             
+      130 +            U_blocks = U.reshape(d_out, n_blocks, bs)         
+      131 +            U_eff = einsum(U_blocks, R_blocks, "d n b, n c b -
+          +> d n c").reshape(d_out, r)                                   
+      132              Vh_eff = Vh
+      125 -            U_eff = U @ R.T                                   
+          -# rotate OUTPUT basis                                         
+      133          else:
+      134              raise ValueError(f"rotate_basis must be 'U' or 'V'
+           , got {rotate_basis!r}")
+      135  
+
+● Verify the antipasto change with a quick import + a smoke equivalence check.
+
+● Bash(python -c "
+      import torch…)
+  ⎿  Error: Exit code 127
+     (eval):1: command not found: python
+
+❯ > einsum(R_blocks, Vh_blocks, "n a b, n b i -> n a i").reshape(r, d_in) 
+                                                                          
+  should not the reshape use einops reshape? there is also reduce         
+
+● You're right — use einops rearrange for the shape ops to keep them
+  self-documenting. Updating.
+
+  Read 1 file (ctrl+o to expand)
+
+● Update(src/lora_lite/variants/antipasto.py)
+  ⎿  Added 1 line, removed 1 line
+      19  from typing import Literal
+      20  
+      21  import torch
+      22 -from einops import einsum                                      
+      22 +from einops import einsum, rearrange                           
+      23  from jaxtyping import Float
+      24  from torch import nn, Tensor as T
+      25  
+
+● Update(src/lora_lite/variants/antipasto.py)
+  ⎿  Added 7 lines, removed 7 lines
+      115          Vh = layer.lora_Vh.to(x.dtype)                        
+           # (r, d_in)
+      116  
+      117          R_blocks = _build_rotation(layer.lora_rot_T.float(), b
+           s, max_angle).to(x.dtype)
+      118 -        n_blocks, _, _ = R_blocks.shape                       
+          -# (n, bs, bs)                                                 
+      119 -        d_in, d_out = Vh.shape[1], U.shape[0]                 
+      120 -        r = n_blocks * bs                                     
+      118 +        n_blocks = R_blocks.shape[0]                          
+          +# R_blocks: (n, bs, bs)                                       
+      119  
+      120          # Apply block-diagonal R per-block via einsum, never m
+           aterializing (r,r).
+      121          if rotate_basis == "V":
+      122              # Vh_eff = R @ Vh, viewed block-wise on the r-axis
+           .
+      125 -            Vh_blocks = Vh.reshape(n_blocks, bs, d_in)        
+      126 -            Vh_eff = einsum(R_blocks, Vh_blocks, "n a b, n b i
+          - -> n a i").reshape(r, d_in)                                  
+      123 +            Vh_blocks = rearrange(Vh, "(n a) i -> n a i", n=n_
+          +blocks)                                                       
+      124 +            Vh_rot = einsum(R_blocks, Vh_blocks, "n a b, n b i
+          + -> n a i")                                                   
+      125 +            Vh_eff = rearrange(Vh_rot, "n a i -> (n a) i")    
+      126              U_eff = U
+      127          elif rotate_basis == "U":
+      128              # U_eff = U @ R.T, viewed block-wise on the r-axis
+           .
+      130 -            U_blocks = U.reshape(d_out, n_blocks, bs)         
+      131 -            U_eff = einsum(U_blocks, R_blocks, "d n b, n c b -
+          -> d n c").reshape(d_out, r)                                   
+      129 +            U_blocks = rearrange(U, "d (n b) -> d n b", n=n_bl
+          +ocks)                                                         
+      130 +            U_rot = einsum(U_blocks, R_blocks, "d n b, n c b -
+          +> d n c")                                                     
+      131 +            U_eff = rearrange(U_rot, "d n c -> d (n c)")      
+      132              Vh_eff = Vh
+      133          else:
+      134              raise ValueError(f"rotate_basis must be 'U' or 'V'
+           , got {rotate_basis!r}")
+
+● Now run the equivalence smoke check.
+
+  Listed 1 directory (ctrl+o to expand)
+
+● Bash(.venv/bin/python -c "
+      import torch…)
+  ⎿  t=0 identity err (V):  1.91e-06
+     einsum vs blockdiag (V): 1.31e-06
+     einsum vs blockdiag (U): 1.07e-06
+
+● All three checks pass at fp32 round-trip noise (~1e-6).
+
+● Quick sanity check on dora and pissa too.
+
+● Bash(.venv/bin/python -c "
+      import torch…)
+  ⎿  DoRA t=0 identity err: 0.00e+00
+     DoRA grad on A: 0.3232 (should be > 0)
+     DoRA grad on m: 3.2282 (should be > 0)
+     … +4 lines (ctrl+o to expand)
+
+● All three fixes verified:
+
+  - DoRA: t=0 bit-identity (err=0), grads still flow on A and m, but no longer
+  through v_norm (paper §4.3 / peft).
+  - PiSSA: bit-identity now holds for α ∈ {r/2, r, 2r}, not just α=r. Old impl
+  was only correct at α=r.
+  - AntiPaSTO: einsum path matches block-diag matmul to fp32 round-trip (~1e-6);
+   t=0 bit-identity preserved.
+
+  Summary
+
+  ┌───────────────────────────────────────────┬─────────────────────┬────────┐
+  │                    Fix                    │        File         │ Status │
+  ├───────────────────────────────────────────┼─────────────────────┼────────┤
+  │ 1. detach BA + v_norm (paper §4.3)        │ dora.py:62-64       │ ✓      │
+  ├───────────────────────────────────────────┼─────────────────────┼────────┤
+  │ 2. pre-divide Sr by scale to match peft   │ pissa.py:51-65      │ ✓      │
+  │ for any α                                 │                     │        │
+  ├───────────────────────────────────────────┼─────────────────────┼────────┤
+  │ 3. einops rearrange + per-block einsum,   │ antipasto.py:41-130 │ ✓      │
+  │ dropped _block_diag                       │                     │        │
+  └───────────────────────────────────────────┴─────────────────────┴────────┘
+
+  Skipped (judgment calls — flag if you want them):
+  - delora lambda0=0.0 default → already documented in docstring
+  - antipasto α steering coefficient → design decision, ask before adding
+  - antipasto rotation chirality vs antipasto3 → gradient-equivalent under sign
+  flip; no checkpoint compat path anyway (jax↔torch)
+
+✻ Tinkered for 5m 55s
@@ -1,75 +0,0 @@
-# Design review: should lora-lite borrow from peft EVA / baukit / antipasto3?
-
-You are reviewing a minimal from-scratch LoRA library (`lora-lite`) and comparing
-it to three reference implementations. Goal: identify cherry-picks that would
-**reduce** complexity or unlock missing capability, **without bloating the lib**.
-
-## Inputs
-
- lora-lite code: `src/lora_lite/` (adapter.py, target.py, variant.py, config.py, variants/*.py)
- Reference: `docs/refs/peft_eva.py` (peft's EVA: data-driven SVD-of-activations init)
- Reference: `docs/refs/peft_eva_finetuning.py` (example usage)
- Reference: `docs/refs/baukit_nethook.py` (nethook: forward/backward hook patterns)
- Reference: `docs/refs/antipasto3_svd_adapter.py` (wassname's earlier JAX SVD adapter)
-
-## Project ethos (read first)
-
-Lora-lite is fail-fast research code. Principles:
- No defensive programming, no fallbacks, no legacy compat
- Simplicity beats features. If you add X you must remove equivalent complexity.
- Each variant is one file with paper URL + honest deviation notes.
- Targets discovered by structural type-check, not name regex.
- Hooks via plain torch forward_pre_hook on a single layer, no global registry.
-
-Read `AGENTS.md` if present.
-
-## Questions to answer
-
-For each reference, answer:
-
-### A. peft EVA (`docs/refs/peft_eva.py` + `peft_eva_finetuning.py`)
-
-1. What does EVA actually do? (1-paragraph summary; cite line numbers)
-2. What would a *minimal* EVA variant in lora-lite look like? Sketch the API:
-   - How does the user pass calibration data?
-   - Where does the SVD-of-activations happen — in `init()` with a callback,
-     or as a separate `calibrate(model, dataloader, cfg)` step before `attach`?
-3. Does peft's implementation have anything we could **drop** if we re-implemented?
-   (e.g. the rank-redistribution logic, the resume-from-checkpoint plumbing)
-4. Does lora-lite's current `Variant.init(layer, cfg)` signature support EVA, or
-   would we need to extend it? Recommend the **smallest** API change.
-
-### B. baukit nethook (`docs/refs/baukit_nethook.py`)
-
-1. What does `TraceDict` / `Trace` give us that our current per-layer
-   `forward_pre_hook` registration does not?
-2. Would adopting `baukit` for hook management (a) simplify our adapter.py,
-   (b) complicate it, or (c) be neutral? Quote specific lines from
-   `src/lora_lite/adapter.py` to justify.
-3. Lora-lite's principle: minimize deps. Is baukit worth a dep? Or should
-   we just **inline** the 1-2 useful patterns?
-
-### C. antipasto3 SVD adapter (`docs/refs/antipasto3_svd_adapter.py`)
-
-1. This is the user's earlier JAX work. Anything in there (init style, scale
-   parameterization, save/load format) that lora-lite should adopt or
-   deliberately diverge from?
-2. Does it suggest a cleaner factoring for PiSSA-like methods?
-
-## Output format
-
-For each (A, B, C), end with:
-
-**Recommendation: ADOPT / SKIP / PARTIAL**
-
-If ADOPT or PARTIAL, list the specific lines/patterns to import and the
-approximate net line-count impact on lora-lite (+ added, − removed).
-
-## Hard rules
-
- Do NOT propose code edits. This is design notes only.
- Do NOT recommend adding a feature unless you can name what to remove or
-  simplify in exchange.
- Be specific. "Could be cleaner" is not a recommendation; "Replace L42-L67
-  in adapter.py with a 5-line TraceDict call" is.
- If a reference's pattern is worse than what lora-lite already has, say so.
@@ -1,63 +0,0 @@
-# Per-variant paper-faithfulness audit for lora-lite
-
-You are reviewing a small from-scratch PEFT library (`lora-lite`) that re-implements
-6 LoRA variants. Your job: independent paper-vs-implementation sign-off for each.
-
-## Inputs available locally
-
- Code: `src/lora_lite/variants/{lora,pissa,dora,ia3,hra,delora}.py`
- Adapter base + plumbing: `src/lora_lite/{adapter.py,target.py,variant.py,config.py}`
- Papers (extracted text): `docs/papers/{lora,pissa,dora,ia3,hra,delora}_*.txt`
- Smoke log (toy + bnb): `logs/smoke.log`
- Real-model probe log (Qwen0.6B, 16 SGD steps): `logs/qwen_probe.log`
- Reference implementations (peft / antipasto3 / baukit): `docs/refs/*.py`
-
-## What I want from you (per variant, all 6)
-
-For each of `lora, pissa, dora, ia3, hra, delora` produce a section with:
-
-1. **Paper claim summary (1-3 sentences)** — cite paper file + section/eq number.
-   E.g. "PiSSA (docs/papers/pissa_2404.02948.txt §3.1, eq.4): A,B = top-r SVD of W,
-   W_res = W - BA; trains A,B with W_res frozen."
-
-2. **What our code does** — point to the function and key lines in
-   `src/lora_lite/variants/<v>.py`. Quote ≤5 lines.
-
-3. **Match? Y / Partial / N** — explicit verdict. If Partial, state the deviation
-   and whether it is documented in the variant's docstring.
-
-4. **Smoke evidence** — quote the exact row from `logs/smoke.log` (toy + bnb)
-   and the row from `logs/qwen_probe.log`. State whether the numbers are
-   consistent with paper expectations (e.g. PiSSA should have nonzero perturb at
-   t=0 because B@A reconstructs W; LoRA/HRA/IA3/DeLoRA should be identity at t=0).
-
-5. **Bugs / concerns** — anything actually wrong, especially:
-   - Gradient flow issues
-   - Wrong normalization / scaling
-   - Wrong initialization (e.g. PiSSA without SVD, HRA without orthogonality)
-   - Missing or wrong save/load handling
-   - Numerical issues (dtype, in-place ops on grad-required tensors)
-
-6. **Confidence** — High / Medium / Low, with one-line reason.
-
-## Final aggregate
-
-After the 6 sections, produce a Markdown table:
-
-| variant | paper match | smoke pass | qwen pass | bugs found | confidence |
-
-And a 3-bullet "biggest risks" summary.
-
-## Rules
-
- Be skeptical. The previous audit found IA3, HRA, DeLoRA bugs that had been
-  declared "OK". Assume nothing.
- If the smoke log does not include a check that you'd want to see, flag it as
-  a missing test — don't infer correctness from absence.
- Quote evidence; do not paraphrase code.
- Use file links: `src/lora_lite/variants/lora.py:42` style.
- Do NOT edit code. Output is a verdict only.
- If you cannot determine something from the available files, say so explicitly
-  rather than guessing.
-
-Write the full review to stdout. I will redirect to a file.
@@ -1,74 +0,0 @@
-# Per-variant paper-faithfulness audit V2 (with reference implementations)
-
-Re-audit of `lora-lite` after adding canonical reference implementation URLs to
-each variant docstring. Your job: for each variant, **directly compare** our
-implementation against the reference impl (peft and/or paper-author repo), not
-just against the paper text. This is round 2 — the previous review (you can
-read `docs/audit/variants_review.md`) found:
-
- HRA gate=0 init kills `lora_U` gradient on step 0
- DeLoRA same pattern with lambda0=0
- IA3 targets q/v not paper k/v/ffn-down (deviation documented but untested)
- PiSSA bf16 init err 0.31 on Qwen
- Saved adapters don't preserve PiSSA W_res mutation
-
-Your job now is to verify those findings against the **reference code**, and
-look for anything the prior review missed once you have the reference in hand.
-
-## Inputs
-
- Our code: `src/lora_lite/variants/{lora,pissa,dora,ia3,hra,delora}.py`
- Adapter plumbing: `src/lora_lite/{adapter.py,target.py,variant.py,config.py}`
- Papers (text): `docs/papers/*_*.txt`
- **Reference implementations** (just added):
-  - `docs/refs/peft_lora_layer.py`   — peft LoRA Linear (and PiSSA init paths)
-  - `docs/refs/peft_lora_dora.py`    — peft DoRA helper module
-  - `docs/refs/peft_lora_variants.py` — peft per-variant init dispatch (PiSSA, OLoRA, etc.)
-  - `docs/refs/peft_ia3_layer.py`    — peft IA3 layer
-  - `docs/refs/peft_hra_layer.py`    — peft HRA layer (clean, has apply_GS toggle)
-  - `docs/refs/peft_delora_layer.py` — peft DeLoRA layer (upstreamed)
-  - `docs/refs/orig_pissa_init.py`   — PiSSA paper authors' init script (MuLabPKU)
-  - `docs/refs/orig_hra_layer.py`    — HRA paper authors' OFT-with-HRA layer (DaShenZi721)
-  - `docs/refs/orig_delora.py`       — DeLoRA paper authors' fork-of-peft impl (ExplainableML)
- Logs: `logs/smoke.log`, `logs/qwen_probe.log`
- Prior review: `docs/audit/variants_review.md` (do NOT just restate it)
-
-## What to deliver per variant (LoRA, PiSSA, DoRA, IA3, HRA, DeLoRA)
-
-1. **Reference impl ground-truth** — what does the *reference* code actually do
-   for: parameter shapes, initialization, scale factor, forward equation,
-   save/load, target placement? Quote ≤10 lines with file/line cites from
-   `docs/refs/`.
-
-2. **Our code** — quote our impl (≤10 lines, with `src/lora_lite/variants/<v>.py:LN` cites).
-
-3. **Diff** — bullet list of every meaningful difference.
-   Mark each one as: `[OK-doc]` (acceptable, documented), `[OK-undoc]` (acceptable,
-   should add to docstring), `[BUG]` (likely wrong), `[STYLE]` (cosmetic).
-
-4. **Did the prior review get it right?** Quote the relevant prior verdict
-   line and either confirm or correct.
-
-5. **Verdict** — Faithful / Faithful-with-doc-gap / Partial / Buggy.
-   One-line reason.
-
-## Final aggregate
-
-Markdown table:
-
-| variant | prior verdict | new verdict | new bugs found | doc gaps |
-
-And a 5-bullet "what to fix next" list, ordered by severity.
-
-## Hard rules
-
- Quote evidence from `docs/refs/` files. If you can't find the relevant
-  reference function, say so explicitly — don't guess.
- Do NOT edit code. Output review only.
- Be specific about line numbers from the references. "peft does X" is not
-  enough; "peft_lora_layer.py:L1234 does X" is.
- If you find a NEW bug not flagged in `variants_review.md`, mark it
-  `[NEW-BUG]` and explain the failure mode.
- If the prior review was wrong (false positive), mark it `[OVERTURN]`.
-
-Write to stdout. I will redirect to `docs/audit/variants_review_v2.md`.
@@ -1,107 +0,0 @@
-# V3 Variant Review — per-component audit
-
-You are an expert ML engineer reviewing a from-scratch PEFT library
-(`lora-lite`, ~500 LOC) that re-implements 8 LoRA variants. Two prior reviews
-already happened (V1 paper-vs-code, V2 with reference implementations
-provided). Your job is V3: a tight per-component audit focused on
-correctness-of-mechanism rather than overall design.
-
-# Scope
-
-8 variants live in `src/lora_lite/variants/`:
- lora.py
- pissa.py
- delora.py
- ia3.py (two registered: `ia3` and `ia3_ff`)
- dora.py
- hra.py
- eva.py        (NEW since V2)
- antipasto.py  (NEW since V2)
-
-Plus the runtime in `src/lora_lite/{adapter.py,variant.py,target.py,config.py}`
-and the smoke test in `tests/smoke.py`.
-
-Reference implementations are in `docs/refs/` and the URLs are also pasted in
-each variant's module docstring. Compare against those.
-
-# What I want from you (per variant, in this order, every time)
-
-For EACH variant, work through these five checkpoints, using only that
-variant's file and its referenced peft/author code:
-
-1. **PARAMS** — list every spec returned by `param_specs`. For each:
-   shape, dtype (cfg.dtype unless overridden), trainable, as_buffer.
-   Does the shape match the reference impl? Are buffers vs Parameters
-   chosen correctly (no Parameter that should be a buffer; no buffer
-   that we want to learn)? Does as_buffer mean it persists in
-   state_dict (check `register_buffer(..., persistent=True)` in
-   adapter.py)?
-
-2. **INIT** — what does `init()` (and `group_init()` if defined) do?
-   Does it match the reference exactly? Pay special attention to
-   ZERO INITS — they often kill gradient flow on dependent params.
-   Walk the gradient: at t=0, given this init, which trainable params
-   actually receive non-zero gradient on the first SGD step?
-   Are dtype casts placed correctly (fp32 SVD, then to cfg.dtype)?
-
-3. **DTYPE** — trace dtype through init -> param storage -> forward.
-   Where could silent precision loss happen? Is bf16 or fp16 used
-   anywhere it shouldn't be? Does identity-at-init survive bf16?
-
-4. **FORWARD** — write the math the forward implements, in the same
-   convention as the reference (peft/author paper). Compare term by
-   term. Common mistakes to look for:
-   - wrong scale (alpha/r vs 1/r vs alpha vs 1)
-   - missing or doubled normalization
-   - wrong basis (rotating U vs V; gating input vs output)
-   - dropout placement (we have no dropout — flag if any variant
-     references one; see config.py)
-
-5. **LINK SANITY** — open the URLs in the docstring. Verify:
-   - the paper arxiv link goes to the right paper
-   - the github link points to a real file
-   - the offline `docs/refs/` snapshot matches what the URL serves
-     today (the snapshots may be stale; if so, flag the drift)
-
-# Output format
-
-For each variant, write at most ~60 lines. Use this template:
-
-    ## <variant>
-
-    ### params
-    - <one bullet per ParamSpec; flag bug if any>
-
-    ### init / group_init
-    - <bullets; identify GRADIENT FLOW at t=0 explicitly>
-
-    ### dtype
-    - <bullets>
-
-    ### forward
-    Math:    <one-line equation in our convention>
-    Ref math: <one-line equation in reference convention>
-    Match?   YES / NO + one-line explanation
-
-    ### links
-    - paper: OK / WRONG / DEAD
-    - peft ref: OK / DEAD
-    - author ref (if any): OK / DEAD
-    - offline snapshot drift: NONE / MINOR / MAJOR
-
-    ### verdict
-    CORRECT / PARTIAL / BUGGY  -- one-sentence reason
-
-After all variants, write a "## summary" with a markdown table of verdicts and
-a numbered list of MUST-FIX bugs (severity high) vs nice-to-haves.
-
-# Hard rules
-
- Be specific. Cite line numbers (`src/lora_lite/variants/foo.py:NN`) for
-  every claim.
- Do NOT propose redesigns. Only flag correctness issues against the
-  references.
- If an issue is intentional and documented, say so and move on -- don't
-  re-flag known deviations from the docstrings.
- If you can't tell whether something is a bug, say "AMBIGUOUS" with the
-  question you'd need answered.
@@ -1,186 +0,0 @@
-# V4 Variant Review — per-component vs reference + smoke/probe validity
-
-You are an expert ML engineer reviewing a from-scratch PEFT library
-(`lora-lite`, ~500 LOC) that re-implements 8 LoRA variants. Three prior
-reviews already happened (V1 paper-vs-code, V2 with refs provided, V3
-per-component). Your job is V4: re-run the per-component check and
-additionally validate the test harness.
-
-# Part A — per-variant audit (re-do, more rigorous)
-
-8 variants live in `src/lora_lite/variants/`:
- lora.py
- pissa.py
- delora.py
- ia3.py (registers `ia3` and `ia3_ff`)
- dora.py
- hra.py
- eva.py
- antipasto.py
-
-Plus runtime in `src/lora_lite/{adapter.py,variant.py,target.py,config.py}`.
-Reference implementations are in `docs/refs/` and URLs are pasted in each
-variant's module docstring.
-
-## For EACH variant, in this order, every time:
-
-1. **REFERENCE EXISTS** — verify the variant has a real, citeable
-   reference. Required:
-     - a paper (arxiv/conference) link, AND
-     - either an upstream peft implementation OR the original author's
-       code (GitHub).
-   If the variant has NO paper, NO reference code, OR the references
-   are dead/missing/clearly wrong, FLAG IT as `NO REFERENCE` -- this
-   is severity HIGH because it means there's nothing to validate
-   against.
-
-2. **PARAMS** — every spec from `param_specs`: shape, dtype, trainable,
-   as_buffer. Match against the reference. Buffers vs Parameters
-   chosen correctly?
-
-3. **INIT** — what does `init()` (and `group_init()` if defined) do?
-   Match the reference exactly? Walk gradient flow at t=0: which
-   trainable params actually receive non-zero gradient on step 1?
-
-4. **DTYPE** — trace dtype through init -> storage -> forward.
-   Silent precision loss? Identity-at-init survive bf16?
-
-5. **FORWARD** — write the math the forward implements vs the math
-   in the reference. Term-by-term comparison. Common mistakes:
-   - wrong scale (alpha/r vs 1/r vs alpha vs 1)
-   - missing/doubled normalization
-   - wrong basis (rotating U vs V; gating input vs output)
-   - dropout placement (we have NO dropout by design — flag if any
-     code path depends on one)
-
-6. **LINK SANITY** — actually open the URLs. Verify:
-   - paper arxiv link goes to the right paper
-   - github link points to a real file (not 404)
-   - offline `docs/refs/` snapshot still matches what the URL serves
-     today (snapshots may be stale; flag drift)
-
-## Per-variant output (≤60 lines each):
-
-    ## <variant>
-
-    ### references
-    - paper: <url>  -- OK / WRONG / DEAD / MISSING
-    - peft ref: <url>  -- OK / DEAD / MISSING
-    - author ref (if any): <url>  -- OK / DEAD / MISSING
-    - offline snapshot (`docs/refs/...`): NONE / MATCH / DRIFT
-    - VERDICT: HAS_REFERENCE / NO_REFERENCE
-
-    ### params
-    - <one bullet per ParamSpec; flag bug if any>
-
-    ### init / group_init
-    - <bullets; identify GRADIENT FLOW at t=0 explicitly>
-
-    ### dtype
-    - <bullets>
-
-    ### forward
-    Math (ours):    <one-line equation>
-    Math (ref):     <one-line equation>
-    Match?          YES / NO + one-line reason
-
-    ### verdict
-    CORRECT / PARTIAL / BUGGY  -- one-sentence reason
-
-# Part B — validate the smoke test (`tests/smoke.py`)
-
-Read `tests/smoke.py` end-to-end. For each per-variant SHOULD claim,
-answer:
-
-1. **Distinguishing power** — would a SILENT FAILURE (e.g. forward
-   returning `y` unchanged, or training only the bias term, or
-   loading an empty state dict) STILL pass this check? If yes,
-   the check is WEAK -- name a stronger one.
-
-2. **Tolerance sanity** — the bf16/fp16 tolerances are computed
-   from `base_scale`. Are they too loose? Too tight? Could they
-   pass on noise alone?
-
-3. **Coverage** — what mechanisms are NOT tested? (e.g. multi-step
-   convergence on real targets, dtype mismatch between attach and
-   load, mixing variants, calibration data of len < r for EVA)
-
-Output:
-
-    ## smoke.py validity
-
-    ### per-variant SHOULD checks
-    | check | distinguishes silent failure? | tolerance ok? | notes |
-    | ... |
-
-    ### gaps
-    - bullets
-
-    ### must-add tests
-    - bullets
-
-# Part C — validate the qwen overfit probe (`scripts/qwen_train_probe.py`)
-
-Read `scripts/qwen_train_probe.py` end-to-end. Same questions as Part B
-but for the Qwen probe specifically:
-
-1. Does `assert_only_lora_trainable` actually catch a leaked base
-   parameter, given the way `requires_grad` is set in `adapter.py`?
-
-2. `perturb_first_adapter` only perturbs ONE param per variant. Does
-   `perturb_delta > 1e-7` distinguish "the variant uses that param in
-   forward" from "the variant ignores that param"?
-
-3. `loss_last < loss0` after 8 steps with lr=5e-3 -- could this pass
-   purely from optimizer noise? What's the right held-out / validation
-   check to add?
-
-4. The reload check uses `args.reload_tol` (default 2e-2 in bf16). Is
-   that loose enough to mask a real save/load bug?
-
-5. Targets are restricted to `model.layers.0.self_attn.{q,v}_proj` --
-   does this exercise the full attach path or hide bugs that only
-   appear with multi-layer / FFN / lm_head edge cases?
-
-Output:
-
-    ## qwen_train_probe.py validity
-
-    ### claim-by-claim
-    | assertion | catches silent failure? | notes |
-    | ... |
-
-    ### gaps
-    - bullets
-
-    ### must-add tests
-    - bullets
-
-# Final summary
-
-After parts A, B, C, write:
-
-    ## summary
-
-    ### variant verdicts
-    | variant | has_ref | params | init | dtype | forward | verdict |
-
-    ### MUST-FIX (severity HIGH, blocks correctness claim)
-    1. ...
-    2. ...
-
-    ### NICE-TO-HAVE
-    - ...
-
-# Hard rules
-
- Be specific. Cite line numbers (`src/lora_lite/variants/foo.py:NN`)
-  for every claim.
- Do NOT propose redesigns. Only flag correctness issues against
-  references and validity issues in the test harness.
- If an issue is intentional and documented in the docstring, say so
-  and move on -- don't re-flag known deviations.
- If you can't tell whether something is a bug, say "AMBIGUOUS" with
-  the question you'd need answered.
- For Part B/C, focus on whether checks have DISTINGUISHING power
-  (would a silent failure still pass?) -- not just whether they run.
@@ -1,42 +0,0 @@
-# External-Review Summary
-
-Two independent reviews via `acpx` external models. Full reviews:
- [docs/audit/variants_review.md](variants_review.md) — per-variant paper-faithfulness audit
- [docs/audit/design_review.md](design_review.md) — peft EVA / baukit / antipasto3 vs lora-lite design
-
-## Per-variant verdict
-
-| variant | match | bugs found | confidence |
-|---|---|---|---|
-| lora    | Y       | none material                                                                          | High        |
-| pissa   | Partial | bf16/Qwen init err 0.31; deviation `alpha==r` only in inline comment; residual not in saved adapter | Medium      |
-| dora    | Y       | possible denominator-gradient mismatch with paper's "cost-saving" variant              | High        |
-| ia3     | Partial | targets q/v not paper's k/v/ffn-down; deviation documented but not tested              | Medium      |
-| hra     | Partial | gate=0 init -> dU/dx=0 first step (lora_U dead); not orthogonal when gate != 1         | Medium-Low  |
-| delora  | Partial | no Eq.9 frozen-copy init; lambda0=0 -> A/B dead grad; lambda0=0.1 breaks identity      | Medium      |
-
-## Three biggest risks (reviewer's words)
-
-1. **Initialization vs gradient-flow tradeoffs are hidden by coarse tests.** HRA's `lora_U` and DeLoRA's `A/B` can be initially dead while `grad_nonzero=True` still passes (because *some* lora_* param has nonzero grad).
-2. **Qwen probe pass criteria do not enforce paper identity.** PiSSA shows `id_err=0.31`, DeLoRA `id_err=0.72`, but log says PASS.
-3. **Target semantics under-tested.** IA3's documented k/v/ffn deviation is never exercised by a positive test.
-
-## Design recommendations
-
-| ref | verdict | impact |
-|---|---|---|
-| peft EVA       | PARTIAL — add `calibrate(model, dataloader, cfg)` (~50 lines) | +50 lines, additive |
-| baukit nethook | SKIP — current 5-line hook registration is simpler          | 0 |
-| antipasto3 SVD | ADOPT concept (learnable delta_s) — no code change now      | 0 |
-
-## Recommended follow-up tasks (need user approval before implementing)
-
-A. **Per-param gradient probe**: extend smoke to assert grad on *each* lora_* param at step 0. Catches HRA/DeLoRA init-dead-param bug.
-
-B. **Per-variant identity tolerance in qwen probe**: PiSSA/DeLoRA need a stricter check (or relative tol against `||y_base||`) instead of "passes if id_err < some constant".
-
-C. **IA3 paper-faithful test row**: add one Qwen probe configuration with `target_names=k_proj|v_proj|down_proj` to exercise the documented IA3 placement.
-
-D. **PiSSA equivalence test against `peft.PiSSA`**: same seed + alpha=r, compare `B@A` reconstruction. Adds `peft` to test extras only.
-
-E. **EVA variant**: implement minimal `calibrate()` per design review (~50 lines). Optional, but provides our first data-driven init variant for the user's stated interest.
@@ -1,780 +0,0 @@
-[client] initialize (running)
-
-[client] session/new (running)
-
-[thinking] This is a detailed design review request. I need to: 1. Read the lora-lite codebase to understand its current structure 2. Read the three reference implementations 3. Answer specific questions for each Let me start by exploring the codebase structure and reading the key files.
-
-[tool] glob (pending)
-  input: {}
-
-[tool] glob (pending)
-  input: {}
-
-[tool] glob (completed)
-  kind: search
-  input: src/lora_lite/**/*.py
-  output:
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/adapter.py
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/lora.py
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/delora.py
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/hra.py
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/dora.py
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/ia3.py
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/__init__.py
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/pissa.py
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/target.py
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variant.py
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/__init__.py
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/config.py
-
-[tool] glob (completed)
-  kind: search
-  input: docs/refs/*.py
-  output:
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/antipasto3_svd_adapter.py
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/baukit_nethook.py
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva_finetuning.py
-    /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva.py
-
-[thinking] I need to read all these files to do a proper comparative analysis. Let me read the key lora-lite files first (adapter.py, variant.py, variants/lora.py), then the reference files.
-
-[tool] read (pending)
-  input: {}
-
-[tool] read (pending)
-  input: {}
-
-[tool] src/lora_lite/adapter.py (completed)
-  kind: read
-  input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/adapter.py
-  files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/adapter.py
-  output:
-    <path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/adapter.py</path>
-    <type>file</type>
-    <content>
-    1: """attach / detach / save / load. The whole runtime."""
-    2: from __future__ import annotations
-    3: import torch
-    4: from torch import nn
-    5: from torch.utils.hooks import RemovableHandle
-    6: 
-    7: from .config import LoraLiteConfig
-    8: from .variant import REGISTRY
-    9: from .target import find_targets
-    10: 
-    11: 
-    12: _ATTACHED_ATTR = "_lora_lite_attached"
-    13: 
-    14: 
-    15: def _hook(layer, args, y):
-    16:     (x,) = args
-    17:     cfg: LoraLiteConfig = layer._lora_cfg
-    18:     x_cast = x.to(cfg.dtype)
-    19:     out = layer._lora_variant.forward(layer, x_cast, y)
-    20:     return out.to(y.dtype)
-    21: 
-    22: 
-    23: def _pre_hook(layer, args):
-    24:     (x,) = args
-    25:     cfg: LoraLiteConfig = layer._lora_cfg
-    ... (92 more lines)
-
-[tool] read (pending)
-  input: {}
-
-[tool] src/lora_lite/variant.py (completed)
-  kind: read
-  input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variant.py
-  files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variant.py
-  output:
-    <path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variant.py</path>
-    <type>file</type>
-    <content>
-    1: """Variant protocol + registry. Variants own (x, layer.weight, layer.lora_*) -> y_new."""
-    2: from dataclasses import dataclass
-    3: from typing import Callable, Protocol, Any
-    4: import torch
-    5: from torch import nn
-    6: 
-    7: from .config import LoraLiteConfig
-    8: 
-    9: 
-    10: @dataclass
-    11: class ParamSpec:
-    12:     shape: tuple[int, ...]
-    13:     init: str | Callable[[torch.Tensor], None] = "zeros"  # 'zeros'|'kaiming'|'ones'|callable(t)
-    14:     trainable: bool = True
-    15: 
-    16:     def make(self, dtype: torch.dtype, device) -> nn.Parameter:
-    17:         t = torch.empty(self.shape, dtype=dtype, device=device)
-    18:         if callable(self.init):
-    19:             self.init(t)
-    20:         elif self.init == "zeros":
-    21:             t.zero_()
-    22:         elif self.init == "ones":
-    23:             t.fill_(1.0)
-    24:         elif self.init == "kaiming":
-    25:             # match nn.Linear default: kaiming_uniform_(a=sqrt(5))
-    ... (34 more lines)
-
-[tool] src/lora_lite/variants/lora.py (completed)
-  kind: read
-  input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/lora.py
-  files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/lora.py
-  output:
-    <path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/lora.py</path>
-    <type>file</type>
-    <content>
-    1: """Vanilla LoRA. Hu et al. 2021  https://arxiv.org/abs/2106.09685
-    2: 
-    3:     h = W x + (alpha/r) B A x
-    4: 
-    5: Identity at t=0 from B=0. Faithful to the paper.
-    6: """
-    7: from einops import einsum
-    8: from torch import nn
-    9: import torch
-    10: 
-    11: from ..variant import register, ParamSpec
-    12: 
-    13: 
-    14: @register
-    15: class LoRA:
-    16:     name = "lora"
-    17: 
-    18:     @staticmethod
-    19:     def param_specs(d_in, d_out, cfg):
-    20:         return {
-    21:             "lora_A": ParamSpec((cfg.r, d_in), init="kaiming", trainable=True),
-    22:             "lora_B": ParamSpec((d_out, cfg.r), init="zeros",  trainable=True),
-    23:         }
-    24: 
-    25:     @staticmethod
-    ... (14 more lines)
-
-[thinking] Now let me read the reference implementations to answer the questions.
-
-[tool] read (pending)
-  input: {}
-
-[tool] read (pending)
-  input: {}
-
-[tool] docs/refs/peft_eva_finetuning.py (completed)
-  kind: read
-  input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva_finetuning.py
-  files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva_finetuning.py
-  output:
-    <path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva_finetuning.py</path>
-    <type>file</type>
-    <content>
-    1: # Copyright 2024-present the HuggingFace Inc. team.
-    2: #
-    3: # Licensed under the Apache License, Version 2.0 (the "License");
-    4: # you may not use this file except in compliance with the License.
-    5: # You may obtain a copy of the License at
-    6: #
-    7: #     http://www.apache.org/licenses/LICENSE-2.0
-    8: #
-    9: # Unless required by applicable law or agreed to in writing, software
-    10: # distributed under the License is distributed on an "AS IS" BASIS,
-    11: # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-    12: # See the License for the specific language governing permissions and
-    13: # limitations under the License.
-    14: 
-    15: import torch
-    16: from datasets import load_dataset
-    17: from torch.utils.data import DataLoader
-    18: from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
-    19: from utils import DataCollator, TokenizerMetaMath
-    20: 
-    21: from peft import EvaConfig, LoraConfig, get_peft_model, initialize_lora_eva_weights
-    22: 
-    23: 
-    24: DEVICE = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
-    25: 
-    ... (74 more lines)
-
-[tool] docs/refs/peft_eva.py (completed)
-  kind: read
-  input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva.py
-  files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva.py
-  output:
-    <path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva.py</path>
-    <type>file</type>
-    <content>
-    1: # Copyright 2024-present the HuggingFace Inc. team.
-    2: #
-    3: # Licensed under the Apache License, Version 2.0 (the "License");
-    4: # you may not use this file except in compliance with the License.
-    5: # You may obtain a copy of the License at
-    6: #
-    7: #     http://www.apache.org/licenses/LICENSE-2.0
-    8: #
-    9: # Unless required by applicable law or agreed to in writing, software
-    10: # distributed under the License is distributed on an "AS IS" BASIS,
-    11: # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-    12: # See the License for the specific language governing permissions and
-    13: # limitations under the License.
-    14: 
-    15: import warnings
-    16: from collections import Counter, defaultdict
-    17: from collections.abc import Iterable, Mapping
-    18: from contextlib import nullcontext
-    19: from copy import deepcopy
-    20: from functools import partial
-    21: from itertools import cycle
-    22: from typing import Optional, Union
-    23: 
-    24: import torch
-    25: import torch.distributed as dist
-    ... (717 more lines)
-
-[thinking] Now let me read the baukit and antipasto3 references.
-
-[tool] read (pending)
-  input: {}
-
-[tool] read (pending)
-  input: {}
-
-[tool] docs/refs/baukit_nethook.py (completed)
-  kind: read
-  input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/baukit_nethook.py
-  files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/baukit_nethook.py
-  output:
-    <path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/baukit_nethook.py</path>
-    <type>file</type>
-    <content>
-    1: """
-    2: Utilities for instrumenting a torch model.
-    3: 
-    4: Trace will hook one layer at a time.
-    5: TraceDict will hook multiple layers at once.
-    6: subsequence slices intervals from Sequential modules.
-    7: get_module, replace_module, get_parameter resolve dotted names.
-    8: set_requires_grad recursively sets requires_grad in module parameters.
-    9: """
-    10: 
-    11: import contextlib
-    12: import copy
-    13: import inspect
-    14: from collections import OrderedDict
-    15: 
-    16: import torch
-    17: 
-    18: 
-    19: class Trace(contextlib.AbstractContextManager):
-    20:     """
-    21:     To retain the output of the named layer during the computation of
-    22:     the given network:
-    23: 
-    24:         with Trace(net, 'layer.name') as ret:
-    25:             _ = net(inp)
-    ... (449 more lines)
-
-[tool] docs/refs/antipasto3_svd_adapter.py (completed)
-  kind: read
-  input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/antipasto3_svd_adapter.py
-  files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/antipasto3_svd_adapter.py
-  output:
-    <path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/antipasto3_svd_adapter.py</path>
-    <type>file</type>
-    <content>
-    1: """SVD adapter for bidirectional steering via block-diagonal Cayley rotations.
-    2: 
-    3: Flax NNX port.
-    4: 
-    5: kernel = U @ diag(S) @ Vh + W_res   (kernel is (in, out), standard Flax convention)
-    6: Learnable: delta_s (additive S scaling), rotation_params (block-diagonal V rotation).
-    7: alpha scales both: S + alpha*delta_s, U @ R(alpha).
-    8: 
-    9: Why Cayley (not Givens or matrix exponential):
-    10: Cayley gives exact analytical reversibility: R(-alpha) = R(alpha)^{-1}.
-    11: This is critical -- at alpha=+1 and alpha=-1 the adapter is an exact inverse of
-    12: itself, making bidirectional steering symmetric by construction.
-    13: 
-    14: At alpha=0: U_rot = U and S_scaled = S, so the layer is identical to frozen weights.
-    15: """
-    16: 
-    17: import math
-    18: 
-    19: import jax
-    20: import jax.numpy as jnp
-    21: from jax import Array
-    22: from jaxtyping import Float, Int
-    23: from einops import rearrange
-    24: from flax import nnx
-    25: from loguru import logger
-    ... (353 more lines)
-
-[thinking] Now let me read the lora-lite config and variant protocol more carefully to understand the current API. Then I can compare it to the references.
-
-[tool] read (pending)
-  input: {}
-
-[tool] read (pending)
-  input: {}
-
-[tool] src/lora_lite/config.py (completed)
-  kind: read
-  input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/config.py
-  files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/config.py
-  output:
-    <path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/config.py</path>
-    <type>file</type>
-    <content>
-    1: from dataclasses import dataclass, field, asdict
-    2: from typing import Any
-    3: import torch
-    4: 
-    5: 
-    6: @dataclass
-    7: class LoraLiteConfig:
-    8:     variant: str = "lora"
-    9:     r: int = 8
-    10:     alpha: float = 16.0
-    11:     dropout: float = 0.0  # currently ignored; variants may use cfg.variant_kwargs
-    12:     dtype: torch.dtype = torch.bfloat16
-    13: 
-    14:     # targeting
-    15:     target_roles: tuple[str, ...] = ("reader", "writer")
-    16:     target_names: tuple[str, ...] = ()
-    17:     exclude_names: tuple[str, ...] = ("lm_head", "embed_tokens")
-    18:     layers: tuple[int, ...] | None = None
-    19: 
-    20:     # variant-specific bag (e.g. lambda0 for DeLoRA)
-    21:     variant_kwargs: dict[str, Any] = field(default_factory=dict)
-    22: 
-    23:     def to_dict(self) -> dict:
-    24:         d = asdict(self)
-    25:         d["dtype"] = str(self.dtype).removeprefix("torch.")
-    ... (16 more lines)
-
-[tool] src/lora_lite/target.py (completed)
-  kind: read
-  input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/target.py
-  files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/target.py
-  output:
-    <path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/target.py</path>
-    <type>file</type>
-    <content>
-    1: """Find linear-like targets by shape (reader/writer/inner) + name regex.
-    2: 
-    3: Structural matching is deliberate: bnb Linear4bit/8bitLt are not nn.Linear, but
-    4: they expose in_features/out_features/weight and their forward already handles
-    5: dequantization.
-    6: """
-    7: import re
-    8: from torch import nn
-    9: 
-    10: 
-    11: def is_linear_like(m: nn.Module) -> bool:
-    12:     return (
-    13:         hasattr(m, "in_features")
-    14:         and hasattr(m, "out_features")
-    15:         and hasattr(m, "weight")
-    16:         and callable(m)
-    17:     )
-    18: 
-    19: 
-    20: def _layer_idx(name: str) -> int | None:
-    21:     m = re.search(r"\.layers?\.(\d+)\.", name)
-    22:     return int(m.group(1)) if m else None
-    23: 
-    24: 
-    25: def _classify(m: nn.Module, d_model: int, name: str) -> str:
-    ... (37 more lines)
-
-[thinking] OK now I have a clear picture. Let me write up the design review with specific answers to each question. Let me also check if there's an AGENTS.md file:
-
-[tool] glob (pending)
-  input: {}
-
-[tool] glob (completed)
-  kind: search
-  input: AGENTS.md
-  output:
-    No files found
-
-[thinking] Let me check for AGENTS.md in parent directories or look for any README or docs.
-
-[tool] glob (pending)
-  input: {}
-
-[tool] glob (completed)
-  kind: search
-  input: **/AGENTS.md
-  output:
-    No files found
-
-[thinking] The user wants design notes. I have all the information I need. Let me write up the design notes for each reference.
-
-[tool] todowrite (pending)
-  input: {}
-
-[plan]
-  - [in_progress] write design notes for peft EVA
-  - [pending] write design notes for baukit nethook
-  - [pending] write design notes for antipasto3 SVD
-
-[tool] 3 todos (completed)
-  kind: other
-  input: {"todos":[{"content":"write design notes for peft EVA","status":"in_progress","priority":"high"},{"content":"write design notes for baukit nethook","status":"pending","priority":"high"},{"content":"write design notes ...
-  output:
-    [
-      {
-        "content": "write design notes for peft EVA",
-        "status": "in_progress",
-        "priority": "high"
-      },
-      {
-        "content": "write design notes for baukit nethook",
-        "status": "pending",
-        "priority": "high"
-      },
-      {
-        "content": "write design notes for antipasto3 SVD",
-        "status": "pending",
-        "priority": "high"
-      }
-    ]
-
-[thinking] Let me write out the design notes for each reference. ## A. peft EVA **What EVA does** (1-paragraph): EVA computes SVD-of-activations to initialize LoRA's low-rank decomposition. Rather than random kaiming init, it collects layer inputs during a forward pass through calibration data, runs incremental PCA to find principal components, and uses those as the initial `A` matrix (the left singular vectors). Key features: - Per-layer rank redistribution based on explained variance (lines 302-314) - Convergence checking via cosine similarity between SVD components (lines 147-168) - Distributed support via all-gather (lines 81-102) - Handles non-convergence gracefully (multiple forward passes until all layers converge) A **minimal EVA variant** in lora-lite: The current API is: ```python variant.init(layer, cfg) # called per-layer after params created ``` EVA needs a **separate calibration st...
-
-
-# Design Review: lora-lite vs peft EVA / baukit / antipasto3
-
-## A. peft EVA
-
-**What EVA does** (1-paragraph summary):  
-EVA computes SVD-of-activations to initialize LoRA's low-rank decomposition. Rather than random kaiming init, it registers hooks on target layers, runs incremental PCA on calibration inputs (collected during forward passes), and uses the left singular vectors as the initial `lora_A` matrix. Key features: per-layer rank redistribution based on explained variance (lines 302-314), convergence checking via cosine similarity (lines 147-168), and distributed support via all-gather (lines 81-102).
-
-**Minimal EVA variant sketch:**
-
-The current API is:
-```python
-variant.init(layer, cfg)  # called per-layer after params created (adapter.py:56)
-```
-
-EVA needs a **separate calibration step before `attach()`**:
-
-```python
-# Option: calibrate() -> eva_state dict
-def calibrate(model, dataloader, cfg):
-    # Run forward passes, collect activations per target
-    # Compute SVD, return {"layer_name": U_svd}
-    ...
-
-# Then in LoRA variant.init():
-if "eva_state" in cfg.variant_kwargs:
-    with torch.no_grad():
-        layer.lora_A.weight.copy_(cfg.variant_kwargs["eva_state"][layer_name])
-```
-
-**What to drop if re-implementing:**
- Rank redistribution logic (lines 302-314) — skip, use fixed rank
- Distributed gather (lines 81-102) — skip single-GPU only
- Hash deduplication for equal inputs (lines 173-213, 374-386) — skip
- Resume-from-checkpoint plumbing (lines 489-559) — skip
- Whiten option (lines 475-477) — skip
- PEFT model wrapping (lines 612-656) — skip
-
-**API impact:**
-Current `Variant.init(layer, cfg)` signature at adapter.py:56 supports EVA via `cfg.variant_kwargs["eva_state"]`. Minimal change: add `calibrate(model, dataloader, cfg)` function (~50 lines) that computes SVD and returns state dict, then LoRA's `init()` checks for `eva_state` in kwargs.
-
---
-
-**Recommendation: PARTIAL**
-
- **Import**: Core SVD-of-activations collection + SVD computation (lines 289-486 minus distributed/rank-redist/PEFT plumbing)
- **Approximate line-count**: +~80 lines (calibrate function + init hook), −0 removed
- **Trade**: Adds data-driven init capability; costs one new function and variant_kwargs passthrough
-
---
-
-## B. baukit nethook
-
-**What `TraceDict`/`Trace` gives us that lora-lite doesn't:**
- Context manager pattern with automatic cleanup (adapter.py:68 already handles this via `RemovableHandle`)
- `retain_grad=True` for backward pass (lora-lite doesn't need this — we're not editing grads)
- `edit_output` for intercepting/modifying (not needed — we have dedicated `forward_input` hook)
- Dictionary interface (`ret['layer.name'].output`) — we just store layer ref directly
-
-Quote from adapter.py:64-68:
-```python
-for _, layer, _ in attached_targets:
-    if hasattr(layer._lora_variant, "forward_input"):
-        handles.append(layer.register_forward_pre_hook(_pre_hook))
-    else:
-        handles.append(layer.register_forward_hook(_hook))
-```
-
-**Would baukit simplify or complicate adapter.py?**
- **Neutral to negative**. Current code is 5 lines (64-68). Baukit's TraceDict adds ~100 lines of context-management indirection. We don't need:
-  - Per-layer config dicts (retain_output/retain_input per layer)
-  - Edit hooks mid-forward
-  - Stop-after-layer control
-  - clone/detach options
-
-**Dependencies:**
-Baukit is a single file (~470 lines). lora-lite currently has no deps beyond torch+einops. Adding baukit adds a dep for ~5 lines of value.
-
---
-
-**Recommendation: SKIP**
-
-Baukit adds ~400 lines for capabilities we don't need. Current hook registration (adapter.py:64-68) is simpler. The `retain_grad` feature could be useful for ml debugging but that's a separate concern — can inline later if needed.
-
---
-
-## C. antipasto3 SVD adapter
-
-**1. What to adopt or diverge from?**
-
-| Feature | antipasto3 | lora-lite | Verdict |
-|--------|-----------|----------|--------|
-| SVD init from kernel | `create_svd_adapter()` line 169 | manual param_specs | **Divergen**: lora-lite missing |
-| Delta-s scaling | `S_scaled = S + alpha*delta_s` (line 160) | direct scaling via `alpha/r` | **Adopt**: antipasto3 better — learnable scaling, not fixed |
-| Block-diagonal rotation | Cayley transform (lines 135-141) | N/A | **Too complex** for lora-lite ethos |
-| Residual: `W_res = kernel - U@S@Vh` | line 204 | N/A | **Consider**: adds expressivity |
-| Dimension selection: L1 trip | lines 219-267 | None | **Skip**: too domain-specific |
-| Polarity interleave | lines 271-287 | N/A | **Skip**: steering-specific |
-
-**Key insight**: antipasto3's `W_res` residual (kernel - U@S@Vh) is the cleanest simplification over vanilla LoRA. Instead of:
- LoRA: `y = Wx + (α/r)BAx`  (learns full rank-r delta)
-
-You get:
- SVD: `y = W_res@x + (α/r)U_scaled@S_scaled@Vh@x`  (SVD-based delta, capped)
-
-This is essentially **PiSSA** (arXiv:2409.13349) — already in lora-lite variants/ (pissa.py).
-
-**2. PiSSA-like factoring:**
-PiSSA is in variants/pissa.py. Let me check it:
-
-[tool] read (pending)
-  input: {}
-
-[tool] src/lora_lite/variants/pissa.py (completed)
-  kind: read
-  input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/pissa.py
-  files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/pissa.py
-  output:
-    <path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/pissa.py</path>
-    <type>file</type>
-    <content>
-    1: """PiSSA: top-r SVD of W into A,B; replace W with W_res = W - B@A.
-    2: 
-    3: Meng et al. 2024  https://arxiv.org/abs/2404.02948
-    4: W_eff(t=0) = W_res + B@A = W (numerically; bf16 round-trip not bit-exact).
-    5: """
-    6: import torch
-    7: from einops import einsum
-    8: from torch import nn
-    9: 
-    10: from ..variant import register, ParamSpec
-    11: 
-    12: 
-    13: @register
-    14: class PiSSA:
-    15:     name = "pissa"
-    16: 
-    17:     @staticmethod
-    18:     def param_specs(d_in, d_out, cfg):
-    19:         return {
-    20:             "lora_A": ParamSpec((cfg.r, d_in), init="zeros", trainable=True),
-    21:             "lora_B": ParamSpec((d_out, cfg.r), init="zeros", trainable=True),
-    22:         }
-    23: 
-    24:     @staticmethod
-    25:     def init(layer: nn.Linear, cfg) -> None:
-    ... (32 more lines)
-
-[thinking] So PiSSA is already in lora-lite! Let me check what antipasto3 offers beyond this. Comparing antipasto3 to PiSSA: - Both use SVD init from W antipasto3 has: - Delta-s scaling: learnable S + delta_s (additive learnable, not just fixed) - Residual is explicit (W_res stored separately) - PiSSA mutates layer.weight into W_res The key differences: 1. **Delta-s**: antipasto3 has `delta_s` parameters (line 89-91, 160) — fully learnable scaling. PiSSA has no learnable scaling, just fixed `alpha/r`. 2. **Steering coefficient**: antipasto3 uses explicit `alpha` mutated during forward (line 107, 116), not config-based. 3. **Residual**: PiSSA mutates `layer.weight` directly. antipasto3 stores `W_res` separately. antipasto3's delta-s is the main "new" thing worth considering. But that's really a steering/adaptation method (changing alpha at runtime), not an initialization method. For lora-lite's m...
-
-[tool] todowrite (pending)
-  input: {}
-
-[plan]
-  - [completed] write design notes for peft EVA
-  - [completed] write design notes for baukit nethook
-  - [completed] write design notes for antipasto3 SVD
-
-[tool] 0 todos (completed)
-  kind: other
-  input: {"todos":[{"content":"write design notes for peft EVA","priority":"high","status":"completed"},{"content":"write design notes for baukit nethook","priority":"high","status":"completed"},{"content":"write design notes ...
-  output:
-    [
-      {
-        "content": "write design notes for peft EVA",
-        "status": "completed",
-        "priority": "high"
-      },
-      {
-        "content": "write design notes for baukit nethook",
-        "status": "completed",
-        "priority": "high"
-      },
-      {
-        "content": "write design notes for antipasto3 SVD",
-        "status": "completed",
-        "priority": "high"
-      }
-    ]
-
-[thinking] Wait, I wrote the design notes directly as output but the user asked me to answer questions and end with a recommendation. Let me reformat as requested. Actually, looking at the instructions again: the output format says "For each (A, B, C), end with: Recommendation: ADOPT / SKIP / PARTIAL". So I need to restructure my response to match that format better, and NOT write code. Let me restructure: A, B, C sections with each ending in recommendation.
-
-
-# Design Review: lora-lite vs refs
-
-## A. peft EVA (`docs/refs/peft_eva.py` + `peft_eva_finetuning.py`)
-
-### 1. What EVA actually does
-
-EVA uses SVD-of-activations (incremental PCA on layer inputs during forward passes) to initialize LoRA rather than random kaiming init. Key mechanism:
- Registers hooks on target layers, collects inputs across calibration batches
- Computes left singular vectors via `IncrementalPCA` (lines 138-143)
- Convergence check via cosine similarity between current/previous components (lines 167-168)
- Per-layer rank redistribution based on explained variance ratio (lines 302-314, 456)
-
-The user workflow (peft_eva_finetuning.py:66-76):
-```python
-eva_config = EvaConfig(rho=rho)
-peft_config = LoraConfig(r=rank, init_lora_weights="eva", eva_config=eva_config)
-peft_model = get_peft_model(model, peft_config)
-initialize_lora_eva_weights(peft_model, dataloader)  # separate calibration step
-```
-
-### 2. Minimal EVA variant API sketch
-
-**Option: separate calibrate() step before attach()**
-```python
-# User does:
-eva_state = calibrate(model, dataloader, cfg)  # returns {"layer_name": U_svd}
-cfg.variant_kwargs["eva_state"] = eva_state
-attach(model, cfg)  # LoRA.init() copies U_svd -> lora_A
-```
-
-API change: add `calibrate(model, dataloader, cfg)` function (~50 lines) that computes SVD on calibration inputs, returns state dict. In `LoRA.init()`, check for `eva_state[layer_name]` and copy.
-
-### 3. What to drop if re-implementing
-
- Lines 302-314: rank redistribution logic → use fixed rank r
- Lines 81-102: distributed all-gather → single-GPU only
- Lines 173-213, 374-386: hash deduplication for equal inputs → skip
- Lines 489-559: resume-from-checkpoint plumbing → skip
- Lines 475-477: whiten option → skip
- Lines 612-656: PEFT model wrapper integration → standalone function
-
-Droppable: ~60% of the 739-line file.
-
-### 4. Does current init() signature support EVA?
-
-Yes. adapter.py:56:
-```python
-variant.init(layer, cfg)
-```
-
-Add `eva_state` to `cfg.variant_kwargs` before attach, then in `LoRA.init()`:
-```python
-if "eva_state" in cfg.variant_kwargs:
-    # copy precomputed U into lora_A
-```
-
-Smallest change: add `calibrate()` function (~50 lines), no signature change needed.
-
---
-
-**Recommendation: PARTIAL**
-
- **Import**: SVD-of-activations collection + compute
- **Lines**: +~50 (calibrate fn), −0
- **What to remove in exchange**: Nothing — this is additive capability only. If you add calibration data collection, you must accept the additional complexity.
-
---
-
-## B. baukit nethook (`docs/refs/baukit_nethook.py`)
-
-### 1. What TraceDict/Trace gives us that we don't have
-
-| Feature | baukit | lora-lite | Value |
-|--------|-------|----------|-------|
-| Context manager cleanup | lines 99-105 | RemovableHandle (adapter.py:64-68) | Same |
-| retain_grad | line 37-38 | None | Not needed |
-| edit_output mid-forward | lines 39, 72-75 | None | Not needed |
-| Dictionary interface | line 111 | None | Not needed |
-| Stop-after-layer | lines 46-47, 189-202 | None | Not needed |
-
-TraceDict essentially provides: per-layer output retention + config dicts + cleanup. We already handle cleanup via RemovableHandle.
-
-### 2. Would baukit simplify or complicate adapter.py?
-
-Quote adapter.py:64-68:
-```python
-for _, layer, _ in attached_targets:
-    if hasattr(layer._lora_variant, "forward_input"):
-        handles.append(layer.register_forward_pre_hook(_pre_hook))
-    else:
-        handles.append(layer.register_forward_hook(_hook))
-```
-
-This is 5 lines. Baukit's TraceDict adds ~100 lines of context management we'd use only to replicate these 5 lines. Our current approach is simpler.
-
-### 3. Is baukit worth a dep?
-
-No. lora-lite's principle is minimal deps (currently torch+einops). Baukit adds ~400 lines of code for capabilities we don't need. `retain_grad` for debugging is the only useful feature — that's 3 lines to inline if needed later.
-
---
-
-**Recommendation: SKIP**
-
-Baukit adds a dep for no net simplification. Current hook registration (adapter.py:64-68) is 5 lines and sufficient. If `retain_grad` is needed for ml debugging, inline 3 lines later.
-
---
-
-## C. antipasto3 SVD adapter (`docs/refs/antipasto3_svd_adapter.py`)
-
-### 1. What to adopt or diverge from
-
-| Feature | antipasto3 | lora-lite | Verdict |
-|---------|-----------|----------|--------|
-| SVD init from kernel | line 169 `create_svd_adapter()` | PiSSA (pissa.py) | Already have |
-| Delta-s scaling: learnable | line 160 `S + alpha*delta_s` | fixed via alpha/r | **Adopt** |
-| Residual: `W_res = kernel - U@S@Vh` | line 204 | PiSSA has this | Already have |
-| Block-diagonal Cayley rotation | lines 135-141 | N/A | Skip (too complex) |
-| Dimension selection (L1 trip) | lines 219-267 | N/A | Skip (domain-specific) |
-
-**Key adoption**: antipasto3's learnable `delta_s` is cleaner than fixed scaling. Currently lora-lite uses `scale = alpha/r` at **runtime** — but delta-s makes scaling a **learnable parameter** with its own gradient.
-
-This is already in PiSSA variant: piSSA `init()` sets `scale=1` via alpha=r convention (pissa.py:43-46), but doesn't expose the scaling as learnable.
-
-### 2. PiSSA-like factoring
-
-lora-lite already has PiSSA in `variants/pissa.py`:
- Line 31-34: SVD of weight
- Line 36-40: decompose into B@A with sqrt(S) scaling
- Line 46: residual W_res = W - B@A
-
-This is exactly antipasto3's core insight, implemented cleanly. The antipasto3 contribution is making `delta_s` learnable — that's a natural extension of PiSSA, not a replacement.
-
---
-
-**Recommendation: ADOPT (concept)**
-
- **Import**: learnable delta_s concept (not code — just the idea)
- **Lines**: +0 by intent, no code change right now
- **What to remove**: Nothing — this is a design note that PiSSA could evolve toward learnable scaling instead of fixed alpha/r
-
-The antipasto3 patterns ( Cayley rotation, polarity interleave) are steering-specific and too complex for lora-lite ethos. The clean insight is already in PiSSA.
-
---
-
-## Summary
-
-| Ref | Recommendation | Impact |
-|-----|--------------|--------|
-| peft EVA | PARTIAL | +50 lines calibrate fn |
-| baukit | SKIP | 0 (keep current) |
-| antipasto3 | ADOPT concept | 0 (design note) |
-
-[done] end_turn
@@ -1,22 +0,0 @@
-# 2026-04-26 code review: testing proof
-
-## External review
-
-Reviewer: Gemini 2.5 Flash CLI, read-only prompt.
-
-Findings:
-
- Critical: `tests/smoke.py` could silently pass if base gradients leaked because it did not check non-`lora_*` grads.
- Important: `tests/smoke.py` did not explicitly assert the expected number of attached TinyModel targets.
-
-Resolution:
-
- Added `assert_no_base_grads(model)` to the smoke training loop.
- Added `assert n_targets == 28` immediately after smoke attach.
- Re-ran `just test` and `just smoke`; both passed.
-
-## Fresh-eyes subagent review
-
-Verdict: PASS.
-
-The reviewer could not name a remaining blocker for skipped targets, dead hooks, base-gradient leakage, or broken save/load producing the collected evidence. Caveat: Qwen coverage is intentionally narrow, layer-0 `q_proj`/`v_proj`, one prompt, tiny steps. This supports interface/training proof, not downstream finetuning quality.
@@ -1,265 +0,0 @@
-# lora-lite plan and status
-
-## Goal
-
-Build a small, hackable LoRA-family adapter library for research experiments.
-
-The core bet is that adapter variants should own the relationship between `(x, layer.weight, layer.lora_*)` and the layer output, while the library only handles targeting, parameter attachment, hooks, and save/load.
-
-## Non-goals
-
- No PEFT compatibility layer.
- No module replacement.
- No merge/unmerge.
- No multiple named adapters per layer.
- No backward compatibility promises.
- No silent fallbacks.
-
-## Design constraints
-
- Adapter params are attached directly to target layers as `lora_*` parameters.
- Save/load uses normal `state_dict()` keys, filtered by `"lora_"`.
- Forward hooks return the layer's new output, not just a delta.
- Targeting is structural: modules with `in_features`, `out_features`, and `weight` are linear-like.
- LoRA/DeLoRA support bnb-style 4/8-bit forward paths because the quantized base layer computes `y`; the hook only adds adapter math.
- PiSSA is fp-only in v1 because it mutates `layer.weight` into `W_res`.
- Data-calibrated variants use `group_init(model, targets, cfg, calibration_data)`; dataloaders stay out of `cfg` so checkpoints are serializable.
-
-## Implemented v0.0.1
-
-| Area | Status | Evidence |
-|---|---:|---|
-| `LoraLiteConfig` | done | `src/lora_lite/config.py` |
-| Variant registry + `ParamSpec` | done | `src/lora_lite/variant.py` |
-| Structural target discovery | done | `src/lora_lite/target.py` |
-| `attach` / `detach` / `save` / `load` | done | `src/lora_lite/adapter.py` |
-| LoRA | done | `src/lora_lite/variants/lora.py` |
-| PiSSA | done, fp-only | `src/lora_lite/variants/pissa.py` |
-| DeLoRA | done | `src/lora_lite/variants/delora.py` |
-| IA3 | done | `src/lora_lite/variants/ia3.py` |
-| DoRA | done, fp-only | `src/lora_lite/variants/dora.py` |
-| HRA | done | `src/lora_lite/variants/hra.py` (output-side Householder, hook-only -> bnb-compatible) |
-| Smoke tests | done | `tests/smoke.py` |
-| bnb minimal forward smoke | done | `Linear8bitLt` and `Linear4bit` pass on CUDA with `just bnb-smoke` |
-
-## Current smoke evidence
-
-Last verified log: `/home/wassname/.cache/agent-tmp/lora_lite_smoke_after_review.log`
-
-| Check | Result |
-|---|---|
-| LoRA identity | `0.000e+00` |
-| LoRA loss drop | `6.1%` |
-| PiSSA identity | `1.550e-06` |
-| PiSSA loss drop | `11.5%` |
-| DeLoRA identity | `0.000e+00` |
-| DeLoRA loss drop | `93.4%` |
-| IA3 identity | `0.000e+00` |
-| IA3 loss drop | `88.7%` |
-| DoRA identity | `0.000e+00` |
-| DoRA loss drop | `63.3%` |
-| fake non-`nn.Linear` target | attaches, identity `0.000e+00`, grad nonzero |
-| bnb `Linear8bitLt` | identity `0.000e+00`, grad nonzero |
-| bnb `Linear4bit` | identity `0.000e+00`, grad nonzero |
-
-## 2026-04-26 testing proof pass
-
-Goal: upgrade from smoke-tested sketch to evidence that the current PEFT-lite interface trains on both toy models and a real HF Qwen model.
-
-### Scope
-
-In:
-
- Pytest coverage for LoRA, PiSSA, and DeLoRA correctness invariants.
- IA3 coverage as the first waiting-time simple adapter.
- A real `Qwen/Qwen3-0.6B` probe that trains each current variant on layer-0 `q_proj` and `v_proj`.
- Repeatable `just` recipes and workspace-local logs/artifacts.
-
-Out:
-
- Benchmark claims.
- Quantized Qwen proof for PiSSA. PiSSA remains fp-only because it mutates `weight`.
- Full default-target training over every Qwen layer.
-
-### Requirements and evidence
-
-| Requirement | Distinguishing check | Evidence |
-|---|---|---|
-| R1: toy tests catch skipped targets/hooks | Perturb only `lora_*`; output must change. Missing target must raise. | `just test` -> `8 passed in 2.43s` in `logs/pytest.log` |
-| R2: toy tests catch base-gradient leakage | After backward, all non-`lora_*` grads are `None`; all trainable names contain `lora_`. | `just test` -> `8 passed in 2.43s` |
-| R3: save/load is exact for adapters | Saved key set equals full-path `lora_*` state; reload tensors equal; missing/extra `lora_*` keys raise. | `just test` -> `8 passed in 2.43s` |
-| R4: current variants train on tiny task | 28 TinyModel targets; non-`lora_*` grads stay `None`; 20-step loss drop >5%. | `just smoke` -> LoRA 6.1%, PiSSA 11.5%, DeLoRA 93.4% |
-| R5: current variants train on real Qwen | Fresh Qwen per variant; exact targets are layer-0 `q_proj`/`v_proj`; perturb >0; lossN < loss0; reload err < tol. | `pueue` task 70, `logs/qwen_probe.log`, all probes pass |
-| R6: cold review cannot explain evidence under silent failure | External review findings fixed, then fresh-eyes subagent says PASS. | `docs/spec/20260426_code_review.md` |
-
-### Qwen proof table
-
-Command:
-
-```bash
-pueue add --immediate --follow --label "why: verify warning-free current Qwen probe after dtype API cleanup; resolve: same pass table proves current script" --working-directory "$PWD" --priority 1 -- just qwen-probe
-```
-
-Result from task 70:
-
-| variant | targets | trainable | id_err | perturb | loss0 | lossN | drop% | grad | dθ | reload | adapter |
-|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---|
-| lora | 2 | 20480 | 0 | 0.375 | 5.25 | 3.131 | 40.36 | 1.432 | 4.262 | 0 | `outputs/qwen_train_probe/lora_adapter.pt` |
-| pissa | 2 | 20480 | 0.3125 | 0.75 | 5.25 | 3.629 | 30.88 | 6.124 | 4.381 | 0 | `outputs/qwen_train_probe/pissa_adapter.pt` |
-| delora | 2 | 20482 | 0.375 | 0.4062 | 5.246 | 5.166 | 1.537 | 0.04778 | 8.196 | 0 | `outputs/qwen_train_probe/delora_adapter.pt` |
-
-Follow-up tasks 80 (lora/pissa/delora/ia3 at 16 steps) and 81 (dora at 16 steps) extend the table:
-
-| variant | targets | trainable | id_err | perturb | loss0 | lossN | drop% | grad | dθ | reload | adapter |
-|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---|
-| lora | 2 | 20480 | 0 | 0.375 | 5.25 | 2.432 | 53.68 | 1.467 | 6.403 | 0 | `outputs/qwen_train_probe/lora_adapter.pt` |
-| pissa | 2 | 20480 | 0.3125 | 0.75 | 5.25 | 2.958 | 43.66 | 6.124 | 5.909 | 0 | `outputs/qwen_train_probe/pissa_adapter.pt` |
-| delora | 2 | 20482 | 0.3281 | 0.3125 | 5.261 | 4.823 | 8.322 | 0.06303 | 15.1 | 0 | `outputs/qwen_train_probe/delora_adapter.pt` |
-| ia3 | 2 | 3072 | 0 | 0.375 | 5.25 | 4.473 | 14.79 | 0.463 | 5.926 | 0 | `outputs/qwen_train_probe/ia3_adapter.pt` |
-| dora | 2 | 23552 | 0 | 0.3203 | 5.25 | 2.439 | 53.54 | 1.776 | 7.44 | 0 | `outputs/qwen_train_probe/dora_adapter.pt` |
-| hra | 2 | 12290 | 0 | 0.3438 | 5.25 | 4.07 | 22.47 | 0.05225 | 4.735 | 0 | `outputs/qwen_train_probe/hra_adapter.pt` |
-
-Failure-mode interpretation:
-
- If targeting silently skipped, exact target-set assertion would fail before training.
- If hooks were attached but dead, perturb delta would be 0.
- If base params trained, the non-`lora_*` gradient check would fail.
- If adapter grads were absent, `grad` or `dθ` would be 0/non-finite.
- If save/load were broken, adapter tensor equality or reload logit error would fail.
-
-## 2026-04-26 publishable workflow pass
-
-Goal: make the repo behave like a small buildable library without adding framework surface area.
-
-### Scope
-
-In:
-
- Keep minimal functional integration tests as the default proof path.
- Add a GitHub CI hook for fast tests and package build.
- Add a `pueue` recipe for the larger Qwen proof.
- Tighten README structure so install, quickstart, core idea, tests, and status are easy to find.
-
-Out:
-
- PyPI publishing token/workflow. Publishing should wait until the license decision is explicit.
- Implementing every named adapter variant in one pass. That would add complexity faster than tests can explain it.
-
-### Requirements and evidence
-
-| Requirement | Distinguishing check | Evidence |
-|---|---|---|
-| R7: fast CI catches broken tests/builds | `just check` must run pytest, smoke, `uv build`, and `twine check`; a broken test, wheel, sdist, or README metadata fails the hook. | `just check` -> pytest `10 passed in 4.10s`, smoke all pass, wheel/sdist built, `twine check dist/*` passed |
-| R8: large proof is queued, not hidden in CI | `just qwen-queue` must create a pueue task in the repo cwd with why/resolve label and intended Qwen command. | task 79 queued at `/media/wassname/SGIronWolf/projects5/2026/lora-lite` with `--variants lora pissa delora ia3 --steps 16` |
-| R9: README is publishable enough to judge | Reader sees install, quickstart, pseudocode core, testing commands, proof caveat before variant wishlist. | `README.md` reordered and human note removed |
-| R10: variant roadmap buys simplicity | Next variant is ranked by fit to current hook contract; non-hook variants are deferred rather than half-supported. | this section |
-
-Fresh review first blocked on weak `qwen-queue` evidence and README citation/comment junk. Fixes: queued real pueue task 74, added `twine check`, fenced citation, removed the stray README note. Final fresh review verdict: PASS.
-
-Follow-up after omega correction:
-
- The recursive task failed previously because it was left as a passive sentinel instead of being expanded immediately after push.
- Real bnb proof now runs through required `just bnb-smoke` with `bitsandbytes==0.49.2`; `Linear8bitLt` and `Linear4bit` both show identity error `0.000e+00` and nonzero adapter gradients. The command now fails instead of skipping if CUDA or bnb is unavailable.
- IA3 was added while waiting for queued Qwen work: `just test` now has 10 tests, `just smoke` includes IA3, and pueue task 79 is queued for LoRA/PiSSA/DeLoRA/IA3 on Qwen 0.6B. Task 78 caught a quoting bug in `qwen-queue`; the recipe now queues a single `bash -lc` command.
- README was reduced to prospective-user content; implementation notes moved to `docs/developer_guide.md`.
-
-### Adapter roadmap, ranked by simplicity
-
-| Variant | Why it fits or waits | Next check |
-|---|---|---|
-| IA3 | Implemented. Multiplicative output vector, no base-weight mutation. | `just test` -> 12 tests passed; smoke/Qwen task 80 pass. |
-| DoRA | Implemented for fp layers. Reads dense `weight` to compute `||V||_c`; bnb layers fail loudly. | smoke and Qwen task 81 pass with id_err=0, drop=53.5%, reload=0. |
-| SSVD / PiSSA-family | Fits current `weight`-SVD pattern and teaches the SVD adapter path. | Reconstruction/identity invariant plus train proof. |
-| HRA / OFT / ROAD | Interesting, but likely wants orthogonal or weight-transform semantics. Keep until hook-only formulation is clear. | Pseudocode first, then one invariant that distinguishes real rotation from dead code. |
-| S-steer / AntiPaSTO | Research adapters. Should use `group_init` and activation evidence, not be squeezed into plain LoRA tests. | Calibration is consumed, hooks removed, load does not need calibration data. |
-
-## Review history
-
-A cold subagent review first returned `PASS_WITH_BLOCKERS`:
-
-1. bnb modules were not targeted.
-2. Hook cast `y` to `cfg.dtype`, which could round base outputs.
-3. PiSSA overclaimed bnb support.
-4. `load()` did not fail on missing adapter keys.
-5. Data-calibrated init needed model-level access.
-
-Fixes applied:
-
-1. Structural `is_linear_like()` target predicate.
-2. Hook only casts `x`, keeps `y` in base output dtype.
-3. PiSSA fail-fast rejects non-plain `nn.Linear`.
-4. `load()` fails on missing or unexpected `lora_` keys.
-5. `attach(..., calibration_data=None)` plus optional `group_init(model, targets, cfg, calibration_data)`.
-
-Second cold review verdict: `PASS` for the minimal 4bit-enabled scope.
-
-## TODO / status
-
-### Next implementation goals
-
- [x] Add DoRA.
-  - Verified: fp32 identity 0.000e+00, finite gradients, smoke drop 63.3%, Qwen-0.6B task 81 drop 53.5% reload 0.
-  - Caveat: bnb DoRA fails fast in `init` (needs dense `weight` for `||V||_c`).
-
- [ ] Add VeRA.
-  - Verify: shared buffers are allocated once, target slices match shape, identity or near-identity at init.
-
- [ ] Add SSVD or AntiPaSTO-style SVD variant.
-  - Verify: reconstruction or intended rotation invariant at init.
-
- [ ] Add real activation-calibrated toy variant using `group_init`.
-  - Verify: `calibration_data` is consumed during `attach`, hooks are removed, checkpoint is serializable, and `load()` does not require calibration data.
-
- [ ] Add load path that can skip calibration init for future `group_init` variants.
-  - Current caveat: `load()` calls `attach(model, cfg)` with `calibration_data=None`; fine for current variants, but future calibrated variants should separate param creation from calibration.
-
- [ ] Add a tiny HF-model smoke when convenient.
-  - Verify: target names look like real transformer modules and state dict keys match full paths.
-
-### Design TODOs
-
- [ ] Decide whether `group_init` should run before or after forward hooks are registered.
-  - Current choice: after params are attached, before adapter forward hooks are registered.
-
- [ ] Decide whether replacing variants need `runs_base_layer=False` or can always transform `y`.
-  - OFT-like variants can rotate `y`; variants that truly avoid base forward need module replacement or pre-hook rewriting, likely out of v1.
-
- [ ] Add `weight_mode` for BitFit/SHiRA if those variants become in-scope.
-  - Minimal surface: `weight_mode in {"frozen", "bias_only", "sparse_grad"}`.
-
-## Variant contract
-
-```python
-class Variant:
-    name: str
-
-    @staticmethod
-    def param_specs(d_in, d_out, cfg) -> dict[str, ParamSpec]: ...
-
-    @staticmethod
-    def init(layer, cfg) -> None:
-        # weight-only init; may mutate plain fp weights
-        ...
-
-    @staticmethod
-    def group_init(model, targets, cfg, calibration_data) -> None:
-        # optional model-level init for data-calibrated or cross-layer variants
-        ...
-
-    @staticmethod
-    def forward(layer, x, y) -> Tensor:
-        # return NEW output; additive variants return y + delta
-        ...
-```
-
-## Done means
-
-This repo is good enough for a first real experiment when:
-
-1. A Qwen/Llama model can attach LoRA adapters to intended target layers.
-2. A 4bit or 8bit loaded model can train LoRA/DeLoRA params with nonzero gradients.
-3. Saved adapter tensors use full-path keys and reload without calibration data.
-4. Smoke tests distinguish target-skipping, hook identity drift, and missing-key load failure.
-
-see interesting adapters here https://github.com/wassname/adapters_as_hypotheses
-how peft handle 4bit here https://github.com/huggingface/peft/blob/6030f9160ed2fc17220f6f41382a66f1257b6a93/src/peft/tuners/lora/layer.py
@@ -72,3 +72,34 @@ metamath-queue variant="lora" steps="5000" model="Qwen/Qwen3-0.6B-Base":
 		-l "why: HF-style MetaMathQA->GSM8K benchmark for {{model}} {{variant}} at {{steps}} steps; resolve: result JSON under outputs/metamath_gsm8k proves grad>0 dθ>0 base_grad_leaks=0 and reports valid/test accuracy" \
 		-w "$PWD" -o 1 -- \
 		uv run --extra benchmark python scripts/metamath_gsm8k_benchmark.py --model {{model}} --variant {{variant}} --steps {{steps}}
+
+metamath-queue-all model="Qwen/Qwen3-0.6B-Base" steps="5000" variants="lora pissa delora dora hra ia3 ia3_ff eva antipasto":
+	#!/usr/bin/env bash
+	set -euo pipefail
+	for variant in {{variants}}; do
+		lr=1e-4
+		extra_args=(--target-name '(q_proj|v_proj)$' --layers all --r 32 --alpha 64)
+		case "$variant" in
+			delora)
+				lr=1e-3
+				;;
+			ia3)
+				lr=1e-3
+				extra_args=(--target-name '(k_proj|v_proj)$' --layers all --r 32 --alpha 64)
+				;;
+			ia3_ff)
+				lr=1e-3
+				extra_args=(--target-name '(down_proj)$' --layers all --r 32 --alpha 64)
+				;;
+			eva)
+				lr=1e-4
+				;;
+			antipasto)
+				lr=1e-4
+				;;
+		esac
+		pueue add \
+			-l "why: benchmark {{model}} ${variant} on MetaMathQA->GSM8K at {{steps}} steps; resolve: outputs/metamath_gsm8k/results/benchmark_results.tsv gets a row with accuracy commit time method argv and result JSON for ${variant}" \
+			-w "$PWD" -o 1 -- \
+			uv run --extra benchmark python scripts/metamath_gsm8k_benchmark.py --model {{model}} --variant "$variant" --steps {{steps}} --lr "$lr" "${extra_args[@]}"
+	done
@@ -5,8 +5,11 @@ import hashlib
 import json
 import math
 import re
+import fcntl
+import subprocess
 import sys
 import time
+from datetime import datetime, timezone
 from dataclasses import asdict, dataclass, field
 from pathlib import Path
 from typing import Any, Literal
@@ -426,11 +429,61 @@ def print_final_report(row: dict[str, Any], result_path: Path) -> None:
    print(tabulate([row], headers="keys", tablefmt="tsv", floatfmt=".4g"))


+def current_git_commit() -> str:
+    try:
+        return subprocess.check_output(["git", "rev-parse", "HEAD"], text=True).strip()
+    except (subprocess.CalledProcessError, FileNotFoundError):
+        return "unknown"
+
+
+def append_results_row(
+    args: BenchmarkConfig,
+    result_path: Path,
+    result: dict[str, Any],
+    run_commit: str,
+) -> tuple[Path, Path]:
+    results_dir = args.output_dir / "results"
+    results_dir.mkdir(parents=True, exist_ok=True)
+    tsv_path = results_dir / "benchmark_results.tsv"
+    lock_path = results_dir / "benchmark_results.tsv.lock"
+    finished_at = datetime.now(timezone.utc).isoformat(timespec="seconds")
+    finished_label = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
+    snapshot_path = results_dir / f"{result['run_id']}__{finished_label}.json"
+    snapshot_path.write_text(json.dumps(result, indent=2), encoding="utf-8")
+    row = {
+        "time_utc": finished_at,
+        "commit": run_commit,
+        "method": args.variant,
+        "model": args.model,
+        "mode": args.mode,
+        "valid_accuracy": result["valid_accuracy"],
+        "test_accuracy": result["test_accuracy"],
+        "steps": args.steps,
+        "samples": result["train_samples"],
+        "wall_time_s": result["wall_time_s"],
+        "argv": " ".join(sys.argv),
+        "result_json": str(snapshot_path),
+        "latest_result_json": str(result_path),
+    }
+    header = "\t".join(row)
+    values = "\t".join(str(value) for value in row.values())
+    with lock_path.open("w", encoding="utf-8") as lock_handle:
+        fcntl.flock(lock_handle.fileno(), fcntl.LOCK_EX)
+        if not tsv_path.exists():
+            tsv_path.write_text(header + "\n" + values + "\n", encoding="utf-8")
+        else:
+            with tsv_path.open("a", encoding="utf-8") as handle:
+                handle.write(values + "\n")
+        fcntl.flock(lock_handle.fileno(), fcntl.LOCK_UN)
+    return tsv_path, snapshot_path
+
+
 def run(args: BenchmarkConfig) -> dict[str, Any]:
    if args.device == "cuda" and not torch.cuda.is_available():
        raise RuntimeError("CUDA requested but unavailable; pass --device cpu for plumbing smoke only")
    torch.manual_seed(args.seed)
    dtype = getattr(torch, args.torch_dtype)
+    run_commit = current_git_commit()
    run_id = f"{args.model.replace('/', '--')}__{args.variant}__s{args.steps}__seed{args.seed}"
    out_dir = args.output_dir / run_id
    out_dir.mkdir(parents=True, exist_ok=True)
@@ -501,6 +554,12 @@ def run(args: BenchmarkConfig) -> dict[str, Any]:
    }
    result_path = out_dir / "result.json"
    result_path.write_text(json.dumps(result, indent=2), encoding="utf-8")
+    results_tsv_path, result_snapshot_path = append_results_row(args, result_path, result, run_commit)
+    result["results_tsv_path"] = str(results_tsv_path)
+    result["result_snapshot_path"] = str(result_snapshot_path)
+    result["commit"] = run_commit
+    result_path.write_text(json.dumps(result, indent=2), encoding="utf-8")
+    commit_prefix = run_commit[:12]

    row = {
        "run_id": run_id,
@@ -515,6 +574,7 @@ def run(args: BenchmarkConfig) -> dict[str, Any]:
        "base_grad_leaks": train_metrics["base_grad_leaks"],
        "valid_acc": valid_metrics["accuracy"],
        "test_acc": test_metrics["accuracy"],
+        "commit": run_commit[:12],
        "result": str(result_path),
    }
    if probe_metrics is not None:
@@ -19,7 +19,7 @@ from dataclasses import dataclass
 from typing import Literal

 import torch
-from einops import einsum
+from einops import einsum, rearrange
 from jaxtyping import Float
 from torch import nn, Tensor as T

@@ -59,15 +59,6 @@ def _build_rotation(rot_T: torch.Tensor, bs: int, max_angle: float) -> torch.Ten
    return _cayley(A)


-def _block_diag(blocks: torch.Tensor) -> torch.Tensor:
-    """(n_blocks, bs, bs) -> (n_blocks*bs, n_blocks*bs) block-diagonal."""
-    n, bs, _ = blocks.shape
-    out = blocks.new_zeros(n * bs, n * bs)
-    for i in range(n):
-        out[i * bs : (i + 1) * bs, i * bs : (i + 1) * bs] = blocks[i]
-    return out
-
-
@register
 class AntiPaSTO:
    name = "antipasto"
@@ -123,15 +114,22 @@ class AntiPaSTO:
        S = layer.lora_S.to(x.dtype)                          # (r,)
        Vh = layer.lora_Vh.to(x.dtype)                        # (r, d_in)

-        R_blocks = _build_rotation(layer.lora_rot_T.float(), bs, max_angle)
-        R = _block_diag(R_blocks).to(x.dtype)                 # (r, r)
+        R_blocks = _build_rotation(layer.lora_rot_T.float(), bs, max_angle).to(x.dtype)
+        n_blocks = R_blocks.shape[0]                          # R_blocks: (n, bs, bs)

+        # Apply block-diagonal R per-block via einsum, never materializing (r,r).
        if rotate_basis == "V":
-            Vh_eff = R @ Vh                                   # rotate INPUT basis
+            # Vh_eff = R @ Vh, viewed block-wise on the r-axis.
+            Vh_blocks = rearrange(Vh, "(n a) i -> n a i", n=n_blocks)
+            Vh_rot = einsum(R_blocks, Vh_blocks, "n a b, n b i -> n a i")
+            Vh_eff = rearrange(Vh_rot, "n a i -> (n a) i")
            U_eff = U
        elif rotate_basis == "U":
+            # U_eff = U @ R.T, viewed block-wise on the r-axis.
+            U_blocks = rearrange(U, "d (n b) -> d n b", n=n_blocks)
+            U_rot = einsum(U_blocks, R_blocks, "d n b, n c b -> d n c")
+            U_eff = rearrange(U_rot, "d n c -> d (n c)")
            Vh_eff = Vh
-            U_eff = U @ R.T                                   # rotate OUTPUT basis
        else:
            raise ValueError(f"rotate_basis must be 'U' or 'V', got {rotate_basis!r}")

@@ -57,10 +57,11 @@ class DoRA:
    ) -> Float[T, '*B o']:
        cfg = layer._lora_cfg
        scale = cfg.alpha / cfg.r
-        # V = W + scale * B @ A
+        # Paper §4.3: treat ||V+ΔV||_c as a constant (detach from grad graph) for
+        # stability and ~2x lower memory. Match peft (lora_weight.detach + weight_norm.detach).
        BA = einsum(layer.lora_B, layer.lora_A, "o r, r i -> o i")
-        V = layer.weight + scale * BA                          # (d_out, d_in)
-        v_norm = V.norm(dim=1).clamp_min(1e-12)                # (d_out,)
+        V = layer.weight + scale * BA.detach()                 # (d_out, d_in)
+        v_norm = V.norm(dim=1).clamp_min(1e-12).detach()       # (d_out,)
        # Bias passes through unscaled (matches peft).
        bias = getattr(layer, "bias", None)
        wx = y if bias is None else (y - bias)
@@ -13,7 +13,7 @@ Refs:
    (offline: docs/refs/peft_eva.py; example: docs/refs/peft_eva_finetuning.py)
 """
 import torch
-from einops import einsum
+from einops import einsum, rearrange
 from jaxtyping import Float
 from torch import nn, Tensor as T
 from typing import Iterable
@@ -68,7 +68,7 @@ class EVA:
            def _h(module, args, kwargs):
                # signature: pre-forward, args[0] is the input tensor
                x = args[0].detach()
-                captured[name].append(x.reshape(-1, x.shape[-1]).to(torch.float32).cpu())
+                captured[name].append(rearrange(x, "... d -> (...) d").to(torch.float32).cpu())
            return _h

        handles = [
@@ -17,7 +17,7 @@ Refs:
    (offline: docs/refs/peft_hra_layer.py)
 """
 import torch
-from einops import einsum
+from einops import einsum, repeat
 from jaxtyping import Float
 from torch import nn, Tensor as T
 from dataclasses import dataclass
@@ -56,7 +56,7 @@ class HRA:
            r, d_in = layer.lora_U.shape
            half = torch.empty(r // 2, d_in, dtype=layer.lora_U.dtype, device=layer.lora_U.device)
            nn.init.kaiming_uniform_(half, a=5 ** 0.5)
-            layer.lora_U.copy_(torch.repeat_interleave(half, 2, dim=0))
+            layer.lora_U.copy_(repeat(half, "h d -> (h two) d", two=2))
        return

    @staticmethod
@@ -3,10 +3,11 @@
 Meng et al. 2024  https://arxiv.org/abs/2404.02948

    W = U S Vh        (truncated to top-r)
-    B = U sqrt(S),  A = sqrt(S) Vh,   W_res = W - B A
+    Sr_eff = Sr / (alpha/r)                          # peft-style: pre-divide so A/B
+    B = U sqrt(Sr_eff),  A = sqrt(Sr_eff) Vh         # update dynamics match for any alpha
+    W_res = W - (alpha/r) B A      = W - U Sr Vh     # scaling cancels symmetrically

-Identity at t=0: W_res + B@A == W (bf16 round-trip, not bit-exact).
-Pass alpha=r for paper-faithful scale=1.
+Identity at t=0: W_res + (alpha/r) B@A == W (fp32 round-trip, bf16 cast can drift).

 Refs:
  - paper: https://github.com/MuLabPKU/PiSSA/blob/main/utils/init_pissa.py
@@ -51,16 +52,20 @@ class PiSSA:
        W = layer.weight.data.float()                       # (d_out, d_in)
        U, S, Vh = torch.linalg.svd(W, full_matrices=False)
        r = cfg.r
+        scale = cfg.alpha / cfg.r
        Ur, Sr, Vhr = U[:, :r], S[:r], Vh[:r, :]
-        sqrtS = Sr.sqrt()
-        # B @ A = Ur diag(Sr) Vhr;  pick B = Ur sqrt(Sr),  A = sqrt(Sr) * Vhr
+        # Pre-divide Sr by scaling so A/B carry "natural" magnitudes for any alpha
+        # (peft pissa_init does this; needed so the alpha/r scaling on the forward
+        # gives matched update dynamics rather than re-scaling A,B by sqrt(scale)).
+        Sr_eff = Sr / scale
+        sqrtS = Sr_eff.sqrt()
+        # B @ A = Ur diag(Sr/scale) Vhr;  W_res = W - scale * B@A = W - Ur diag(Sr) Vhr.
        B = (Ur * sqrtS).to(cfg.dtype)
        A = (sqrtS[:, None] * Vhr).to(cfg.dtype)
        layer.lora_B.data.copy_(B)
        layer.lora_A.data.copy_(A)
        # fp32 subtraction so W_res stays accurate.
        BA = (B.float() @ A.float())
-        scale = cfg.alpha / cfg.r
        layer.weight.data.copy_((W - scale * BA).to(layer.weight.dtype))

    @staticmethod