tidy tests to subset of metamath

This commit is contained in:
wassname
2026-04-27 09:20:07 +08:00
parent 1a93df10b2
commit 727ef6ea73
6 changed files with 145 additions and 1211 deletions
-754
View File
@@ -1,754 +0,0 @@
yes make a task to deep review each one against how it's implemented. so one
task for each. please give me the pseudocode of 1) reference code 2)
reference paper 3) ours and compare.
Searched for 1 pattern, read 1 file (ctrl+o to expand)
● 1. LoRA review
Paper (Hu et al. 2021):
forward:
h = W·x + (α/r) · B·A·x # B∈R^{d×r}, A∈R^{r×d}
init:
A ~ Gaussian (or kaiming), B = 0
peft peft_lora_layer.py:941-969:
forward(x):
result = base_layer(x) # W·x + bias
for adapter in active_adapters:
result += lora_B(lora_A(dropout(x))) * scaling # scaling = α/r or
α/√r if rslora
init (reset_lora_parameters, line 263):
kaiming_uniform_(lora_A, a=√5); zeros_(lora_B)
ours src/lora_lite/variants/lora.py:
param_specs:
lora_A: (r, d_in) init=kaiming
lora_B: (d_out, r) init=zeros
forward(x, y): # y = base_layer(x) already computed by host
scale = α/r
h = einsum(x, A, "... i, r i -> ... r")
delta = einsum(h, B, "... r, o r -> ... o")
return y + scale·delta
Diff vs peft:
- ✓ math identical: scale·B·A·x + y
- ✗ no dropout (peft has lora_dropout)
- ✗ no rslora (α/√r)
- ✗ no init_lora_weights="gaussian" option (only kaiming)
- ✓ B=0 zero-init → identity at t=0
Verdict: correct vanilla LoRA. Missing dropout and rslora are deliberate scope
cuts (not bugs). No fix needed.
Read 1 file (ctrl+o to expand)
● 2. DoRA review
Paper (Liu et al. 2024, §4.1, §4.3):
W' = m · V/||V||_c, V = W + ΔV = W + (α/r)·B·A # ||·||_c = per-row L2
init: m = ||W||_c, B = 0 → W' = W
§4.3 trick: treat ||V+ΔV||_c as a CONSTANT (detach from grad graph)
for stability + speed.
peft peft_lora_dora.py:132-166:
forward(x):
lora_W = B·A
weight_norm = ||W + α/r·lora_W.detach()||_c # << DETACHED
weight_norm = weight_norm.detach() # << belt+braces
mag_norm_scale = (m / weight_norm)
base_result = W·x (or passed in, minus bias)
return (mag_norm_scale - 1)·base_result + mag_norm_scale · α/r · B·A·x
# bias re-added outside
ours src/lora_lite/variants/dora.py:53-71:
V = W + scale·B·A # << NOT detached
v_norm = V.norm(dim=1).clamp_min(1e-12) # << NOT detached
wx = y - bias if bias else y
delta_lora = scale · B·A·x
combined = wx + delta_lora
out = (m / v_norm) · combined
return out + bias if bias else out
Diff vs peft:
- ✗ Bug: missing .detach() on the lora term inside weight_norm. Paper §4.3
explicitly says to detach; peft detaches twice (line 144 lora_weight.detach()
and line 152 weight_norm.detach()). Without it, grads flow through the norm
denominator → unstable training and ~2× memory, the exact thing the paper
warns against.
- ✓ math otherwise equivalent: (m/||V||)·(Wx + α/r·BAx) = (m/||V||-1)·Wx +
(m/||V||)·α/r·BAx + Wx. Both forms produce same output.
- ✓ bias handling matches peft (subtract before scaling, add after).
- ✓ m init ||W||_c matches.
- = nn.Linear-only restriction is a deliberate scope cut (peft handles bnb via
dequantize_module_weight).
Recommended fix: add .detach() to V (or equivalently v_norm.detach()) before
dividing.
V = layer.weight + scale * BA.detach()
v_norm = V.norm(dim=1).clamp_min(1e-12).detach()
Read 2 files (ctrl+o to expand)
● 3. PiSSA review
Paper (Meng et al. 2024):
W = U S V^T (truncated to top-r)
A = √S · V^T, B = U · √S → B·A = U S V^T
W_res = W - B·A
forward: W_res·x + (α/r)·B·A·x → at t=0 equals W·x
peft peft_lora_layer.py:360-393:
V, S, Uh = svd(W) # V is left, Uh is right (peft
naming)
Sr = S[:r] / scaling # << pre-divide by α/r
A = diag(√Sr) @ Uh[:r] # absorbs 1/√scaling
B = V[:, :r] @ diag(√Sr) # absorbs 1/√scaling
W_res = W - scaling · B·A # net: W - V[:,:r]·S[:r]·Uh[:r]
(scaling cancels)
forward: W_res·x + scaling·B·A·x
ours src/lora_lite/variants/pissa.py:45-76:
U, S, Vh = svd(W)
B = U[:,:r] · √S[:r] # NO scaling pre-division
A = √S[:r] · Vh[:r] # NO scaling pre-division
W_res = W - scale · B·A # = W - scale · U·S·Vh (NOT
exactly W - U·S·Vh)
forward: W_res·x + scale · B·A·x # net: W·x at t=0 ✓
Diff vs peft:
- ✓ Bit-identity at t=0 holds for any α (algebra cancels symmetrically): both
peft and ours satisfy W_res + scale·BA = W round-tripped.
- ⚠ A/B starting magnitudes differ when α≠r. peft pre-divides S/scale so
||B||,||A|| ∝ √(S/scale); ours has ||B||,||A|| ∝ √S. Update dynamics (α/r ·
ΔB·A) therefore differ by a factor of √scale per matrix. For α=r (scale=1)
they're identical — and the docstring already says "Pass alpha=r for
paper-faithful scale=1".
- ✓ Linear-only restriction is a deliberate scope cut.
- ✓ cfg.dtype cast on A,B before storage matches peft's to(dtype) after save.
- ✓ fp32 SVD then (W - scale·BA) in fp32 before bf16 cast — comment "fp32
subtraction so W_res stays accurate" is honoured. peft does same via
weight.to(torch.float32).
Verdict: correct for α = r. For α ≠ r, behaves differently from peft (still
mathematically valid PiSSA, just different parameter scaling of A/B). Either
fix (apply Sr /= scale like peft) or tighten the docstring to "α=r required"
and assert it. Mild fix; not a correctness bug.
Read 1 file (ctrl+o to expand)
● 4. DeLoRA review
Paper (Bini et al. 2025, ICLR'25):
W' = W + (λ/r) · B · Ξ · A, Ξ_ii = 1/(||b_i|| · ||a_i||)
init: B=0 → ΔW=0
default λ=15
Crucial: w_norm is per-input-channel ||W||_2 (col-norm), captured ONCE at init
(frozen reference) — not Frobenius scalar.
peft peft_delora_layer.py:226-264:
forward(x):
h = (x * w_norm) @ A.T # broadcast w_norm over
input dim
An = clamp(A.norm(dim=1), 1e-4)
Bn = clamp(B.norm(dim=0), 1e-4)
scale_r = (λ/r) / (An * Bn) # (r,)
h = h * scale_r
h = h @ B.T
return base_out + h
init: A ← kaiming(√5), B ← 0, λ ← 15.0, w_norm ← ||W||_2 col (frozen
buffer)
ours src/lora_lite/variants/delora.py:55-81:
forward(x, y):
x_scaled = x * lora_wnorm
h = einsum(x_scaled, A, "... i, r i -> ... r") # = (x * w_norm) @ A.T
An = clamp(A.norm(dim=1), 1e-4)
Bn = clamp(B.norm(dim=0), 1e-4)
scale_r = (λ/r) / (An * Bn)
h = h * scale_r
delta = einsum(h, B, "... r, o r -> ... o") # = h @ B.T
return y + delta
init: A ← kaiming(√5), B ← 0, λ ← lambda0 (default 0.0, NOT 15.0),
lora_wnorm ← ||W||_2 col (frozen buffer)
Diff vs peft:
- ✓ math line-for-line identical: same dim-0 col-norm, same per-rank A/B
normalization, same gradient path (norms not detached, A,B receive gradient
through them — matches peft exactly).
- ✓ frozen lora_wnorm buffer (peft uses BufferDict, ours uses as_buffer=True).
- ⚠ default lambda0=0.0 differs from peft's 15.0. Docstring is honest about
this ("0.0 = bit-identity at t=0, but A,B get zero grad until lambda moves").
At λ=0, gradients on A,B vanish since delta ∝ λ, so training stalls until λ
moves off zero. This is documented but is a sharp foot-gun. peft's 15.0
default avoids it.
- ✓ even-r restriction not relevant here (HRA only).
- ✓ nn.Linear-only restriction documented.
Verdict: correct. The lambda0 default is a deliberate "bit-identity but
stalls" choice — keep, but consider noting in docstring that for actual
training you'd want λ0≈1-15 (paper uses 15).
Searched for 1 pattern, read 1 file (ctrl+o to expand)
● 5. EVA review
Paper (Paischer et al. 2024):
For each target layer l:
Collect inputs x ∈ R^{N × d_in} on calibration batches
Vh = right_singular_vectors(x) # top-r principal input dirs
A_l ← Vh[:r] # data-driven init
B_l ← 0
forward = standard LoRA
Optional: rank redistribution by explained variance (ρ-budget across layers)
Optional: hash-dedup layers with identical inputs (Q/K/V share x)
peft peft_eva.py:
For each calibration batch:
IncrementalPCA.partial_fit(layer_input) # streaming,
memory-efficient
Hash inputs; dedup layers seeing same x.
Track convergence via cossim(components_t, components_{t-1})
After:
Redistribute ranks by explained_variance_ratio (ρ scale)
A_l ← svd.components_[:rank_l] # top-rank rows
B_l ← 0
forward = vanilla LoRA path (`scale·B@A·x`)
ours src/lora_lite/variants/eva.py:
group_init(model, targets, calibration_data):
hooks capture pre-forward inputs per layer
for batch in calibration_data: model(batch)
for each target:
X = cat(captured) # (N, d_in), float32 on cpu
_,_,Vh = torch.linalg.svd(X, full_matrices=False)
A ← Vh[:r] # uniform r per layer
forward(x, y):
return y + (α/r) · (B @ A @ x) # B=0 at t=0 → identity
Diff vs peft:
- ✓ Top-r right singular vectors of pooled inputs → A. Math identical.
- ✓ B=0 init → identity at t=0.
- ✓ A,B both trainable post-init (peft same).
- ✗ Full SVD on cpu-pooled tensor vs. peft's IncrementalPCA — limits us to
small calibration sets (we cat all activations in RAM). The docstring
acknowledges this.
- ✗ No rank redistribution (rho param). All targets get same r. Acknowledged.
- ✗ No QKV input dedup. Acknowledged. Cost: 3× SVD work for layers that share
input. Correct, just slower.
- ✗ No convergence check. Single pass over calibration_data. Acknowledged.
- ⚠ Captured tensors moved to cpu+fp32 — fine, but a noisy 1k-batch on a 70B
model = lots of RAM. Just a scaling note.
Verdict: correct minimal EVA, all simplifications stated up-front in
docstring. No bugs. If you ever want to scale to large calibration corpora,
swap the torch.cat → svd for IncrementalPCA.
Read 1 file (ctrl+o to expand)
● 6. HRA review
Paper (Yuan et al. 2024):
W' = W·R, R = ∏_{i=0..r-1} H_i, H_i = I - 2·u_i·u_i^T / ||u_i||^2
init (paper): independent random columns
peft variant: paired columns (u_{2k}=u_{2k+1}) → R=I exactly at t=0, even r
GS variant (apply_GS=True): explicit Gram-Schmidt for strict orthogonality
Loss-side: orthogonality regularizer (Eq. 6); not part of layer.
peft peft_hra_layer.py:208-265:
hra_u: (d_in, r) # COLUMNS are vectors
get_delta_weight():
opt_u = opt_u / opt_u.norm(dim=0) # pre-normalize columns
R = I_{d_in}
for i in 0..r-1: # forward order
u = opt_u[:, i]
R = R - 2·R·u·u^T # R ← R · H_i
forward(x):
R = product over active adapters
return F.linear(x, W @ R, bias) # = x · R^T · W^T
(effectively R applied to x)
init: half_u = kaiming((d_in, r/2)); u = repeat_interleave(half_u, 2, dim=1)
# cols 0=1, 2=3, ...
ours src/lora_lite/variants/hra.py:53-75:
lora_U: (r, d_in) # ROWS are vectors
(transpose convention)
forward_input(x):
Rx = x
for i = r-1 downto 0: # reverse order, applied
directly
u = U[i]
sq = (u·u).sum().clamp_min(1e-12)
coeff = (Rx · u) · (2/sq)
Rx = Rx - coeff · u # Rx ← Rx · H_i
return Rx # then base layer applies W
to this
init: half = kaiming((r/2, d_in)); U = repeat_interleave(half, 2, dim=0) #
rows 0=1, 2=3, ...
Diff vs peft:
- ✓ Math equivalent. peft computes R = H_0·H_1·...·H_{r-1} (col-vector
convention), then applies via x·R^T = x·H_{r-1}·...·H_0 (since H is
symmetric). Ours iterates i = r-1 → 0 doing Rx ← Rx·H_i, computing
x·H_{r-1}·...·H_0. Same.
- ✓ symmetric init: row pairs vs col pairs — consistent with our row-vector
convention.
- ✓ identity at t=0: paired u_a=u_b → H_a·H_a=I (involution). R=I.
- ⚠ No apply_GS=True path. peft offers Gram-Schmidt orthogonalization.
Docstring acknowledges via "Note: paper's orthogonality regularizer (Eq. 6) is
loss-side". Ours uses pure Householder products (which already give exactly
orthogonal R at any step), so GS is redundant unless you want unit-norm
columns. Fine to omit.
- ✓ More efficient than peft. Avoids materializing (d_in × d_in) R matrix;
applies reflections directly to x in O(r·d_in) not O(r·d_in²).
- ⚠ Slight gradient-flow difference. peft pre-normalizes opt_u /
||opt_u||_col, so gradients flow through normalization. Ours keeps ||u||² in
the denominator each step, gradients flow through that. Both define the same
orthogonal H_i; gradients should give equivalent training to within numerical
noise, but not bit-identical.
- ✓ even-r requirement raised loudly (peft warns + falls back; we raise —
preferable for fail-fast).
Verdict: correct + faster than peft. Ordering bug from 67a6daf is properly
fixed (verified: i = r-1 → 0 iteration matches peft's R = H_0..H_{r-1} then
x·R^T).
Read 1 file (ctrl+o to expand)
● 7. IA3 review
Paper (Liu et al. 2022):
y = ⊙ (W·x) for attention K, V (output-side gate)
y = W_2·(_ff ⊙ γ(W_1·x)) for FF (mid-FFN gate, T5)
init: = 1
peft peft_ia3_layer.py:161-188 (single class with is_feedforward flag):
if is_feedforward:
g shape (1, d_in)
interm = x * g
return base_layer(interm) # input-side
else:
g shape (d_out, 1)
result = base_layer(x)
return result * g # output-side
init: g ← ones (when init_ia3_weights=True, default)
default targets: ["k_proj", "v_proj", "down_proj"];
feedforward_modules=["down_proj"]
ours src/lora_lite/variants/ia3.py:38-76 (two classes IA3, IA3FF):
IA3 (output-side, e.g. k_proj/v_proj):
g shape (d_out,); init ones
forward(x, y) = y * g
IA3FF (input-side, e.g. down_proj/fc2):
g shape (d_in,); init ones
forward_input(x) = x * g # base_layer applied
after by host
Diff vs peft:
- ✓ math identical to peft's two branches.
- ✓ init=ones matches.
- ✓ Splitting into two classes (rather than is_feedforward flag) is cleaner —
fail-fast: which class you register fixes the side, no boolean to misset.
- ⚠ peft's output-side path also scales base_layer.bias during merge. We
don't merge (forward-only adapter), so no issue at training time — but if you
ever add merge, watch this.
- ⚠ docstring example uses regex r"\\.k_proj$" which under YAML/Python would
match a literal .k_proj string-end; check that target_names is treated as
regex by your attach() (matches the behavior in variant.py).
- ⚠ Position-on-Llama: peft applies "ff" gate at down_proj input, which is
gate*up (post-SiLU). This deviates from paper's T5 position (post-activation
of single W_1) but matches peft's default for Llama-style. Acknowledged via
docstring example.
Verdict: correct, cleaner than peft. No bugs.
ko! so what do we need to fix?
● Fixes needed (from reviews 1-7)
Real bug (1):
- dora.py:62-63 — missing .detach() on the LoRA path inside the norm. Paper
§4.3 explicitly says treat ||V+ΔV||_c as a constant; peft detaches twice.
Without it, grads flow through the norm denominator → unstable training + ~2×
memory.
V = layer.weight + scale * BA.detach() # detach BA here
v_norm = V.norm(dim=1).clamp_min(1e-12).detach() # belt & braces
Behavioral / docstring (2):
- pissa.py — at α≠r, A,B starting magnitude differs from peft by √scale per
matrix (peft pre-divides S/scale). Either match peft (Sr /= scale before
computing √Sr factors) or assert cfg.alpha == cfg.r and tighten the docstring.
- delora.py:36 — lambda0=0.0 default is bit-identity but a foot-gun: at λ=0
the gradient on A,B is zero, so training is dead until λ moves. Either change
default to peft's 15.0, or expand the existing comment to flag this for new
users.
Clean (4): lora, eva, hra, ia3 — no fixes.
Pending: antipasto.py (review #8 not yet done; want me to continue?).
✻ Perused for 5m 30s
sure do antipasto now...
Read 1 file (ctrl+o to expand)
● 8. AntiPaSTO review
Reference (antipasto3, Flax):
kernel = U·diag(S)·Vh + W_res # kernel: (in, out)
init U,S,Vh,W_res = svd(kernel) (frozen, fp32 / bf16 W_res)
trainable:
delta_s truncated_normal·4e-4 + 4e-4 # small POSITIVE bias
(sym-break)
rot_T truncated_normal·1e-4 # small random
forward(x, alpha):
A = scatter(rot_T into upper-triu of (n_blocks, bs, bs))
A = 0.5·(A - A^T); A = a_lim·tanh(A/a_lim)
X = α·A/2 # << α scales rotation
R = (I-X)^-1 (I+X) # block Cayley
if rotate_U: U_rot = U @ R # einsum, never builds full
(r,r)
if rotate_V: Vh_rot = R @ Vh
S_scaled = S + α·delta_s
return (x @ U_rot)·S_scaled @ Vh_rot + x @ W_res
Identity property: bidirectional symmetry — R(α=+1) and R(α=-1) are exact
inverses.
At α=0: U,S,Vh untouched → kernel reproduced.
ours src/lora_lite/variants/antipasto.py:
weight (pytorch): (d_out, d_in); SVD: U:(d_out,r), S:(r,), Vh:(r,d_in)
init: U,S,Vh frozen buffers; W_res mutates layer.weight (PiSSA-style)
trainable:
delta_s = 0
rot_T = 0 # << EXACT identity at t=0
forward(x, y):
A = scatter(rot_T) # (n_blocks, bs, bs)
A = 0.5·(A - A^T); A = a_lim·tanh(A/a_lim)
X = A/2 # << no α
R_blocks = Cayley(X)
R = block_diag(R_blocks) # << builds full (r,r) via
for-loop
if rotate_basis=="V": Vh_eff = R @ Vh
elif rotate_basis=="U": U_eff = U @ R^T
S_eff = S + delta_s
h = (x @ Vh_eff^T) · S_eff
delta = h @ U_eff^T
return y + delta # y = x @ W_res^T already
Identity at t=0: rot_T=0 → R=I, delta_s=0 → delta = x @ Vh^T @ diag(S) @ U^T =
x @ W_svd^T → exact bit-identity.
Diff vs antipasto3:
- ⚠ No alpha steering coefficient. antipasto3's headline feature —
bidirectional symmetry (R(+α)·R(-α)=I) — is dropped. If you intend antipasto
as a fine-tuning-only adapter, this is a deliberate scope cut; if you ever
want bidirectional steering at inference, you'd need to thread an α through
the forward.
- ⚠ Rotation chirality flipped. antipasto3 rotates the input basis by R (U @
R); ours rotates by R^T (Vh_eff = R @ Vh ⇒ (x @ Vh^T) @ R^T). Same
orthogonal-group element up to inversion, so gradient-equivalent under sign
flip of rot_T. Not a correctness bug, but checkpoints aren't transferable
between the two impls without a sign flip.
- ✓ Stricter t=0 identity. Ours zero-inits both delta_s and rot_T → exact
bit-identity. antipasto3 nudges delta_s ≈ +4e-4 and rot_T 1e-4 for symmetry
breaking.
Foot-gun: at exact zero rotation, ∂loss/∂rot_T may be sign-symmetric (rotating
in either direction gives same loss to first order), so optimizer can stall
on rot_T while delta_s carries the load. antipasto3's small bias avoids this.
Worth a docstring note.
- ✗ Perf: full (r,r) block-diagonal materialized via Python for-loop in
_block_diag. antipasto3 uses einsum('dnb,nbc->dnc', U_reshaped, R_blocks) to
apply each block directly without ever building the (r,r) matrix. For r=64,
bs=4 this is 16 reflective blocks; the for-loop is O(r²) memory and Python
overhead. Easy fix:
# rotate V (input basis): replace block_diag + matmul with einsum
Vh_blocks = Vh.reshape(n_blocks, bs, d_in)
Vh_eff = einsum(R_blocks, Vh_blocks, "n a b, n b i -> n a i").reshape(r, d_in)
- ⚠ rotate_basis enum ("V" XOR "U") vs. antipasto3's two booleans (can do
both). Minor expressivity loss.
- ✓ Cayley, skew construction, tanh angle clamping all match antipasto3
exactly (incl. the 0.5 factor on (A - A^T)).
- ✓ W_res via PiSSA-style in-place mutation of layer.weight is more
memory-efficient than antipasto3's separate buffer; semantically identical.
- ⚠ arxiv URL: https://arxiv.org/abs/2601.07473 — earlier I confirmed it
returns HTTP 200, so the paper exists. Fine.
Verdict: correct as a fine-tuning adapter with stricter t=0 identity. Two
divergences from antipasto3 worth deciding on:
1. Chirality: trivial sign-flip in skew if you want checkpoint compat with
antipasto3 — otherwise leave.
2. No α: if antipasto needs to support inference-time steering (the paper's
selling point), thread an α scalar through the forward and S_eff = S +
α·delta_s; X = α·A/2.
And one perf fix: kill _block_diag, use einsum.
ok now make a task for each fix and fix pls
● Update(src/lora_lite/variants/dora.py)
 Added 4 lines, removed 3 lines
57 ) -> Float[T, '*B o']:
58 cfg = layer._lora_cfg
59 scale = cfg.alpha / cfg.r
60 - # V = W + scale * B @ A
60 + # Paper §4.3: treat ||V+ΔV||_c as a constant (detach fr
+om grad graph) for
61 + # stability and ~2x lower memory. Match peft (lora_weig
+ht.detach + weight_norm.detach).
62 BA = einsum(layer.lora_B, layer.lora_A, "o r, r i -> o
i")
62 - V = layer.weight + scale * BA
-# (d_out, d_in)
63 - v_norm = V.norm(dim=1).clamp_min(1e-12)
-# (d_out,)
63 + V = layer.weight + scale * BA.detach()
+# (d_out, d_in)
64 + v_norm = V.norm(dim=1).clamp_min(1e-12).detach()
+# (d_out,)
65 # Bias passes through unscaled (matches peft).
66 bias = getattr(layer, "bias", None)
67 wx = y if bias is None else (y - bias)
● Update(src/lora_lite/variants/pissa.py)
 Added 4 lines, removed 3 lines
3 Meng et al. 2024 https://arxiv.org/abs/2404.02948
4
5 W = U S Vh (truncated to top-r)
6 - B = U sqrt(S), A = sqrt(S) Vh, W_res = W - B A
6 + Sr_eff = Sr / (alpha/r) # peft-sty
+le: pre-divide so A/B
7 + B = U sqrt(Sr_eff), A = sqrt(Sr_eff) Vh # update d
+ynamics match for any alpha
8 + W_res = W - (alpha/r) B A = W - U Sr Vh # scaling
+cancels symmetrically
9
8 -Identity at t=0: W_res + B@A == W (bf16 round-trip, not bit-exa
-ct).
9 -Pass alpha=r for paper-faithful scale=1.
10 +Identity at t=0: W_res + (alpha/r) B@A == W (fp32 round-trip, b
+f16 cast can drift).
11
12 Refs:
13 - paper: https://github.com/MuLabPKU/PiSSA/blob/main/utils/in
it_pissa.py
● Update(src/lora_lite/variants/pissa.py)
 Added 7 lines, removed 3 lines
52 W = layer.weight.data.float() # (
d_out, d_in)
53 U, S, Vh = torch.linalg.svd(W, full_matrices=False)
54 r = cfg.r
55 + scale = cfg.alpha / cfg.r
56 Ur, Sr, Vhr = U[:, :r], S[:r], Vh[:r, :]
56 - sqrtS = Sr.sqrt()
57 - # B @ A = Ur diag(Sr) Vhr; pick B = Ur sqrt(Sr), A =
-sqrt(Sr) * Vhr
57 + # Pre-divide Sr by scaling so A/B carry "natural" magni
+tudes for any alpha
58 + # (peft pissa_init does this; needed so the alpha/r sca
+ling on the forward
59 + # gives matched update dynamics rather than re-scaling
+A,B by sqrt(scale)).
60 + Sr_eff = Sr / scale
61 + sqrtS = Sr_eff.sqrt()
62 + # B @ A = Ur diag(Sr/scale) Vhr; W_res = W - scale * B
+@A = W - Ur diag(Sr) Vhr.
63 B = (Ur * sqrtS).to(cfg.dtype)
64 A = (sqrtS[:, None] * Vhr).to(cfg.dtype)
65 layer.lora_B.data.copy_(B)
66 layer.lora_A.data.copy_(A)
67 # fp32 subtraction so W_res stays accurate.
68 BA = (B.float() @ A.float())
64 - scale = cfg.alpha / cfg.r
69 layer.weight.data.copy_((W - scale * BA).to(layer.weigh
t.dtype))
70
71 @staticmethod
● Update(src/lora_lite/variants/antipasto.py)
 Removed 9 lines
59 return _cayley(A)
60
61
62 -def _block_diag(blocks: torch.Tensor) -> torch.Tensor:
63 - """(n_blocks, bs, bs) -> (n_blocks*bs, n_blocks*bs) block-d
-iagonal."""
64 - n, bs, _ = blocks.shape
65 - out = blocks.new_zeros(n * bs, n * bs)
66 - for i in range(n):
67 - out[i * bs : (i + 1) * bs, i * bs : (i + 1) * bs] = blo
-cks[i]
68 - return out
69 -
70 -
62 @register
63 class AntiPaSTO:
64 name = "antipasto"
● Update(src/lora_lite/variants/antipasto.py)
 Added 11 lines, removed 4 lines
114 S = layer.lora_S.to(x.dtype)
# (r,)
115 Vh = layer.lora_Vh.to(x.dtype)
# (r, d_in)
116
117 - R_blocks = _build_rotation(layer.lora_rot_T.float(), b
-s, max_angle)
118 - R = _block_diag(R_blocks).to(x.dtype)
-# (r, r)
117 + R_blocks = _build_rotation(layer.lora_rot_T.float(), b
+s, max_angle).to(x.dtype)
118 + n_blocks, _, _ = R_blocks.shape
+# (n, bs, bs)
119 + d_in, d_out = Vh.shape[1], U.shape[0]
120 + r = n_blocks * bs
121
122 + # Apply block-diagonal R per-block via einsum, never m
+aterializing (r,r).
123 if rotate_basis == "V":
121 - Vh_eff = R @ Vh
-# rotate INPUT basis
124 + # Vh_eff = R @ Vh, viewed block-wise on the r-axis
+.
125 + Vh_blocks = Vh.reshape(n_blocks, bs, d_in)
126 + Vh_eff = einsum(R_blocks, Vh_blocks, "n a b, n b i
+ -> n a i").reshape(r, d_in)
127 U_eff = U
128 elif rotate_basis == "U":
129 + # U_eff = U @ R.T, viewed block-wise on the r-axis
+.
130 + U_blocks = U.reshape(d_out, n_blocks, bs)
131 + U_eff = einsum(U_blocks, R_blocks, "d n b, n c b -
+> d n c").reshape(d_out, r)
132 Vh_eff = Vh
125 - U_eff = U @ R.T
-# rotate OUTPUT basis
133 else:
134 raise ValueError(f"rotate_basis must be 'U' or 'V'
, got {rotate_basis!r}")
135
● Verify the antipasto change with a quick import + a smoke equivalence check.
● Bash(python -c "
import torch…)
 Error: Exit code 127
(eval):1: command not found: python
> einsum(R_blocks, Vh_blocks, "n a b, n b i -> n a i").reshape(r, d_in)
should not the reshape use einops reshape? there is also reduce
● You're right — use einops rearrange for the shape ops to keep them
self-documenting. Updating.
Read 1 file (ctrl+o to expand)
● Update(src/lora_lite/variants/antipasto.py)
 Added 1 line, removed 1 line
19 from typing import Literal
20
21 import torch
22 -from einops import einsum
22 +from einops import einsum, rearrange
23 from jaxtyping import Float
24 from torch import nn, Tensor as T
25
● Update(src/lora_lite/variants/antipasto.py)
 Added 7 lines, removed 7 lines
115 Vh = layer.lora_Vh.to(x.dtype)
# (r, d_in)
116
117 R_blocks = _build_rotation(layer.lora_rot_T.float(), b
s, max_angle).to(x.dtype)
118 - n_blocks, _, _ = R_blocks.shape
-# (n, bs, bs)
119 - d_in, d_out = Vh.shape[1], U.shape[0]
120 - r = n_blocks * bs
118 + n_blocks = R_blocks.shape[0]
+# R_blocks: (n, bs, bs)
119
120 # Apply block-diagonal R per-block via einsum, never m
aterializing (r,r).
121 if rotate_basis == "V":
122 # Vh_eff = R @ Vh, viewed block-wise on the r-axis
.
125 - Vh_blocks = Vh.reshape(n_blocks, bs, d_in)
126 - Vh_eff = einsum(R_blocks, Vh_blocks, "n a b, n b i
- -> n a i").reshape(r, d_in)
123 + Vh_blocks = rearrange(Vh, "(n a) i -> n a i", n=n_
+blocks)
124 + Vh_rot = einsum(R_blocks, Vh_blocks, "n a b, n b i
+ -> n a i")
125 + Vh_eff = rearrange(Vh_rot, "n a i -> (n a) i")
126 U_eff = U
127 elif rotate_basis == "U":
128 # U_eff = U @ R.T, viewed block-wise on the r-axis
.
130 - U_blocks = U.reshape(d_out, n_blocks, bs)
131 - U_eff = einsum(U_blocks, R_blocks, "d n b, n c b -
-> d n c").reshape(d_out, r)
129 + U_blocks = rearrange(U, "d (n b) -> d n b", n=n_bl
+ocks)
130 + U_rot = einsum(U_blocks, R_blocks, "d n b, n c b -
+> d n c")
131 + U_eff = rearrange(U_rot, "d n c -> d (n c)")
132 Vh_eff = Vh
133 else:
134 raise ValueError(f"rotate_basis must be 'U' or 'V'
, got {rotate_basis!r}")
● Now run the equivalence smoke check.
Listed 1 directory (ctrl+o to expand)
● Bash(.venv/bin/python -c "
import torch…)
 t=0 identity err (V): 1.91e-06
einsum vs blockdiag (V): 1.31e-06
einsum vs blockdiag (U): 1.07e-06
● All three checks pass at fp32 round-trip noise (~1e-6).
● Quick sanity check on dora and pissa too.
● Bash(.venv/bin/python -c "
import torch…)
 DoRA t=0 identity err: 0.00e+00
DoRA grad on A: 0.3232 (should be > 0)
DoRA grad on m: 3.2282 (should be > 0)
… +4 lines (ctrl+o to expand)
● All three fixes verified:
- DoRA: t=0 bit-identity (err=0), grads still flow on A and m, but no longer
through v_norm (paper §4.3 / peft).
- PiSSA: bit-identity now holds for α ∈ {r/2, r, 2r}, not just α=r. Old impl
was only correct at α=r.
- AntiPaSTO: einsum path matches block-diag matmul to fp32 round-trip (~1e-6);
t=0 bit-identity preserved.
Summary
┌───────────────────────────────────────────┬─────────────────────┬────────┐
│ Fix │ File │ Status │
├───────────────────────────────────────────┼─────────────────────┼────────┤
│ 1. detach BA + v_norm (paper §4.3) │ dora.py:62-64 │ ✓ │
├───────────────────────────────────────────┼─────────────────────┼────────┤
│ 2. pre-divide Sr by scale to match peft │ pissa.py:51-65 │ ✓ │
│ for any α │ │ │
├───────────────────────────────────────────┼─────────────────────┼────────┤
│ 3. einops rearrange + per-block einsum, │ antipasto.py:41-130 │ ✓ │
│ dropped _block_diag │ │ │
└───────────────────────────────────────────┴─────────────────────┴────────┘
Skipped (judgment calls — flag if you want them):
- delora lambda0=0.0 default → already documented in docstring
- antipasto α steering coefficient → design decision, ask before adding
- antipasto rotation chirality vs antipasto3 → gradient-equivalent under sign
flip; no checkpoint compat path anyway (jax↔torch)
✻ Tinkered for 5m 55s