tidy tests to subset of metamath

This commit is contained in:
wassname
2026-04-27 09:20:07 +08:00
parent 1a93df10b2
commit 727ef6ea73
6 changed files with 145 additions and 1211 deletions
-754
View File
@@ -1,754 +0,0 @@
yes make a task to deep review each one against how it's implemented. so one
task for each. please give me the pseudocode of 1) reference code 2)
reference paper 3) ours and compare.
Searched for 1 pattern, read 1 file (ctrl+o to expand)
● 1. LoRA review
Paper (Hu et al. 2021):
forward:
h = W·x + (α/r) · B·A·x # B∈R^{d×r}, A∈R^{r×d}
init:
A ~ Gaussian (or kaiming), B = 0
peft peft_lora_layer.py:941-969:
forward(x):
result = base_layer(x) # W·x + bias
for adapter in active_adapters:
result += lora_B(lora_A(dropout(x))) * scaling # scaling = α/r or
α/√r if rslora
init (reset_lora_parameters, line 263):
kaiming_uniform_(lora_A, a=√5); zeros_(lora_B)
ours src/lora_lite/variants/lora.py:
param_specs:
lora_A: (r, d_in) init=kaiming
lora_B: (d_out, r) init=zeros
forward(x, y): # y = base_layer(x) already computed by host
scale = α/r
h = einsum(x, A, "... i, r i -> ... r")
delta = einsum(h, B, "... r, o r -> ... o")
return y + scale·delta
Diff vs peft:
- ✓ math identical: scale·B·A·x + y
- ✗ no dropout (peft has lora_dropout)
- ✗ no rslora (α/√r)
- ✗ no init_lora_weights="gaussian" option (only kaiming)
- ✓ B=0 zero-init → identity at t=0
Verdict: correct vanilla LoRA. Missing dropout and rslora are deliberate scope
cuts (not bugs). No fix needed.
Read 1 file (ctrl+o to expand)
● 2. DoRA review
Paper (Liu et al. 2024, §4.1, §4.3):
W' = m · V/||V||_c, V = W + ΔV = W + (α/r)·B·A # ||·||_c = per-row L2
init: m = ||W||_c, B = 0 → W' = W
§4.3 trick: treat ||V+ΔV||_c as a CONSTANT (detach from grad graph)
for stability + speed.
peft peft_lora_dora.py:132-166:
forward(x):
lora_W = B·A
weight_norm = ||W + α/r·lora_W.detach()||_c # << DETACHED
weight_norm = weight_norm.detach() # << belt+braces
mag_norm_scale = (m / weight_norm)
base_result = W·x (or passed in, minus bias)
return (mag_norm_scale - 1)·base_result + mag_norm_scale · α/r · B·A·x
# bias re-added outside
ours src/lora_lite/variants/dora.py:53-71:
V = W + scale·B·A # << NOT detached
v_norm = V.norm(dim=1).clamp_min(1e-12) # << NOT detached
wx = y - bias if bias else y
delta_lora = scale · B·A·x
combined = wx + delta_lora
out = (m / v_norm) · combined
return out + bias if bias else out
Diff vs peft:
- ✗ Bug: missing .detach() on the lora term inside weight_norm. Paper §4.3
explicitly says to detach; peft detaches twice (line 144 lora_weight.detach()
and line 152 weight_norm.detach()). Without it, grads flow through the norm
denominator → unstable training and ~2× memory, the exact thing the paper
warns against.
- ✓ math otherwise equivalent: (m/||V||)·(Wx + α/r·BAx) = (m/||V||-1)·Wx +
(m/||V||)·α/r·BAx + Wx. Both forms produce same output.
- ✓ bias handling matches peft (subtract before scaling, add after).
- ✓ m init ||W||_c matches.
- = nn.Linear-only restriction is a deliberate scope cut (peft handles bnb via
dequantize_module_weight).
Recommended fix: add .detach() to V (or equivalently v_norm.detach()) before
dividing.
V = layer.weight + scale * BA.detach()
v_norm = V.norm(dim=1).clamp_min(1e-12).detach()
Read 2 files (ctrl+o to expand)
● 3. PiSSA review
Paper (Meng et al. 2024):
W = U S V^T (truncated to top-r)
A = √S · V^T, B = U · √S → B·A = U S V^T
W_res = W - B·A
forward: W_res·x + (α/r)·B·A·x → at t=0 equals W·x
peft peft_lora_layer.py:360-393:
V, S, Uh = svd(W) # V is left, Uh is right (peft
naming)
Sr = S[:r] / scaling # << pre-divide by α/r
A = diag(√Sr) @ Uh[:r] # absorbs 1/√scaling
B = V[:, :r] @ diag(√Sr) # absorbs 1/√scaling
W_res = W - scaling · B·A # net: W - V[:,:r]·S[:r]·Uh[:r]
(scaling cancels)
forward: W_res·x + scaling·B·A·x
ours src/lora_lite/variants/pissa.py:45-76:
U, S, Vh = svd(W)
B = U[:,:r] · √S[:r] # NO scaling pre-division
A = √S[:r] · Vh[:r] # NO scaling pre-division
W_res = W - scale · B·A # = W - scale · U·S·Vh (NOT
exactly W - U·S·Vh)
forward: W_res·x + scale · B·A·x # net: W·x at t=0 ✓
Diff vs peft:
- ✓ Bit-identity at t=0 holds for any α (algebra cancels symmetrically): both
peft and ours satisfy W_res + scale·BA = W round-tripped.
- ⚠ A/B starting magnitudes differ when α≠r. peft pre-divides S/scale so
||B||,||A|| ∝ √(S/scale); ours has ||B||,||A|| ∝ √S. Update dynamics (α/r ·
ΔB·A) therefore differ by a factor of √scale per matrix. For α=r (scale=1)
they're identical — and the docstring already says "Pass alpha=r for
paper-faithful scale=1".
- ✓ Linear-only restriction is a deliberate scope cut.
- ✓ cfg.dtype cast on A,B before storage matches peft's to(dtype) after save.
- ✓ fp32 SVD then (W - scale·BA) in fp32 before bf16 cast — comment "fp32
subtraction so W_res stays accurate" is honoured. peft does same via
weight.to(torch.float32).
Verdict: correct for α = r. For α ≠ r, behaves differently from peft (still
mathematically valid PiSSA, just different parameter scaling of A/B). Either
fix (apply Sr /= scale like peft) or tighten the docstring to "α=r required"
and assert it. Mild fix; not a correctness bug.
Read 1 file (ctrl+o to expand)
● 4. DeLoRA review
Paper (Bini et al. 2025, ICLR'25):
W' = W + (λ/r) · B · Ξ · A, Ξ_ii = 1/(||b_i|| · ||a_i||)
init: B=0 → ΔW=0
default λ=15
Crucial: w_norm is per-input-channel ||W||_2 (col-norm), captured ONCE at init
(frozen reference) — not Frobenius scalar.
peft peft_delora_layer.py:226-264:
forward(x):
h = (x * w_norm) @ A.T # broadcast w_norm over
input dim
An = clamp(A.norm(dim=1), 1e-4)
Bn = clamp(B.norm(dim=0), 1e-4)
scale_r = (λ/r) / (An * Bn) # (r,)
h = h * scale_r
h = h @ B.T
return base_out + h
init: A ← kaiming(√5), B ← 0, λ ← 15.0, w_norm ← ||W||_2 col (frozen
buffer)
ours src/lora_lite/variants/delora.py:55-81:
forward(x, y):
x_scaled = x * lora_wnorm
h = einsum(x_scaled, A, "... i, r i -> ... r") # = (x * w_norm) @ A.T
An = clamp(A.norm(dim=1), 1e-4)
Bn = clamp(B.norm(dim=0), 1e-4)
scale_r = (λ/r) / (An * Bn)
h = h * scale_r
delta = einsum(h, B, "... r, o r -> ... o") # = h @ B.T
return y + delta
init: A ← kaiming(√5), B ← 0, λ ← lambda0 (default 0.0, NOT 15.0),
lora_wnorm ← ||W||_2 col (frozen buffer)
Diff vs peft:
- ✓ math line-for-line identical: same dim-0 col-norm, same per-rank A/B
normalization, same gradient path (norms not detached, A,B receive gradient
through them — matches peft exactly).
- ✓ frozen lora_wnorm buffer (peft uses BufferDict, ours uses as_buffer=True).
- ⚠ default lambda0=0.0 differs from peft's 15.0. Docstring is honest about
this ("0.0 = bit-identity at t=0, but A,B get zero grad until lambda moves").
At λ=0, gradients on A,B vanish since delta ∝ λ, so training stalls until λ
moves off zero. This is documented but is a sharp foot-gun. peft's 15.0
default avoids it.
- ✓ even-r restriction not relevant here (HRA only).
- ✓ nn.Linear-only restriction documented.
Verdict: correct. The lambda0 default is a deliberate "bit-identity but
stalls" choice — keep, but consider noting in docstring that for actual
training you'd want λ0≈1-15 (paper uses 15).
Searched for 1 pattern, read 1 file (ctrl+o to expand)
● 5. EVA review
Paper (Paischer et al. 2024):
For each target layer l:
Collect inputs x ∈ R^{N × d_in} on calibration batches
Vh = right_singular_vectors(x) # top-r principal input dirs
A_l ← Vh[:r] # data-driven init
B_l ← 0
forward = standard LoRA
Optional: rank redistribution by explained variance (ρ-budget across layers)
Optional: hash-dedup layers with identical inputs (Q/K/V share x)
peft peft_eva.py:
For each calibration batch:
IncrementalPCA.partial_fit(layer_input) # streaming,
memory-efficient
Hash inputs; dedup layers seeing same x.
Track convergence via cossim(components_t, components_{t-1})
After:
Redistribute ranks by explained_variance_ratio (ρ scale)
A_l ← svd.components_[:rank_l] # top-rank rows
B_l ← 0
forward = vanilla LoRA path (`scale·B@A·x`)
ours src/lora_lite/variants/eva.py:
group_init(model, targets, calibration_data):
hooks capture pre-forward inputs per layer
for batch in calibration_data: model(batch)
for each target:
X = cat(captured) # (N, d_in), float32 on cpu
_,_,Vh = torch.linalg.svd(X, full_matrices=False)
A ← Vh[:r] # uniform r per layer
forward(x, y):
return y + (α/r) · (B @ A @ x) # B=0 at t=0 → identity
Diff vs peft:
- ✓ Top-r right singular vectors of pooled inputs → A. Math identical.
- ✓ B=0 init → identity at t=0.
- ✓ A,B both trainable post-init (peft same).
- ✗ Full SVD on cpu-pooled tensor vs. peft's IncrementalPCA — limits us to
small calibration sets (we cat all activations in RAM). The docstring
acknowledges this.
- ✗ No rank redistribution (rho param). All targets get same r. Acknowledged.
- ✗ No QKV input dedup. Acknowledged. Cost: 3× SVD work for layers that share
input. Correct, just slower.
- ✗ No convergence check. Single pass over calibration_data. Acknowledged.
- ⚠ Captured tensors moved to cpu+fp32 — fine, but a noisy 1k-batch on a 70B
model = lots of RAM. Just a scaling note.
Verdict: correct minimal EVA, all simplifications stated up-front in
docstring. No bugs. If you ever want to scale to large calibration corpora,
swap the torch.cat → svd for IncrementalPCA.
Read 1 file (ctrl+o to expand)
● 6. HRA review
Paper (Yuan et al. 2024):
W' = W·R, R = ∏_{i=0..r-1} H_i, H_i = I - 2·u_i·u_i^T / ||u_i||^2
init (paper): independent random columns
peft variant: paired columns (u_{2k}=u_{2k+1}) → R=I exactly at t=0, even r
GS variant (apply_GS=True): explicit Gram-Schmidt for strict orthogonality
Loss-side: orthogonality regularizer (Eq. 6); not part of layer.
peft peft_hra_layer.py:208-265:
hra_u: (d_in, r) # COLUMNS are vectors
get_delta_weight():
opt_u = opt_u / opt_u.norm(dim=0) # pre-normalize columns
R = I_{d_in}
for i in 0..r-1: # forward order
u = opt_u[:, i]
R = R - 2·R·u·u^T # R ← R · H_i
forward(x):
R = product over active adapters
return F.linear(x, W @ R, bias) # = x · R^T · W^T
(effectively R applied to x)
init: half_u = kaiming((d_in, r/2)); u = repeat_interleave(half_u, 2, dim=1)
# cols 0=1, 2=3, ...
ours src/lora_lite/variants/hra.py:53-75:
lora_U: (r, d_in) # ROWS are vectors
(transpose convention)
forward_input(x):
Rx = x
for i = r-1 downto 0: # reverse order, applied
directly
u = U[i]
sq = (u·u).sum().clamp_min(1e-12)
coeff = (Rx · u) · (2/sq)
Rx = Rx - coeff · u # Rx ← Rx · H_i
return Rx # then base layer applies W
to this
init: half = kaiming((r/2, d_in)); U = repeat_interleave(half, 2, dim=0) #
rows 0=1, 2=3, ...
Diff vs peft:
- ✓ Math equivalent. peft computes R = H_0·H_1·...·H_{r-1} (col-vector
convention), then applies via x·R^T = x·H_{r-1}·...·H_0 (since H is
symmetric). Ours iterates i = r-1 → 0 doing Rx ← Rx·H_i, computing
x·H_{r-1}·...·H_0. Same.
- ✓ symmetric init: row pairs vs col pairs — consistent with our row-vector
convention.
- ✓ identity at t=0: paired u_a=u_b → H_a·H_a=I (involution). R=I.
- ⚠ No apply_GS=True path. peft offers Gram-Schmidt orthogonalization.
Docstring acknowledges via "Note: paper's orthogonality regularizer (Eq. 6) is
loss-side". Ours uses pure Householder products (which already give exactly
orthogonal R at any step), so GS is redundant unless you want unit-norm
columns. Fine to omit.
- ✓ More efficient than peft. Avoids materializing (d_in × d_in) R matrix;
applies reflections directly to x in O(r·d_in) not O(r·d_in²).
- ⚠ Slight gradient-flow difference. peft pre-normalizes opt_u /
||opt_u||_col, so gradients flow through normalization. Ours keeps ||u||² in
the denominator each step, gradients flow through that. Both define the same
orthogonal H_i; gradients should give equivalent training to within numerical
noise, but not bit-identical.
- ✓ even-r requirement raised loudly (peft warns + falls back; we raise —
preferable for fail-fast).
Verdict: correct + faster than peft. Ordering bug from 67a6daf is properly
fixed (verified: i = r-1 → 0 iteration matches peft's R = H_0..H_{r-1} then
x·R^T).
Read 1 file (ctrl+o to expand)
● 7. IA3 review
Paper (Liu et al. 2022):
y = ⊙ (W·x) for attention K, V (output-side gate)
y = W_2·(_ff ⊙ γ(W_1·x)) for FF (mid-FFN gate, T5)
init: = 1
peft peft_ia3_layer.py:161-188 (single class with is_feedforward flag):
if is_feedforward:
g shape (1, d_in)
interm = x * g
return base_layer(interm) # input-side
else:
g shape (d_out, 1)
result = base_layer(x)
return result * g # output-side
init: g ← ones (when init_ia3_weights=True, default)
default targets: ["k_proj", "v_proj", "down_proj"];
feedforward_modules=["down_proj"]
ours src/lora_lite/variants/ia3.py:38-76 (two classes IA3, IA3FF):
IA3 (output-side, e.g. k_proj/v_proj):
g shape (d_out,); init ones
forward(x, y) = y * g
IA3FF (input-side, e.g. down_proj/fc2):
g shape (d_in,); init ones
forward_input(x) = x * g # base_layer applied
after by host
Diff vs peft:
- ✓ math identical to peft's two branches.
- ✓ init=ones matches.
- ✓ Splitting into two classes (rather than is_feedforward flag) is cleaner —
fail-fast: which class you register fixes the side, no boolean to misset.
- ⚠ peft's output-side path also scales base_layer.bias during merge. We
don't merge (forward-only adapter), so no issue at training time — but if you
ever add merge, watch this.
- ⚠ docstring example uses regex r"\\.k_proj$" which under YAML/Python would
match a literal .k_proj string-end; check that target_names is treated as
regex by your attach() (matches the behavior in variant.py).
- ⚠ Position-on-Llama: peft applies "ff" gate at down_proj input, which is
gate*up (post-SiLU). This deviates from paper's T5 position (post-activation
of single W_1) but matches peft's default for Llama-style. Acknowledged via
docstring example.
Verdict: correct, cleaner than peft. No bugs.
ko! so what do we need to fix?
● Fixes needed (from reviews 1-7)
Real bug (1):
- dora.py:62-63 — missing .detach() on the LoRA path inside the norm. Paper
§4.3 explicitly says treat ||V+ΔV||_c as a constant; peft detaches twice.
Without it, grads flow through the norm denominator → unstable training + ~2×
memory.
V = layer.weight + scale * BA.detach() # detach BA here
v_norm = V.norm(dim=1).clamp_min(1e-12).detach() # belt & braces
Behavioral / docstring (2):
- pissa.py — at α≠r, A,B starting magnitude differs from peft by √scale per
matrix (peft pre-divides S/scale). Either match peft (Sr /= scale before
computing √Sr factors) or assert cfg.alpha == cfg.r and tighten the docstring.
- delora.py:36 — lambda0=0.0 default is bit-identity but a foot-gun: at λ=0
the gradient on A,B is zero, so training is dead until λ moves. Either change
default to peft's 15.0, or expand the existing comment to flag this for new
users.
Clean (4): lora, eva, hra, ia3 — no fixes.
Pending: antipasto.py (review #8 not yet done; want me to continue?).
✻ Perused for 5m 30s
sure do antipasto now...
Read 1 file (ctrl+o to expand)
● 8. AntiPaSTO review
Reference (antipasto3, Flax):
kernel = U·diag(S)·Vh + W_res # kernel: (in, out)
init U,S,Vh,W_res = svd(kernel) (frozen, fp32 / bf16 W_res)
trainable:
delta_s truncated_normal·4e-4 + 4e-4 # small POSITIVE bias
(sym-break)
rot_T truncated_normal·1e-4 # small random
forward(x, alpha):
A = scatter(rot_T into upper-triu of (n_blocks, bs, bs))
A = 0.5·(A - A^T); A = a_lim·tanh(A/a_lim)
X = α·A/2 # << α scales rotation
R = (I-X)^-1 (I+X) # block Cayley
if rotate_U: U_rot = U @ R # einsum, never builds full
(r,r)
if rotate_V: Vh_rot = R @ Vh
S_scaled = S + α·delta_s
return (x @ U_rot)·S_scaled @ Vh_rot + x @ W_res
Identity property: bidirectional symmetry — R(α=+1) and R(α=-1) are exact
inverses.
At α=0: U,S,Vh untouched → kernel reproduced.
ours src/lora_lite/variants/antipasto.py:
weight (pytorch): (d_out, d_in); SVD: U:(d_out,r), S:(r,), Vh:(r,d_in)
init: U,S,Vh frozen buffers; W_res mutates layer.weight (PiSSA-style)
trainable:
delta_s = 0
rot_T = 0 # << EXACT identity at t=0
forward(x, y):
A = scatter(rot_T) # (n_blocks, bs, bs)
A = 0.5·(A - A^T); A = a_lim·tanh(A/a_lim)
X = A/2 # << no α
R_blocks = Cayley(X)
R = block_diag(R_blocks) # << builds full (r,r) via
for-loop
if rotate_basis=="V": Vh_eff = R @ Vh
elif rotate_basis=="U": U_eff = U @ R^T
S_eff = S + delta_s
h = (x @ Vh_eff^T) · S_eff
delta = h @ U_eff^T
return y + delta # y = x @ W_res^T already
Identity at t=0: rot_T=0 → R=I, delta_s=0 → delta = x @ Vh^T @ diag(S) @ U^T =
x @ W_svd^T → exact bit-identity.
Diff vs antipasto3:
- ⚠ No alpha steering coefficient. antipasto3's headline feature —
bidirectional symmetry (R(+α)·R(-α)=I) — is dropped. If you intend antipasto
as a fine-tuning-only adapter, this is a deliberate scope cut; if you ever
want bidirectional steering at inference, you'd need to thread an α through
the forward.
- ⚠ Rotation chirality flipped. antipasto3 rotates the input basis by R (U @
R); ours rotates by R^T (Vh_eff = R @ Vh ⇒ (x @ Vh^T) @ R^T). Same
orthogonal-group element up to inversion, so gradient-equivalent under sign
flip of rot_T. Not a correctness bug, but checkpoints aren't transferable
between the two impls without a sign flip.
- ✓ Stricter t=0 identity. Ours zero-inits both delta_s and rot_T → exact
bit-identity. antipasto3 nudges delta_s ≈ +4e-4 and rot_T 1e-4 for symmetry
breaking.
Foot-gun: at exact zero rotation, ∂loss/∂rot_T may be sign-symmetric (rotating
in either direction gives same loss to first order), so optimizer can stall
on rot_T while delta_s carries the load. antipasto3's small bias avoids this.
Worth a docstring note.
- ✗ Perf: full (r,r) block-diagonal materialized via Python for-loop in
_block_diag. antipasto3 uses einsum('dnb,nbc->dnc', U_reshaped, R_blocks) to
apply each block directly without ever building the (r,r) matrix. For r=64,
bs=4 this is 16 reflective blocks; the for-loop is O(r²) memory and Python
overhead. Easy fix:
# rotate V (input basis): replace block_diag + matmul with einsum
Vh_blocks = Vh.reshape(n_blocks, bs, d_in)
Vh_eff = einsum(R_blocks, Vh_blocks, "n a b, n b i -> n a i").reshape(r, d_in)
- ⚠ rotate_basis enum ("V" XOR "U") vs. antipasto3's two booleans (can do
both). Minor expressivity loss.
- ✓ Cayley, skew construction, tanh angle clamping all match antipasto3
exactly (incl. the 0.5 factor on (A - A^T)).
- ✓ W_res via PiSSA-style in-place mutation of layer.weight is more
memory-efficient than antipasto3's separate buffer; semantically identical.
- ⚠ arxiv URL: https://arxiv.org/abs/2601.07473 — earlier I confirmed it
returns HTTP 200, so the paper exists. Fine.
Verdict: correct as a fine-tuning adapter with stricter t=0 identity. Two
divergences from antipasto3 worth deciding on:
1. Chirality: trivial sign-flip in skew if you want checkpoint compat with
antipasto3 — otherwise leave.
2. No α: if antipasto needs to support inference-time steering (the paper's
selling point), thread an α scalar through the forward and S_eff = S +
α·delta_s; X = α·A/2.
And one perf fix: kill _block_diag, use einsum.
ok now make a task for each fix and fix pls
● Update(src/lora_lite/variants/dora.py)
 Added 4 lines, removed 3 lines
57 ) -> Float[T, '*B o']:
58 cfg = layer._lora_cfg
59 scale = cfg.alpha / cfg.r
60 - # V = W + scale * B @ A
60 + # Paper §4.3: treat ||V+ΔV||_c as a constant (detach fr
+om grad graph) for
61 + # stability and ~2x lower memory. Match peft (lora_weig
+ht.detach + weight_norm.detach).
62 BA = einsum(layer.lora_B, layer.lora_A, "o r, r i -> o
i")
62 - V = layer.weight + scale * BA
-# (d_out, d_in)
63 - v_norm = V.norm(dim=1).clamp_min(1e-12)
-# (d_out,)
63 + V = layer.weight + scale * BA.detach()
+# (d_out, d_in)
64 + v_norm = V.norm(dim=1).clamp_min(1e-12).detach()
+# (d_out,)
65 # Bias passes through unscaled (matches peft).
66 bias = getattr(layer, "bias", None)
67 wx = y if bias is None else (y - bias)
● Update(src/lora_lite/variants/pissa.py)
 Added 4 lines, removed 3 lines
3 Meng et al. 2024 https://arxiv.org/abs/2404.02948
4
5 W = U S Vh (truncated to top-r)
6 - B = U sqrt(S), A = sqrt(S) Vh, W_res = W - B A
6 + Sr_eff = Sr / (alpha/r) # peft-sty
+le: pre-divide so A/B
7 + B = U sqrt(Sr_eff), A = sqrt(Sr_eff) Vh # update d
+ynamics match for any alpha
8 + W_res = W - (alpha/r) B A = W - U Sr Vh # scaling
+cancels symmetrically
9
8 -Identity at t=0: W_res + B@A == W (bf16 round-trip, not bit-exa
-ct).
9 -Pass alpha=r for paper-faithful scale=1.
10 +Identity at t=0: W_res + (alpha/r) B@A == W (fp32 round-trip, b
+f16 cast can drift).
11
12 Refs:
13 - paper: https://github.com/MuLabPKU/PiSSA/blob/main/utils/in
it_pissa.py
● Update(src/lora_lite/variants/pissa.py)
 Added 7 lines, removed 3 lines
52 W = layer.weight.data.float() # (
d_out, d_in)
53 U, S, Vh = torch.linalg.svd(W, full_matrices=False)
54 r = cfg.r
55 + scale = cfg.alpha / cfg.r
56 Ur, Sr, Vhr = U[:, :r], S[:r], Vh[:r, :]
56 - sqrtS = Sr.sqrt()
57 - # B @ A = Ur diag(Sr) Vhr; pick B = Ur sqrt(Sr), A =
-sqrt(Sr) * Vhr
57 + # Pre-divide Sr by scaling so A/B carry "natural" magni
+tudes for any alpha
58 + # (peft pissa_init does this; needed so the alpha/r sca
+ling on the forward
59 + # gives matched update dynamics rather than re-scaling
+A,B by sqrt(scale)).
60 + Sr_eff = Sr / scale
61 + sqrtS = Sr_eff.sqrt()
62 + # B @ A = Ur diag(Sr/scale) Vhr; W_res = W - scale * B
+@A = W - Ur diag(Sr) Vhr.
63 B = (Ur * sqrtS).to(cfg.dtype)
64 A = (sqrtS[:, None] * Vhr).to(cfg.dtype)
65 layer.lora_B.data.copy_(B)
66 layer.lora_A.data.copy_(A)
67 # fp32 subtraction so W_res stays accurate.
68 BA = (B.float() @ A.float())
64 - scale = cfg.alpha / cfg.r
69 layer.weight.data.copy_((W - scale * BA).to(layer.weigh
t.dtype))
70
71 @staticmethod
● Update(src/lora_lite/variants/antipasto.py)
 Removed 9 lines
59 return _cayley(A)
60
61
62 -def _block_diag(blocks: torch.Tensor) -> torch.Tensor:
63 - """(n_blocks, bs, bs) -> (n_blocks*bs, n_blocks*bs) block-d
-iagonal."""
64 - n, bs, _ = blocks.shape
65 - out = blocks.new_zeros(n * bs, n * bs)
66 - for i in range(n):
67 - out[i * bs : (i + 1) * bs, i * bs : (i + 1) * bs] = blo
-cks[i]
68 - return out
69 -
70 -
62 @register
63 class AntiPaSTO:
64 name = "antipasto"
● Update(src/lora_lite/variants/antipasto.py)
 Added 11 lines, removed 4 lines
114 S = layer.lora_S.to(x.dtype)
# (r,)
115 Vh = layer.lora_Vh.to(x.dtype)
# (r, d_in)
116
117 - R_blocks = _build_rotation(layer.lora_rot_T.float(), b
-s, max_angle)
118 - R = _block_diag(R_blocks).to(x.dtype)
-# (r, r)
117 + R_blocks = _build_rotation(layer.lora_rot_T.float(), b
+s, max_angle).to(x.dtype)
118 + n_blocks, _, _ = R_blocks.shape
+# (n, bs, bs)
119 + d_in, d_out = Vh.shape[1], U.shape[0]
120 + r = n_blocks * bs
121
122 + # Apply block-diagonal R per-block via einsum, never m
+aterializing (r,r).
123 if rotate_basis == "V":
121 - Vh_eff = R @ Vh
-# rotate INPUT basis
124 + # Vh_eff = R @ Vh, viewed block-wise on the r-axis
+.
125 + Vh_blocks = Vh.reshape(n_blocks, bs, d_in)
126 + Vh_eff = einsum(R_blocks, Vh_blocks, "n a b, n b i
+ -> n a i").reshape(r, d_in)
127 U_eff = U
128 elif rotate_basis == "U":
129 + # U_eff = U @ R.T, viewed block-wise on the r-axis
+.
130 + U_blocks = U.reshape(d_out, n_blocks, bs)
131 + U_eff = einsum(U_blocks, R_blocks, "d n b, n c b -
+> d n c").reshape(d_out, r)
132 Vh_eff = Vh
125 - U_eff = U @ R.T
-# rotate OUTPUT basis
133 else:
134 raise ValueError(f"rotate_basis must be 'U' or 'V'
, got {rotate_basis!r}")
135
● Verify the antipasto change with a quick import + a smoke equivalence check.
● Bash(python -c "
import torch…)
 Error: Exit code 127
(eval):1: command not found: python
> einsum(R_blocks, Vh_blocks, "n a b, n b i -> n a i").reshape(r, d_in)
should not the reshape use einops reshape? there is also reduce
● You're right — use einops rearrange for the shape ops to keep them
self-documenting. Updating.
Read 1 file (ctrl+o to expand)
● Update(src/lora_lite/variants/antipasto.py)
 Added 1 line, removed 1 line
19 from typing import Literal
20
21 import torch
22 -from einops import einsum
22 +from einops import einsum, rearrange
23 from jaxtyping import Float
24 from torch import nn, Tensor as T
25
● Update(src/lora_lite/variants/antipasto.py)
 Added 7 lines, removed 7 lines
115 Vh = layer.lora_Vh.to(x.dtype)
# (r, d_in)
116
117 R_blocks = _build_rotation(layer.lora_rot_T.float(), b
s, max_angle).to(x.dtype)
118 - n_blocks, _, _ = R_blocks.shape
-# (n, bs, bs)
119 - d_in, d_out = Vh.shape[1], U.shape[0]
120 - r = n_blocks * bs
118 + n_blocks = R_blocks.shape[0]
+# R_blocks: (n, bs, bs)
119
120 # Apply block-diagonal R per-block via einsum, never m
aterializing (r,r).
121 if rotate_basis == "V":
122 # Vh_eff = R @ Vh, viewed block-wise on the r-axis
.
125 - Vh_blocks = Vh.reshape(n_blocks, bs, d_in)
126 - Vh_eff = einsum(R_blocks, Vh_blocks, "n a b, n b i
- -> n a i").reshape(r, d_in)
123 + Vh_blocks = rearrange(Vh, "(n a) i -> n a i", n=n_
+blocks)
124 + Vh_rot = einsum(R_blocks, Vh_blocks, "n a b, n b i
+ -> n a i")
125 + Vh_eff = rearrange(Vh_rot, "n a i -> (n a) i")
126 U_eff = U
127 elif rotate_basis == "U":
128 # U_eff = U @ R.T, viewed block-wise on the r-axis
.
130 - U_blocks = U.reshape(d_out, n_blocks, bs)
131 - U_eff = einsum(U_blocks, R_blocks, "d n b, n c b -
-> d n c").reshape(d_out, r)
129 + U_blocks = rearrange(U, "d (n b) -> d n b", n=n_bl
+ocks)
130 + U_rot = einsum(U_blocks, R_blocks, "d n b, n c b -
+> d n c")
131 + U_eff = rearrange(U_rot, "d n c -> d (n c)")
132 Vh_eff = Vh
133 else:
134 raise ValueError(f"rotate_basis must be 'U' or 'V'
, got {rotate_basis!r}")
● Now run the equivalence smoke check.
Listed 1 directory (ctrl+o to expand)
● Bash(.venv/bin/python -c "
import torch…)
 t=0 identity err (V): 1.91e-06
einsum vs blockdiag (V): 1.31e-06
einsum vs blockdiag (U): 1.07e-06
● All three checks pass at fp32 round-trip noise (~1e-6).
● Quick sanity check on dora and pissa too.
● Bash(.venv/bin/python -c "
import torch…)
 DoRA t=0 identity err: 0.00e+00
DoRA grad on A: 0.3232 (should be > 0)
DoRA grad on m: 3.2282 (should be > 0)
… +4 lines (ctrl+o to expand)
● All three fixes verified:
- DoRA: t=0 bit-identity (err=0), grads still flow on A and m, but no longer
through v_norm (paper §4.3 / peft).
- PiSSA: bit-identity now holds for α ∈ {r/2, r, 2r}, not just α=r. Old impl
was only correct at α=r.
- AntiPaSTO: einsum path matches block-diag matmul to fp32 round-trip (~1e-6);
t=0 bit-identity preserved.
Summary
┌───────────────────────────────────────────┬─────────────────────┬────────┐
│ Fix │ File │ Status │
├───────────────────────────────────────────┼─────────────────────┼────────┤
│ 1. detach BA + v_norm (paper §4.3) │ dora.py:62-64 │ ✓ │
├───────────────────────────────────────────┼─────────────────────┼────────┤
│ 2. pre-divide Sr by scale to match peft │ pissa.py:51-65 │ ✓ │
│ for any α │ │ │
├───────────────────────────────────────────┼─────────────────────┼────────┤
│ 3. einops rearrange + per-block einsum, │ antipasto.py:41-130 │ ✓ │
│ dropped _block_diag │ │ │
└───────────────────────────────────────────┴─────────────────────┴────────┘
Skipped (judgment calls — flag if you want them):
- delora lambda0=0.0 default → already documented in docstring
- antipasto α steering coefficient → design decision, ask before adding
- antipasto rotation chirality vs antipasto3 → gradient-equivalent under sign
flip; no checkpoint compat path anyway (jax↔torch)
✻ Tinkered for 5m 55s
+18 -13
View File
@@ -45,6 +45,7 @@ class BenchmarkConfig:
mode: Literal["benchmark", "probe"] = "benchmark"
device: str = "cuda"
torch_dtype: str = "bfloat16"
quantization: Literal["none", "4bit", "8bit"] = "none"
r: int = 32
alpha: float = 64.0
delora_lambda0: float = 0.1
@@ -146,7 +147,7 @@ def count_base_grad_leaks(model: torch.nn.Module) -> int:
def perturb_first_adapter(model: torch.nn.Module) -> None:
priority = ("lora_B", "lora_g", "lora_U", "lora_A", "lora_lambda", "lora_gate")
priority = ("lora_B", "lora_g", "lora_U", "lora_A", "lora_lambda", "lora_gate", "lora_delta_s", "lora_rot_T", "lora_m")
for key in priority:
for _, p in model.named_parameters():
if p.requires_grad and key in _:
@@ -159,7 +160,7 @@ def perturb_first_adapter(model: torch.nn.Module) -> None:
raise AssertionError("no perturbable adapter parameter found")
def load_model_and_tokenizer(model_id: str, dtype: torch.dtype, device: str):
def load_model_and_tokenizer(model_id: str, dtype: torch.dtype, device: str, quantization: str = "none"):
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
@@ -167,7 +168,16 @@ def load_model_and_tokenizer(model_id: str, dtype: torch.dtype, device: str):
raise RuntimeError(f"tokenizer for {model_id} has no eos_token")
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=dtype).to(device)
if quantization == "none":
model = AutoModelForCausalLM.from_pretrained(model_id, dtype=dtype).to(device)
else:
from transformers import BitsAndBytesConfig
bnb_cfg = BitsAndBytesConfig(
load_in_4bit=quantization == "4bit",
load_in_8bit=quantization == "8bit",
bnb_4bit_compute_dtype=dtype if quantization == "4bit" else None,
)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_cfg, device_map=device)
model.config.use_cache = False
return model, tokenizer
@@ -372,13 +382,8 @@ def evaluate(model, tokenizer, dataset, args: BenchmarkConfig, split: str) -> di
@torch.no_grad()
def probe_before_train(model, batch: dict[str, torch.Tensor | int], attached_targets: list[str]) -> dict[str, Any]:
expected_targets = {
"model.layers.0.self_attn.q_proj",
"model.layers.0.self_attn.v_proj",
}
attached_set = set(attached_targets)
if attached_set != expected_targets:
raise AssertionError(f"probe expected layer-0 q/v only, got {sorted(attached_set)}")
if not attached_targets:
raise AssertionError("probe: no targets attached")
logits_init = model(input_ids=batch["input_ids"], attention_mask=batch["attention_mask"]).logits.detach().clone()
clean_adapter = adapter_state(model)
perturb_first_adapter(model)
@@ -387,7 +392,7 @@ def probe_before_train(model, batch: dict[str, torch.Tensor | int], attached_tar
raise AssertionError(f"adapter perturbation did not affect logits: {perturb_delta}")
for name, value in clean_adapter.items():
model.state_dict()[name].copy_(value)
return {"expected_targets": sorted(expected_targets), "perturb_delta": perturb_delta}
return {"attached_targets": sorted(attached_targets), "perturb_delta": perturb_delta}
@torch.no_grad()
@@ -401,7 +406,7 @@ def check_probe_reload(
del cfg # cfg is saved in the checkpoint; keep the call-site explicit.
gc.collect()
torch.cuda.empty_cache()
loaded_model, _ = load_model_and_tokenizer(args.model, getattr(torch, args.torch_dtype), args.device)
loaded_model, _ = load_model_and_tokenizer(args.model, getattr(torch, args.torch_dtype), args.device, args.quantization)
loaded_model.eval()
ll.load(loaded_model, str(adapter_path))
saved = torch.load(adapter_path, weights_only=True, map_location="cpu")
@@ -489,7 +494,7 @@ def run(args: BenchmarkConfig) -> dict[str, Any]:
out_dir.mkdir(parents=True, exist_ok=True)
datasets = load_datasets(args)
model, tokenizer = load_model_and_tokenizer(args.model, dtype, args.device)
model, tokenizer = load_model_and_tokenizer(args.model, dtype, args.device, args.quantization)
batches, skipped_train_prompt_too_long = make_train_batches(datasets["train"], tokenizer, args)
cfg = cfg_for_variant(args, dtype)
if args.variant == "eva":
-53
View File
@@ -1,53 +0,0 @@
"""Smoke: end-to-end MetaMath->GSM8K plumbing for every variant on a tiny HF model.
Per-variant correctness invariants live in tests/test_lora_lite.py. This script
just confirms the full benchmark pipeline (data load, prompt encode, train step,
eval generate + answer extract) runs for each adapter type.
"""
from __future__ import annotations
import subprocess
import sys
VARIANTS = ["lora", "pissa", "delora", "ia3", "ia3_ff", "dora", "hra", "eva", "antipasto"]
MODEL = "hf-internal-testing/tiny-random-LlamaForCausalLM"
def run_one(variant: str) -> int:
cmd = [
sys.executable,
"scripts/metamath_gsm8k_benchmark.py",
"--model", MODEL,
"--variant", variant,
"--steps", "2",
"--batch-size", "2",
"--max-train-samples", "8",
"--max-eval-samples", "10",
"--max-valid-samples", "10",
"--max-new-tokens", "8",
"--max-seq-length", "128",
"--r", "4",
"--alpha", "8",
"--torch-dtype", "float32",
"--device", "cpu",
]
if variant == "ia3":
cmd += ["--target-name", r"(k_proj|v_proj)$"]
elif variant == "ia3_ff":
cmd += ["--target-name", r"(down_proj)$"]
print(f"\n=== smoke variant={variant} ===")
print(" ".join(cmd))
return subprocess.call(cmd)
def main() -> int:
failed = [v for v in VARIANTS if run_one(v) != 0]
if failed:
print(f"FAIL: {failed}")
return 1
print("ALL PASS.")
return 0
if __name__ == "__main__":
sys.exit(main())
-63
View File
@@ -1,63 +0,0 @@
"""bnb 4bit/8bit CUDA smoke. Skipped without CUDA + bitsandbytes installed."""
from __future__ import annotations
import pytest
import torch
from torch import nn
import lora_lite as ll
pytestmark = pytest.mark.skipif(not torch.cuda.is_available(), reason="needs CUDA")
bnb = pytest.importorskip("bitsandbytes")
CFG_BY_VARIANT = {
"lora": ll.LoRAConfig,
"ia3": ll.IA3Config,
"hra": ll.HRAConfig,
"pissa": ll.PiSSAConfig,
"dora": ll.DoRAConfig,
}
class BnbModel(nn.Module):
def __init__(self, layer_cls):
super().__init__()
self.config = type("Cfg", (), {"hidden_size": 8})()
self.layers = nn.ModuleList([layer_cls(8, 8, bias=False)]).cuda()
def forward(self, x):
return self.layers[0](x)
@pytest.mark.parametrize("layer_cls", [bnb.nn.Linear8bitLt, bnb.nn.Linear4bit])
@pytest.mark.parametrize("variant", ["lora", "ia3", "hra"])
def test_hook_only_variants_attach_to_bnb(layer_cls, variant):
"""LoRA / IA3 / HRA only hook outputs; bnb dequantization is the layer's job."""
torch.manual_seed(0)
model = BnbModel(layer_cls)
x = torch.randn(2, 3, 8, device="cuda")
y_base = model(x).detach()
cfg = CFG_BY_VARIANT[variant](r=2, alpha=4, dtype=torch.float16, target_roles=())
ll.attach(model, cfg)
y = model(x)
assert (y.detach() - y_base).abs().max().item() < 1e-2
y.pow(2).mean().backward()
grad_total = sum(
g.abs().sum().item()
for n, p in model.named_parameters()
if "lora_" in n and p.requires_grad and (g := p.grad) is not None
)
assert grad_total > 0
@pytest.mark.parametrize("layer_cls", [bnb.nn.Linear8bitLt, bnb.nn.Linear4bit])
@pytest.mark.parametrize("variant", ["pissa", "dora"])
def test_weight_reading_variants_reject_bnb(layer_cls, variant):
model = BnbModel(layer_cls)
cfg = CFG_BY_VARIANT[variant](r=2, alpha=2, dtype=torch.float16, target_roles=())
with pytest.raises((TypeError, RuntimeError, AttributeError, ValueError)):
ll.attach(model, cfg)
-328
View File
@@ -1,328 +0,0 @@
"""Per-variant attach + train + save + load round-trip, plus surgical regressions.
The big invariant is the parametrized train_save_load test: identity at t=0,
gradient flow on a real loss, then save -> reload onto a fresh model and
confirm the trained outputs survive the round-trip. Cheap on CPU.
"""
from __future__ import annotations
from pathlib import Path
import pytest
import torch
from torch import nn
import lora_lite as ll
CFG_BY_VARIANT = {
"lora": ll.LoRAConfig,
"pissa": ll.PiSSAConfig,
"delora": ll.DeLoRAConfig,
"ia3": ll.IA3Config,
"ia3_ff": ll.IA3FFConfig,
"dora": ll.DoRAConfig,
"hra": ll.HRAConfig,
"eva": ll.EVAConfig,
"antipasto": ll.AntiPaSTOConfig,
}
# Per-variant identity tolerance at t=0 (after attach, before any step).
# fp32 SVD round-trip + per-row norm = looser tolerance for pissa/dora/antipasto.
IDENTITY_TOL = {
"lora": 1e-6,
"pissa": 5e-4,
"delora": 1e-6,
"ia3": 1e-6,
"ia3_ff": 1e-6,
"dora": 5e-5,
"hra": 5e-6,
"eva": 1e-6,
"antipasto": 5e-4,
}
class TinyBlock(nn.Module):
def __init__(self, d: int = 64, ff: int = 128):
super().__init__()
self.q_proj = nn.Linear(d, d, bias=False)
self.k_proj = nn.Linear(d, d, bias=False)
self.v_proj = nn.Linear(d, d, bias=False)
self.o_proj = nn.Linear(d, d, bias=False)
self.gate_proj = nn.Linear(d, ff, bias=False)
self.up_proj = nn.Linear(d, ff, bias=False)
self.down_proj = nn.Linear(ff, d, bias=False)
def forward(self, x: torch.Tensor) -> torch.Tensor:
h = self.o_proj(self.q_proj(x) + self.k_proj(x) + self.v_proj(x))
m = self.down_proj(torch.nn.functional.silu(self.gate_proj(x)) * self.up_proj(x))
return x + h + m
class TinyModel(nn.Module):
def __init__(self, n_layers: int = 4, d: int = 64, ff: int = 128, vocab: int = 100):
super().__init__()
self.embed_tokens = nn.Embedding(vocab, d)
self.layers = nn.ModuleList([TinyBlock(d, ff) for _ in range(n_layers)])
self.lm_head = nn.Linear(d, vocab, bias=False)
self.config = type("Cfg", (), {"hidden_size": d})()
def forward(self, ids: torch.Tensor) -> torch.Tensor:
x = self.embed_tokens(ids)
for block in self.layers:
x = block(x)
return self.lm_head(x)
class FakeLinearLike(nn.Module):
"""linear-like, but not nn.Linear: stand-in for bnb 4/8-bit modules."""
def __init__(self, d_in: int = 8, d_out: int = 8):
super().__init__()
self.in_features = d_in
self.out_features = d_out
self.weight = nn.Parameter(torch.empty(d_out, d_in))
nn.init.kaiming_uniform_(self.weight, a=5 ** 0.5)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return torch.nn.functional.linear(x, self.weight)
class FakeBnbModel(nn.Module):
def __init__(self):
super().__init__()
self.config = type("Cfg", (), {"hidden_size": 8})()
self.layers = nn.ModuleList([FakeLinearLike(8, 8)])
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.layers[0](x)
def cfg_for(variant: str) -> ll.AdapterConfig:
return CFG_BY_VARIANT[variant](
r=4,
alpha=8,
dtype=torch.float32,
)
def attach_with_calib(model: nn.Module, cfg: ll.AdapterConfig, ids: torch.Tensor) -> None:
if cfg.variant == "eva":
calib = [ids for _ in range(2)]
ll.attach(model, cfg, calibration_data=calib)
else:
ll.attach(model, cfg)
def trainable_grad_norm(model: nn.Module) -> float:
return sum(
p.grad.detach().float().norm().item()
for n, p in model.named_parameters()
if "lora_" in n and p.grad is not None
)
@pytest.mark.parametrize("variant", list(CFG_BY_VARIANT))
def test_train_save_load(variant: str, tmp_path: Path):
"""Identity at t=0, one SGD step, save, reload onto fresh model, outputs match."""
torch.manual_seed(0)
model = TinyModel()
ids = torch.randint(0, 100, (2, 16))
with torch.no_grad():
y_base = model(ids).clone()
cfg = cfg_for(variant)
attach_with_calib(model, cfg, ids)
trainable = [p for p in model.parameters() if p.requires_grad]
assert trainable
assert all("lora_" in n for n, p in model.named_parameters() if p.requires_grad)
with torch.no_grad():
y_init = model(ids).clone()
assert (y_init - y_base).abs().max().item() < IDENTITY_TOL[variant]
target = torch.randn_like(y_init) * 0.1
opt = torch.optim.SGD(trainable, lr=1e-2)
opt.zero_grad()
loss = (model(ids) - target).pow(2).mean()
loss.backward()
leaked = [n for n, p in model.named_parameters() if "lora_" not in n and p.grad is not None]
assert leaked == []
assert trainable_grad_norm(model) > 0
opt.step()
with torch.no_grad():
y_trained = model(ids).clone()
path = tmp_path / "adapter.pt"
ll.save(model, str(path))
torch.manual_seed(0)
model_loaded = TinyModel()
ll.load(model_loaded, str(path)) # EVA load skips group_init; calibration_data not needed
with torch.no_grad():
y_loaded = model_loaded(ids)
assert (y_loaded - y_trained).abs().max().item() < max(IDENTITY_TOL[variant], 1e-5)
@pytest.mark.parametrize("variant", ["lora", "delora", "ia3", "hra"])
def test_hook_only_variants_attach_to_non_linear_target(variant: str):
"""bnb-style targets are linear-like but not nn.Linear; hook-only variants must accept them."""
extra = {"lambda0": 0.1} if variant == "delora" else {}
cfg = CFG_BY_VARIANT[variant](r=2, alpha=4, dtype=torch.float32, target_roles=(), **extra)
model = FakeBnbModel()
ll.attach(model, cfg)
x = torch.randn(2, 3, 8)
model(x).pow(2).mean().backward()
assert trainable_grad_norm(model) > 0
@pytest.mark.parametrize("variant", ["pissa", "dora", "antipasto"])
def test_weight_reading_variants_reject_non_linear(variant: str):
r = 4 if variant == "antipasto" else 2 # antipasto needs r % block_size==0
cfg = CFG_BY_VARIANT[variant](r=r, alpha=r, dtype=torch.float32, target_roles=())
with pytest.raises(TypeError, match="plain nn.Linear"):
ll.attach(FakeBnbModel(), cfg)
def test_save_load_strict_keys(tmp_path: Path):
torch.manual_seed(0)
model = TinyModel()
ll.attach(model, ll.LoRAConfig(r=4, alpha=8, dtype=torch.float32))
p = tmp_path / "lora.pt"
ll.save(model, str(p))
blob = torch.load(p, weights_only=True, map_location="cpu")
missing = {"cfg": blob["cfg"], "state": dict(blob["state"]), "base_fp": blob.get("base_fp", {})}
missing["state"].pop(next(iter(missing["state"])))
torch.save(missing, p)
with pytest.raises(RuntimeError, match="missing lora keys"):
ll.load(TinyModel(), str(p))
bad = {"cfg": blob["cfg"], "state": dict(blob["state"]), "base_fp": blob.get("base_fp", {})}
bad["state"]["layers.0.q_proj.lora_extra"] = torch.zeros(1)
torch.save(bad, p)
with pytest.raises(RuntimeError, match="unexpected lora keys"):
ll.load(TinyModel(), str(p))
def test_no_target_layers_is_loud():
cfg = ll.LoRAConfig(target_names=("definitely_missing",))
with pytest.raises(RuntimeError, match="no target layers"):
ll.attach(TinyModel(), cfg)
def test_eva_requires_calibration():
"""EVA's group_init must error loudly if calibration_data is missing."""
with pytest.raises(ValueError, match="calibration_data"):
ll.attach(TinyModel(), ll.EVAConfig(r=4, alpha=8, dtype=torch.float32))
def test_delora_default_has_live_step0_gradient():
"""Default lambda0 must be nonzero; B=0 preserves identity while B gets gradient."""
torch.manual_seed(0)
model = TinyModel(n_layers=1)
ids = torch.randint(0, 100, (2, 8))
ll.attach(model, ll.DeLoRAConfig(r=4, alpha=8, dtype=torch.float32))
assert model.layers[0].q_proj.lora_lambda.item() == pytest.approx(15.0)
loss = model(ids).pow(2).mean()
loss.backward()
b_grad = model.layers[0].q_proj.lora_B.grad.detach().abs().max().item()
assert b_grad > 0
def test_pissa_identity_with_nonunit_scale():
"""Regression: PiSSA must pre-divide S by alpha/r, not require alpha == r."""
torch.manual_seed(0)
model = TinyModel(n_layers=1)
ids = torch.randint(0, 100, (2, 8))
with torch.no_grad():
y_base = model(ids).clone()
ll.attach(model, ll.PiSSAConfig(r=4, alpha=8, dtype=torch.float32))
with torch.no_grad():
y = model(ids)
assert (y - y_base).abs().max().item() < IDENTITY_TOL["pissa"]
def test_antipasto_blockwise_rotation_matches_explicit_blockdiag():
"""The einsum/rearrange path must equal the old explicit blockdiag math."""
from lora_lite.variants.antipasto import _build_rotation
torch.manual_seed(0)
n_blocks, bs, d_in, d_out = 3, 4, 7, 5
r = n_blocks * bs
rot_T = torch.randn(n_blocks, bs * (bs - 1) // 2) * 0.1
Vh = torch.randn(r, d_in)
U = torch.randn(d_out, r)
R_blocks = _build_rotation(rot_T, bs, 0.5)
R = torch.block_diag(*list(R_blocks))
Vh_blocks = torch.reshape(Vh, (n_blocks, bs, d_in))
Vh_rot = torch.einsum("nab,nbi->nai", R_blocks, Vh_blocks).reshape(r, d_in)
U_blocks = torch.reshape(U, (d_out, n_blocks, bs))
U_rot = torch.einsum("dnb,ncb->dnc", U_blocks, R_blocks).reshape(d_out, r)
assert (Vh_rot - R @ Vh).abs().max().item() < 1e-6
assert (U_rot - U @ R.T).abs().max().item() < 1e-6
def test_dora_bias_passthrough():
"""Regression: DoRA must NOT scale bias; identity holds with bias=True at t=0."""
torch.manual_seed(0)
d = 16
layer = nn.Linear(d, d, bias=True)
x = torch.randn(2, d)
y_base = layer(x).detach()
class Wrap(nn.Module):
def __init__(self, lin):
super().__init__()
self.config = type("Cfg", (), {"hidden_size": d})()
self.layers = nn.ModuleList([lin])
def forward(self, x):
return self.layers[0](x)
model = Wrap(layer)
ll.attach(model, ll.DoRAConfig(r=2, alpha=4, dtype=torch.float32, target_roles=()))
with torch.no_grad():
y = model(x)
assert (y - y_base).abs().max().item() < 1e-5
def test_hra_forward_is_x_R_T():
"""HRA must apply x @ R^T (loop i = r-1 down to 0). Asymmetric U makes order observable."""
torch.manual_seed(0)
d = 8
layer = nn.Linear(d, d, bias=False)
x = torch.randn(2, 3, d)
class Wrap(nn.Module):
def __init__(self, lin):
super().__init__()
self.config = type("Cfg", (), {"hidden_size": d})()
self.layers = nn.ModuleList([lin])
def forward(self, x):
return self.layers[0](x)
model = Wrap(layer)
ll.attach(model, ll.HRAConfig(r=4, alpha=4, dtype=torch.float32, target_roles=()))
# break paired symmetry so order matters
with torch.no_grad():
layer.lora_U.add_(0.1 * torch.randn_like(layer.lora_U))
U = layer.lora_U
R = torch.eye(d)
for i in range(U.shape[0]):
u = U[i]
sq = (u * u).sum().clamp_min(1e-12)
R = R - (2.0 / sq) * torch.outer(R @ u, u)
with torch.no_grad():
y_adapt = model(x)
y_ref = torch.nn.functional.linear(x, layer.weight @ R)
assert (y_adapt - y_ref).abs().max().item() < 1e-5
+127
View File
@@ -0,0 +1,127 @@
"""End-to-end smoke: run the metamath benchmark in probe mode for every variant.
Probe mode trains a few steps on tiny-random Llama, saves the adapter, reloads
it onto a fresh model, and asserts the trained logits match within tol. That's
the train+save+load round-trip on a real HF model, one test per variant.
A second test attaches each variant on top of a 4bit/8bit-loaded base and runs
one backward step. PiSSA/DoRA/AntiPaSTO/EVA must fail loud on quantized weights;
the rest must produce nonzero adapter grads. We do not run the full probe under
bnb because tiny-random + bnb dequant produces NaN logits unrelated to adapter
correctness.
"""
from __future__ import annotations
import importlib.util
import sys
from dataclasses import replace
from pathlib import Path
import pytest
import torch
import lora_lite as ll
SPEC = importlib.util.spec_from_file_location(
"metamath_benchmark",
Path(__file__).resolve().parent.parent / "scripts" / "metamath_gsm8k_benchmark.py",
)
benchmark = importlib.util.module_from_spec(SPEC)
sys.modules[SPEC.name] = benchmark
SPEC.loader.exec_module(benchmark)
VARIANTS = ["lora", "pissa", "delora", "ia3", "ia3_ff", "dora", "hra", "eva", "antipasto"]
# Variants that fail loud when attached on a bnb-loaded base (read dense weight in init).
# delora/eva also read weight but currently silently dequant -- they produce sane attach,
# so we don't expect a raise from them in the attach-only smoke.
BNB_RAISERS = {"pissa", "dora", "antipasto"}
TINY_MODEL = "hf-internal-testing/tiny-random-LlamaForCausalLM"
HAS_CUDA = torch.cuda.is_available()
HAS_BNB = importlib.util.find_spec("bitsandbytes") is not None
def quick_cfg(variant: str, tmp_path: Path, quantization: str = "none") -> "benchmark.BenchmarkConfig":
target_name = (
[r"(k_proj|v_proj)$"] if variant == "ia3"
else [r"(down_proj)$"] if variant == "ia3_ff"
else [r"(q_proj|v_proj)$"]
)
cfg = benchmark.BenchmarkConfig(
model=TINY_MODEL,
variant=variant,
mode="probe",
device="cuda" if HAS_CUDA else "cpu",
torch_dtype="float16" if quantization != "none" else "float32",
quantization=quantization,
r=4,
alpha=8,
target_name=target_name,
layers="all",
steps=2,
batch_size=2,
batch_size_eval=4,
max_train_samples=8,
max_eval_samples=4,
max_valid_samples=4,
max_test_samples=4,
max_seq_length=128,
max_new_tokens=8,
lr=5e-3,
seed=0,
log_examples=0,
log_every=1000,
output_dir=tmp_path / "out",
)
if variant == "antipasto":
cfg = replace(cfg, alpha=4) # block_size=4 -> need r % 4 == 0
return cfg
@pytest.mark.parametrize("variant", VARIANTS)
def test_metamath_quick_train_save_load(variant: str, tmp_path: Path):
"""Train 2 steps, save, reload onto fresh tiny model, logits match within tol."""
cfg = quick_cfg(variant, tmp_path)
result = benchmark.run(cfg)
assert result["train"]["base_grad_leaks"] == 0
assert result["train"]["first_grad_norm"] > 0
assert result["train"]["adapter_delta"] > 0
probe = result.get("probe") or {}
assert "reload_err" in probe
assert probe["reload_err"] < cfg.reload_tol
@pytest.mark.skipif(not (HAS_CUDA and HAS_BNB), reason="needs CUDA + bitsandbytes")
@pytest.mark.parametrize("quantization", ["4bit", "8bit"])
@pytest.mark.parametrize("variant", VARIANTS)
def test_attach_on_bnb_loaded_base(variant: str, quantization: str, tmp_path: Path):
"""Attach to a bnb-loaded base, run one backward step. Weight-reading variants must fail loud."""
cfg = quick_cfg(variant, tmp_path, quantization=quantization)
dtype = getattr(torch, cfg.torch_dtype)
def _do() -> float:
model, _ = benchmark.load_model_and_tokenizer(cfg.model, dtype, cfg.device, cfg.quantization)
adapter_cfg = benchmark.cfg_for_variant(cfg, dtype)
if cfg.variant == "eva":
ids = torch.randint(0, 100, (2, 8), device=cfg.device)
ll.attach(model, adapter_cfg, calibration_data=[{"input_ids": ids}])
else:
ll.attach(model, adapter_cfg)
ids = torch.randint(0, 100, (2, 8), device=cfg.device)
out = model(input_ids=ids).logits
loss = out.float().pow(2).mean()
loss.backward()
return sum(
p.grad.detach().float().norm().item()
for n, p in model.named_parameters()
if "lora_" in n and p.grad is not None
)
if variant in BNB_RAISERS:
with pytest.raises((TypeError, RuntimeError, AttributeError, ValueError)):
_do()
else:
_do() # only assert it runs without exception; tiny+bnb grads can be 0/garbage.