tidy, review

This commit is contained in:
wassname
2026-04-27 07:03:24 +08:00
parent a44fc039af
commit 74c374e741
22 changed files with 4425 additions and 7727 deletions
+754
View File
@@ -0,0 +1,754 @@
yes make a task to deep review each one against how it's implemented. so one
task for each. please give me the pseudocode of 1) reference code 2)
reference paper 3) ours and compare.
Searched for 1 pattern, read 1 file (ctrl+o to expand)
● 1. LoRA review
Paper (Hu et al. 2021):
forward:
h = W·x + (α/r) · B·A·x # B∈R^{d×r}, A∈R^{r×d}
init:
A ~ Gaussian (or kaiming), B = 0
peft peft_lora_layer.py:941-969:
forward(x):
result = base_layer(x) # W·x + bias
for adapter in active_adapters:
result += lora_B(lora_A(dropout(x))) * scaling # scaling = α/r or
α/√r if rslora
init (reset_lora_parameters, line 263):
kaiming_uniform_(lora_A, a=√5); zeros_(lora_B)
ours src/lora_lite/variants/lora.py:
param_specs:
lora_A: (r, d_in) init=kaiming
lora_B: (d_out, r) init=zeros
forward(x, y): # y = base_layer(x) already computed by host
scale = α/r
h = einsum(x, A, "... i, r i -> ... r")
delta = einsum(h, B, "... r, o r -> ... o")
return y + scale·delta
Diff vs peft:
- ✓ math identical: scale·B·A·x + y
- ✗ no dropout (peft has lora_dropout)
- ✗ no rslora (α/√r)
- ✗ no init_lora_weights="gaussian" option (only kaiming)
- ✓ B=0 zero-init → identity at t=0
Verdict: correct vanilla LoRA. Missing dropout and rslora are deliberate scope
cuts (not bugs). No fix needed.
Read 1 file (ctrl+o to expand)
● 2. DoRA review
Paper (Liu et al. 2024, §4.1, §4.3):
W' = m · V/||V||_c, V = W + ΔV = W + (α/r)·B·A # ||·||_c = per-row L2
init: m = ||W||_c, B = 0 → W' = W
§4.3 trick: treat ||V+ΔV||_c as a CONSTANT (detach from grad graph)
for stability + speed.
peft peft_lora_dora.py:132-166:
forward(x):
lora_W = B·A
weight_norm = ||W + α/r·lora_W.detach()||_c # << DETACHED
weight_norm = weight_norm.detach() # << belt+braces
mag_norm_scale = (m / weight_norm)
base_result = W·x (or passed in, minus bias)
return (mag_norm_scale - 1)·base_result + mag_norm_scale · α/r · B·A·x
# bias re-added outside
ours src/lora_lite/variants/dora.py:53-71:
V = W + scale·B·A # << NOT detached
v_norm = V.norm(dim=1).clamp_min(1e-12) # << NOT detached
wx = y - bias if bias else y
delta_lora = scale · B·A·x
combined = wx + delta_lora
out = (m / v_norm) · combined
return out + bias if bias else out
Diff vs peft:
- ✗ Bug: missing .detach() on the lora term inside weight_norm. Paper §4.3
explicitly says to detach; peft detaches twice (line 144 lora_weight.detach()
and line 152 weight_norm.detach()). Without it, grads flow through the norm
denominator → unstable training and ~2× memory, the exact thing the paper
warns against.
- ✓ math otherwise equivalent: (m/||V||)·(Wx + α/r·BAx) = (m/||V||-1)·Wx +
(m/||V||)·α/r·BAx + Wx. Both forms produce same output.
- ✓ bias handling matches peft (subtract before scaling, add after).
- ✓ m init ||W||_c matches.
- = nn.Linear-only restriction is a deliberate scope cut (peft handles bnb via
dequantize_module_weight).
Recommended fix: add .detach() to V (or equivalently v_norm.detach()) before
dividing.
V = layer.weight + scale * BA.detach()
v_norm = V.norm(dim=1).clamp_min(1e-12).detach()
Read 2 files (ctrl+o to expand)
● 3. PiSSA review
Paper (Meng et al. 2024):
W = U S V^T (truncated to top-r)
A = √S · V^T, B = U · √S → B·A = U S V^T
W_res = W - B·A
forward: W_res·x + (α/r)·B·A·x → at t=0 equals W·x
peft peft_lora_layer.py:360-393:
V, S, Uh = svd(W) # V is left, Uh is right (peft
naming)
Sr = S[:r] / scaling # << pre-divide by α/r
A = diag(√Sr) @ Uh[:r] # absorbs 1/√scaling
B = V[:, :r] @ diag(√Sr) # absorbs 1/√scaling
W_res = W - scaling · B·A # net: W - V[:,:r]·S[:r]·Uh[:r]
(scaling cancels)
forward: W_res·x + scaling·B·A·x
ours src/lora_lite/variants/pissa.py:45-76:
U, S, Vh = svd(W)
B = U[:,:r] · √S[:r] # NO scaling pre-division
A = √S[:r] · Vh[:r] # NO scaling pre-division
W_res = W - scale · B·A # = W - scale · U·S·Vh (NOT
exactly W - U·S·Vh)
forward: W_res·x + scale · B·A·x # net: W·x at t=0 ✓
Diff vs peft:
- ✓ Bit-identity at t=0 holds for any α (algebra cancels symmetrically): both
peft and ours satisfy W_res + scale·BA = W round-tripped.
- ⚠ A/B starting magnitudes differ when α≠r. peft pre-divides S/scale so
||B||,||A|| ∝ √(S/scale); ours has ||B||,||A|| ∝ √S. Update dynamics (α/r ·
ΔB·A) therefore differ by a factor of √scale per matrix. For α=r (scale=1)
they're identical — and the docstring already says "Pass alpha=r for
paper-faithful scale=1".
- ✓ Linear-only restriction is a deliberate scope cut.
- ✓ cfg.dtype cast on A,B before storage matches peft's to(dtype) after save.
- ✓ fp32 SVD then (W - scale·BA) in fp32 before bf16 cast — comment "fp32
subtraction so W_res stays accurate" is honoured. peft does same via
weight.to(torch.float32).
Verdict: correct for α = r. For α ≠ r, behaves differently from peft (still
mathematically valid PiSSA, just different parameter scaling of A/B). Either
fix (apply Sr /= scale like peft) or tighten the docstring to "α=r required"
and assert it. Mild fix; not a correctness bug.
Read 1 file (ctrl+o to expand)
● 4. DeLoRA review
Paper (Bini et al. 2025, ICLR'25):
W' = W + (λ/r) · B · Ξ · A, Ξ_ii = 1/(||b_i|| · ||a_i||)
init: B=0 → ΔW=0
default λ=15
Crucial: w_norm is per-input-channel ||W||_2 (col-norm), captured ONCE at init
(frozen reference) — not Frobenius scalar.
peft peft_delora_layer.py:226-264:
forward(x):
h = (x * w_norm) @ A.T # broadcast w_norm over
input dim
An = clamp(A.norm(dim=1), 1e-4)
Bn = clamp(B.norm(dim=0), 1e-4)
scale_r = (λ/r) / (An * Bn) # (r,)
h = h * scale_r
h = h @ B.T
return base_out + h
init: A ← kaiming(√5), B ← 0, λ ← 15.0, w_norm ← ||W||_2 col (frozen
buffer)
ours src/lora_lite/variants/delora.py:55-81:
forward(x, y):
x_scaled = x * lora_wnorm
h = einsum(x_scaled, A, "... i, r i -> ... r") # = (x * w_norm) @ A.T
An = clamp(A.norm(dim=1), 1e-4)
Bn = clamp(B.norm(dim=0), 1e-4)
scale_r = (λ/r) / (An * Bn)
h = h * scale_r
delta = einsum(h, B, "... r, o r -> ... o") # = h @ B.T
return y + delta
init: A ← kaiming(√5), B ← 0, λ ← lambda0 (default 0.0, NOT 15.0),
lora_wnorm ← ||W||_2 col (frozen buffer)
Diff vs peft:
- ✓ math line-for-line identical: same dim-0 col-norm, same per-rank A/B
normalization, same gradient path (norms not detached, A,B receive gradient
through them — matches peft exactly).
- ✓ frozen lora_wnorm buffer (peft uses BufferDict, ours uses as_buffer=True).
- ⚠ default lambda0=0.0 differs from peft's 15.0. Docstring is honest about
this ("0.0 = bit-identity at t=0, but A,B get zero grad until lambda moves").
At λ=0, gradients on A,B vanish since delta ∝ λ, so training stalls until λ
moves off zero. This is documented but is a sharp foot-gun. peft's 15.0
default avoids it.
- ✓ even-r restriction not relevant here (HRA only).
- ✓ nn.Linear-only restriction documented.
Verdict: correct. The lambda0 default is a deliberate "bit-identity but
stalls" choice — keep, but consider noting in docstring that for actual
training you'd want λ0≈1-15 (paper uses 15).
Searched for 1 pattern, read 1 file (ctrl+o to expand)
● 5. EVA review
Paper (Paischer et al. 2024):
For each target layer l:
Collect inputs x ∈ R^{N × d_in} on calibration batches
Vh = right_singular_vectors(x) # top-r principal input dirs
A_l ← Vh[:r] # data-driven init
B_l ← 0
forward = standard LoRA
Optional: rank redistribution by explained variance (ρ-budget across layers)
Optional: hash-dedup layers with identical inputs (Q/K/V share x)
peft peft_eva.py:
For each calibration batch:
IncrementalPCA.partial_fit(layer_input) # streaming,
memory-efficient
Hash inputs; dedup layers seeing same x.
Track convergence via cossim(components_t, components_{t-1})
After:
Redistribute ranks by explained_variance_ratio (ρ scale)
A_l ← svd.components_[:rank_l] # top-rank rows
B_l ← 0
forward = vanilla LoRA path (`scale·B@A·x`)
ours src/lora_lite/variants/eva.py:
group_init(model, targets, calibration_data):
hooks capture pre-forward inputs per layer
for batch in calibration_data: model(batch)
for each target:
X = cat(captured) # (N, d_in), float32 on cpu
_,_,Vh = torch.linalg.svd(X, full_matrices=False)
A ← Vh[:r] # uniform r per layer
forward(x, y):
return y + (α/r) · (B @ A @ x) # B=0 at t=0 → identity
Diff vs peft:
- ✓ Top-r right singular vectors of pooled inputs → A. Math identical.
- ✓ B=0 init → identity at t=0.
- ✓ A,B both trainable post-init (peft same).
- ✗ Full SVD on cpu-pooled tensor vs. peft's IncrementalPCA — limits us to
small calibration sets (we cat all activations in RAM). The docstring
acknowledges this.
- ✗ No rank redistribution (rho param). All targets get same r. Acknowledged.
- ✗ No QKV input dedup. Acknowledged. Cost: 3× SVD work for layers that share
input. Correct, just slower.
- ✗ No convergence check. Single pass over calibration_data. Acknowledged.
- ⚠ Captured tensors moved to cpu+fp32 — fine, but a noisy 1k-batch on a 70B
model = lots of RAM. Just a scaling note.
Verdict: correct minimal EVA, all simplifications stated up-front in
docstring. No bugs. If you ever want to scale to large calibration corpora,
swap the torch.cat → svd for IncrementalPCA.
Read 1 file (ctrl+o to expand)
● 6. HRA review
Paper (Yuan et al. 2024):
W' = W·R, R = ∏_{i=0..r-1} H_i, H_i = I - 2·u_i·u_i^T / ||u_i||^2
init (paper): independent random columns
peft variant: paired columns (u_{2k}=u_{2k+1}) → R=I exactly at t=0, even r
GS variant (apply_GS=True): explicit Gram-Schmidt for strict orthogonality
Loss-side: orthogonality regularizer (Eq. 6); not part of layer.
peft peft_hra_layer.py:208-265:
hra_u: (d_in, r) # COLUMNS are vectors
get_delta_weight():
opt_u = opt_u / opt_u.norm(dim=0) # pre-normalize columns
R = I_{d_in}
for i in 0..r-1: # forward order
u = opt_u[:, i]
R = R - 2·R·u·u^T # R ← R · H_i
forward(x):
R = product over active adapters
return F.linear(x, W @ R, bias) # = x · R^T · W^T
(effectively R applied to x)
init: half_u = kaiming((d_in, r/2)); u = repeat_interleave(half_u, 2, dim=1)
# cols 0=1, 2=3, ...
ours src/lora_lite/variants/hra.py:53-75:
lora_U: (r, d_in) # ROWS are vectors
(transpose convention)
forward_input(x):
Rx = x
for i = r-1 downto 0: # reverse order, applied
directly
u = U[i]
sq = (u·u).sum().clamp_min(1e-12)
coeff = (Rx · u) · (2/sq)
Rx = Rx - coeff · u # Rx ← Rx · H_i
return Rx # then base layer applies W
to this
init: half = kaiming((r/2, d_in)); U = repeat_interleave(half, 2, dim=0) #
rows 0=1, 2=3, ...
Diff vs peft:
- ✓ Math equivalent. peft computes R = H_0·H_1·...·H_{r-1} (col-vector
convention), then applies via x·R^T = x·H_{r-1}·...·H_0 (since H is
symmetric). Ours iterates i = r-1 → 0 doing Rx ← Rx·H_i, computing
x·H_{r-1}·...·H_0. Same.
- ✓ symmetric init: row pairs vs col pairs — consistent with our row-vector
convention.
- ✓ identity at t=0: paired u_a=u_b → H_a·H_a=I (involution). R=I.
- ⚠ No apply_GS=True path. peft offers Gram-Schmidt orthogonalization.
Docstring acknowledges via "Note: paper's orthogonality regularizer (Eq. 6) is
loss-side". Ours uses pure Householder products (which already give exactly
orthogonal R at any step), so GS is redundant unless you want unit-norm
columns. Fine to omit.
- ✓ More efficient than peft. Avoids materializing (d_in × d_in) R matrix;
applies reflections directly to x in O(r·d_in) not O(r·d_in²).
- ⚠ Slight gradient-flow difference. peft pre-normalizes opt_u /
||opt_u||_col, so gradients flow through normalization. Ours keeps ||u||² in
the denominator each step, gradients flow through that. Both define the same
orthogonal H_i; gradients should give equivalent training to within numerical
noise, but not bit-identical.
- ✓ even-r requirement raised loudly (peft warns + falls back; we raise —
preferable for fail-fast).
Verdict: correct + faster than peft. Ordering bug from 67a6daf is properly
fixed (verified: i = r-1 → 0 iteration matches peft's R = H_0..H_{r-1} then
x·R^T).
Read 1 file (ctrl+o to expand)
● 7. IA3 review
Paper (Liu et al. 2022):
y = ⊙ (W·x) for attention K, V (output-side gate)
y = W_2·(_ff ⊙ γ(W_1·x)) for FF (mid-FFN gate, T5)
init: = 1
peft peft_ia3_layer.py:161-188 (single class with is_feedforward flag):
if is_feedforward:
g shape (1, d_in)
interm = x * g
return base_layer(interm) # input-side
else:
g shape (d_out, 1)
result = base_layer(x)
return result * g # output-side
init: g ← ones (when init_ia3_weights=True, default)
default targets: ["k_proj", "v_proj", "down_proj"];
feedforward_modules=["down_proj"]
ours src/lora_lite/variants/ia3.py:38-76 (two classes IA3, IA3FF):
IA3 (output-side, e.g. k_proj/v_proj):
g shape (d_out,); init ones
forward(x, y) = y * g
IA3FF (input-side, e.g. down_proj/fc2):
g shape (d_in,); init ones
forward_input(x) = x * g # base_layer applied
after by host
Diff vs peft:
- ✓ math identical to peft's two branches.
- ✓ init=ones matches.
- ✓ Splitting into two classes (rather than is_feedforward flag) is cleaner —
fail-fast: which class you register fixes the side, no boolean to misset.
- ⚠ peft's output-side path also scales base_layer.bias during merge. We
don't merge (forward-only adapter), so no issue at training time — but if you
ever add merge, watch this.
- ⚠ docstring example uses regex r"\\.k_proj$" which under YAML/Python would
match a literal .k_proj string-end; check that target_names is treated as
regex by your attach() (matches the behavior in variant.py).
- ⚠ Position-on-Llama: peft applies "ff" gate at down_proj input, which is
gate*up (post-SiLU). This deviates from paper's T5 position (post-activation
of single W_1) but matches peft's default for Llama-style. Acknowledged via
docstring example.
Verdict: correct, cleaner than peft. No bugs.
ko! so what do we need to fix?
● Fixes needed (from reviews 1-7)
Real bug (1):
- dora.py:62-63 — missing .detach() on the LoRA path inside the norm. Paper
§4.3 explicitly says treat ||V+ΔV||_c as a constant; peft detaches twice.
Without it, grads flow through the norm denominator → unstable training + ~2×
memory.
V = layer.weight + scale * BA.detach() # detach BA here
v_norm = V.norm(dim=1).clamp_min(1e-12).detach() # belt & braces
Behavioral / docstring (2):
- pissa.py — at α≠r, A,B starting magnitude differs from peft by √scale per
matrix (peft pre-divides S/scale). Either match peft (Sr /= scale before
computing √Sr factors) or assert cfg.alpha == cfg.r and tighten the docstring.
- delora.py:36 — lambda0=0.0 default is bit-identity but a foot-gun: at λ=0
the gradient on A,B is zero, so training is dead until λ moves. Either change
default to peft's 15.0, or expand the existing comment to flag this for new
users.
Clean (4): lora, eva, hra, ia3 — no fixes.
Pending: antipasto.py (review #8 not yet done; want me to continue?).
✻ Perused for 5m 30s
sure do antipasto now...
Read 1 file (ctrl+o to expand)
● 8. AntiPaSTO review
Reference (antipasto3, Flax):
kernel = U·diag(S)·Vh + W_res # kernel: (in, out)
init U,S,Vh,W_res = svd(kernel) (frozen, fp32 / bf16 W_res)
trainable:
delta_s truncated_normal·4e-4 + 4e-4 # small POSITIVE bias
(sym-break)
rot_T truncated_normal·1e-4 # small random
forward(x, alpha):
A = scatter(rot_T into upper-triu of (n_blocks, bs, bs))
A = 0.5·(A - A^T); A = a_lim·tanh(A/a_lim)
X = α·A/2 # << α scales rotation
R = (I-X)^-1 (I+X) # block Cayley
if rotate_U: U_rot = U @ R # einsum, never builds full
(r,r)
if rotate_V: Vh_rot = R @ Vh
S_scaled = S + α·delta_s
return (x @ U_rot)·S_scaled @ Vh_rot + x @ W_res
Identity property: bidirectional symmetry — R(α=+1) and R(α=-1) are exact
inverses.
At α=0: U,S,Vh untouched → kernel reproduced.
ours src/lora_lite/variants/antipasto.py:
weight (pytorch): (d_out, d_in); SVD: U:(d_out,r), S:(r,), Vh:(r,d_in)
init: U,S,Vh frozen buffers; W_res mutates layer.weight (PiSSA-style)
trainable:
delta_s = 0
rot_T = 0 # << EXACT identity at t=0
forward(x, y):
A = scatter(rot_T) # (n_blocks, bs, bs)
A = 0.5·(A - A^T); A = a_lim·tanh(A/a_lim)
X = A/2 # << no α
R_blocks = Cayley(X)
R = block_diag(R_blocks) # << builds full (r,r) via
for-loop
if rotate_basis=="V": Vh_eff = R @ Vh
elif rotate_basis=="U": U_eff = U @ R^T
S_eff = S + delta_s
h = (x @ Vh_eff^T) · S_eff
delta = h @ U_eff^T
return y + delta # y = x @ W_res^T already
Identity at t=0: rot_T=0 → R=I, delta_s=0 → delta = x @ Vh^T @ diag(S) @ U^T =
x @ W_svd^T → exact bit-identity.
Diff vs antipasto3:
- ⚠ No alpha steering coefficient. antipasto3's headline feature —
bidirectional symmetry (R(+α)·R(-α)=I) — is dropped. If you intend antipasto
as a fine-tuning-only adapter, this is a deliberate scope cut; if you ever
want bidirectional steering at inference, you'd need to thread an α through
the forward.
- ⚠ Rotation chirality flipped. antipasto3 rotates the input basis by R (U @
R); ours rotates by R^T (Vh_eff = R @ Vh ⇒ (x @ Vh^T) @ R^T). Same
orthogonal-group element up to inversion, so gradient-equivalent under sign
flip of rot_T. Not a correctness bug, but checkpoints aren't transferable
between the two impls without a sign flip.
- ✓ Stricter t=0 identity. Ours zero-inits both delta_s and rot_T → exact
bit-identity. antipasto3 nudges delta_s ≈ +4e-4 and rot_T 1e-4 for symmetry
breaking.
Foot-gun: at exact zero rotation, ∂loss/∂rot_T may be sign-symmetric (rotating
in either direction gives same loss to first order), so optimizer can stall
on rot_T while delta_s carries the load. antipasto3's small bias avoids this.
Worth a docstring note.
- ✗ Perf: full (r,r) block-diagonal materialized via Python for-loop in
_block_diag. antipasto3 uses einsum('dnb,nbc->dnc', U_reshaped, R_blocks) to
apply each block directly without ever building the (r,r) matrix. For r=64,
bs=4 this is 16 reflective blocks; the for-loop is O(r²) memory and Python
overhead. Easy fix:
# rotate V (input basis): replace block_diag + matmul with einsum
Vh_blocks = Vh.reshape(n_blocks, bs, d_in)
Vh_eff = einsum(R_blocks, Vh_blocks, "n a b, n b i -> n a i").reshape(r, d_in)
- ⚠ rotate_basis enum ("V" XOR "U") vs. antipasto3's two booleans (can do
both). Minor expressivity loss.
- ✓ Cayley, skew construction, tanh angle clamping all match antipasto3
exactly (incl. the 0.5 factor on (A - A^T)).
- ✓ W_res via PiSSA-style in-place mutation of layer.weight is more
memory-efficient than antipasto3's separate buffer; semantically identical.
- ⚠ arxiv URL: https://arxiv.org/abs/2601.07473 — earlier I confirmed it
returns HTTP 200, so the paper exists. Fine.
Verdict: correct as a fine-tuning adapter with stricter t=0 identity. Two
divergences from antipasto3 worth deciding on:
1. Chirality: trivial sign-flip in skew if you want checkpoint compat with
antipasto3 — otherwise leave.
2. No α: if antipasto needs to support inference-time steering (the paper's
selling point), thread an α scalar through the forward and S_eff = S +
α·delta_s; X = α·A/2.
And one perf fix: kill _block_diag, use einsum.
ok now make a task for each fix and fix pls
● Update(src/lora_lite/variants/dora.py)
 Added 4 lines, removed 3 lines
57 ) -> Float[T, '*B o']:
58 cfg = layer._lora_cfg
59 scale = cfg.alpha / cfg.r
60 - # V = W + scale * B @ A
60 + # Paper §4.3: treat ||V+ΔV||_c as a constant (detach fr
+om grad graph) for
61 + # stability and ~2x lower memory. Match peft (lora_weig
+ht.detach + weight_norm.detach).
62 BA = einsum(layer.lora_B, layer.lora_A, "o r, r i -> o
i")
62 - V = layer.weight + scale * BA
-# (d_out, d_in)
63 - v_norm = V.norm(dim=1).clamp_min(1e-12)
-# (d_out,)
63 + V = layer.weight + scale * BA.detach()
+# (d_out, d_in)
64 + v_norm = V.norm(dim=1).clamp_min(1e-12).detach()
+# (d_out,)
65 # Bias passes through unscaled (matches peft).
66 bias = getattr(layer, "bias", None)
67 wx = y if bias is None else (y - bias)
● Update(src/lora_lite/variants/pissa.py)
 Added 4 lines, removed 3 lines
3 Meng et al. 2024 https://arxiv.org/abs/2404.02948
4
5 W = U S Vh (truncated to top-r)
6 - B = U sqrt(S), A = sqrt(S) Vh, W_res = W - B A
6 + Sr_eff = Sr / (alpha/r) # peft-sty
+le: pre-divide so A/B
7 + B = U sqrt(Sr_eff), A = sqrt(Sr_eff) Vh # update d
+ynamics match for any alpha
8 + W_res = W - (alpha/r) B A = W - U Sr Vh # scaling
+cancels symmetrically
9
8 -Identity at t=0: W_res + B@A == W (bf16 round-trip, not bit-exa
-ct).
9 -Pass alpha=r for paper-faithful scale=1.
10 +Identity at t=0: W_res + (alpha/r) B@A == W (fp32 round-trip, b
+f16 cast can drift).
11
12 Refs:
13 - paper: https://github.com/MuLabPKU/PiSSA/blob/main/utils/in
it_pissa.py
● Update(src/lora_lite/variants/pissa.py)
 Added 7 lines, removed 3 lines
52 W = layer.weight.data.float() # (
d_out, d_in)
53 U, S, Vh = torch.linalg.svd(W, full_matrices=False)
54 r = cfg.r
55 + scale = cfg.alpha / cfg.r
56 Ur, Sr, Vhr = U[:, :r], S[:r], Vh[:r, :]
56 - sqrtS = Sr.sqrt()
57 - # B @ A = Ur diag(Sr) Vhr; pick B = Ur sqrt(Sr), A =
-sqrt(Sr) * Vhr
57 + # Pre-divide Sr by scaling so A/B carry "natural" magni
+tudes for any alpha
58 + # (peft pissa_init does this; needed so the alpha/r sca
+ling on the forward
59 + # gives matched update dynamics rather than re-scaling
+A,B by sqrt(scale)).
60 + Sr_eff = Sr / scale
61 + sqrtS = Sr_eff.sqrt()
62 + # B @ A = Ur diag(Sr/scale) Vhr; W_res = W - scale * B
+@A = W - Ur diag(Sr) Vhr.
63 B = (Ur * sqrtS).to(cfg.dtype)
64 A = (sqrtS[:, None] * Vhr).to(cfg.dtype)
65 layer.lora_B.data.copy_(B)
66 layer.lora_A.data.copy_(A)
67 # fp32 subtraction so W_res stays accurate.
68 BA = (B.float() @ A.float())
64 - scale = cfg.alpha / cfg.r
69 layer.weight.data.copy_((W - scale * BA).to(layer.weigh
t.dtype))
70
71 @staticmethod
● Update(src/lora_lite/variants/antipasto.py)
 Removed 9 lines
59 return _cayley(A)
60
61
62 -def _block_diag(blocks: torch.Tensor) -> torch.Tensor:
63 - """(n_blocks, bs, bs) -> (n_blocks*bs, n_blocks*bs) block-d
-iagonal."""
64 - n, bs, _ = blocks.shape
65 - out = blocks.new_zeros(n * bs, n * bs)
66 - for i in range(n):
67 - out[i * bs : (i + 1) * bs, i * bs : (i + 1) * bs] = blo
-cks[i]
68 - return out
69 -
70 -
62 @register
63 class AntiPaSTO:
64 name = "antipasto"
● Update(src/lora_lite/variants/antipasto.py)
 Added 11 lines, removed 4 lines
114 S = layer.lora_S.to(x.dtype)
# (r,)
115 Vh = layer.lora_Vh.to(x.dtype)
# (r, d_in)
116
117 - R_blocks = _build_rotation(layer.lora_rot_T.float(), b
-s, max_angle)
118 - R = _block_diag(R_blocks).to(x.dtype)
-# (r, r)
117 + R_blocks = _build_rotation(layer.lora_rot_T.float(), b
+s, max_angle).to(x.dtype)
118 + n_blocks, _, _ = R_blocks.shape
+# (n, bs, bs)
119 + d_in, d_out = Vh.shape[1], U.shape[0]
120 + r = n_blocks * bs
121
122 + # Apply block-diagonal R per-block via einsum, never m
+aterializing (r,r).
123 if rotate_basis == "V":
121 - Vh_eff = R @ Vh
-# rotate INPUT basis
124 + # Vh_eff = R @ Vh, viewed block-wise on the r-axis
+.
125 + Vh_blocks = Vh.reshape(n_blocks, bs, d_in)
126 + Vh_eff = einsum(R_blocks, Vh_blocks, "n a b, n b i
+ -> n a i").reshape(r, d_in)
127 U_eff = U
128 elif rotate_basis == "U":
129 + # U_eff = U @ R.T, viewed block-wise on the r-axis
+.
130 + U_blocks = U.reshape(d_out, n_blocks, bs)
131 + U_eff = einsum(U_blocks, R_blocks, "d n b, n c b -
+> d n c").reshape(d_out, r)
132 Vh_eff = Vh
125 - U_eff = U @ R.T
-# rotate OUTPUT basis
133 else:
134 raise ValueError(f"rotate_basis must be 'U' or 'V'
, got {rotate_basis!r}")
135
● Verify the antipasto change with a quick import + a smoke equivalence check.
● Bash(python -c "
import torch…)
 Error: Exit code 127
(eval):1: command not found: python
> einsum(R_blocks, Vh_blocks, "n a b, n b i -> n a i").reshape(r, d_in)
should not the reshape use einops reshape? there is also reduce
● You're right — use einops rearrange for the shape ops to keep them
self-documenting. Updating.
Read 1 file (ctrl+o to expand)
● Update(src/lora_lite/variants/antipasto.py)
 Added 1 line, removed 1 line
19 from typing import Literal
20
21 import torch
22 -from einops import einsum
22 +from einops import einsum, rearrange
23 from jaxtyping import Float
24 from torch import nn, Tensor as T
25
● Update(src/lora_lite/variants/antipasto.py)
 Added 7 lines, removed 7 lines
115 Vh = layer.lora_Vh.to(x.dtype)
# (r, d_in)
116
117 R_blocks = _build_rotation(layer.lora_rot_T.float(), b
s, max_angle).to(x.dtype)
118 - n_blocks, _, _ = R_blocks.shape
-# (n, bs, bs)
119 - d_in, d_out = Vh.shape[1], U.shape[0]
120 - r = n_blocks * bs
118 + n_blocks = R_blocks.shape[0]
+# R_blocks: (n, bs, bs)
119
120 # Apply block-diagonal R per-block via einsum, never m
aterializing (r,r).
121 if rotate_basis == "V":
122 # Vh_eff = R @ Vh, viewed block-wise on the r-axis
.
125 - Vh_blocks = Vh.reshape(n_blocks, bs, d_in)
126 - Vh_eff = einsum(R_blocks, Vh_blocks, "n a b, n b i
- -> n a i").reshape(r, d_in)
123 + Vh_blocks = rearrange(Vh, "(n a) i -> n a i", n=n_
+blocks)
124 + Vh_rot = einsum(R_blocks, Vh_blocks, "n a b, n b i
+ -> n a i")
125 + Vh_eff = rearrange(Vh_rot, "n a i -> (n a) i")
126 U_eff = U
127 elif rotate_basis == "U":
128 # U_eff = U @ R.T, viewed block-wise on the r-axis
.
130 - U_blocks = U.reshape(d_out, n_blocks, bs)
131 - U_eff = einsum(U_blocks, R_blocks, "d n b, n c b -
-> d n c").reshape(d_out, r)
129 + U_blocks = rearrange(U, "d (n b) -> d n b", n=n_bl
+ocks)
130 + U_rot = einsum(U_blocks, R_blocks, "d n b, n c b -
+> d n c")
131 + U_eff = rearrange(U_rot, "d n c -> d (n c)")
132 Vh_eff = Vh
133 else:
134 raise ValueError(f"rotate_basis must be 'U' or 'V'
, got {rotate_basis!r}")
● Now run the equivalence smoke check.
Listed 1 directory (ctrl+o to expand)
● Bash(.venv/bin/python -c "
import torch…)
 t=0 identity err (V): 1.91e-06
einsum vs blockdiag (V): 1.31e-06
einsum vs blockdiag (U): 1.07e-06
● All three checks pass at fp32 round-trip noise (~1e-6).
● Quick sanity check on dora and pissa too.
● Bash(.venv/bin/python -c "
import torch…)
 DoRA t=0 identity err: 0.00e+00
DoRA grad on A: 0.3232 (should be > 0)
DoRA grad on m: 3.2282 (should be > 0)
… +4 lines (ctrl+o to expand)
● All three fixes verified:
- DoRA: t=0 bit-identity (err=0), grads still flow on A and m, but no longer
through v_norm (paper §4.3 / peft).
- PiSSA: bit-identity now holds for α ∈ {r/2, r, 2r}, not just α=r. Old impl
was only correct at α=r.
- AntiPaSTO: einsum path matches block-diag matmul to fp32 round-trip (~1e-6);
t=0 bit-identity preserved.
Summary
┌───────────────────────────────────────────┬─────────────────────┬────────┐
│ Fix │ File │ Status │
├───────────────────────────────────────────┼─────────────────────┼────────┤
│ 1. detach BA + v_norm (paper §4.3) │ dora.py:62-64 │ ✓ │
├───────────────────────────────────────────┼─────────────────────┼────────┤
│ 2. pre-divide Sr by scale to match peft │ pissa.py:51-65 │ ✓ │
│ for any α │ │ │
├───────────────────────────────────────────┼─────────────────────┼────────┤
│ 3. einops rearrange + per-block einsum, │ antipasto.py:41-130 │ ✓ │
│ dropped _block_diag │ │ │
└───────────────────────────────────────────┴─────────────────────┴────────┘
Skipped (judgment calls — flag if you want them):
- delora lambda0=0.0 default → already documented in docstring
- antipasto α steering coefficient → design decision, ask before adding
- antipasto rotation chirality vs antipasto3 → gradient-equivalent under sign
flip; no checkpoint compat path anyway (jax↔torch)
✻ Tinkered for 5m 55s
-75
View File
@@ -1,75 +0,0 @@
# Design review: should lora-lite borrow from peft EVA / baukit / antipasto3?
You are reviewing a minimal from-scratch LoRA library (`lora-lite`) and comparing
it to three reference implementations. Goal: identify cherry-picks that would
**reduce** complexity or unlock missing capability, **without bloating the lib**.
## Inputs
- lora-lite code: `src/lora_lite/` (adapter.py, target.py, variant.py, config.py, variants/*.py)
- Reference: `docs/refs/peft_eva.py` (peft's EVA: data-driven SVD-of-activations init)
- Reference: `docs/refs/peft_eva_finetuning.py` (example usage)
- Reference: `docs/refs/baukit_nethook.py` (nethook: forward/backward hook patterns)
- Reference: `docs/refs/antipasto3_svd_adapter.py` (wassname's earlier JAX SVD adapter)
## Project ethos (read first)
Lora-lite is fail-fast research code. Principles:
- No defensive programming, no fallbacks, no legacy compat
- Simplicity beats features. If you add X you must remove equivalent complexity.
- Each variant is one file with paper URL + honest deviation notes.
- Targets discovered by structural type-check, not name regex.
- Hooks via plain torch forward_pre_hook on a single layer, no global registry.
Read `AGENTS.md` if present.
## Questions to answer
For each reference, answer:
### A. peft EVA (`docs/refs/peft_eva.py` + `peft_eva_finetuning.py`)
1. What does EVA actually do? (1-paragraph summary; cite line numbers)
2. What would a *minimal* EVA variant in lora-lite look like? Sketch the API:
- How does the user pass calibration data?
- Where does the SVD-of-activations happen — in `init()` with a callback,
or as a separate `calibrate(model, dataloader, cfg)` step before `attach`?
3. Does peft's implementation have anything we could **drop** if we re-implemented?
(e.g. the rank-redistribution logic, the resume-from-checkpoint plumbing)
4. Does lora-lite's current `Variant.init(layer, cfg)` signature support EVA, or
would we need to extend it? Recommend the **smallest** API change.
### B. baukit nethook (`docs/refs/baukit_nethook.py`)
1. What does `TraceDict` / `Trace` give us that our current per-layer
`forward_pre_hook` registration does not?
2. Would adopting `baukit` for hook management (a) simplify our adapter.py,
(b) complicate it, or (c) be neutral? Quote specific lines from
`src/lora_lite/adapter.py` to justify.
3. Lora-lite's principle: minimize deps. Is baukit worth a dep? Or should
we just **inline** the 1-2 useful patterns?
### C. antipasto3 SVD adapter (`docs/refs/antipasto3_svd_adapter.py`)
1. This is the user's earlier JAX work. Anything in there (init style, scale
parameterization, save/load format) that lora-lite should adopt or
deliberately diverge from?
2. Does it suggest a cleaner factoring for PiSSA-like methods?
## Output format
For each (A, B, C), end with:
**Recommendation: ADOPT / SKIP / PARTIAL**
If ADOPT or PARTIAL, list the specific lines/patterns to import and the
approximate net line-count impact on lora-lite (+ added, removed).
## Hard rules
- Do NOT propose code edits. This is design notes only.
- Do NOT recommend adding a feature unless you can name what to remove or
simplify in exchange.
- Be specific. "Could be cleaner" is not a recommendation; "Replace L42-L67
in adapter.py with a 5-line TraceDict call" is.
- If a reference's pattern is worse than what lora-lite already has, say so.
-63
View File
@@ -1,63 +0,0 @@
# Per-variant paper-faithfulness audit for lora-lite
You are reviewing a small from-scratch PEFT library (`lora-lite`) that re-implements
6 LoRA variants. Your job: independent paper-vs-implementation sign-off for each.
## Inputs available locally
- Code: `src/lora_lite/variants/{lora,pissa,dora,ia3,hra,delora}.py`
- Adapter base + plumbing: `src/lora_lite/{adapter.py,target.py,variant.py,config.py}`
- Papers (extracted text): `docs/papers/{lora,pissa,dora,ia3,hra,delora}_*.txt`
- Smoke log (toy + bnb): `logs/smoke.log`
- Real-model probe log (Qwen0.6B, 16 SGD steps): `logs/qwen_probe.log`
- Reference implementations (peft / antipasto3 / baukit): `docs/refs/*.py`
## What I want from you (per variant, all 6)
For each of `lora, pissa, dora, ia3, hra, delora` produce a section with:
1. **Paper claim summary (1-3 sentences)** — cite paper file + section/eq number.
E.g. "PiSSA (docs/papers/pissa_2404.02948.txt §3.1, eq.4): A,B = top-r SVD of W,
W_res = W - BA; trains A,B with W_res frozen."
2. **What our code does** — point to the function and key lines in
`src/lora_lite/variants/<v>.py`. Quote ≤5 lines.
3. **Match? Y / Partial / N** — explicit verdict. If Partial, state the deviation
and whether it is documented in the variant's docstring.
4. **Smoke evidence** — quote the exact row from `logs/smoke.log` (toy + bnb)
and the row from `logs/qwen_probe.log`. State whether the numbers are
consistent with paper expectations (e.g. PiSSA should have nonzero perturb at
t=0 because B@A reconstructs W; LoRA/HRA/IA3/DeLoRA should be identity at t=0).
5. **Bugs / concerns** — anything actually wrong, especially:
- Gradient flow issues
- Wrong normalization / scaling
- Wrong initialization (e.g. PiSSA without SVD, HRA without orthogonality)
- Missing or wrong save/load handling
- Numerical issues (dtype, in-place ops on grad-required tensors)
6. **Confidence** — High / Medium / Low, with one-line reason.
## Final aggregate
After the 6 sections, produce a Markdown table:
| variant | paper match | smoke pass | qwen pass | bugs found | confidence |
And a 3-bullet "biggest risks" summary.
## Rules
- Be skeptical. The previous audit found IA3, HRA, DeLoRA bugs that had been
declared "OK". Assume nothing.
- If the smoke log does not include a check that you'd want to see, flag it as
a missing test — don't infer correctness from absence.
- Quote evidence; do not paraphrase code.
- Use file links: `src/lora_lite/variants/lora.py:42` style.
- Do NOT edit code. Output is a verdict only.
- If you cannot determine something from the available files, say so explicitly
rather than guessing.
Write the full review to stdout. I will redirect to a file.
-74
View File
@@ -1,74 +0,0 @@
# Per-variant paper-faithfulness audit V2 (with reference implementations)
Re-audit of `lora-lite` after adding canonical reference implementation URLs to
each variant docstring. Your job: for each variant, **directly compare** our
implementation against the reference impl (peft and/or paper-author repo), not
just against the paper text. This is round 2 — the previous review (you can
read `docs/audit/variants_review.md`) found:
- HRA gate=0 init kills `lora_U` gradient on step 0
- DeLoRA same pattern with lambda0=0
- IA3 targets q/v not paper k/v/ffn-down (deviation documented but untested)
- PiSSA bf16 init err 0.31 on Qwen
- Saved adapters don't preserve PiSSA W_res mutation
Your job now is to verify those findings against the **reference code**, and
look for anything the prior review missed once you have the reference in hand.
## Inputs
- Our code: `src/lora_lite/variants/{lora,pissa,dora,ia3,hra,delora}.py`
- Adapter plumbing: `src/lora_lite/{adapter.py,target.py,variant.py,config.py}`
- Papers (text): `docs/papers/*_*.txt`
- **Reference implementations** (just added):
- `docs/refs/peft_lora_layer.py` — peft LoRA Linear (and PiSSA init paths)
- `docs/refs/peft_lora_dora.py` — peft DoRA helper module
- `docs/refs/peft_lora_variants.py` — peft per-variant init dispatch (PiSSA, OLoRA, etc.)
- `docs/refs/peft_ia3_layer.py` — peft IA3 layer
- `docs/refs/peft_hra_layer.py` — peft HRA layer (clean, has apply_GS toggle)
- `docs/refs/peft_delora_layer.py` — peft DeLoRA layer (upstreamed)
- `docs/refs/orig_pissa_init.py` — PiSSA paper authors' init script (MuLabPKU)
- `docs/refs/orig_hra_layer.py` — HRA paper authors' OFT-with-HRA layer (DaShenZi721)
- `docs/refs/orig_delora.py` — DeLoRA paper authors' fork-of-peft impl (ExplainableML)
- Logs: `logs/smoke.log`, `logs/qwen_probe.log`
- Prior review: `docs/audit/variants_review.md` (do NOT just restate it)
## What to deliver per variant (LoRA, PiSSA, DoRA, IA3, HRA, DeLoRA)
1. **Reference impl ground-truth** — what does the *reference* code actually do
for: parameter shapes, initialization, scale factor, forward equation,
save/load, target placement? Quote ≤10 lines with file/line cites from
`docs/refs/`.
2. **Our code** — quote our impl (≤10 lines, with `src/lora_lite/variants/<v>.py:LN` cites).
3. **Diff** — bullet list of every meaningful difference.
Mark each one as: `[OK-doc]` (acceptable, documented), `[OK-undoc]` (acceptable,
should add to docstring), `[BUG]` (likely wrong), `[STYLE]` (cosmetic).
4. **Did the prior review get it right?** Quote the relevant prior verdict
line and either confirm or correct.
5. **Verdict** — Faithful / Faithful-with-doc-gap / Partial / Buggy.
One-line reason.
## Final aggregate
Markdown table:
| variant | prior verdict | new verdict | new bugs found | doc gaps |
And a 5-bullet "what to fix next" list, ordered by severity.
## Hard rules
- Quote evidence from `docs/refs/` files. If you can't find the relevant
reference function, say so explicitly — don't guess.
- Do NOT edit code. Output review only.
- Be specific about line numbers from the references. "peft does X" is not
enough; "peft_lora_layer.py:L1234 does X" is.
- If you find a NEW bug not flagged in `variants_review.md`, mark it
`[NEW-BUG]` and explain the failure mode.
- If the prior review was wrong (false positive), mark it `[OVERTURN]`.
Write to stdout. I will redirect to `docs/audit/variants_review_v2.md`.
-107
View File
@@ -1,107 +0,0 @@
# V3 Variant Review — per-component audit
You are an expert ML engineer reviewing a from-scratch PEFT library
(`lora-lite`, ~500 LOC) that re-implements 8 LoRA variants. Two prior reviews
already happened (V1 paper-vs-code, V2 with reference implementations
provided). Your job is V3: a tight per-component audit focused on
correctness-of-mechanism rather than overall design.
# Scope
8 variants live in `src/lora_lite/variants/`:
- lora.py
- pissa.py
- delora.py
- ia3.py (two registered: `ia3` and `ia3_ff`)
- dora.py
- hra.py
- eva.py (NEW since V2)
- antipasto.py (NEW since V2)
Plus the runtime in `src/lora_lite/{adapter.py,variant.py,target.py,config.py}`
and the smoke test in `tests/smoke.py`.
Reference implementations are in `docs/refs/` and the URLs are also pasted in
each variant's module docstring. Compare against those.
# What I want from you (per variant, in this order, every time)
For EACH variant, work through these five checkpoints, using only that
variant's file and its referenced peft/author code:
1. **PARAMS** — list every spec returned by `param_specs`. For each:
shape, dtype (cfg.dtype unless overridden), trainable, as_buffer.
Does the shape match the reference impl? Are buffers vs Parameters
chosen correctly (no Parameter that should be a buffer; no buffer
that we want to learn)? Does as_buffer mean it persists in
state_dict (check `register_buffer(..., persistent=True)` in
adapter.py)?
2. **INIT** — what does `init()` (and `group_init()` if defined) do?
Does it match the reference exactly? Pay special attention to
ZERO INITS — they often kill gradient flow on dependent params.
Walk the gradient: at t=0, given this init, which trainable params
actually receive non-zero gradient on the first SGD step?
Are dtype casts placed correctly (fp32 SVD, then to cfg.dtype)?
3. **DTYPE** — trace dtype through init -> param storage -> forward.
Where could silent precision loss happen? Is bf16 or fp16 used
anywhere it shouldn't be? Does identity-at-init survive bf16?
4. **FORWARD** — write the math the forward implements, in the same
convention as the reference (peft/author paper). Compare term by
term. Common mistakes to look for:
- wrong scale (alpha/r vs 1/r vs alpha vs 1)
- missing or doubled normalization
- wrong basis (rotating U vs V; gating input vs output)
- dropout placement (we have no dropout — flag if any variant
references one; see config.py)
5. **LINK SANITY** — open the URLs in the docstring. Verify:
- the paper arxiv link goes to the right paper
- the github link points to a real file
- the offline `docs/refs/` snapshot matches what the URL serves
today (the snapshots may be stale; if so, flag the drift)
# Output format
For each variant, write at most ~60 lines. Use this template:
## <variant>
### params
- <one bullet per ParamSpec; flag bug if any>
### init / group_init
- <bullets; identify GRADIENT FLOW at t=0 explicitly>
### dtype
- <bullets>
### forward
Math: <one-line equation in our convention>
Ref math: <one-line equation in reference convention>
Match? YES / NO + one-line explanation
### links
- paper: OK / WRONG / DEAD
- peft ref: OK / DEAD
- author ref (if any): OK / DEAD
- offline snapshot drift: NONE / MINOR / MAJOR
### verdict
CORRECT / PARTIAL / BUGGY -- one-sentence reason
After all variants, write a "## summary" with a markdown table of verdicts and
a numbered list of MUST-FIX bugs (severity high) vs nice-to-haves.
# Hard rules
- Be specific. Cite line numbers (`src/lora_lite/variants/foo.py:NN`) for
every claim.
- Do NOT propose redesigns. Only flag correctness issues against the
references.
- If an issue is intentional and documented, say so and move on -- don't
re-flag known deviations from the docstrings.
- If you can't tell whether something is a bug, say "AMBIGUOUS" with the
question you'd need answered.
-186
View File
@@ -1,186 +0,0 @@
# V4 Variant Review — per-component vs reference + smoke/probe validity
You are an expert ML engineer reviewing a from-scratch PEFT library
(`lora-lite`, ~500 LOC) that re-implements 8 LoRA variants. Three prior
reviews already happened (V1 paper-vs-code, V2 with refs provided, V3
per-component). Your job is V4: re-run the per-component check and
additionally validate the test harness.
# Part A — per-variant audit (re-do, more rigorous)
8 variants live in `src/lora_lite/variants/`:
- lora.py
- pissa.py
- delora.py
- ia3.py (registers `ia3` and `ia3_ff`)
- dora.py
- hra.py
- eva.py
- antipasto.py
Plus runtime in `src/lora_lite/{adapter.py,variant.py,target.py,config.py}`.
Reference implementations are in `docs/refs/` and URLs are pasted in each
variant's module docstring.
## For EACH variant, in this order, every time:
1. **REFERENCE EXISTS** — verify the variant has a real, citeable
reference. Required:
- a paper (arxiv/conference) link, AND
- either an upstream peft implementation OR the original author's
code (GitHub).
If the variant has NO paper, NO reference code, OR the references
are dead/missing/clearly wrong, FLAG IT as `NO REFERENCE` -- this
is severity HIGH because it means there's nothing to validate
against.
2. **PARAMS** — every spec from `param_specs`: shape, dtype, trainable,
as_buffer. Match against the reference. Buffers vs Parameters
chosen correctly?
3. **INIT** — what does `init()` (and `group_init()` if defined) do?
Match the reference exactly? Walk gradient flow at t=0: which
trainable params actually receive non-zero gradient on step 1?
4. **DTYPE** — trace dtype through init -> storage -> forward.
Silent precision loss? Identity-at-init survive bf16?
5. **FORWARD** — write the math the forward implements vs the math
in the reference. Term-by-term comparison. Common mistakes:
- wrong scale (alpha/r vs 1/r vs alpha vs 1)
- missing/doubled normalization
- wrong basis (rotating U vs V; gating input vs output)
- dropout placement (we have NO dropout by design — flag if any
code path depends on one)
6. **LINK SANITY** — actually open the URLs. Verify:
- paper arxiv link goes to the right paper
- github link points to a real file (not 404)
- offline `docs/refs/` snapshot still matches what the URL serves
today (snapshots may be stale; flag drift)
## Per-variant output (≤60 lines each):
## <variant>
### references
- paper: <url> -- OK / WRONG / DEAD / MISSING
- peft ref: <url> -- OK / DEAD / MISSING
- author ref (if any): <url> -- OK / DEAD / MISSING
- offline snapshot (`docs/refs/...`): NONE / MATCH / DRIFT
- VERDICT: HAS_REFERENCE / NO_REFERENCE
### params
- <one bullet per ParamSpec; flag bug if any>
### init / group_init
- <bullets; identify GRADIENT FLOW at t=0 explicitly>
### dtype
- <bullets>
### forward
Math (ours): <one-line equation>
Math (ref): <one-line equation>
Match? YES / NO + one-line reason
### verdict
CORRECT / PARTIAL / BUGGY -- one-sentence reason
# Part B — validate the smoke test (`tests/smoke.py`)
Read `tests/smoke.py` end-to-end. For each per-variant SHOULD claim,
answer:
1. **Distinguishing power** — would a SILENT FAILURE (e.g. forward
returning `y` unchanged, or training only the bias term, or
loading an empty state dict) STILL pass this check? If yes,
the check is WEAK -- name a stronger one.
2. **Tolerance sanity** — the bf16/fp16 tolerances are computed
from `base_scale`. Are they too loose? Too tight? Could they
pass on noise alone?
3. **Coverage** — what mechanisms are NOT tested? (e.g. multi-step
convergence on real targets, dtype mismatch between attach and
load, mixing variants, calibration data of len < r for EVA)
Output:
## smoke.py validity
### per-variant SHOULD checks
| check | distinguishes silent failure? | tolerance ok? | notes |
| ... |
### gaps
- bullets
### must-add tests
- bullets
# Part C — validate the qwen overfit probe (`scripts/qwen_train_probe.py`)
Read `scripts/qwen_train_probe.py` end-to-end. Same questions as Part B
but for the Qwen probe specifically:
1. Does `assert_only_lora_trainable` actually catch a leaked base
parameter, given the way `requires_grad` is set in `adapter.py`?
2. `perturb_first_adapter` only perturbs ONE param per variant. Does
`perturb_delta > 1e-7` distinguish "the variant uses that param in
forward" from "the variant ignores that param"?
3. `loss_last < loss0` after 8 steps with lr=5e-3 -- could this pass
purely from optimizer noise? What's the right held-out / validation
check to add?
4. The reload check uses `args.reload_tol` (default 2e-2 in bf16). Is
that loose enough to mask a real save/load bug?
5. Targets are restricted to `model.layers.0.self_attn.{q,v}_proj` --
does this exercise the full attach path or hide bugs that only
appear with multi-layer / FFN / lm_head edge cases?
Output:
## qwen_train_probe.py validity
### claim-by-claim
| assertion | catches silent failure? | notes |
| ... |
### gaps
- bullets
### must-add tests
- bullets
# Final summary
After parts A, B, C, write:
## summary
### variant verdicts
| variant | has_ref | params | init | dtype | forward | verdict |
### MUST-FIX (severity HIGH, blocks correctness claim)
1. ...
2. ...
### NICE-TO-HAVE
- ...
# Hard rules
- Be specific. Cite line numbers (`src/lora_lite/variants/foo.py:NN`)
for every claim.
- Do NOT propose redesigns. Only flag correctness issues against
references and validity issues in the test harness.
- If an issue is intentional and documented in the docstring, say so
and move on -- don't re-flag known deviations.
- If you can't tell whether something is a bug, say "AMBIGUOUS" with
the question you'd need answered.
- For Part B/C, focus on whether checks have DISTINGUISHING power
(would a silent failure still pass?) -- not just whether they run.
-42
View File
@@ -1,42 +0,0 @@
# External-Review Summary
Two independent reviews via `acpx` external models. Full reviews:
- [docs/audit/variants_review.md](variants_review.md) — per-variant paper-faithfulness audit
- [docs/audit/design_review.md](design_review.md) — peft EVA / baukit / antipasto3 vs lora-lite design
## Per-variant verdict
| variant | match | bugs found | confidence |
|---|---|---|---|
| lora | Y | none material | High |
| pissa | Partial | bf16/Qwen init err 0.31; deviation `alpha==r` only in inline comment; residual not in saved adapter | Medium |
| dora | Y | possible denominator-gradient mismatch with paper's "cost-saving" variant | High |
| ia3 | Partial | targets q/v not paper's k/v/ffn-down; deviation documented but not tested | Medium |
| hra | Partial | gate=0 init -> dU/dx=0 first step (lora_U dead); not orthogonal when gate != 1 | Medium-Low |
| delora | Partial | no Eq.9 frozen-copy init; lambda0=0 -> A/B dead grad; lambda0=0.1 breaks identity | Medium |
## Three biggest risks (reviewer's words)
1. **Initialization vs gradient-flow tradeoffs are hidden by coarse tests.** HRA's `lora_U` and DeLoRA's `A/B` can be initially dead while `grad_nonzero=True` still passes (because *some* lora_* param has nonzero grad).
2. **Qwen probe pass criteria do not enforce paper identity.** PiSSA shows `id_err=0.31`, DeLoRA `id_err=0.72`, but log says PASS.
3. **Target semantics under-tested.** IA3's documented k/v/ffn deviation is never exercised by a positive test.
## Design recommendations
| ref | verdict | impact |
|---|---|---|
| peft EVA | PARTIAL — add `calibrate(model, dataloader, cfg)` (~50 lines) | +50 lines, additive |
| baukit nethook | SKIP — current 5-line hook registration is simpler | 0 |
| antipasto3 SVD | ADOPT concept (learnable delta_s) — no code change now | 0 |
## Recommended follow-up tasks (need user approval before implementing)
A. **Per-param gradient probe**: extend smoke to assert grad on *each* lora_* param at step 0. Catches HRA/DeLoRA init-dead-param bug.
B. **Per-variant identity tolerance in qwen probe**: PiSSA/DeLoRA need a stricter check (or relative tol against `||y_base||`) instead of "passes if id_err < some constant".
C. **IA3 paper-faithful test row**: add one Qwen probe configuration with `target_names=k_proj|v_proj|down_proj` to exercise the documented IA3 placement.
D. **PiSSA equivalence test against `peft.PiSSA`**: same seed + alpha=r, compare `B@A` reconstruction. Adds `peft` to test extras only.
E. **EVA variant**: implement minimal `calibrate()` per design review (~50 lines). Optional, but provides our first data-driven init variant for the user's stated interest.
-780
View File
@@ -1,780 +0,0 @@
[client] initialize (running)
[client] session/new (running)
[thinking] This is a detailed design review request. I need to: 1. Read the lora-lite codebase to understand its current structure 2. Read the three reference implementations 3. Answer specific questions for each Let me start by exploring the codebase structure and reading the key files.
[tool] glob (pending)
input: {}
[tool] glob (pending)
input: {}
[tool] glob (completed)
kind: search
input: src/lora_lite/**/*.py
output:
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/adapter.py
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/lora.py
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/delora.py
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/hra.py
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/dora.py
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/ia3.py
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/__init__.py
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/pissa.py
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/target.py
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variant.py
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/__init__.py
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/config.py
[tool] glob (completed)
kind: search
input: docs/refs/*.py
output:
/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/antipasto3_svd_adapter.py
/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/baukit_nethook.py
/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva_finetuning.py
/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva.py
[thinking] I need to read all these files to do a proper comparative analysis. Let me read the key lora-lite files first (adapter.py, variant.py, variants/lora.py), then the reference files.
[tool] read (pending)
input: {}
[tool] read (pending)
input: {}
[tool] src/lora_lite/adapter.py (completed)
kind: read
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/adapter.py
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/adapter.py
output:
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/adapter.py</path>
<type>file</type>
<content>
1: """attach / detach / save / load. The whole runtime."""
2: from __future__ import annotations
3: import torch
4: from torch import nn
5: from torch.utils.hooks import RemovableHandle
6:
7: from .config import LoraLiteConfig
8: from .variant import REGISTRY
9: from .target import find_targets
10:
11:
12: _ATTACHED_ATTR = "_lora_lite_attached"
13:
14:
15: def _hook(layer, args, y):
16: (x,) = args
17: cfg: LoraLiteConfig = layer._lora_cfg
18: x_cast = x.to(cfg.dtype)
19: out = layer._lora_variant.forward(layer, x_cast, y)
20: return out.to(y.dtype)
21:
22:
23: def _pre_hook(layer, args):
24: (x,) = args
25: cfg: LoraLiteConfig = layer._lora_cfg
... (92 more lines)
[tool] read (pending)
input: {}
[tool] src/lora_lite/variant.py (completed)
kind: read
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variant.py
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variant.py
output:
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variant.py</path>
<type>file</type>
<content>
1: """Variant protocol + registry. Variants own (x, layer.weight, layer.lora_*) -> y_new."""
2: from dataclasses import dataclass
3: from typing import Callable, Protocol, Any
4: import torch
5: from torch import nn
6:
7: from .config import LoraLiteConfig
8:
9:
10: @dataclass
11: class ParamSpec:
12: shape: tuple[int, ...]
13: init: str | Callable[[torch.Tensor], None] = "zeros" # 'zeros'|'kaiming'|'ones'|callable(t)
14: trainable: bool = True
15:
16: def make(self, dtype: torch.dtype, device) -> nn.Parameter:
17: t = torch.empty(self.shape, dtype=dtype, device=device)
18: if callable(self.init):
19: self.init(t)
20: elif self.init == "zeros":
21: t.zero_()
22: elif self.init == "ones":
23: t.fill_(1.0)
24: elif self.init == "kaiming":
25: # match nn.Linear default: kaiming_uniform_(a=sqrt(5))
... (34 more lines)
[tool] src/lora_lite/variants/lora.py (completed)
kind: read
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/lora.py
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/lora.py
output:
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/lora.py</path>
<type>file</type>
<content>
1: """Vanilla LoRA. Hu et al. 2021 https://arxiv.org/abs/2106.09685
2:
3: h = W x + (alpha/r) B A x
4:
5: Identity at t=0 from B=0. Faithful to the paper.
6: """
7: from einops import einsum
8: from torch import nn
9: import torch
10:
11: from ..variant import register, ParamSpec
12:
13:
14: @register
15: class LoRA:
16: name = "lora"
17:
18: @staticmethod
19: def param_specs(d_in, d_out, cfg):
20: return {
21: "lora_A": ParamSpec((cfg.r, d_in), init="kaiming", trainable=True),
22: "lora_B": ParamSpec((d_out, cfg.r), init="zeros", trainable=True),
23: }
24:
25: @staticmethod
... (14 more lines)
[thinking] Now let me read the reference implementations to answer the questions.
[tool] read (pending)
input: {}
[tool] read (pending)
input: {}
[tool] docs/refs/peft_eva_finetuning.py (completed)
kind: read
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva_finetuning.py
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva_finetuning.py
output:
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva_finetuning.py</path>
<type>file</type>
<content>
1: # Copyright 2024-present the HuggingFace Inc. team.
2: #
3: # Licensed under the Apache License, Version 2.0 (the "License");
4: # you may not use this file except in compliance with the License.
5: # You may obtain a copy of the License at
6: #
7: # http://www.apache.org/licenses/LICENSE-2.0
8: #
9: # Unless required by applicable law or agreed to in writing, software
10: # distributed under the License is distributed on an "AS IS" BASIS,
11: # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12: # See the License for the specific language governing permissions and
13: # limitations under the License.
14:
15: import torch
16: from datasets import load_dataset
17: from torch.utils.data import DataLoader
18: from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
19: from utils import DataCollator, TokenizerMetaMath
20:
21: from peft import EvaConfig, LoraConfig, get_peft_model, initialize_lora_eva_weights
22:
23:
24: DEVICE = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
25:
... (74 more lines)
[tool] docs/refs/peft_eva.py (completed)
kind: read
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva.py
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva.py
output:
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva.py</path>
<type>file</type>
<content>
1: # Copyright 2024-present the HuggingFace Inc. team.
2: #
3: # Licensed under the Apache License, Version 2.0 (the "License");
4: # you may not use this file except in compliance with the License.
5: # You may obtain a copy of the License at
6: #
7: # http://www.apache.org/licenses/LICENSE-2.0
8: #
9: # Unless required by applicable law or agreed to in writing, software
10: # distributed under the License is distributed on an "AS IS" BASIS,
11: # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12: # See the License for the specific language governing permissions and
13: # limitations under the License.
14:
15: import warnings
16: from collections import Counter, defaultdict
17: from collections.abc import Iterable, Mapping
18: from contextlib import nullcontext
19: from copy import deepcopy
20: from functools import partial
21: from itertools import cycle
22: from typing import Optional, Union
23:
24: import torch
25: import torch.distributed as dist
... (717 more lines)
[thinking] Now let me read the baukit and antipasto3 references.
[tool] read (pending)
input: {}
[tool] read (pending)
input: {}
[tool] docs/refs/baukit_nethook.py (completed)
kind: read
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/baukit_nethook.py
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/baukit_nethook.py
output:
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/baukit_nethook.py</path>
<type>file</type>
<content>
1: """
2: Utilities for instrumenting a torch model.
3:
4: Trace will hook one layer at a time.
5: TraceDict will hook multiple layers at once.
6: subsequence slices intervals from Sequential modules.
7: get_module, replace_module, get_parameter resolve dotted names.
8: set_requires_grad recursively sets requires_grad in module parameters.
9: """
10:
11: import contextlib
12: import copy
13: import inspect
14: from collections import OrderedDict
15:
16: import torch
17:
18:
19: class Trace(contextlib.AbstractContextManager):
20: """
21: To retain the output of the named layer during the computation of
22: the given network:
23:
24: with Trace(net, 'layer.name') as ret:
25: _ = net(inp)
... (449 more lines)
[tool] docs/refs/antipasto3_svd_adapter.py (completed)
kind: read
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/antipasto3_svd_adapter.py
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/antipasto3_svd_adapter.py
output:
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/antipasto3_svd_adapter.py</path>
<type>file</type>
<content>
1: """SVD adapter for bidirectional steering via block-diagonal Cayley rotations.
2:
3: Flax NNX port.
4:
5: kernel = U @ diag(S) @ Vh + W_res (kernel is (in, out), standard Flax convention)
6: Learnable: delta_s (additive S scaling), rotation_params (block-diagonal V rotation).
7: alpha scales both: S + alpha*delta_s, U @ R(alpha).
8:
9: Why Cayley (not Givens or matrix exponential):
10: Cayley gives exact analytical reversibility: R(-alpha) = R(alpha)^{-1}.
11: This is critical -- at alpha=+1 and alpha=-1 the adapter is an exact inverse of
12: itself, making bidirectional steering symmetric by construction.
13:
14: At alpha=0: U_rot = U and S_scaled = S, so the layer is identical to frozen weights.
15: """
16:
17: import math
18:
19: import jax
20: import jax.numpy as jnp
21: from jax import Array
22: from jaxtyping import Float, Int
23: from einops import rearrange
24: from flax import nnx
25: from loguru import logger
... (353 more lines)
[thinking] Now let me read the lora-lite config and variant protocol more carefully to understand the current API. Then I can compare it to the references.
[tool] read (pending)
input: {}
[tool] read (pending)
input: {}
[tool] src/lora_lite/config.py (completed)
kind: read
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/config.py
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/config.py
output:
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/config.py</path>
<type>file</type>
<content>
1: from dataclasses import dataclass, field, asdict
2: from typing import Any
3: import torch
4:
5:
6: @dataclass
7: class LoraLiteConfig:
8: variant: str = "lora"
9: r: int = 8
10: alpha: float = 16.0
11: dropout: float = 0.0 # currently ignored; variants may use cfg.variant_kwargs
12: dtype: torch.dtype = torch.bfloat16
13:
14: # targeting
15: target_roles: tuple[str, ...] = ("reader", "writer")
16: target_names: tuple[str, ...] = ()
17: exclude_names: tuple[str, ...] = ("lm_head", "embed_tokens")
18: layers: tuple[int, ...] | None = None
19:
20: # variant-specific bag (e.g. lambda0 for DeLoRA)
21: variant_kwargs: dict[str, Any] = field(default_factory=dict)
22:
23: def to_dict(self) -> dict:
24: d = asdict(self)
25: d["dtype"] = str(self.dtype).removeprefix("torch.")
... (16 more lines)
[tool] src/lora_lite/target.py (completed)
kind: read
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/target.py
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/target.py
output:
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/target.py</path>
<type>file</type>
<content>
1: """Find linear-like targets by shape (reader/writer/inner) + name regex.
2:
3: Structural matching is deliberate: bnb Linear4bit/8bitLt are not nn.Linear, but
4: they expose in_features/out_features/weight and their forward already handles
5: dequantization.
6: """
7: import re
8: from torch import nn
9:
10:
11: def is_linear_like(m: nn.Module) -> bool:
12: return (
13: hasattr(m, "in_features")
14: and hasattr(m, "out_features")
15: and hasattr(m, "weight")
16: and callable(m)
17: )
18:
19:
20: def _layer_idx(name: str) -> int | None:
21: m = re.search(r"\.layers?\.(\d+)\.", name)
22: return int(m.group(1)) if m else None
23:
24:
25: def _classify(m: nn.Module, d_model: int, name: str) -> str:
... (37 more lines)
[thinking] OK now I have a clear picture. Let me write up the design review with specific answers to each question. Let me also check if there's an AGENTS.md file:
[tool] glob (pending)
input: {}
[tool] glob (completed)
kind: search
input: AGENTS.md
output:
No files found
[thinking] Let me check for AGENTS.md in parent directories or look for any README or docs.
[tool] glob (pending)
input: {}
[tool] glob (completed)
kind: search
input: **/AGENTS.md
output:
No files found
[thinking] The user wants design notes. I have all the information I need. Let me write up the design notes for each reference.
[tool] todowrite (pending)
input: {}
[plan]
- [in_progress] write design notes for peft EVA
- [pending] write design notes for baukit nethook
- [pending] write design notes for antipasto3 SVD
[tool] 3 todos (completed)
kind: other
input: {"todos":[{"content":"write design notes for peft EVA","status":"in_progress","priority":"high"},{"content":"write design notes for baukit nethook","status":"pending","priority":"high"},{"content":"write design notes ...
output:
[
{
"content": "write design notes for peft EVA",
"status": "in_progress",
"priority": "high"
},
{
"content": "write design notes for baukit nethook",
"status": "pending",
"priority": "high"
},
{
"content": "write design notes for antipasto3 SVD",
"status": "pending",
"priority": "high"
}
]
[thinking] Let me write out the design notes for each reference. ## A. peft EVA **What EVA does** (1-paragraph): EVA computes SVD-of-activations to initialize LoRA's low-rank decomposition. Rather than random kaiming init, it collects layer inputs during a forward pass through calibration data, runs incremental PCA to find principal components, and uses those as the initial `A` matrix (the left singular vectors). Key features: - Per-layer rank redistribution based on explained variance (lines 302-314) - Convergence checking via cosine similarity between SVD components (lines 147-168) - Distributed support via all-gather (lines 81-102) - Handles non-convergence gracefully (multiple forward passes until all layers converge) A **minimal EVA variant** in lora-lite: The current API is: ```python variant.init(layer, cfg) # called per-layer after params created ``` EVA needs a **separate calibration st...
# Design Review: lora-lite vs peft EVA / baukit / antipasto3
## A. peft EVA
**What EVA does** (1-paragraph summary):
EVA computes SVD-of-activations to initialize LoRA's low-rank decomposition. Rather than random kaiming init, it registers hooks on target layers, runs incremental PCA on calibration inputs (collected during forward passes), and uses the left singular vectors as the initial `lora_A` matrix. Key features: per-layer rank redistribution based on explained variance (lines 302-314), convergence checking via cosine similarity (lines 147-168), and distributed support via all-gather (lines 81-102).
**Minimal EVA variant sketch:**
The current API is:
```python
variant.init(layer, cfg) # called per-layer after params created (adapter.py:56)
```
EVA needs a **separate calibration step before `attach()`**:
```python
# Option: calibrate() -> eva_state dict
def calibrate(model, dataloader, cfg):
# Run forward passes, collect activations per target
# Compute SVD, return {"layer_name": U_svd}
...
# Then in LoRA variant.init():
if "eva_state" in cfg.variant_kwargs:
with torch.no_grad():
layer.lora_A.weight.copy_(cfg.variant_kwargs["eva_state"][layer_name])
```
**What to drop if re-implementing:**
- Rank redistribution logic (lines 302-314) — skip, use fixed rank
- Distributed gather (lines 81-102) — skip single-GPU only
- Hash deduplication for equal inputs (lines 173-213, 374-386) — skip
- Resume-from-checkpoint plumbing (lines 489-559) — skip
- Whiten option (lines 475-477) — skip
- PEFT model wrapping (lines 612-656) — skip
**API impact:**
Current `Variant.init(layer, cfg)` signature at adapter.py:56 supports EVA via `cfg.variant_kwargs["eva_state"]`. Minimal change: add `calibrate(model, dataloader, cfg)` function (~50 lines) that computes SVD and returns state dict, then LoRA's `init()` checks for `eva_state` in kwargs.
---
**Recommendation: PARTIAL**
- **Import**: Core SVD-of-activations collection + SVD computation (lines 289-486 minus distributed/rank-redist/PEFT plumbing)
- **Approximate line-count**: +~80 lines (calibrate function + init hook), 0 removed
- **Trade**: Adds data-driven init capability; costs one new function and variant_kwargs passthrough
---
## B. baukit nethook
**What `TraceDict`/`Trace` gives us that lora-lite doesn't:**
- Context manager pattern with automatic cleanup (adapter.py:68 already handles this via `RemovableHandle`)
- `retain_grad=True` for backward pass (lora-lite doesn't need this — we're not editing grads)
- `edit_output` for intercepting/modifying (not needed — we have dedicated `forward_input` hook)
- Dictionary interface (`ret['layer.name'].output`) — we just store layer ref directly
Quote from adapter.py:64-68:
```python
for _, layer, _ in attached_targets:
if hasattr(layer._lora_variant, "forward_input"):
handles.append(layer.register_forward_pre_hook(_pre_hook))
else:
handles.append(layer.register_forward_hook(_hook))
```
**Would baukit simplify or complicate adapter.py?**
- **Neutral to negative**. Current code is 5 lines (64-68). Baukit's TraceDict adds ~100 lines of context-management indirection. We don't need:
- Per-layer config dicts (retain_output/retain_input per layer)
- Edit hooks mid-forward
- Stop-after-layer control
- clone/detach options
**Dependencies:**
Baukit is a single file (~470 lines). lora-lite currently has no deps beyond torch+einops. Adding baukit adds a dep for ~5 lines of value.
---
**Recommendation: SKIP**
Baukit adds ~400 lines for capabilities we don't need. Current hook registration (adapter.py:64-68) is simpler. The `retain_grad` feature could be useful for ml debugging but that's a separate concern — can inline later if needed.
---
## C. antipasto3 SVD adapter
**1. What to adopt or diverge from?**
| Feature | antipasto3 | lora-lite | Verdict |
|--------|-----------|----------|--------|
| SVD init from kernel | `create_svd_adapter()` line 169 | manual param_specs | **Divergen**: lora-lite missing |
| Delta-s scaling | `S_scaled = S + alpha*delta_s` (line 160) | direct scaling via `alpha/r` | **Adopt**: antipasto3 better — learnable scaling, not fixed |
| Block-diagonal rotation | Cayley transform (lines 135-141) | N/A | **Too complex** for lora-lite ethos |
| Residual: `W_res = kernel - U@S@Vh` | line 204 | N/A | **Consider**: adds expressivity |
| Dimension selection: L1 trip | lines 219-267 | None | **Skip**: too domain-specific |
| Polarity interleave | lines 271-287 | N/A | **Skip**: steering-specific |
**Key insight**: antipasto3's `W_res` residual (kernel - U@S@Vh) is the cleanest simplification over vanilla LoRA. Instead of:
- LoRA: `y = Wx + (α/r)BAx` (learns full rank-r delta)
You get:
- SVD: `y = W_res@x + (α/r)U_scaled@S_scaled@Vh@x` (SVD-based delta, capped)
This is essentially **PiSSA** (arXiv:2409.13349) — already in lora-lite variants/ (pissa.py).
**2. PiSSA-like factoring:**
PiSSA is in variants/pissa.py. Let me check it:
[tool] read (pending)
input: {}
[tool] src/lora_lite/variants/pissa.py (completed)
kind: read
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/pissa.py
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/pissa.py
output:
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/pissa.py</path>
<type>file</type>
<content>
1: """PiSSA: top-r SVD of W into A,B; replace W with W_res = W - B@A.
2:
3: Meng et al. 2024 https://arxiv.org/abs/2404.02948
4: W_eff(t=0) = W_res + B@A = W (numerically; bf16 round-trip not bit-exact).
5: """
6: import torch
7: from einops import einsum
8: from torch import nn
9:
10: from ..variant import register, ParamSpec
11:
12:
13: @register
14: class PiSSA:
15: name = "pissa"
16:
17: @staticmethod
18: def param_specs(d_in, d_out, cfg):
19: return {
20: "lora_A": ParamSpec((cfg.r, d_in), init="zeros", trainable=True),
21: "lora_B": ParamSpec((d_out, cfg.r), init="zeros", trainable=True),
22: }
23:
24: @staticmethod
25: def init(layer: nn.Linear, cfg) -> None:
... (32 more lines)
[thinking] So PiSSA is already in lora-lite! Let me check what antipasto3 offers beyond this. Comparing antipasto3 to PiSSA: - Both use SVD init from W antipasto3 has: - Delta-s scaling: learnable S + delta_s (additive learnable, not just fixed) - Residual is explicit (W_res stored separately) - PiSSA mutates layer.weight into W_res The key differences: 1. **Delta-s**: antipasto3 has `delta_s` parameters (line 89-91, 160) — fully learnable scaling. PiSSA has no learnable scaling, just fixed `alpha/r`. 2. **Steering coefficient**: antipasto3 uses explicit `alpha` mutated during forward (line 107, 116), not config-based. 3. **Residual**: PiSSA mutates `layer.weight` directly. antipasto3 stores `W_res` separately. antipasto3's delta-s is the main "new" thing worth considering. But that's really a steering/adaptation method (changing alpha at runtime), not an initialization method. For lora-lite's m...
[tool] todowrite (pending)
input: {}
[plan]
- [completed] write design notes for peft EVA
- [completed] write design notes for baukit nethook
- [completed] write design notes for antipasto3 SVD
[tool] 0 todos (completed)
kind: other
input: {"todos":[{"content":"write design notes for peft EVA","priority":"high","status":"completed"},{"content":"write design notes for baukit nethook","priority":"high","status":"completed"},{"content":"write design notes ...
output:
[
{
"content": "write design notes for peft EVA",
"status": "completed",
"priority": "high"
},
{
"content": "write design notes for baukit nethook",
"status": "completed",
"priority": "high"
},
{
"content": "write design notes for antipasto3 SVD",
"status": "completed",
"priority": "high"
}
]
[thinking] Wait, I wrote the design notes directly as output but the user asked me to answer questions and end with a recommendation. Let me reformat as requested. Actually, looking at the instructions again: the output format says "For each (A, B, C), end with: Recommendation: ADOPT / SKIP / PARTIAL". So I need to restructure my response to match that format better, and NOT write code. Let me restructure: A, B, C sections with each ending in recommendation.
# Design Review: lora-lite vs refs
## A. peft EVA (`docs/refs/peft_eva.py` + `peft_eva_finetuning.py`)
### 1. What EVA actually does
EVA uses SVD-of-activations (incremental PCA on layer inputs during forward passes) to initialize LoRA rather than random kaiming init. Key mechanism:
- Registers hooks on target layers, collects inputs across calibration batches
- Computes left singular vectors via `IncrementalPCA` (lines 138-143)
- Convergence check via cosine similarity between current/previous components (lines 167-168)
- Per-layer rank redistribution based on explained variance ratio (lines 302-314, 456)
The user workflow (peft_eva_finetuning.py:66-76):
```python
eva_config = EvaConfig(rho=rho)
peft_config = LoraConfig(r=rank, init_lora_weights="eva", eva_config=eva_config)
peft_model = get_peft_model(model, peft_config)
initialize_lora_eva_weights(peft_model, dataloader) # separate calibration step
```
### 2. Minimal EVA variant API sketch
**Option: separate calibrate() step before attach()**
```python
# User does:
eva_state = calibrate(model, dataloader, cfg) # returns {"layer_name": U_svd}
cfg.variant_kwargs["eva_state"] = eva_state
attach(model, cfg) # LoRA.init() copies U_svd -> lora_A
```
API change: add `calibrate(model, dataloader, cfg)` function (~50 lines) that computes SVD on calibration inputs, returns state dict. In `LoRA.init()`, check for `eva_state[layer_name]` and copy.
### 3. What to drop if re-implementing
- Lines 302-314: rank redistribution logic → use fixed rank r
- Lines 81-102: distributed all-gather → single-GPU only
- Lines 173-213, 374-386: hash deduplication for equal inputs → skip
- Lines 489-559: resume-from-checkpoint plumbing → skip
- Lines 475-477: whiten option → skip
- Lines 612-656: PEFT model wrapper integration → standalone function
Droppable: ~60% of the 739-line file.
### 4. Does current init() signature support EVA?
Yes. adapter.py:56:
```python
variant.init(layer, cfg)
```
Add `eva_state` to `cfg.variant_kwargs` before attach, then in `LoRA.init()`:
```python
if "eva_state" in cfg.variant_kwargs:
# copy precomputed U into lora_A
```
Smallest change: add `calibrate()` function (~50 lines), no signature change needed.
---
**Recommendation: PARTIAL**
- **Import**: SVD-of-activations collection + compute
- **Lines**: +~50 (calibrate fn), 0
- **What to remove in exchange**: Nothing — this is additive capability only. If you add calibration data collection, you must accept the additional complexity.
---
## B. baukit nethook (`docs/refs/baukit_nethook.py`)
### 1. What TraceDict/Trace gives us that we don't have
| Feature | baukit | lora-lite | Value |
|--------|-------|----------|-------|
| Context manager cleanup | lines 99-105 | RemovableHandle (adapter.py:64-68) | Same |
| retain_grad | line 37-38 | None | Not needed |
| edit_output mid-forward | lines 39, 72-75 | None | Not needed |
| Dictionary interface | line 111 | None | Not needed |
| Stop-after-layer | lines 46-47, 189-202 | None | Not needed |
TraceDict essentially provides: per-layer output retention + config dicts + cleanup. We already handle cleanup via RemovableHandle.
### 2. Would baukit simplify or complicate adapter.py?
Quote adapter.py:64-68:
```python
for _, layer, _ in attached_targets:
if hasattr(layer._lora_variant, "forward_input"):
handles.append(layer.register_forward_pre_hook(_pre_hook))
else:
handles.append(layer.register_forward_hook(_hook))
```
This is 5 lines. Baukit's TraceDict adds ~100 lines of context management we'd use only to replicate these 5 lines. Our current approach is simpler.
### 3. Is baukit worth a dep?
No. lora-lite's principle is minimal deps (currently torch+einops). Baukit adds ~400 lines of code for capabilities we don't need. `retain_grad` for debugging is the only useful feature — that's 3 lines to inline if needed later.
---
**Recommendation: SKIP**
Baukit adds a dep for no net simplification. Current hook registration (adapter.py:64-68) is 5 lines and sufficient. If `retain_grad` is needed for ml debugging, inline 3 lines later.
---
## C. antipasto3 SVD adapter (`docs/refs/antipasto3_svd_adapter.py`)
### 1. What to adopt or diverge from
| Feature | antipasto3 | lora-lite | Verdict |
|---------|-----------|----------|--------|
| SVD init from kernel | line 169 `create_svd_adapter()` | PiSSA (pissa.py) | Already have |
| Delta-s scaling: learnable | line 160 `S + alpha*delta_s` | fixed via alpha/r | **Adopt** |
| Residual: `W_res = kernel - U@S@Vh` | line 204 | PiSSA has this | Already have |
| Block-diagonal Cayley rotation | lines 135-141 | N/A | Skip (too complex) |
| Dimension selection (L1 trip) | lines 219-267 | N/A | Skip (domain-specific) |
**Key adoption**: antipasto3's learnable `delta_s` is cleaner than fixed scaling. Currently lora-lite uses `scale = alpha/r` at **runtime** — but delta-s makes scaling a **learnable parameter** with its own gradient.
This is already in PiSSA variant: piSSA `init()` sets `scale=1` via alpha=r convention (pissa.py:43-46), but doesn't expose the scaling as learnable.
### 2. PiSSA-like factoring
lora-lite already has PiSSA in `variants/pissa.py`:
- Line 31-34: SVD of weight
- Line 36-40: decompose into B@A with sqrt(S) scaling
- Line 46: residual W_res = W - B@A
This is exactly antipasto3's core insight, implemented cleanly. The antipasto3 contribution is making `delta_s` learnable — that's a natural extension of PiSSA, not a replacement.
---
**Recommendation: ADOPT (concept)**
- **Import**: learnable delta_s concept (not code — just the idea)
- **Lines**: +0 by intent, no code change right now
- **What to remove**: Nothing — this is a design note that PiSSA could evolve toward learnable scaling instead of fixed alpha/r
The antipasto3 patterns ( Cayley rotation, polarity interleave) are steering-specific and too complex for lora-lite ethos. The clean insight is already in PiSSA.
---
## Summary
| Ref | Recommendation | Impact |
|-----|--------------|--------|
| peft EVA | PARTIAL | +50 lines calibrate fn |
| baukit | SKIP | 0 (keep current) |
| antipasto3 | ADOPT concept | 0 (design note) |
[done] end_turn
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
File diff suppressed because it is too large Load Diff
-22
View File
@@ -1,22 +0,0 @@
# 2026-04-26 code review: testing proof
## External review
Reviewer: Gemini 2.5 Flash CLI, read-only prompt.
Findings:
- Critical: `tests/smoke.py` could silently pass if base gradients leaked because it did not check non-`lora_*` grads.
- Important: `tests/smoke.py` did not explicitly assert the expected number of attached TinyModel targets.
Resolution:
- Added `assert_no_base_grads(model)` to the smoke training loop.
- Added `assert n_targets == 28` immediately after smoke attach.
- Re-ran `just test` and `just smoke`; both passed.
## Fresh-eyes subagent review
Verdict: PASS.
The reviewer could not name a remaining blocker for skipped targets, dead hooks, base-gradient leakage, or broken save/load producing the collected evidence. Caveat: Qwen coverage is intentionally narrow, layer-0 `q_proj`/`v_proj`, one prompt, tiny steps. This supports interface/training proof, not downstream finetuning quality.
-265
View File
@@ -1,265 +0,0 @@
# lora-lite plan and status
## Goal
Build a small, hackable LoRA-family adapter library for research experiments.
The core bet is that adapter variants should own the relationship between `(x, layer.weight, layer.lora_*)` and the layer output, while the library only handles targeting, parameter attachment, hooks, and save/load.
## Non-goals
- No PEFT compatibility layer.
- No module replacement.
- No merge/unmerge.
- No multiple named adapters per layer.
- No backward compatibility promises.
- No silent fallbacks.
## Design constraints
- Adapter params are attached directly to target layers as `lora_*` parameters.
- Save/load uses normal `state_dict()` keys, filtered by `"lora_"`.
- Forward hooks return the layer's new output, not just a delta.
- Targeting is structural: modules with `in_features`, `out_features`, and `weight` are linear-like.
- LoRA/DeLoRA support bnb-style 4/8-bit forward paths because the quantized base layer computes `y`; the hook only adds adapter math.
- PiSSA is fp-only in v1 because it mutates `layer.weight` into `W_res`.
- Data-calibrated variants use `group_init(model, targets, cfg, calibration_data)`; dataloaders stay out of `cfg` so checkpoints are serializable.
## Implemented v0.0.1
| Area | Status | Evidence |
|---|---:|---|
| `LoraLiteConfig` | done | `src/lora_lite/config.py` |
| Variant registry + `ParamSpec` | done | `src/lora_lite/variant.py` |
| Structural target discovery | done | `src/lora_lite/target.py` |
| `attach` / `detach` / `save` / `load` | done | `src/lora_lite/adapter.py` |
| LoRA | done | `src/lora_lite/variants/lora.py` |
| PiSSA | done, fp-only | `src/lora_lite/variants/pissa.py` |
| DeLoRA | done | `src/lora_lite/variants/delora.py` |
| IA3 | done | `src/lora_lite/variants/ia3.py` |
| DoRA | done, fp-only | `src/lora_lite/variants/dora.py` |
| HRA | done | `src/lora_lite/variants/hra.py` (output-side Householder, hook-only -> bnb-compatible) |
| Smoke tests | done | `tests/smoke.py` |
| bnb minimal forward smoke | done | `Linear8bitLt` and `Linear4bit` pass on CUDA with `just bnb-smoke` |
## Current smoke evidence
Last verified log: `/home/wassname/.cache/agent-tmp/lora_lite_smoke_after_review.log`
| Check | Result |
|---|---|
| LoRA identity | `0.000e+00` |
| LoRA loss drop | `6.1%` |
| PiSSA identity | `1.550e-06` |
| PiSSA loss drop | `11.5%` |
| DeLoRA identity | `0.000e+00` |
| DeLoRA loss drop | `93.4%` |
| IA3 identity | `0.000e+00` |
| IA3 loss drop | `88.7%` |
| DoRA identity | `0.000e+00` |
| DoRA loss drop | `63.3%` |
| fake non-`nn.Linear` target | attaches, identity `0.000e+00`, grad nonzero |
| bnb `Linear8bitLt` | identity `0.000e+00`, grad nonzero |
| bnb `Linear4bit` | identity `0.000e+00`, grad nonzero |
## 2026-04-26 testing proof pass
Goal: upgrade from smoke-tested sketch to evidence that the current PEFT-lite interface trains on both toy models and a real HF Qwen model.
### Scope
In:
- Pytest coverage for LoRA, PiSSA, and DeLoRA correctness invariants.
- IA3 coverage as the first waiting-time simple adapter.
- A real `Qwen/Qwen3-0.6B` probe that trains each current variant on layer-0 `q_proj` and `v_proj`.
- Repeatable `just` recipes and workspace-local logs/artifacts.
Out:
- Benchmark claims.
- Quantized Qwen proof for PiSSA. PiSSA remains fp-only because it mutates `weight`.
- Full default-target training over every Qwen layer.
### Requirements and evidence
| Requirement | Distinguishing check | Evidence |
|---|---|---|
| R1: toy tests catch skipped targets/hooks | Perturb only `lora_*`; output must change. Missing target must raise. | `just test` -> `8 passed in 2.43s` in `logs/pytest.log` |
| R2: toy tests catch base-gradient leakage | After backward, all non-`lora_*` grads are `None`; all trainable names contain `lora_`. | `just test` -> `8 passed in 2.43s` |
| R3: save/load is exact for adapters | Saved key set equals full-path `lora_*` state; reload tensors equal; missing/extra `lora_*` keys raise. | `just test` -> `8 passed in 2.43s` |
| R4: current variants train on tiny task | 28 TinyModel targets; non-`lora_*` grads stay `None`; 20-step loss drop >5%. | `just smoke` -> LoRA 6.1%, PiSSA 11.5%, DeLoRA 93.4% |
| R5: current variants train on real Qwen | Fresh Qwen per variant; exact targets are layer-0 `q_proj`/`v_proj`; perturb >0; lossN < loss0; reload err < tol. | `pueue` task 70, `logs/qwen_probe.log`, all probes pass |
| R6: cold review cannot explain evidence under silent failure | External review findings fixed, then fresh-eyes subagent says PASS. | `docs/spec/20260426_code_review.md` |
### Qwen proof table
Command:
```bash
pueue add --immediate --follow --label "why: verify warning-free current Qwen probe after dtype API cleanup; resolve: same pass table proves current script" --working-directory "$PWD" --priority 1 -- just qwen-probe
```
Result from task 70:
| variant | targets | trainable | id_err | perturb | loss0 | lossN | drop% | grad | dθ | reload | adapter |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---|
| lora | 2 | 20480 | 0 | 0.375 | 5.25 | 3.131 | 40.36 | 1.432 | 4.262 | 0 | `outputs/qwen_train_probe/lora_adapter.pt` |
| pissa | 2 | 20480 | 0.3125 | 0.75 | 5.25 | 3.629 | 30.88 | 6.124 | 4.381 | 0 | `outputs/qwen_train_probe/pissa_adapter.pt` |
| delora | 2 | 20482 | 0.375 | 0.4062 | 5.246 | 5.166 | 1.537 | 0.04778 | 8.196 | 0 | `outputs/qwen_train_probe/delora_adapter.pt` |
Follow-up tasks 80 (lora/pissa/delora/ia3 at 16 steps) and 81 (dora at 16 steps) extend the table:
| variant | targets | trainable | id_err | perturb | loss0 | lossN | drop% | grad | dθ | reload | adapter |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---|
| lora | 2 | 20480 | 0 | 0.375 | 5.25 | 2.432 | 53.68 | 1.467 | 6.403 | 0 | `outputs/qwen_train_probe/lora_adapter.pt` |
| pissa | 2 | 20480 | 0.3125 | 0.75 | 5.25 | 2.958 | 43.66 | 6.124 | 5.909 | 0 | `outputs/qwen_train_probe/pissa_adapter.pt` |
| delora | 2 | 20482 | 0.3281 | 0.3125 | 5.261 | 4.823 | 8.322 | 0.06303 | 15.1 | 0 | `outputs/qwen_train_probe/delora_adapter.pt` |
| ia3 | 2 | 3072 | 0 | 0.375 | 5.25 | 4.473 | 14.79 | 0.463 | 5.926 | 0 | `outputs/qwen_train_probe/ia3_adapter.pt` |
| dora | 2 | 23552 | 0 | 0.3203 | 5.25 | 2.439 | 53.54 | 1.776 | 7.44 | 0 | `outputs/qwen_train_probe/dora_adapter.pt` |
| hra | 2 | 12290 | 0 | 0.3438 | 5.25 | 4.07 | 22.47 | 0.05225 | 4.735 | 0 | `outputs/qwen_train_probe/hra_adapter.pt` |
Failure-mode interpretation:
- If targeting silently skipped, exact target-set assertion would fail before training.
- If hooks were attached but dead, perturb delta would be 0.
- If base params trained, the non-`lora_*` gradient check would fail.
- If adapter grads were absent, `grad` or `dθ` would be 0/non-finite.
- If save/load were broken, adapter tensor equality or reload logit error would fail.
## 2026-04-26 publishable workflow pass
Goal: make the repo behave like a small buildable library without adding framework surface area.
### Scope
In:
- Keep minimal functional integration tests as the default proof path.
- Add a GitHub CI hook for fast tests and package build.
- Add a `pueue` recipe for the larger Qwen proof.
- Tighten README structure so install, quickstart, core idea, tests, and status are easy to find.
Out:
- PyPI publishing token/workflow. Publishing should wait until the license decision is explicit.
- Implementing every named adapter variant in one pass. That would add complexity faster than tests can explain it.
### Requirements and evidence
| Requirement | Distinguishing check | Evidence |
|---|---|---|
| R7: fast CI catches broken tests/builds | `just check` must run pytest, smoke, `uv build`, and `twine check`; a broken test, wheel, sdist, or README metadata fails the hook. | `just check` -> pytest `10 passed in 4.10s`, smoke all pass, wheel/sdist built, `twine check dist/*` passed |
| R8: large proof is queued, not hidden in CI | `just qwen-queue` must create a pueue task in the repo cwd with why/resolve label and intended Qwen command. | task 79 queued at `/media/wassname/SGIronWolf/projects5/2026/lora-lite` with `--variants lora pissa delora ia3 --steps 16` |
| R9: README is publishable enough to judge | Reader sees install, quickstart, pseudocode core, testing commands, proof caveat before variant wishlist. | `README.md` reordered and human note removed |
| R10: variant roadmap buys simplicity | Next variant is ranked by fit to current hook contract; non-hook variants are deferred rather than half-supported. | this section |
Fresh review first blocked on weak `qwen-queue` evidence and README citation/comment junk. Fixes: queued real pueue task 74, added `twine check`, fenced citation, removed the stray README note. Final fresh review verdict: PASS.
Follow-up after omega correction:
- The recursive task failed previously because it was left as a passive sentinel instead of being expanded immediately after push.
- Real bnb proof now runs through required `just bnb-smoke` with `bitsandbytes==0.49.2`; `Linear8bitLt` and `Linear4bit` both show identity error `0.000e+00` and nonzero adapter gradients. The command now fails instead of skipping if CUDA or bnb is unavailable.
- IA3 was added while waiting for queued Qwen work: `just test` now has 10 tests, `just smoke` includes IA3, and pueue task 79 is queued for LoRA/PiSSA/DeLoRA/IA3 on Qwen 0.6B. Task 78 caught a quoting bug in `qwen-queue`; the recipe now queues a single `bash -lc` command.
- README was reduced to prospective-user content; implementation notes moved to `docs/developer_guide.md`.
### Adapter roadmap, ranked by simplicity
| Variant | Why it fits or waits | Next check |
|---|---|---|
| IA3 | Implemented. Multiplicative output vector, no base-weight mutation. | `just test` -> 12 tests passed; smoke/Qwen task 80 pass. |
| DoRA | Implemented for fp layers. Reads dense `weight` to compute `||V||_c`; bnb layers fail loudly. | smoke and Qwen task 81 pass with id_err=0, drop=53.5%, reload=0. |
| SSVD / PiSSA-family | Fits current `weight`-SVD pattern and teaches the SVD adapter path. | Reconstruction/identity invariant plus train proof. |
| HRA / OFT / ROAD | Interesting, but likely wants orthogonal or weight-transform semantics. Keep until hook-only formulation is clear. | Pseudocode first, then one invariant that distinguishes real rotation from dead code. |
| S-steer / AntiPaSTO | Research adapters. Should use `group_init` and activation evidence, not be squeezed into plain LoRA tests. | Calibration is consumed, hooks removed, load does not need calibration data. |
## Review history
A cold subagent review first returned `PASS_WITH_BLOCKERS`:
1. bnb modules were not targeted.
2. Hook cast `y` to `cfg.dtype`, which could round base outputs.
3. PiSSA overclaimed bnb support.
4. `load()` did not fail on missing adapter keys.
5. Data-calibrated init needed model-level access.
Fixes applied:
1. Structural `is_linear_like()` target predicate.
2. Hook only casts `x`, keeps `y` in base output dtype.
3. PiSSA fail-fast rejects non-plain `nn.Linear`.
4. `load()` fails on missing or unexpected `lora_` keys.
5. `attach(..., calibration_data=None)` plus optional `group_init(model, targets, cfg, calibration_data)`.
Second cold review verdict: `PASS` for the minimal 4bit-enabled scope.
## TODO / status
### Next implementation goals
- [x] Add DoRA.
- Verified: fp32 identity 0.000e+00, finite gradients, smoke drop 63.3%, Qwen-0.6B task 81 drop 53.5% reload 0.
- Caveat: bnb DoRA fails fast in `init` (needs dense `weight` for `||V||_c`).
- [ ] Add VeRA.
- Verify: shared buffers are allocated once, target slices match shape, identity or near-identity at init.
- [ ] Add SSVD or AntiPaSTO-style SVD variant.
- Verify: reconstruction or intended rotation invariant at init.
- [ ] Add real activation-calibrated toy variant using `group_init`.
- Verify: `calibration_data` is consumed during `attach`, hooks are removed, checkpoint is serializable, and `load()` does not require calibration data.
- [ ] Add load path that can skip calibration init for future `group_init` variants.
- Current caveat: `load()` calls `attach(model, cfg)` with `calibration_data=None`; fine for current variants, but future calibrated variants should separate param creation from calibration.
- [ ] Add a tiny HF-model smoke when convenient.
- Verify: target names look like real transformer modules and state dict keys match full paths.
### Design TODOs
- [ ] Decide whether `group_init` should run before or after forward hooks are registered.
- Current choice: after params are attached, before adapter forward hooks are registered.
- [ ] Decide whether replacing variants need `runs_base_layer=False` or can always transform `y`.
- OFT-like variants can rotate `y`; variants that truly avoid base forward need module replacement or pre-hook rewriting, likely out of v1.
- [ ] Add `weight_mode` for BitFit/SHiRA if those variants become in-scope.
- Minimal surface: `weight_mode in {"frozen", "bias_only", "sparse_grad"}`.
## Variant contract
```python
class Variant:
name: str
@staticmethod
def param_specs(d_in, d_out, cfg) -> dict[str, ParamSpec]: ...
@staticmethod
def init(layer, cfg) -> None:
# weight-only init; may mutate plain fp weights
...
@staticmethod
def group_init(model, targets, cfg, calibration_data) -> None:
# optional model-level init for data-calibrated or cross-layer variants
...
@staticmethod
def forward(layer, x, y) -> Tensor:
# return NEW output; additive variants return y + delta
...
```
## Done means
This repo is good enough for a first real experiment when:
1. A Qwen/Llama model can attach LoRA adapters to intended target layers.
2. A 4bit or 8bit loaded model can train LoRA/DeLoRA params with nonzero gradients.
3. Saved adapter tensors use full-path keys and reload without calibration data.
4. Smoke tests distinguish target-skipping, hook identity drift, and missing-key load failure.
see interesting adapters here https://github.com/wassname/adapters_as_hypotheses
how peft handle 4bit here https://github.com/huggingface/peft/blob/6030f9160ed2fc17220f6f41382a66f1257b6a93/src/peft/tuners/lora/layer.py
+31
View File
@@ -72,3 +72,34 @@ metamath-queue variant="lora" steps="5000" model="Qwen/Qwen3-0.6B-Base":
-l "why: HF-style MetaMathQA->GSM8K benchmark for {{model}} {{variant}} at {{steps}} steps; resolve: result JSON under outputs/metamath_gsm8k proves grad>0 dθ>0 base_grad_leaks=0 and reports valid/test accuracy" \
-w "$PWD" -o 1 -- \
uv run --extra benchmark python scripts/metamath_gsm8k_benchmark.py --model {{model}} --variant {{variant}} --steps {{steps}}
metamath-queue-all model="Qwen/Qwen3-0.6B-Base" steps="5000" variants="lora pissa delora dora hra ia3 ia3_ff eva antipasto":
#!/usr/bin/env bash
set -euo pipefail
for variant in {{variants}}; do
lr=1e-4
extra_args=(--target-name '(q_proj|v_proj)$' --layers all --r 32 --alpha 64)
case "$variant" in
delora)
lr=1e-3
;;
ia3)
lr=1e-3
extra_args=(--target-name '(k_proj|v_proj)$' --layers all --r 32 --alpha 64)
;;
ia3_ff)
lr=1e-3
extra_args=(--target-name '(down_proj)$' --layers all --r 32 --alpha 64)
;;
eva)
lr=1e-4
;;
antipasto)
lr=1e-4
;;
esac
pueue add \
-l "why: benchmark {{model}} ${variant} on MetaMathQA->GSM8K at {{steps}} steps; resolve: outputs/metamath_gsm8k/results/benchmark_results.tsv gets a row with accuracy commit time method argv and result JSON for ${variant}" \
-w "$PWD" -o 1 -- \
uv run --extra benchmark python scripts/metamath_gsm8k_benchmark.py --model {{model}} --variant "$variant" --steps {{steps}} --lr "$lr" "${extra_args[@]}"
done
+60
View File
@@ -5,8 +5,11 @@ import hashlib
import json
import math
import re
import fcntl
import subprocess
import sys
import time
from datetime import datetime, timezone
from dataclasses import asdict, dataclass, field
from pathlib import Path
from typing import Any, Literal
@@ -426,11 +429,61 @@ def print_final_report(row: dict[str, Any], result_path: Path) -> None:
print(tabulate([row], headers="keys", tablefmt="tsv", floatfmt=".4g"))
def current_git_commit() -> str:
try:
return subprocess.check_output(["git", "rev-parse", "HEAD"], text=True).strip()
except (subprocess.CalledProcessError, FileNotFoundError):
return "unknown"
def append_results_row(
args: BenchmarkConfig,
result_path: Path,
result: dict[str, Any],
run_commit: str,
) -> tuple[Path, Path]:
results_dir = args.output_dir / "results"
results_dir.mkdir(parents=True, exist_ok=True)
tsv_path = results_dir / "benchmark_results.tsv"
lock_path = results_dir / "benchmark_results.tsv.lock"
finished_at = datetime.now(timezone.utc).isoformat(timespec="seconds")
finished_label = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
snapshot_path = results_dir / f"{result['run_id']}__{finished_label}.json"
snapshot_path.write_text(json.dumps(result, indent=2), encoding="utf-8")
row = {
"time_utc": finished_at,
"commit": run_commit,
"method": args.variant,
"model": args.model,
"mode": args.mode,
"valid_accuracy": result["valid_accuracy"],
"test_accuracy": result["test_accuracy"],
"steps": args.steps,
"samples": result["train_samples"],
"wall_time_s": result["wall_time_s"],
"argv": " ".join(sys.argv),
"result_json": str(snapshot_path),
"latest_result_json": str(result_path),
}
header = "\t".join(row)
values = "\t".join(str(value) for value in row.values())
with lock_path.open("w", encoding="utf-8") as lock_handle:
fcntl.flock(lock_handle.fileno(), fcntl.LOCK_EX)
if not tsv_path.exists():
tsv_path.write_text(header + "\n" + values + "\n", encoding="utf-8")
else:
with tsv_path.open("a", encoding="utf-8") as handle:
handle.write(values + "\n")
fcntl.flock(lock_handle.fileno(), fcntl.LOCK_UN)
return tsv_path, snapshot_path
def run(args: BenchmarkConfig) -> dict[str, Any]:
if args.device == "cuda" and not torch.cuda.is_available():
raise RuntimeError("CUDA requested but unavailable; pass --device cpu for plumbing smoke only")
torch.manual_seed(args.seed)
dtype = getattr(torch, args.torch_dtype)
run_commit = current_git_commit()
run_id = f"{args.model.replace('/', '--')}__{args.variant}__s{args.steps}__seed{args.seed}"
out_dir = args.output_dir / run_id
out_dir.mkdir(parents=True, exist_ok=True)
@@ -501,6 +554,12 @@ def run(args: BenchmarkConfig) -> dict[str, Any]:
}
result_path = out_dir / "result.json"
result_path.write_text(json.dumps(result, indent=2), encoding="utf-8")
results_tsv_path, result_snapshot_path = append_results_row(args, result_path, result, run_commit)
result["results_tsv_path"] = str(results_tsv_path)
result["result_snapshot_path"] = str(result_snapshot_path)
result["commit"] = run_commit
result_path.write_text(json.dumps(result, indent=2), encoding="utf-8")
commit_prefix = run_commit[:12]
row = {
"run_id": run_id,
@@ -515,6 +574,7 @@ def run(args: BenchmarkConfig) -> dict[str, Any]:
"base_grad_leaks": train_metrics["base_grad_leaks"],
"valid_acc": valid_metrics["accuracy"],
"test_acc": test_metrics["accuracy"],
"commit": run_commit[:12],
"result": str(result_path),
}
if probe_metrics is not None:
+12 -14
View File
@@ -19,7 +19,7 @@ from dataclasses import dataclass
from typing import Literal
import torch
from einops import einsum
from einops import einsum, rearrange
from jaxtyping import Float
from torch import nn, Tensor as T
@@ -59,15 +59,6 @@ def _build_rotation(rot_T: torch.Tensor, bs: int, max_angle: float) -> torch.Ten
return _cayley(A)
def _block_diag(blocks: torch.Tensor) -> torch.Tensor:
"""(n_blocks, bs, bs) -> (n_blocks*bs, n_blocks*bs) block-diagonal."""
n, bs, _ = blocks.shape
out = blocks.new_zeros(n * bs, n * bs)
for i in range(n):
out[i * bs : (i + 1) * bs, i * bs : (i + 1) * bs] = blocks[i]
return out
@register
class AntiPaSTO:
name = "antipasto"
@@ -123,15 +114,22 @@ class AntiPaSTO:
S = layer.lora_S.to(x.dtype) # (r,)
Vh = layer.lora_Vh.to(x.dtype) # (r, d_in)
R_blocks = _build_rotation(layer.lora_rot_T.float(), bs, max_angle)
R = _block_diag(R_blocks).to(x.dtype) # (r, r)
R_blocks = _build_rotation(layer.lora_rot_T.float(), bs, max_angle).to(x.dtype)
n_blocks = R_blocks.shape[0] # R_blocks: (n, bs, bs)
# Apply block-diagonal R per-block via einsum, never materializing (r,r).
if rotate_basis == "V":
Vh_eff = R @ Vh # rotate INPUT basis
# Vh_eff = R @ Vh, viewed block-wise on the r-axis.
Vh_blocks = rearrange(Vh, "(n a) i -> n a i", n=n_blocks)
Vh_rot = einsum(R_blocks, Vh_blocks, "n a b, n b i -> n a i")
Vh_eff = rearrange(Vh_rot, "n a i -> (n a) i")
U_eff = U
elif rotate_basis == "U":
# U_eff = U @ R.T, viewed block-wise on the r-axis.
U_blocks = rearrange(U, "d (n b) -> d n b", n=n_blocks)
U_rot = einsum(U_blocks, R_blocks, "d n b, n c b -> d n c")
U_eff = rearrange(U_rot, "d n c -> d (n c)")
Vh_eff = Vh
U_eff = U @ R.T # rotate OUTPUT basis
else:
raise ValueError(f"rotate_basis must be 'U' or 'V', got {rotate_basis!r}")
+4 -3
View File
@@ -57,10 +57,11 @@ class DoRA:
) -> Float[T, '*B o']:
cfg = layer._lora_cfg
scale = cfg.alpha / cfg.r
# V = W + scale * B @ A
# Paper §4.3: treat ||V+ΔV||_c as a constant (detach from grad graph) for
# stability and ~2x lower memory. Match peft (lora_weight.detach + weight_norm.detach).
BA = einsum(layer.lora_B, layer.lora_A, "o r, r i -> o i")
V = layer.weight + scale * BA # (d_out, d_in)
v_norm = V.norm(dim=1).clamp_min(1e-12) # (d_out,)
V = layer.weight + scale * BA.detach() # (d_out, d_in)
v_norm = V.norm(dim=1).clamp_min(1e-12).detach() # (d_out,)
# Bias passes through unscaled (matches peft).
bias = getattr(layer, "bias", None)
wx = y if bias is None else (y - bias)
+2 -2
View File
@@ -13,7 +13,7 @@ Refs:
(offline: docs/refs/peft_eva.py; example: docs/refs/peft_eva_finetuning.py)
"""
import torch
from einops import einsum
from einops import einsum, rearrange
from jaxtyping import Float
from torch import nn, Tensor as T
from typing import Iterable
@@ -68,7 +68,7 @@ class EVA:
def _h(module, args, kwargs):
# signature: pre-forward, args[0] is the input tensor
x = args[0].detach()
captured[name].append(x.reshape(-1, x.shape[-1]).to(torch.float32).cpu())
captured[name].append(rearrange(x, "... d -> (...) d").to(torch.float32).cpu())
return _h
handles = [
+2 -2
View File
@@ -17,7 +17,7 @@ Refs:
(offline: docs/refs/peft_hra_layer.py)
"""
import torch
from einops import einsum
from einops import einsum, repeat
from jaxtyping import Float
from torch import nn, Tensor as T
from dataclasses import dataclass
@@ -56,7 +56,7 @@ class HRA:
r, d_in = layer.lora_U.shape
half = torch.empty(r // 2, d_in, dtype=layer.lora_U.dtype, device=layer.lora_U.device)
nn.init.kaiming_uniform_(half, a=5 ** 0.5)
layer.lora_U.copy_(torch.repeat_interleave(half, 2, dim=0))
layer.lora_U.copy_(repeat(half, "h d -> (h two) d", two=2))
return
@staticmethod
+11 -6
View File
@@ -3,10 +3,11 @@
Meng et al. 2024 https://arxiv.org/abs/2404.02948
W = U S Vh (truncated to top-r)
B = U sqrt(S), A = sqrt(S) Vh, W_res = W - B A
Sr_eff = Sr / (alpha/r) # peft-style: pre-divide so A/B
B = U sqrt(Sr_eff), A = sqrt(Sr_eff) Vh # update dynamics match for any alpha
W_res = W - (alpha/r) B A = W - U Sr Vh # scaling cancels symmetrically
Identity at t=0: W_res + B@A == W (bf16 round-trip, not bit-exact).
Pass alpha=r for paper-faithful scale=1.
Identity at t=0: W_res + (alpha/r) B@A == W (fp32 round-trip, bf16 cast can drift).
Refs:
- paper: https://github.com/MuLabPKU/PiSSA/blob/main/utils/init_pissa.py
@@ -51,16 +52,20 @@ class PiSSA:
W = layer.weight.data.float() # (d_out, d_in)
U, S, Vh = torch.linalg.svd(W, full_matrices=False)
r = cfg.r
scale = cfg.alpha / cfg.r
Ur, Sr, Vhr = U[:, :r], S[:r], Vh[:r, :]
sqrtS = Sr.sqrt()
# B @ A = Ur diag(Sr) Vhr; pick B = Ur sqrt(Sr), A = sqrt(Sr) * Vhr
# Pre-divide Sr by scaling so A/B carry "natural" magnitudes for any alpha
# (peft pissa_init does this; needed so the alpha/r scaling on the forward
# gives matched update dynamics rather than re-scaling A,B by sqrt(scale)).
Sr_eff = Sr / scale
sqrtS = Sr_eff.sqrt()
# B @ A = Ur diag(Sr/scale) Vhr; W_res = W - scale * B@A = W - Ur diag(Sr) Vhr.
B = (Ur * sqrtS).to(cfg.dtype)
A = (sqrtS[:, None] * Vhr).to(cfg.dtype)
layer.lora_B.data.copy_(B)
layer.lora_A.data.copy_(A)
# fp32 subtraction so W_res stays accurate.
BA = (B.float() @ A.float())
scale = cfg.alpha / cfg.r
layer.weight.data.copy_((W - scale * BA).to(layer.weight.dtype))
@staticmethod