Replace antipasto's rotation/Cayley with a bounded 1+ELU gain and split the
S-space idea into four interpretable PiSSA-style cores (frozen U/S/Vh, small
trainable core):
- antipasto: S_eff = S*(1+ELU(coeff*g)). exp-bounded attenuation, linear
amplification (constant gradient, no runaway). g=0 -> exact identity.
- antipasto_rot: keeps the block-Cayley rotation as a separate variant for
cost comparison (its per-forward solve is the 72ms vs 36ms gap).
- antipasto_ablate: contractive (I - a c c^T) diag(S), eigenvalues in [0,1],
cannot blow up. Optional cov_orient (CorDA) basis.
- antipasto_corda: covariance-oriented oblique projector P = Vh C^{-1/2}, the
data-energy basis rather than the weight-gain basis. 1+ELU gain.
Add scripts/_cost.py + scripts/cost_report.py: one-row-per-variant cost table
(trainable params, peak GPU mem, fwd/bwd ms, added MACs/tok, group_init ms).
Wire all four into the benchmark, smoke test, and __init__ exports.
External review (DeepSeek-v4-pro, docs/reviews/) verified the math; acted on
its one real point (corda g now inits to zeros for exact identity).
Co-Authored-By: Claudypoo <noreply@anthropic.com>
4.8 KiB
Code Review: Four S-space weight adapters (anti-pasto family + sspace reference + cost script)
Summary
The code adds four PiSSA-style adapters (antipasto, antipasto_rot, antipasto_ablate, antipasto_corda) that store frozen top-r SVD buffers and a small trainable gain/core. The key mathematical claims in the docstrings (identity at g=0, contraction of ablation core, exact reconstruction of the CorDA decomposition, signed cosine gate) are correct. However, group_init() re-selects/re-orients the SVD basis but does not reset the trainable parameters (g, delta_s, rot_T, c, alpha), which introduces a latent but important correctness problem — after a data-driven reinit, gains and direction vectors become misaligned with their intended singular axes.
Critical (must fix)
antipasto_corda.py,antipasto_rot.py,antipasto_ablate.py:group_init()re-orients or re-selects the top-r SVD basis but leaves the existing trainable parameters (lora_g,lora_delta_s,lora_rot_T,lora_c,lora_alpha) untouched, still attached to the old indices.- For
antipasto_corda,lora_gis initialised withN(0, 4e-4)ininit(). Aftergroup_init, those small random gains now multiply a different set of singular directions, producing an uncontrolled (though tiny) perturbation. - For
antipasto_rot, the+4e-4bias onlora_delta_sremains, now applied to arbitrary resorted directions, and the block rotationlora_rot_Tis disconnected from the new block structure. - For
antipasto_ablate, the ablation directions and strengths (lora_c,lora_alpha) are not reset; if any warm-starting or training has happened, they point in the wrong subspace.
Fix: After re-selection or re-orientation, either re-initialise the trainable parameters (e.g., zero for g, zeros/small for delta_s and rot_T, random-normalised for c, init_alpha for alpha) or re-index them with the sameidxmapping used for the buffers. Document thatgroup_initmust be called before any training and that the trainable parameters will be fresh after it.
- For
Important (should fix)
antipasto_ablate.py– Contraction claim and coeff bounds: the forward pass appliesh = h - coeff * (proj * alpha) @ Chat.T. The core is a contraction only when bothcoeffandalphaare in[0, 1]. The config’scoeffcan be any float (the docstring mentionscoeff<0for amplification). There is no runtime clamping ofcoeff. If the intention is to keep the contraction property structurally enforced, clampcoeffto[0, 1]insideforward()or at least validate it.sspace.py– Division bysqrtSwithout epsilon:xS = (y_eff @ U_r) / sqrtS. For small or zero singular values (e.g., when r is large or W is low-rank) this can produce NaNs/infs. Add a smallεdenominator (consistent with the ε used elsewhere).antipasto_ablate.py–_orthonormal()callstorch.linalg.qr(c.float())every forward pass. For largerand many layers this adds non-trivial cost. A lighter reparameterisation (e.g., maintainingcas a matrix with a normalisation step that avoids full QR) might be warranted, but for small(r,k)pairs the current approach is acceptable. At minimum, add a comment noting the per-forward cost.
Suggestions
antipasto.pygroup_init: after the idx sort,idx = idx.sort().valuesensures a stable, canonical ordering. This is a nice touch for reproducibility.antipasto_ablate.py– The docstring says “CorDA-orient the basis from input covariance … the ablation is OUTPUT-side and CorDA's U stays orthonormal …”. The forward code correctly uses the orthonormalUfor output projection, so the contraction in S-space carries over to the output.sspace.py– The signed gate correctly preservescossign, so anti-aligned tokens receive a negativegate * alphaand are pushed opposite todS_hat. Verified.
Positive
- The documentation is thorough, explaining the design choices (why 1+ELU, why contraction, why CorDA), and includes references.
- The PiSSA-style
W_resdecomposition is implemented correctly across all variants. antipasto.py’sS_eff = S * (1 + ELU(coeff*g))is indeed C1, positive, and identity atg=0— no sign-flip bugs.antipasto_ablate.pyenforces orthonormalChatand clampsalphato[0,1], making the contraction property safe whencoeffis also bounded.
Verdict
REQUEST CHANGES
group_init() must reset the trainable parameters after it changes the SVD basis; without this, the adapters silently poison their own steering gains with a different set of directions. Fix that and the code is ready.
Note: The sspace.py and _cost.py files are part of the same move into lora-lite and are free of the parameter-reset issue; the only concern in sspace.py is the unprotected division by singular values.