Broaden raw*/err* patterns to *raw*/*err* so prefixed scratch (loraxs_raw.jsonl, loraxs_err.txt) is ignored. Add the GPT-5.5 review of the lora_xs variant as the curated artifact. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
3.5 KiB
Overall verdict: the LoRA-XS core math is mostly correct and faithful to the official repo. The main things I would flag are config/default scaling, documentation/orientation wording, and buffer serialization assumptions, not the SVD factor shapes.
lora_xs.py
-
A = (Sr[:, None] * Vhr)/B = Ur— verdict: correct.
Given PyTorch storeslayer.weightas(d_out, d_in)and computesy = x @ W.T, the SVD is of storedW = U S Vh. The code stores:lora_A:(r, d_in)=diag(Sr) Vhrlora_B:(d_out, r)=Ur
Forward gives:
x @ A.T @ R.T @ B.Ti.e.
x @ V_r @ diag(Sr) @ R.T @ U_r.TThis matches the official PyTorch-module implementation
lora_B(R(lora_A(x))). In paper row-vector notation it isx A_paper R_paper B_paperwithA_paper = A.T,B_paper = B.T, andR_paper = layer.lora_R.T. So the tensor shapes are right. Only caveat: if you ever load official checkpoints directly, confirm whetherRneeds transposition. -
h = h @ R.T— verdict: acceptable, but orientation-sensitive.
SinceRis square and unconstrained, training from scratch is mathematically fine. But this means the stored tensor represents the transpose of the paper-row-vectorR. Not a runtime bug, but worth documenting for checkpoint conversion. -
class LoRAXSConfig(AdapterConfig): variant = "lora_xs"— verdict: suspicious.
The file does not visibly enforce the reference defaultalpha = r. IfAdapterConfiginherits the library’s usual LoRA/PiSSA default, especiallyalpha = 2r, then:scale = cfg.alpha / cfg.rwill silently use
scale=2instead of the paper/repo’sscale=1. That is the most concrete faithfulness risk in this snippet. -
scale = cfg.alpha / cfg.rwithR ~ normal(0, 1e-5)— verdict: not a vanishing-gradient problem.
The tinyRonly makes the initial delta tiny. The gradient w.r.t.Rdoes not vanish becauseAandBare frozen nonzero SVD factors.lr ~ 4e-3is plausible; the early gradient scale is governed by activations,A,B, loss gradients, andalpha/r, not by the magnitude ofR. -
lora_A/lora_B ... as_buffer=True, trainable=False— verdict: correct, with checkpoint caveat.
This prevents grad leakage and keeps onlylora_Rtrainable. Buffers should move with.to()and normally appear instate_dict. But adapter-only saving must include buffers; otherwise load will miss the frozen SVD factors. Also, checkpoint size is not justr*rif buffers are persisted. -
Docstring:
"R sits between two frozen, near-orthonormal bases"— verdict: inaccurate.
B = Uris orthonormal, butA = diag(Sr) Vhris not; its rows have normsSr. This matters for optimizer geometry and gradient conditioning. -
Docstring:
"h = W x + (alpha/r) B R A x"— verdict: misleading for this library.
The implementation is row-vector PyTorch style:y + scale * x @ A.T @ R.T @ B.TThe docstring uses column-vector ordering. Not a code bug, but easy to confuse.
pissa.py contrast
Sr_eff = Sr / scale,sqrtS,W - scale * BA— verdict: correct for PiSSA, not something to copy to LoRA-XS.
PiSSA must splitsqrt(S)and subtract the top-r component fromWto preserve identity. LoRA-XS intentionally leavesWintact and trains onlyR; folding allSintoAmatches the reference repo. Splittingsqrt(S)would be a different parameterization with different optimizer dynamics, even if expressively transformable.