lora-lite/docs/reviews/loraxs_review.md at 28d04f1e1d001a089a5d3d9e0a2beb6166a7c95d

mirror of https://github.com/wassname/lora-lite.git synced 2026-06-27 15:15:55 +08:00

Files

T

wassname 28d04f1e1d gitignore: match loraxs_ review scratch; track curated loraxs_review.md

Broaden raw*/err* patterns to *raw*/*err* so prefixed scratch
(loraxs_raw.jsonl, loraxs_err.txt) is ignored. Add the GPT-5.5 review of
the lora_xs variant as the curated artifact.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-19 06:04:25 +08:00

3.5 KiB

Raw Blame History

Overall verdict: the LoRA-XS core math is mostly correct and faithful to the official repo. The main things I would flag are config/default scaling, documentation/orientation wording, and buffer serialization assumptions, not the SVD factor shapes.

`lora_xs.py`

A = (Sr[:, None] * Vhr) / B = Ur — verdict: correct.
Given PyTorch stores layer.weight as (d_out, d_in) and computes y = x @ W.T, the SVD is of stored W = U S Vh. The code stores:
- lora_A: (r, d_in) = diag(Sr) Vhr
- lora_B: (d_out, r) = Ur
Forward gives:
```
x @ A.T @ R.T @ B.T
```
i.e.
```
x @ V_r @ diag(Sr) @ R.T @ U_r.T
```
This matches the official PyTorch-module implementation lora_B(R(lora_A(x))). In paper row-vector notation it is x A_paper R_paper B_paper with A_paper = A.T, B_paper = B.T, and R_paper = layer.lora_R.T. So the tensor shapes are right. Only caveat: if you ever load official checkpoints directly, confirm whether R needs transposition.
h = h @ R.T — verdict: acceptable, but orientation-sensitive.
Since R is square and unconstrained, training from scratch is mathematically fine. But this means the stored tensor represents the transpose of the paper-row-vector R. Not a runtime bug, but worth documenting for checkpoint conversion.
class LoRAXSConfig(AdapterConfig): variant = "lora_xs" — verdict: suspicious.
The file does not visibly enforce the reference default alpha = r. If AdapterConfig inherits the library’s usual LoRA/PiSSA default, especially alpha = 2r, then:
```
scale = cfg.alpha / cfg.r
```
will silently use scale=2 instead of the paper/repo’s scale=1. That is the most concrete faithfulness risk in this snippet.
scale = cfg.alpha / cfg.r with R ~ normal(0, 1e-5) — verdict: not a vanishing-gradient problem.
The tiny R only makes the initial delta tiny. The gradient w.r.t. R does not vanish because A and B are frozen nonzero SVD factors. lr ~ 4e-3 is plausible; the early gradient scale is governed by activations, A, B, loss gradients, and alpha/r, not by the magnitude of R.
lora_A/lora_B ... as_buffer=True, trainable=False — verdict: correct, with checkpoint caveat.
This prevents grad leakage and keeps only lora_R trainable. Buffers should move with .to() and normally appear in state_dict. But adapter-only saving must include buffers; otherwise load will miss the frozen SVD factors. Also, checkpoint size is not just r*r if buffers are persisted.
Docstring: "R sits between two frozen, near-orthonormal bases" — verdict: inaccurate.
B = Ur is orthonormal, but A = diag(Sr) Vhr is not; its rows have norms Sr. This matters for optimizer geometry and gradient conditioning.
Docstring: "h = W x + (alpha/r) B R A x" — verdict: misleading for this library.
The implementation is row-vector PyTorch style:
```
y + scale * x @ A.T @ R.T @ B.T
```
The docstring uses column-vector ordering. Not a code bug, but easy to confuse.

`pissa.py` contrast

Sr_eff = Sr / scale, sqrtS, W - scale * BA — verdict: correct for PiSSA, not something to copy to LoRA-XS.
PiSSA must split sqrt(S) and subtract the top-r component from W to preserve identity. LoRA-XS intentionally leaves W intact and trains only R; folding all S into A matches the reference repo. Splitting sqrt(S) would be a different parameterization with different optimizer dynamics, even if expressively transformable.

3.5 KiB Raw Blame History Unescape Escape

lora_xs.py

pissa.py contrast

3.5 KiB

Raw Blame History

`lora_xs.py`

`pissa.py` contrast