comment hygiene + HRA row: shorten docstrings, drop dead init branch, track asvd

- variant.py: fix mislabeled "legacy entry" (make() is the live param path); drop unused near_one init branch
- config.py: drop "replaces older LoraLiteConfig" history narration
- antipasto_ablate.py: aspirational "should warm-start" comment -> tracked FIXME
- antipasto_rot.py: cut "kept as separate variant" / "why antipasto dropped rotation" ramble
- benchmark: merge duplicate antipasto/corda/asvd cfg branch
- README: fill HRA row (test 59.2 / valid 70.0)
- track antipasto_asvd.py (was imported+registered but uncommitted)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-17 11:16:07 +08:00
parent 5f9d90d8b8
commit 7e024b4734
7 changed files with 56 additions and 25 deletions
+2 -2
View File
@@ -56,8 +56,8 @@ use r=32; the AntiPaSTO family uses r=256 (it tunes only S-space gain, so it nee
| [LoRA](https://arxiv.org/abs/2106.09685) | 59.8 | 68.0 | 3.54M | 3.54M | 173 / 573 | 0.02 |
| [PiSSA](https://arxiv.org/abs/2404.02948) | 59.8 | 76.0 | 3.54M | 3.54M | 146 / 549 | 2.04 |
| [DoRA](https://arxiv.org/abs/2402.09353) | 60.2 | 68.0 | 3.56M | 3.54M | 161 / 556 | 0.16 |
| [DeLoRA](https://arxiv.org/abs/2503.18225) | | | — | — | — | |
| [HRA](https://arxiv.org/abs/2405.17484) | | | — | — | — | |
| [DeLoRA](https://arxiv.org/abs/2503.18225) | 56.2 | 62.0 | 3.54M | 3.54M | 169 / 593 | 0.21 |
| [HRA](https://arxiv.org/abs/2405.17484) | 59.2 | 70.0 | 2.75M | 2.75M | 225 / 948 | 0.04 |
| [EVA](https://arxiv.org/abs/2410.07170) | — | — | — | — | — | — |
| [IA3](https://arxiv.org/pdf/2205.05638) | — | — | — | — | — | — |
| [IA3-FF](https://arxiv.org/pdf/2205.05638) | — | — | — | — | — | — |
+1 -3
View File
@@ -142,9 +142,7 @@ def cfg_for_variant(args: BenchmarkConfig, dtype: torch.dtype) -> ll.AdapterConf
extra = {"group_size": args.road_group_size}
if args.variant == "antipasto_rot":
extra = {"rotate_basis": args.antipasto_rotate_basis}
if args.variant == "antipasto":
extra = {"coeff": args.antipasto_coeff, "suppress_only": args.antipasto_suppress_only}
if args.variant in ("antipasto_corda", "antipasto_asvd"):
if args.variant in ("antipasto", "antipasto_corda", "antipasto_asvd"):
extra = {"coeff": args.antipasto_coeff, "suppress_only": args.antipasto_suppress_only}
if args.variant == "antipasto_ablate":
extra = {"coeff": args.antipasto_coeff, "k": args.antipasto_ablate_k,
+3 -4
View File
@@ -1,9 +1,8 @@
"""AdapterConfig: per-variant typed dataclass.
Replaces the older `LoraLiteConfig` + `variant_kwargs` dict. Each variant
ships its own subclass under `variants/*.py` (e.g. `DeLoRAConfig`), adding
strongly-typed fields so users discover the knobs via IDE / dataclass
introspection instead of stringly-typed dict lookups.
Each variant ships its own subclass under `variants/*.py` (e.g. `DeLoRAConfig`),
adding strongly-typed fields so the knobs are discoverable via IDE / dataclass
introspection rather than stringly-typed dict lookups.
Wire-up:
- `AdapterConfig` holds the universal fields (variant name, rank, alpha,
+1 -4
View File
@@ -24,9 +24,6 @@ class ParamSpec:
# avoid exact-zero dead zone; N(0, 1e-4) is small enough to be
# ~identity but nonzero so gradients always have somewhere to go
t.normal_(0, 1e-4)
elif self.init == "near_one":
# avoid exact-one dead zone; 1 + N(0, 1e-4)
t.fill_(1.0).add_(torch.randn_like(t).mul_(1e-4))
elif self.init == "ones":
t.fill_(1.0)
elif self.init == "kaiming":
@@ -37,7 +34,7 @@ class ParamSpec:
return t
def make(self, dtype: torch.dtype, device) -> nn.Parameter:
# legacy entry: returns a Parameter (used for trainable adapter params)
# trainable params -> Parameter; buffers go through make_tensor (see attach)
if self.as_buffer:
raise RuntimeError("as_buffer spec must be installed via register_buffer; see adapter.attach")
return nn.Parameter(self._empty(dtype, device), requires_grad=self.trainable)
+3 -3
View File
@@ -85,9 +85,9 @@ class AntiPaSTOAblate:
layer.lora_Vh.copy_(Vhr.to(layer.lora_Vh.dtype))
W_res = (W - (Ur * Sr) @ Vhr).to(layer.weight.dtype)
layer.weight.data.copy_(W_res)
# Optional but recommended: group_init() should warm-start lora_c from the
# S-space contrastive direction dS (see sspace.py extract). Random init also
# trains, just slower and with no guarantee it finds the behavior direction.
# FIXME: lora_c is random-init. A group_init warm-start from the S-space
# contrastive direction dS (cf. sspace.py extract) would converge faster and
# land on the behavior direction; not implemented -- random trains, just slower.
@staticmethod
def group_init(model: nn.Module, targets, cfg, calibration_data: CalibrationData | None) -> None:
+43
View File
@@ -0,0 +1,43 @@
"""AntiPaSTO-ASVD: diagonal-covariance sibling of antipasto_corda.
Same frozen-basis bounded gain, but orients the SVD by the DIAGONAL of the input
second moment (per-channel activation scale) instead of the full covariance:
M = diag(E[x_i^2]) vs CorDA's full C = E[x x^T]
This is Activation-aware SVD (Yuan+ 2023, arXiv:2312.05821): SVD(W diag(s)) with s a
per-channel scale. It is NOT a sub-basis of CorDA -- diag(C)^{1/2} and C^{1/2} are
different oblique rotations, so the top-r directions differ and either can win on a task.
ASVD is the cheap arm: O(d_in) moment, no d_in x d_in matrix, no eigh. The head-to-head
with antipasto_corda isolates whether the off-diagonal of C earns its init cost here.
Reuses antipasto_corda's buffers (U, S, P, g), plain-SVD init, gain forward, and the
shared `_covariance_orient` (only the diag flag differs), so there is one copy of the
math to keep in sync.
Refs: antipasto_corda.py (full-covariance sibling), ASVD arXiv:2312.05821.
"""
from dataclasses import dataclass
from ..variant import register
from ..config import register_config
from .antipasto_corda import AntiPaSTOCorDA, AntiPaSTOCorDAConfig, _covariance_orient
@register_config
@dataclass
class AntiPaSTOASVDConfig(AntiPaSTOCorDAConfig):
variant: str = "antipasto_asvd"
@register
class AntiPaSTOASVD:
name = "antipasto_asvd"
param_specs = staticmethod(AntiPaSTOCorDA.param_specs)
init = staticmethod(AntiPaSTOCorDA.init)
forward = staticmethod(AntiPaSTOCorDA.forward)
@staticmethod
def group_init(model, targets, cfg, calibration_data) -> None:
"""ASVD: re-orient by the diagonal of the input second moment (per-channel)."""
_covariance_orient(model, targets, cfg, calibration_data, diag=True)
+3 -9
View File
@@ -1,6 +1,5 @@
"""AntiPaSTO-Rot: the original SVD adapter with learnable singular-value deltas +
block-diagonal Cayley rotation. Kept as a SEPARATE variant so we can benchmark the
rotation version against the rotation-free 1+ELU gain (antipasto.py) head to head.
"""AntiPaSTO-Rot: SVD adapter with learnable singular-value deltas + a block-diagonal
Cayley rotation of the frozen basis. The rotation arm vs antipasto.py's gain-only core.
wassname 2026 https://arxiv.org/abs/2601.07473
@@ -10,12 +9,7 @@ wassname 2026 https://arxiv.org/abs/2601.07473
y = x @ W_res.T + ((x @ Vh_eff.T) * (S + delta_s)) @ U_eff.T
Identity at t=0: rot_T=0 -> R=I, delta_s~4e-4 -> y ~ x @ W^T (tiny positive bias on
delta_s breaks sign symmetry).
Why antipasto.py dropped the rotation: rotating Vh/U leaves the interpretable singular
basis, and the Cayley solve was numerically finicky. This file preserves it for the
all-else-equal comparison (does the cross-direction mixing the rotation buys beat the
cheaper, more stable gain-only adapter on the same targets and budget?).
delta_s breaks sign symmetry; rotation alone can't).
Refs:
- paper: https://github.com/wassname/AntiPaSTO