README: antipasto variants as table rows; real PEFT reference

- Fold the family into the main Variants table as rows (CorDA/ablate/arrow)
  instead of a separate table.
- Lead with the point (freeze W's SVD, learn only a bounded gain -> interpretable,
  O(r) params) before any numbers.
- Replace the unsourced 'PEFT reports 49.0%' line (wrong; LoRA is ~48%) with a
  real link to PEFT's method_comparison/MetaMathQA and a pointer to the benchmark
  script for hyperparameters. Link CorDA/Arditi papers inline.

Co-Authored-By: Claudypoo <noreply@anthropic.com>
This commit is contained in:
wassname
2026-06-15 18:18:09 +08:00
parent fa69e0cac3
commit 6ab1dfff0e
+5 -4
View File
@@ -54,6 +54,9 @@ just qwen-probe # Qwen/Qwen3-0.6B train/save-load probe
| [DoRA](https://arxiv.org/abs/2402.09353) | no | 62.4% | 4.67M | 11.3 |
| [DeLoRA](https://arxiv.org/abs/2503.18225) | yes | 61.5% | 4.59M | 11.3 |
| [AntiPaSTO](https://arxiv.org/abs/2601.07473) | no | 61.4% | 14.3K | 11.3 |
| AntiPaSTO-CorDA | no | 61.9% | 14.3K | 11.3 |
| AntiPaSTO-ablate | no | 61.0% | 14.4K | 11.3 |
| AntiPaSTO-arrow | no | 60.5% | 17.5K | 11.3 |
| [IA3-FF](https://arxiv.org/pdf/2205.05638) | yes | 61.4% | 86K | 11.4 |
| [EVA](https://arxiv.org/abs/2410.07170) | no | 60.3% | 4.59M | 11.3 |
| [IA3](https://arxiv.org/pdf/2205.05638) | yes | 60.0% | 57K | 11.4 |
@@ -61,11 +64,9 @@ just qwen-probe # Qwen/Qwen3-0.6B train/save-load probe
Params = trainable adapter params. Peak GPU = peak CUDA memory during train+eval (logged from this run onward; older runs predate the column).
Setup: Qwen3-0.6B-Base, MetaMathQA train (5k steps, batch 4 = 20k samples unless noted), r=32, all q/v targets, GSM8K test (1319 examples). HRA used batch 2 (10k samples) due to memory. The AntiPaSTO family used r=256 (default for these variants).
Setup follows [PEFT's method comparison](https://github.com/huggingface/peft/tree/main/method_comparison/MetaMathQA): train on a MetaMathQA subset, test on GSM8K. We swap their Llama-3.2-3B (where LoRA gets ~48%) for the smaller Qwen3-0.6B-Base, so these track method rank, not cross-setup absolutes. Hyperparameters are in [the benchmark script](scripts/metamath_gsm8k_benchmark.py) (r=32 q/v; the AntiPaSTO family uses r=256).
Reference: PEFT reports LoRA at 49.0% on Llama-3.2-3B (different model, different sample count). Our numbers are not directly comparable but suggest the adapters work.
AntiPaSTO freezes the top-r SVD of W and trains only a per-direction gain `S_eff = S * (1 + ELU(g))`, so the singular basis stays interpretable and the adapter is O(r) params (~320x smaller than LoRA). Variants swap the basis or core: `antipasto_corda` orients it by input covariance (CorDA), `antipasto_ablate` learns a contractive directional ablation (Arditi), `antipasto_arrow` adds a cheap dense block for cross-direction mixing. See `src/lora_lite/variants/antipasto*.py`.
AntiPaSTO is the novel row here: instead of adding trainable directions like LoRA, it freezes W's own top-r SVD and learns only a bounded per-direction gain `S_eff = S * (1 + ELU(g))`. The singular basis stays fixed and interpretable, and the adapter is O(r) params (~320x smaller than LoRA). The variants change only the basis or core: CorDA orients it by input covariance ([Yang+ 2024](https://arxiv.org/abs/2406.05223)), ablate learns a contractive directional ablation ([Arditi+ 2024](https://arxiv.org/abs/2406.11717)), arrow adds a small dense block for cross-direction mixing.
## Developer docs