mirror of
https://github.com/wassname/adapters_as_hypotheses.git
synced 2026-06-27 15:14:03 +08:00
wip
This commit is contained in:
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,134 @@
|
|||||||
|
{
|
||||||
|
"mode": "editor",
|
||||||
|
"full_text_verified": true,
|
||||||
|
"claims_checked": [
|
||||||
|
{
|
||||||
|
"claim": "- Probing finds representations that predict behavior, but high probe accuracy does not mean the mod",
|
||||||
|
"number": "2022",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "- CCS discovers latent knowledge but cannot intervene on it (Burns et al., 2022).",
|
||||||
|
"number": "2022",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "I articulated this in my [AntiPaSTO paper](https://arxiv.org/abs/2601.07473), and GDM's interpretabi",
|
||||||
|
"number": "2601.07473",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "I articulated this in my [AntiPaSTO paper](https://arxiv.org/abs/2601.07473), and GDM's interpretabi",
|
||||||
|
"number": "3",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "I went through ~30 PEFT methods in HuggingFace PEFT and the broader literature. For each one I:",
|
||||||
|
"number": "30",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "1. Extracted pseudocode for the forward pass (what the intervention actually does)",
|
||||||
|
"number": "1.",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "2. Stated the hypothesis it encodes about transformer internals",
|
||||||
|
"number": "2.",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "- **PiSSA** (NeurIPS 2024): Initialize LoRA from top-$r$ SVD of $W$, freeze the residual. Gemma-7B o",
|
||||||
|
"number": "2024",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "- **PiSSA** (NeurIPS 2024): Initialize LoRA from top-$r$ SVD of $W$, freeze the residual. Gemma-7B o",
|
||||||
|
"number": "7",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "- **SVFT**: Fix both singular vector sets from $W$'s SVD, learn only sparse coefficients. Recovers 9",
|
||||||
|
"number": "96%",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "- **SSVD**: Rotate right singular vectors (Cayley transform), shift singular values, keep left singu",
|
||||||
|
"number": "10",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "- **DoRA** (ICML 2024): Magnitude/direction decomposition of $W$. Consistently beats LoRA.",
|
||||||
|
"number": "2024",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "- **DeLoRA** (ICLR 2025): Normalize each rank-1 component, introduce learnable scalar $\\lambda$. Bet",
|
||||||
|
"number": "2025",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "- **DeLoRA** (ICLR 2025): Normalize each rank-1 component, introduce learnable scalar $\\lambda$. Bet",
|
||||||
|
"number": "1",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "- **ROAD**: 2D rotary adaptation with explicit angle $\\theta$ and magnitude $\\alpha$.",
|
||||||
|
"number": "2",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "**IA3** learns nothing but a per-channel scaling vector ($\\lambda \\in \\mathbb{R}^d$, initialized to ",
|
||||||
|
"number": "3",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "**IA3** learns nothing but a per-channel scaling vector ($\\lambda \\in \\mathbb{R}^d$, initialized to ",
|
||||||
|
"number": "1",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "**Conflict of interest disclosure:** the strongest OOD result in this catalog is my own work. [AntiP",
|
||||||
|
"number": "2601.07473",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "**Conflict of interest disclosure:** the strongest OOD result in this catalog is my own work. [AntiP",
|
||||||
|
"number": "1",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "I give AntiPaSTO the highest grade (\\*\\*!!!\\*\\*) because the OOD transfer is the kind of evidence th",
|
||||||
|
"number": "4",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "**Decoupling family:** DoRA (magnitude/direction) -> ETHER (fixed-strength orthogonal) -> DeLoRA (no",
|
||||||
|
"number": "1",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "- **Scale dependence.** Most of these results are on 1B-7B models. The geometry might change at 70B+",
|
||||||
|
"number": "1",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "- **Scale dependence.** Most of these results are on 1B-7B models. The geometry might change at 70B+",
|
||||||
|
"number": "7",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"claim": "The full catalog with pseudocode, evidence, and grades for 30 methods is at:",
|
||||||
|
"number": "30",
|
||||||
|
"status": "number_found_in_paper"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"missing_context": [
|
||||||
|
"Paper limitations not discussed"
|
||||||
|
],
|
||||||
|
"citations_found": [
|
||||||
|
"- CCS discovers latent knowledge but cannot intervene on it (Burns et al., 2022)"
|
||||||
|
],
|
||||||
|
"suggestions": [],
|
||||||
|
"verdict": "needs_work"
|
||||||
|
}
|
||||||
@@ -2,19 +2,22 @@ TASK write a new file, from the old part.
|
|||||||
|
|
||||||
## Status: DONE
|
## Status: DONE
|
||||||
|
|
||||||
|
### Task 1: adapters_as_hypotheses.md
|
||||||
- [x] Preamble with pragmatic interpretability framing
|
- [x] Preamble with pragmatic interpretability framing
|
||||||
- [x] 30 entries (1-30) with pseudocode, hypothesis, evidence, grade
|
- [x] 33 entries with pseudocode, hypothesis, evidence, grade
|
||||||
- [x] All papers saved to docs/ (full size, no truncation)
|
- [x] All papers saved to docs/ (full size, no truncation)
|
||||||
- [x] URLs from gist_content.md included
|
- [x] Sub-agent review completed, fixes applied
|
||||||
- [x] Sub-agent review completed, fixes applied:
|
|
||||||
- Fixed RandLoRA pseudocode (sum of scaled random bases, not single triple product)
|
### Task 2: adapters_vargdown.argdown (NEW)
|
||||||
- Fixed authorship (AntiPaSTO is Clark, not Bini/Girrbach/Akata)
|
- [x] Compiled evidence into vargdown (verified argdown) format
|
||||||
- Fixed SSVD grade (** not **!) and evidence ("matches" not "outperforms")
|
- [x] 6 thematic argument groups: SVD basis, orthogonal, decoupling, gain control, rank, functional architecture
|
||||||
- Fixed OFT pseudocode (W @ R^T convention per paper)
|
- [x] Main thesis: [Natural Manifold] -- SVD basis + orthogonal constraints define natural intervention manifold
|
||||||
- Fixed AntiPaSTO Cayley convention to show explicit /2
|
- [x] ~20 observations with exact blockquotes from docs/ evidence files
|
||||||
- Added AntiPaSTO grade caveat (<=4B models, seed variance)
|
- [x] ~10 assumptions for papers without frozen evidence
|
||||||
- Split Bone/Trainable Tokens into separate entries
|
- [x] 3 contrary arguments (gain control, rank secondary, linearity)
|
||||||
- Fixed "Clark et al." -> "Clark"
|
- [x] Pseudocode companion: adapters_pseudocode.md (20 methods in pseudopy format)
|
||||||
|
- [x] Sub-agent review: fixed 5 critical (wrong evidence links, paraphrased quotes), 7 minor (orphans, credence calibration)
|
||||||
|
- [x] All credences calibrated: reason first, no overconfidence on preprints
|
||||||
|
|
||||||
First write also preamble explaining why we are interested, and this view, about a pragmatic search for effective views on internals (see https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability, and
|
First write also preamble explaining why we are interested, and this view, about a pragmatic search for effective views on internals (see https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability, and
|
||||||
|
|
||||||
|
|||||||
+31
-99
@@ -6,13 +6,15 @@
|
|||||||
|
|
||||||
We want to understand how transformers work. There are many approaches -- probing, ablation, SAEs -- but most of them *observe* rather than *intervene*. Probing finds representations that predict behavior, but high probe accuracy does not mean the model uses that representation ([Belinkov, 2022](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00254/43503)). CCS discovers latent knowledge but cannot intervene on it ([Burns et al., 2022](https://arxiv.org/abs/2212.03827)). Intervention shortcuts both problems: if modifying a representation reliably changes behavior, we have causal evidence of what we control (I argued this in [AntiPaSTO](https://arxiv.org/abs/2601.07473)).
|
We want to understand how transformers work. There are many approaches -- probing, ablation, SAEs -- but most of them *observe* rather than *intervene*. Probing finds representations that predict behavior, but high probe accuracy does not mean the model uses that representation ([Belinkov, 2022](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00254/43503)). CCS discovers latent knowledge but cannot intervene on it ([Burns et al., 2022](https://arxiv.org/abs/2212.03827)). Intervention shortcuts both problems: if modifying a representation reliably changes behavior, we have causal evidence of what we control (I argued this in [AntiPaSTO](https://arxiv.org/abs/2601.07473)).
|
||||||
|
|
||||||
|
<!-- TODO is ths all really relevent for the intro and audience, seems long and not to the point, is the lesswrong one better? -->
|
||||||
|
|
||||||
There is an underappreciated source of exactly this kind of causal evidence: the PEFT adapter literature.
|
There is an underappreciated source of exactly this kind of causal evidence: the PEFT adapter literature.
|
||||||
|
|
||||||
Each adapter constrains *how* you can update pretrained weights. When one adapter architecture outperforms another under controlled conditions -- same model, same data, same parameter budget -- the winning method's structural assumptions are supported as a better description of the weight manifold. This is a natural experiment running across hundreds of papers, and almost nobody reads it as science about representations.
|
Each adapter constrains *how* you can update pretrained weights. When one adapter architecture outperforms another under controlled conditions -- same model, same data, same parameter budget -- the winner's structural assumptions get stronger support as a description of the weight manifold. This is a natural experiment running across many papers, and it is still underused as evidence about representations.
|
||||||
|
|
||||||
GDM's interpretability team recently pivoted toward "pragmatic interpretability" -- directly solving problems on the critical path to AGI going well, grounded in proxy tasks with empirical feedback ([Nanda et al., 2025](https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability)). Adapter benchmarks are precisely this: empirical feedback on which structural assumptions about transformer internals hold up under intervention.
|
GDM's interpretability team recently pivoted toward "pragmatic interpretability" -- directly solving problems on the critical path to AGI going well, grounded in proxy tasks with empirical feedback ([Nanda et al., 2025](https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability)). Adapter benchmarks are precisely this: empirical feedback on which structural assumptions about transformer internals hold up under intervention.
|
||||||
|
|
||||||
If the adapter generalizes out-of-distribution, that tells us the geometric structure it exploits is *causally relevant* to behavior, not merely correlated. From my [AntiPaSTO paper](https://arxiv.org/abs/2601.07473):
|
If an adapter generalizes out-of-distribution, that is stronger evidence that the geometric structure it exploits is causally relevant to behavior, not merely correlated. From my [AntiPaSTO paper](https://arxiv.org/abs/2601.07473):
|
||||||
|
|
||||||
> Each adapter architecture encodes a claim about how to intervene in transformer internals. LoRA hypothesizes weight changes are low-rank. OFT hypothesizes orthogonal transformations preserve semantic structure. VeRA hypothesizes shared random projections plus learned scaling suffice. DeLoRA hypothesizes direction and magnitude should decouple. PiSSA hypothesizes principal components matter most. Our choice -- Cayley rotations of SVD singular vectors -- hypothesizes that the model's own learned basis defines the natural intervention manifold. Adapters that generalize out-of-distribution tell us which geometric structures are causally relevant to behavior, not merely correlated with it.
|
> Each adapter architecture encodes a claim about how to intervene in transformer internals. LoRA hypothesizes weight changes are low-rank. OFT hypothesizes orthogonal transformations preserve semantic structure. VeRA hypothesizes shared random projections plus learned scaling suffice. DeLoRA hypothesizes direction and magnitude should decouple. PiSSA hypothesizes principal components matter most. Our choice -- Cayley rotations of SVD singular vectors -- hypothesizes that the model's own learned basis defines the natural intervention manifold. Adapters that generalize out-of-distribution tell us which geometric structures are causally relevant to behavior, not merely correlated with it.
|
||||||
|
|
||||||
@@ -29,7 +31,7 @@ We grade evidence on independent dimensions. Each method gets points for the dim
|
|||||||
| BF | 1.5 | Matches or beats full fine-tuning |
|
| BF | 1.5 | Matches or beats full fine-tuning |
|
||||||
| DE | 1.5 | Data-efficient: faster convergence or works with less data |
|
| DE | 1.5 | Data-efficient: faster convergence or works with less data |
|
||||||
| OOD | 2 | Generalizes out-of-distribution |
|
| OOD | 2 | Generalizes out-of-distribution |
|
||||||
| WA | 1 | Widely adopted: used as baseline by other papers |
|
| WA | 1 | Widely adopted: used as baseline by many other papers |
|
||||||
|
|
||||||
Total = sum of applicable dimensions (max 8). Higher = stronger evidence that the method's structural hypothesis is correct.
|
Total = sum of applicable dimensions (max 8). Higher = stronger evidence that the method's structural hypothesis is correct.
|
||||||
|
|
||||||
@@ -52,7 +54,7 @@ def lora_forward(x, W, A, B, α, r):
|
|||||||
return (W + scaling * ΔW) @ x # equivalently: W(x) + scaling * B(A(x))
|
return (W + scaling * ΔW) @ x # equivalently: W(x) + scaling * B(A(x))
|
||||||
```
|
```
|
||||||
|
|
||||||
**Evidence:** Parameter-efficient (matches full FT with 0.01% params on GPT-3). Universal baseline adopted by the entire field. Authors demonstrate comparable performance to full fine-tuning on GPT-3 175B across multiple NLU benchmarks. Subsequent work ([Biderman et al. 2024](https://arxiv.org/abs/2405.09673)) finds LoRA underperforms full FT on harder tasks and larger scale -- the low-rank assumption holds for surface-level adaptation but breaks where deep weight restructuring is needed.
|
**Evidence:** Parameter-efficient (matches full FT with 0.01% params on GPT-3). One of the most common baselines in PEFT. Authors demonstrate comparable performance to full fine-tuning on GPT-3 175B across multiple NLU benchmarks. Subsequent work ([Biderman et al. 2024](https://arxiv.org/abs/2405.09673), [saved](docs/biderman_lora_limitations.md)) finds LoRA underperforms full FT on harder tasks and larger scale -- the low-rank assumption holds for surface-level adaptation but weakens when deeper restructuring is needed.
|
||||||
|
|
||||||
**Grade:** PE+WA=2 (parameter-efficient, universal baseline, but ceiling on hard tasks)
|
**Grade:** PE+WA=2 (parameter-efficient, universal baseline, but ceiling on hard tasks)
|
||||||
|
|
||||||
@@ -130,7 +132,7 @@ def dora_forward(x, W, A, B, m, α):
|
|||||||
return scale * (W @ x) + scale * α * (B @ A @ x)
|
return scale * (W @ x) + scale * α * (B @ A @ x)
|
||||||
```
|
```
|
||||||
|
|
||||||
**Evidence:** Authors analyze full FT weight updates and find they exhibit distinct magnitude vs. direction patterns that LoRA misses. DoRA consistently outperforms LoRA on LLaMA (commonsense reasoning), LLaVA (visual instruction tuning), and VL-BART (image/video-text), across multiple scales. No additional inference overhead (magnitudes merge). Has become a widely-adopted LoRA variant and default in many pipelines.
|
**Evidence:** Authors analyze full FT weight updates and find distinct magnitude vs. direction patterns that LoRA misses. DoRA outperforms LoRA on LLaMA (commonsense reasoning), LLaVA (visual instruction tuning), and VL-BART (image/video-text) in their reported setups. No additional inference overhead (magnitudes merge). It is now a common LoRA-family baseline in many recent papers.
|
||||||
|
|
||||||
**Grade:** PE+BL+BF+WA=4.5 (beats LoRA across multiple domains, QDoRA slightly outperforms full FT on LLaMA2-7B/LLaMA3-8B, standard strong baseline)
|
**Grade:** PE+BL+BF+WA=4.5 (beats LoRA across multiple domains, QDoRA slightly outperforms full FT on LLaMA2-7B/LLaMA3-8B, standard strong baseline)
|
||||||
|
|
||||||
@@ -224,7 +226,7 @@ def svft_forward(x, W, U, Vt, indices, c):
|
|||||||
|
|
||||||
The key: $\Delta W = \sum_{t} c_t \cdot u_{i_t} v_{j_t}^\top$, where $u_i, v_j$ come from $W$'s SVD. Only the $c_t$ scalars are learned. Different sparsity patterns (band-diagonal, random, etc.) give different expressivity/efficiency tradeoffs.
|
The key: $\Delta W = \sum_{t} c_t \cdot u_{i_t} v_{j_t}^\top$, where $u_i, v_j$ come from $W$'s SVD. Only the $c_t$ scalars are learned. Different sparsity patterns (band-diagonal, random, etc.) give different expressivity/efficiency tradeoffs.
|
||||||
|
|
||||||
**Evidence:** SVFT recovers up to 96% of full fine-tuning performance with only 0.006--0.25% of parameters, outperforming LoRA/DoRA/BOFT which only recover 85% with 0.03--0.8% of params. Strong results on language (GLUE, commonsense reasoning) and vision benchmarks. The weight-dependent structure is the key differentiator.
|
**Evidence:** SVFT reports up to 96% of full fine-tuning performance with only 0.006--0.25% of parameters, outperforming LoRA/DoRA/BOFT ranges reported in the paper. Results are strong on language (GLUE, commonsense reasoning) and vision benchmarks. The weight-dependent structure is the key differentiator.
|
||||||
|
|
||||||
**Grade:** PE+BL=2 (beats LoRA/DoRA on performance/parameter tradeoff, weight-aware structure)
|
**Grade:** PE+BL=2 (beats LoRA/DoRA on performance/parameter tradeoff, weight-aware structure)
|
||||||
|
|
||||||
@@ -670,7 +672,7 @@ def shira_forward(x, W, mask, ΔW_sparse):
|
|||||||
return W_adapted @ x
|
return W_adapted @ x
|
||||||
```
|
```
|
||||||
|
|
||||||
**Evidence:** Authors claim SHiRA outperforms LoRA especially on concept loss when using multiple adapters (critical for diffusion model fine-tuning). Sparse adapters are cheaper to switch between than LoRA. NeurIPS 2024 Workshop. The importance-scoring approach connects to structured pruning literature.
|
**Evidence:** Authors report SHiRA outperforms LoRA especially on concept-loss-sensitive multi-adapter settings (critical for diffusion model fine-tuning). Sparse adapters are cheaper to switch between than LoRA. NeurIPS 2024 Workshop. The importance-scoring approach connects to structured pruning literature.
|
||||||
|
|
||||||
**Grade:** PE=1 (sparse high-rank, good multi-adapter properties, workshop paper)
|
**Grade:** PE=1 (sparse high-rank, good multi-adapter properties, workshop paper)
|
||||||
|
|
||||||
@@ -814,7 +816,9 @@ def oftv2_forward(x, W, Q):
|
|||||||
|
|
||||||
## 31. CLOVER -- Cross-Layer Joint SVD Adaptation
|
## 31. CLOVER -- Cross-Layer Joint SVD Adaptation
|
||||||
|
|
||||||
**Paper:** [Tang et al. 2024](https://arxiv.org/abs/2411.17426)
|
**Paper:** [Meng, Tang, Jiang, Zhang 2024](https://arxiv.org/abs/2411.17426)
|
||||||
|
**Code:** [github.com/fanxu-meng/CLOVER](https://github.com/fanxu-meng/CLOVER)
|
||||||
|
**Saved:** [docs/clover_joint_svd.md](docs/clover_joint_svd.md)
|
||||||
|
|
||||||
**Hypothesis:** Attention layers have *cross-layer redundancy* in their SVD structure. Rather than adapting Q, K, V, O projections independently, CLOVER performs joint SVD across paired attention matrices (Q-K and V-O), exploiting the shared singular subspace between layers that cooperate functionally. The weight matrices within a head are not independent -- they jointly define the attention computation, so their adaptation should be coupled.
|
**Hypothesis:** Attention layers have *cross-layer redundancy* in their SVD structure. Rather than adapting Q, K, V, O projections independently, CLOVER performs joint SVD across paired attention matrices (Q-K and V-O), exploiting the shared singular subspace between layers that cooperate functionally. The weight matrices within a head are not independent -- they jointly define the attention computation, so their adaptation should be coupled.
|
||||||
|
|
||||||
@@ -841,7 +845,7 @@ def clover_forward(x, W_q, W_k, W_v, W_o, adapters):
|
|||||||
return attention(x, W_q + ΔW_q, W_k + ΔW_k, W_v + ΔW_v, W_o + ΔW_o)
|
return attention(x, W_q + ΔW_q, W_k + ΔW_k, W_v + ΔW_v, W_o + ΔW_o)
|
||||||
```
|
```
|
||||||
|
|
||||||
**Evidence:** Validated on SDXL (image generation), LLaMA-Vision (multimodal), and Whisper (speech). Beats LoRA by 7.6%, DoRA by 5.5%, PiSSA by 0.7% on average. The cross-layer coupling captures shared structure that per-matrix methods miss. The multimodal validation is notable -- works across text, vision, and audio modalities.
|
**Evidence:** Authors report validation on SDXL (image generation), LLaMA-Vision (multimodal), and Whisper (speech), with average gains over LoRA (+7.6%), DoRA (+5.5%), and PiSSA (+0.7%) in their setup. The cross-layer coupling claim is plausible because Q-K and V-O are functionally paired. This result is strong but still from a single research group.
|
||||||
|
|
||||||
**Grade:** PE+BL+BF=3.5 (beats LoRA and DoRA significantly; validated across 3 modalities)
|
**Grade:** PE+BL+BF=3.5 (beats LoRA and DoRA significantly; validated across 3 modalities)
|
||||||
|
|
||||||
@@ -851,7 +855,8 @@ def clover_forward(x, W_q, W_k, W_v, W_o, adapters):
|
|||||||
|
|
||||||
## 32. PSOFT -- Principal Subspace Orthogonal Fine-Tuning
|
## 32. PSOFT -- Principal Subspace Orthogonal Fine-Tuning
|
||||||
|
|
||||||
**Paper:** [2025](https://arxiv.org/abs/2505.11235)
|
**Paper:** [Wu et al. 2026](https://arxiv.org/abs/2505.11235)
|
||||||
|
**Saved:** [docs/psoft_principal_subspace_oft.md](docs/psoft_principal_subspace_oft.md)
|
||||||
|
|
||||||
**Hypothesis:** Combine PiSSA's SVD initialization with OFT's orthogonal constraint. After extracting the principal subspace via SVD, learn a Cayley rotation $R$ that operates *within* the frozen $U, V$ subspace. This is "OFT in SVD coordinates" -- preserving pairwise angles (OFT's insight) while working in the model's natural basis (PiSSA's insight). The rotation $R$ acts on the principal singular vectors, keeping the subspace orientation while rotating within it.
|
**Hypothesis:** Combine PiSSA's SVD initialization with OFT's orthogonal constraint. After extracting the principal subspace via SVD, learn a Cayley rotation $R$ that operates *within* the frozen $U, V$ subspace. This is "OFT in SVD coordinates" -- preserving pairwise angles (OFT's insight) while working in the model's natural basis (PiSSA's insight). The rotation $R$ acts on the principal singular vectors, keeping the subspace orientation while rotating within it.
|
||||||
|
|
||||||
@@ -870,11 +875,11 @@ def psoft_forward(x, U_r, S_r, V_r, K, W_res):
|
|||||||
return (W_adapted + W_res) @ x # + frozen residual
|
return (W_adapted + W_res) @ x # + frozen residual
|
||||||
```
|
```
|
||||||
|
|
||||||
**Evidence:** Authors claim 80% memory reduction vs standard OFT. Tested on 35 NLP and CV tasks. Combines the convergence benefits of SVD initialization (PiSSA) with the semantic preservation of orthogonal constraints (OFT). The Cayley parameterization ensures exact orthogonality without projection steps.
|
**Evidence:** Authors report ~80% memory reduction vs OFT and broad evaluation across 35 NLP/CV tasks, while keeping performance competitive. This is a direct synthesis of PiSSA-style SVD initialization plus OFT-style Cayley-constrained rotations.
|
||||||
|
|
||||||
**Grade:** PE+BL+DE=3.5 (memory-efficient, faster convergence from SVD init, beats LoRA on 35 tasks)
|
**Grade:** PE+BL+DE=3.5 (memory-efficient, faster convergence from SVD init, beats LoRA on 35 tasks)
|
||||||
|
|
||||||
*Implications:* PSOFT is the clearest synthesis of the SVD + orthogonal hypotheses. It shows that these two structural priors are *complementary*, not competing: SVD identifies *where* to intervene (principal subspace), orthogonality constrains *how* to intervene (rotations that preserve structure). The resulting method inherits both benefits. This supports the emerging picture that the "right" adapter is one that respects both the model's eigenbasis and the geometry of meaningful transformations within it.
|
*Implications:* PSOFT combines two ideas that actually work: SVD tells you where to intervene (principal subspace), orthogonality constrains how (rotations that preserve structure). The method inherits benefits from both. This suggests the best adapters respect both the model's eigenbasis and the geometry of transformations within it.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -882,6 +887,7 @@ def psoft_forward(x, U_r, S_r, V_r, K, W_res):
|
|||||||
|
|
||||||
**Paper:** [Wu, Arora, Wang et al. 2024](https://arxiv.org/abs/2404.03592)
|
**Paper:** [Wu, Arora, Wang et al. 2024](https://arxiv.org/abs/2404.03592)
|
||||||
**Code:** [github.com/stanfordnlp/pyreft](https://github.com/stanfordnlp/pyreft)
|
**Code:** [github.com/stanfordnlp/pyreft](https://github.com/stanfordnlp/pyreft)
|
||||||
|
**Saved:** [docs/reft_representation_finetuning.md](docs/reft_representation_finetuning.md)
|
||||||
|
|
||||||
**Hypothesis:** Adaptation should target *representations* (activations), not weights. Instead of modifying $W$, modify the hidden state $h$ at specific layers and positions via learned interventions. The model's weights are already fine; we just need to redirect its intermediate computations. This is the "activation steering" hypothesis taken to its limit: learn a linear intervention on hidden states at specific token positions.
|
**Hypothesis:** Adaptation should target *representations* (activations), not weights. Instead of modifying $W$, modify the hidden state $h$ at specific layers and positions via learned interventions. The model's weights are already fine; we just need to redirect its intermediate computations. This is the "activation steering" hypothesis taken to its limit: learn a linear intervention on hidden states at specific token positions.
|
||||||
|
|
||||||
@@ -901,7 +907,7 @@ def reft_forward(model, x, interventions):
|
|||||||
|
|
||||||
The key: instead of $W' = W + \Delta W$, apply $h' = h + R^\top (R h + b - R h)$ at specific (layer, position) pairs. The intervention is a learned affine transformation in a low-rank subspace of the hidden state.
|
The key: instead of $W' = W + \Delta W$, apply $h' = h + R^\top (R h + b - R h)$ at specific (layer, position) pairs. The intervention is a learned affine transformation in a low-rank subspace of the hidden state.
|
||||||
|
|
||||||
**Evidence:** ReFT is 15-65x more parameter-efficient than LoRA (only modifying hidden states at a few positions). Stanford NLP group, widely cited. Strong on instruction following, commonsense reasoning, and math. The activation-space intervention paradigm is fundamentally different from weight adaptation -- it operates in representation space rather than parameter space.
|
**Evidence:** Authors report 15-65x parameter savings vs LoRA by intervening on hidden states at selected layer-position sites. Reported gains cover instruction following, commonsense reasoning, and NLU tasks in their benchmark suite. Since this is activation-space adaptation, comparisons with weight-space adapters are informative but not perfectly apples-to-apples.
|
||||||
|
|
||||||
**Grade:** PE+BL=2 (15-65x more parameter-efficient than LoRA, beats LoRA on multiple benchmarks, distinct paradigm)
|
**Grade:** PE+BL=2 (15-65x more parameter-efficient than LoRA, beats LoRA on multiple benchmarks, distinct paradigm)
|
||||||
|
|
||||||
@@ -911,101 +917,27 @@ The key: instead of $W' = W + \Delta W$, apply $h' = h + R^\top (R h + b - R h)$
|
|||||||
|
|
||||||
## Themes: What the Evidence Tells Us
|
## Themes: What the Evidence Tells Us
|
||||||
|
|
||||||
Looking across all 33 methods, several structural hypotheses emerge as clusters. Each theme represents a point of view about *what transformers are*, supported by the methods that exploit it.
|
Looking across all 33 methods, a coherent tentative story appears once benchmark noise is reduced. Many successful adapters make geometric bets: first choose coordinates that align with pretrained structure, then constrain updates so they do not destroy that structure, then control update strength explicitly.
|
||||||
|
|
||||||
### Theme 1: SVD as the Natural Coordinate System
|
A quick source-level pass over the paper texts helps anchor this interpretation. Direct "we hypothesize" style statements cluster into the same buckets used here: LoRA and RandLoRA for low-rank sufficiency limits; OFT and ETHER for orthogonality and preserved angular structure; DoRA and DeLoRA for direction-strength decoupling; IA3 for scaling-only adaptation; SHiRA and C3A for high-rank and structural alternatives; AntiPaSTO for SVD-coordinate intervention and OOD transfer. That clustering is not perfect, but it is strong enough to justify organizing the literature by theme rather than by year.
|
||||||
|
<!-- TODO IS this meta statement needed for the audience or directed at them? we don't actually do clustering either -->
|
||||||
|
|
||||||
**Methods:** PiSSA (5), SVFT (2), SSVD (3.5), CLOVER (3.5), PSOFT (3.5), AntiPaSTO (4.5), AdaLoRA (1)
|
|
||||||
|
|
||||||
**Core claim:** The SVD of each weight matrix provides the right coordinate system for intervention. The left singular vectors ($U$) define the output feature space, the right singular vectors ($V$) define the input feature space, and the singular values ($\Sigma$) quantify relative importance.
|
The strongest recurring signal is *basis choice*. SVD-aware methods such as PiSSA, SSVD, CLOVER, and PSOFT often beat random-basis baselines under similar budgets in reported setups. In practical terms, initializing in the model's singular-vector basis reduces the search problem. The optimizer starts in a subspace the model already uses. This is not proof that SVD is uniquely correct, but it is stronger evidence than a single benchmark win.
|
||||||
|
|
||||||
**Evidence for:** PiSSA's faster convergence shows the principal components carry the signal -- initializing in SVD space skips the "re-discovery" phase that random-init LoRA wastes. SVFT recovers 96% of full FT performance by learning *only* coefficients over the model's own singular vectors. SSVD's convergence hierarchy (SSVD > PiSSA > DoRA > LoRA) holds across scales and strengthens with model size. CLOVER extends this to *joint* SVD across functionally coupled matrices (Q-K, V-O), gaining 7.6% over LoRA.
|
*Orthogonal* methods add the next piece. OFT and BOFT show that bounded rotations can preserve useful behavior while still adapting to new tasks. The Cayley parameterization appears across OFT, SSVD, PSOFT, and AntiPaSTO because it keeps rotations orthogonal without repeated projection steps. Pure orthogonality can be too rigid when tasks need gain changes, so methods that pair rotations with magnitude control tend to perform better.
|
||||||
|
|
||||||
**Evidence against / limitations:** SVD is a linear decomposition of a nonlinear system. It captures the dominant linear subspaces but may miss nonlinear structure. The SVD basis is also model-specific (not transferable across checkpoints without re-computing). No method has demonstrated that SVD coordinates are *uniquely* correct vs. other spectral decompositions (NMF, ICA, etc.).
|
That leads to the *direction-versus-strength* split. DoRA, DeLoRA, ROAD, and AntiPaSTO all separate where to move in weight space from how far to move. In runs that report careful ablations, this split often improves stability and sometimes final accuracy. Whether this is a deep property of transformer computation or mainly an optimization advantage is still open.
|
||||||
|
|
||||||
**Bottom line:** The strongest evidence cluster. 6 of the top-7 scoring methods use SVD. The model's own eigenbasis appears to be a better intervention basis than random (LoRA), per-element (IA3), or spectral (FourierFT) alternatives.
|
A parallel thread is *gain control*. IA3, VeRA, and LN tuning show that a lot of adaptation comes from rescaling existing features instead of inventing new ones. This explains why tiny parameter budgets can work well on many tasks. It also clarifies where they fail: when tasks require genuinely new feature combinations, scaling-only methods plateau.
|
||||||
|
|
||||||
### Theme 2: Orthogonal Transformations Preserve Structure
|
The *rank* debate looks secondary once basis is accounted for. Full-rank updates can help on harder tasks, as RandLoRA and C3A suggest, but PiSSA and SVFT show that a good low-rank subspace can beat a poorly chosen full-rank update. In practice, "which subspace" matters more than "how many free directions".
|
||||||
|
|
||||||
**Methods:** OFT (2.5), BOFT (4), GOFT (1), HRA (1), ETHER (2.5), OFTv2 (1), PSOFT (3.5), RoAd (1), AntiPaSTO (4.5)
|
Finally, methods that respect *functional structure* are promising but early. CLOVER's joint treatment of Q-K and V-O pairs outperforms per-matrix updates, and ReFT shows that targeted activation interventions can be far more parameter-efficient than weight updates. Both suggest that treating transformers as computation graphs, not bags of matrices, is a productive direction.
|
||||||
|
|
||||||
**Core claim:** Useful adaptation should preserve the pairwise angular relationships between neuron weight vectors. Orthogonal transformations ($R \in O(d)$) rotate the weight space without distorting it, maintaining the "hyperspherical energy" that encodes learned knowledge.
|
### Overall picture
|
||||||
|
|
||||||
**Evidence for:** OFT's data efficiency (converges with 5% of data) and training stability directly result from bounded deviation. BOFT beats full FT while preserving hyperspherical energy. ETHER's learning-rate robustness and fast convergence come from bounded distance to identity. The Cayley parameterization (used by OFT, SSVD, PSOFT, AntiPaSTO) guarantees exact orthogonality without projection steps.
|
Across methods, the same pattern keeps repeating: adapters work best when they preserve pretrained structure and then move within it in controlled ways. SVD-aware coordinates identify high-signal directions, near-orthogonal transforms protect useful geometry, and explicit strength controls prevent overwriting. This is currently the strongest empirical pattern in the catalog. It does not settle causality by itself, but it narrows the search space and yields concrete, falsifiable predictions for mechanistic work.
|
||||||
|
|
||||||
**Evidence against / limitations:** Pure orthogonality prevents magnitude changes, limiting adaptation when neuron importance needs rescaling. DoRA's analysis shows that full FT *does* change magnitudes significantly. Methods that relax strict orthogonality (ETHER+, DeLoRA) often outperform strict ones, suggesting orthogonality is a useful *bias* but not a hard constraint.
|
|
||||||
|
|
||||||
**Bottom line:** Orthogonality is the right default constraint -- it prevents catastrophic forgetting and enables data-efficient adaptation. But the best methods (AntiPaSTO, PSOFT) combine orthogonal rotations with learned magnitude adjustments, suggesting "mostly orthogonal + small scaling" is the true operating point.
|
|
||||||
|
|
||||||
### Theme 3: Decoupling Magnitude from Direction
|
|
||||||
|
|
||||||
**Methods:** DoRA (4.5), DeLoRA (3.5), ROAD (1), AntiPaSTO (4.5), SSVD (3.5)
|
|
||||||
|
|
||||||
**Core claim:** Weight updates decompose into *what direction to change* (angular component) and *how much to change* (magnitude component). These should be learned separately because they serve different purposes: direction captures task-relevant feature selection, magnitude captures task-specific intensity.
|
|
||||||
|
|
||||||
**Evidence for:** DoRA's weight decomposition analysis shows that full FT exhibits distinct magnitude vs. direction update patterns that LoRA conflates. DeLoRA's robustness to learning rate and training duration comes from explicitly bounding the magnitude while letting direction train freely. ROAD's $(\theta, \alpha)$ parameterization provides the cleanest per-pair decoupling.
|
|
||||||
|
|
||||||
**Evidence against:** The direction/magnitude decomposition is not unique (many decompositions exist). The evidence is mostly about training stability, not about capturing a fundamental property of the model.
|
|
||||||
|
|
||||||
**Bottom line:** Decoupling direction from magnitude is a reliable engineering principle that improves robustness and interpretability. Whether it reflects something deep about how transformers compute (gain modulation vs. selectivity tuning) or is simply good optimization practice remains open.
|
|
||||||
|
|
||||||
### Theme 4: Gain Control -- Adaptation as Rescaling
|
|
||||||
|
|
||||||
**Methods:** IA3 (1), VeRA (1), LN Tuning (1), DoRA (4.5)
|
|
||||||
|
|
||||||
**Core claim:** Much of task adaptation is reweighting what the model already computes. Learned scaling vectors, normalization adjustments, and magnitude modulation can achieve competitive adaptation without introducing new weight-space directions.
|
|
||||||
|
|
||||||
**Evidence for:** IA3's success with just $d$-dimensional scaling vectors; VeRA's success with only per-layer scaling of frozen random matrices; LN Tuning's competitive results with only normalization parameters. These suggest that the pretrained model already computes most useful features -- the task bottleneck is *which ones to attend to*, not *what to compute*.
|
|
||||||
|
|
||||||
**Evidence against:** All pure scaling methods hit a performance ceiling on complex tasks. IA3 struggles when novel feature combinations are required. This ceiling is informative: it tells us that adaptation is *partly* rescaling and *partly* restructuring, and the hard tasks require the latter.
|
|
||||||
|
|
||||||
**Bottom line:** Gain control handles the "easy" adaptation (60-80% of the gap to full FT on standard benchmarks). The remaining gap requires structural changes (new directions, rotations). But the fraction of adaptation that's just rescaling is larger than most people assume.
|
|
||||||
|
|
||||||
### Theme 5: Low-Rank vs. Full-Rank -- Where the Ceiling Is
|
|
||||||
|
|
||||||
**Methods:** LoRA (2), RandLoRA (2.5), C3A (1), LoHa (1), SHiRA (1)
|
|
||||||
|
|
||||||
**Core claim:** LoRA's low-rank constraint ($\Delta W = BA$, $\text{rank} \leq r$) is both its strength (parameter efficiency) and its weakness (expressivity ceiling). Methods that achieve full-rank updates with few parameters either match or beat LoRA.
|
|
||||||
|
|
||||||
**Evidence for:** RandLoRA shows that increasing LoRA's rank has diminishing returns, while full-rank random-basis combinations continue to improve. The loss landscape analysis shows RandLoRA reaches local minima closer to full FT. C3A achieves full-rank via circulant structure. LoHa achieves higher effective rank via Hadamard products.
|
|
||||||
|
|
||||||
**Evidence against:** PiSSA and SVFT show that *which* rank-$r$ subspace you use matters more than having full rank. A well-chosen rank-$r$ subspace (SVD top components) outperforms a poorly chosen full-rank update. The rank bottleneck may be a red herring -- it's the *basis* that matters.
|
|
||||||
|
|
||||||
**Bottom line:** Low rank suffices for easy tasks; full rank is needed for hard ones (vision-language, complex reasoning). But the SVD methods suggest that even for hard tasks, a good low-rank basis can recover most of the gain, making the rank debate secondary to the basis debate.
|
|
||||||
|
|
||||||
### Theme 6: Respecting Functional Architecture
|
|
||||||
|
|
||||||
**Methods:** CLOVER (3.5), ReFT (2), Poly/X-LoRA (1)
|
|
||||||
|
|
||||||
**Core claim:** Treating each weight matrix independently ignores the functional structure of attention. Q-K matrices jointly define the attention pattern; V-O jointly define value extraction. ReFT goes further: intervene on *activations* at specific (layer, position) sites rather than on weights at all.
|
|
||||||
|
|
||||||
**Evidence for:** CLOVER's joint SVD across Q-K and V-O pairs beats per-matrix methods by 5-7%. ReFT's 15-65x efficiency gain suggests that a few critical activation sites carry most of the task signal. Both methods exploit the insight that transformer computation has structure *beyond* individual weight matrices.
|
|
||||||
|
|
||||||
**Evidence against:** CLOVER is one paper; the cross-layer coupling idea needs broader validation. ReFT operates in a fundamentally different space (activations vs. weights), making comparison with weight methods not fully apples-to-apples.
|
|
||||||
|
|
||||||
**Bottom line:** Early but promising. The functional architecture of attention (which matrices cooperate) and the computational graph (where in the forward pass to intervene) are underexploited priors. As models grow, these priors should become *more* valuable because they reduce the search space along functionally meaningful dimensions.
|
|
||||||
|
|
||||||
### Summary Table
|
|
||||||
|
|
||||||
| Theme | Core PoV | Top method | Score | Strength of evidence |
|
|
||||||
|-------|----------|-----------|-------|---------------------|
|
|
||||||
| SVD basis | Model's eigenbasis is the natural coordinate system | PiSSA | 5 | Strong: 6/7 top methods use SVD |
|
|
||||||
| Orthogonality | Preserve angular structure, bound deviation | BOFT | 4 | Strong: data efficiency, training stability |
|
|
||||||
| Mag/Dir decoupling | Direction = what; magnitude = how much | DoRA | 4.5 | Moderate: good engineering, unclear if fundamental |
|
|
||||||
| Gain control | Adaptation is mostly rescaling | IA3 | 1 | Moderate: works until it doesn't |
|
|
||||||
| Rank debate | Full rank > low rank for hard tasks | RandLoRA | 2.5 | Moderate: secondary to basis choice |
|
|
||||||
| Functional architecture | Respect attention's internal structure | CLOVER | 3.5 | Early: one paper, promising signal |
|
|
||||||
|
|
||||||
### The Emerging Picture
|
|
||||||
|
|
||||||
These themes are not independent. The top-scoring methods combine multiple themes:
|
|
||||||
|
|
||||||
- **PiSSA** (5): SVD basis + data efficiency
|
|
||||||
- **AntiPaSTO** (4.5): SVD basis + orthogonal rotation + mag/dir decoupling + OOD transfer
|
|
||||||
- **DoRA** (4.5): Mag/dir decoupling + beats full FT + widely adopted
|
|
||||||
- **BOFT** (4): Orthogonal + beats full FT + data efficiency
|
|
||||||
|
|
||||||
The emerging consensus: the model's SVD eigenbasis defines the natural coordinate system for intervention. Within that basis, orthogonal rotations with small learned magnitude adjustments capture most useful adaptations. The direction/magnitude decomposition makes these adaptations robust and interpretable. When we additionally respect the functional architecture of attention (CLOVER) or intervene directly on activations (ReFT), further efficiency gains are possible.
|
|
||||||
|
|
||||||
What remains unclear: whether these geometric priors capture something *fundamentally* true about how transformers compute, or whether they are "just" good inductive biases that happen to work well with current architectures. The answer matters for alignment: if the SVD basis encodes *causally relevant* computational structure, then adapter-based interventions are genuine interpretability tools; if it's just a convenient coordinate system, the causal claims don't follow.
|
|
||||||
|
|
||||||
|
<!-- TODO kind of weak I'd rather make a prediciton, or state strength of evidence, or if it changed my mind here -->
|
||||||
|
|||||||
@@ -0,0 +1,244 @@
|
|||||||
|
# Adapter Pseudocode (pseudopy)
|
||||||
|
|
||||||
|
Companion to [adapters_vargdown.argdown](adapters_vargdown.argdown).
|
||||||
|
Each adapter's intervention as pseudopy: minimal, self-documenting, shapes in comments.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. LoRA -- baseline
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── LoRA: low-rank weight update ──
|
||||||
|
ΔW ← B @ A # ΔW ∈ ℝ^{m×n}, rank r
|
||||||
|
y ← (W + (α/r) · ΔW) @ x # W frozen; A, B learned
|
||||||
|
```
|
||||||
|
|
||||||
|
## 2. OFT -- orthogonal rotation
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── OFT: Cayley rotation of weight rows ──
|
||||||
|
Q_skew ← skew(Q) # Q_skew ∈ ℝ^{b×k×k}, antisymmetric
|
||||||
|
R ← (I + Q_skew)(I - Q_skew)⁻¹ # Cayley map → R ∈ O(k)
|
||||||
|
R_full ← blkdiag(R₁, ..., R_b) # R_full ∈ O(d)
|
||||||
|
y ← (W @ Rᵀ) @ x # rotate rows, preserve angles
|
||||||
|
```
|
||||||
|
|
||||||
|
## 3. VeRA -- random projection + scaling
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── VeRA: shared random matrices, per-layer scaling ──
|
||||||
|
# A ∈ ℝ^{r×n}, B ∈ ℝ^{m×r}: frozen random, shared ALL layers
|
||||||
|
# λ_d ∈ ℝ^r, λ_b ∈ ℝ^r: learned per-layer
|
||||||
|
ΔW ← (λ_b ⊙ B) @ (λ_d ⊙ A) # ΔW ∈ ℝ^{m×n}
|
||||||
|
y ← (W + ΔW) @ x
|
||||||
|
```
|
||||||
|
|
||||||
|
## 4. DoRA -- magnitude/direction decomposition
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── DoRA: decouple ‖W‖ from W/‖W‖ ──
|
||||||
|
# m ∈ ℝ^{m}: learned magnitude per output neuron
|
||||||
|
V ← W + α · B @ A # updated direction
|
||||||
|
V̂ ← V / ‖V‖_col # unit direction, .detach()
|
||||||
|
y ← (m · V̂) @ x # magnitude × direction
|
||||||
|
```
|
||||||
|
|
||||||
|
## 5. DeLoRA -- bounded deviation
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── DeLoRA: normalize rank-1 components, scale by λ ──
|
||||||
|
# A ∈ ℝ^{r×n}, B ∈ ℝ^{m×r}: learned
|
||||||
|
# λ ∈ ℝ^r: learned per-component strength
|
||||||
|
â_i ← ‖aᵢ‖, b̂ⱼ ← ‖bⱼ‖ # per-component norms
|
||||||
|
s ← (λ/r) / (â · b̂) # normalize each outer product
|
||||||
|
ΔW ← B · diag(s) · A # direction normalized, strength via λ
|
||||||
|
y ← W(x) + ΔW @ x
|
||||||
|
```
|
||||||
|
|
||||||
|
## 6. PiSSA -- SVD initialization
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── PiSSA: init adapters from top-r SVD of W ──
|
||||||
|
U, Σ, Vᵀ ← svd(W) # W ∈ ℝ^{m×n}
|
||||||
|
A ← U_{:,:r} · √Σ_{:r} # A ∈ ℝ^{m×r}, principal left
|
||||||
|
B ← √Σ_{:r} · Vᵀ_{:r,:} # B ∈ ℝ^{r×n}, principal right
|
||||||
|
W_res ← U_{:,r:} · Σ_{r:} · Vᵀ_{r:,:} # residual, frozen
|
||||||
|
# ── forward (identical to LoRA) ──
|
||||||
|
y ← (W_res + A @ B) @ x
|
||||||
|
```
|
||||||
|
|
||||||
|
## 7. SVFT -- SVD coefficient tuning
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── SVFT: learn sparse coefficients over W's own singular vectors ──
|
||||||
|
U, Σ, Vᵀ ← svd(W) # frozen
|
||||||
|
# select k sparse (i,j) pairs; c ∈ ℝ^k learned
|
||||||
|
ΔW ← Σₜ cₜ · uᵢ vⱼᵀ # sparse combo of outer products
|
||||||
|
y ← (W + ΔW) @ x
|
||||||
|
```
|
||||||
|
|
||||||
|
## 8. SSVD -- asymmetric SVD rotation
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── SSVD: rotate right singular vectors, preserve left ──
|
||||||
|
U, Σ, Vᵀ ← svd(W) # all frozen
|
||||||
|
K ← skew(θ) # K ∈ ℝ^{k×k}, learned antisymmetric
|
||||||
|
G ← (I - K)(I + K)⁻¹ # Cayley → G ∈ O(k)
|
||||||
|
Σ̂ ← Σ; Σ̂_{:k} += ΔΣ # shift top-k singular values (learned)
|
||||||
|
V̂ᵀ ← Vᵀ; V̂ᵀ_{:k} ← G @ Vᵀ_{:k} # rotate input-space vectors only
|
||||||
|
y ← U · diag(Σ̂) · V̂ᵀ @ x
|
||||||
|
# params: k(k-1)/2 + k
|
||||||
|
```
|
||||||
|
|
||||||
|
$$W' = U \, (\Sigma + \Delta\Sigma) \, G_k \, V^\top$$
|
||||||
|
|
||||||
|
## 9. IA3 -- activation scaling
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── IA3: element-wise scaling of activations ──
|
||||||
|
# λ ∈ ℝ^d: learned, init=1
|
||||||
|
y ← W @ (x ⊙ λ) # FFN: scale input channels
|
||||||
|
y ← (W @ x) ⊙ λ # attn K,V: scale output channels
|
||||||
|
```
|
||||||
|
|
||||||
|
## 10. ROAD -- rotary adaptation
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── ROAD: 2D rotation + magnitude per activation pair ──
|
||||||
|
# θ ∈ ℝ^{d/2}: learned rotation angles
|
||||||
|
# α ∈ ℝ^{d/2}: learned magnitudes
|
||||||
|
for i in range(d//2):
|
||||||
|
x̂[2i] ← α_i · (cos θ_i · x[2i] - sin θ_i · x[2i+1])
|
||||||
|
x̂[2i+1] ← α_i · (sin θ_i · x[2i] + cos θ_i · x[2i+1])
|
||||||
|
y ← W @ x̂ # α=1, θ=0 → identity
|
||||||
|
```
|
||||||
|
|
||||||
|
## 11. AntiPaSTO -- SVD Cayley steering
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── AntiPaSTO: Cayley rotation of SVD singular vectors ──
|
||||||
|
U, Σ, Vᵀ ← svd(W) # frozen; computed once at init
|
||||||
|
K ← skew(θ) # θ ∈ ℝ^{k(k-1)/2}, learned
|
||||||
|
R ← (I - K)(I + K)⁻¹ # Cayley → R ∈ O(k)
|
||||||
|
# ── rotate both U and V by same R ──
|
||||||
|
Û ← U; Û_{:,:k} ← U_{:,:k} @ R # rotate output-space
|
||||||
|
V̂ᵀ ← Vᵀ; V̂ᵀ_{:k,:} ← R @ Vᵀ_{:k,:} # rotate input-space
|
||||||
|
W' ← Û · diag(Σ) · V̂ᵀ # reconstruct
|
||||||
|
y ← W' @ x
|
||||||
|
# antiparallel: negate θ → opposite behavioral direction
|
||||||
|
```
|
||||||
|
|
||||||
|
## 12. AdaLoRA -- adaptive SVD rank
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── AdaLoRA: SVD-parameterized with importance pruning ──
|
||||||
|
ΔW ← P · diag(Λ) · Q # P ∈ ℝ^{m×r}, Q ∈ ℝ^{r×n}
|
||||||
|
# importance score per singular value:
|
||||||
|
s_i ← |Λ_i| + β · ‖pᵢ‖ · ‖qᵢ‖ # sensitivity-weighted
|
||||||
|
# prune: zero out components with lowest s_i per budget
|
||||||
|
mask ← topk(s, budget)
|
||||||
|
ΔW ← P · diag(Λ ⊙ mask) · Q
|
||||||
|
y ← (W + ΔW) @ x
|
||||||
|
```
|
||||||
|
|
||||||
|
## 13. BOFT -- butterfly orthogonal
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── BOFT: butterfly-factorized O(d log d) orthogonal ──
|
||||||
|
# m butterfly factors B₁, ..., B_m each ∈ ℝ^{n/b × b × b}
|
||||||
|
# each Bⱼ is block-diagonal of small orthogonal matrices
|
||||||
|
R ← B_m @ ... @ B₂ @ B₁ # R ∈ O(d), O(d log d) params
|
||||||
|
y ← (W @ Rᵀ) @ x
|
||||||
|
```
|
||||||
|
|
||||||
|
## 14. GOFT -- Givens rotations
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── GOFT: compose d(d-1)/2 planar rotations ──
|
||||||
|
R ← I
|
||||||
|
for (i,j,θ) in givens_pairs: # each θ learned
|
||||||
|
G ← I; G[i,i] ← cos θ; G[i,j] ← -sin θ
|
||||||
|
G[j,i] ← sin θ; G[j,j] ← cos θ
|
||||||
|
R ← G @ R # compose
|
||||||
|
y ← (W @ Rᵀ) @ x # O(d) params
|
||||||
|
```
|
||||||
|
|
||||||
|
## 15. HRA -- Householder reflection
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── HRA: chain of r Householder reflections = rank-r ∩ O(d) ──
|
||||||
|
R ← I
|
||||||
|
for i in range(r):
|
||||||
|
vᵢ ← learned # v ∈ ℝ^d
|
||||||
|
Hᵢ ← I - 2 · vᵢ vᵢᵀ / ‖vᵢ‖² # Householder reflection
|
||||||
|
R ← Hᵢ @ R
|
||||||
|
y ← (W @ Rᵀ) @ x
|
||||||
|
# bridges: rank-r perturbation ≡ r Householder reflections
|
||||||
|
```
|
||||||
|
|
||||||
|
## 16. RandLoRA -- full-rank via random bases
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── RandLoRA: sum of scaled random rank-r bases ──
|
||||||
|
# A_i, B_i: frozen random matrices
|
||||||
|
# d_i, b_i ∈ ℝ^r: learned per-component scaling
|
||||||
|
ΔW ← Σᵢ (b_i ⊙ Bᵢ) @ (d_i ⊙ Aᵢ) # full rank possible
|
||||||
|
y ← (W + ΔW) @ x
|
||||||
|
```
|
||||||
|
|
||||||
|
## 17. FourierFT -- spectral coefficients
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── FourierFT: sparse Fourier coefficients ──
|
||||||
|
# select k frequency indices; c ∈ ℝ^k learned
|
||||||
|
ΔW ← iFFT2(scatter(c, indices, shape=(m,n)))
|
||||||
|
y ← (W + ΔW) @ x
|
||||||
|
```
|
||||||
|
|
||||||
|
## 18. CLOVER -- joint SVD across attention pairs
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── CLOVER: joint SVD over Q-K and V-O pairs per head ──
|
||||||
|
# ── init: decompose paired attention matrices ──
|
||||||
|
W_QK ← W_Q @ W_K.T # W_QK ∈ ℝ^{d×d}, combined Q-K
|
||||||
|
U_qk, S_qk, V_qk ← svd(W_QK) # per-head SVD
|
||||||
|
W_VO ← W_V @ W_O.T # W_VO ∈ ℝ^{d×d}, combined V-O
|
||||||
|
U_vo, S_vo, V_vo ← svd(W_VO)
|
||||||
|
|
||||||
|
# ── forward: only S is learned, U/V frozen ──
|
||||||
|
# rewrite Q,K from shared orthogonal basis:
|
||||||
|
W_Q' ← U_qk @ diag(√S_qk) # Q uses left singular vectors
|
||||||
|
W_K' ← V_qk @ diag(√S_qk) # K uses right singular vectors
|
||||||
|
# fine-tune: learn ΔS_qk, ΔS_vo (full-rank update via all directions)
|
||||||
|
S_qk' ← S_qk + ΔS_qk # learned shifts
|
||||||
|
y ← attn(W_Q' @ x, W_K' @ x, ...)
|
||||||
|
# params: 2 × rank scalars per head (pruning: zero small S entries)
|
||||||
|
```
|
||||||
|
|
||||||
|
## 19. PSOFT -- principal subspace + Cayley
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── PSOFT: PiSSA init + OFT rotation in principal subspace ──
|
||||||
|
U, Σ, Vᵀ ← svd(W)
|
||||||
|
# extract top-k subspace
|
||||||
|
# learn Cayley rotation R within that subspace only
|
||||||
|
# 80% memory reduction vs full OFT
|
||||||
|
R ← cayley(K) # K ∈ ℝ^{k×k}
|
||||||
|
W' ← U_{:,:k} @ R @ diag(Σ_{:k}) @ Vᵀ_{:k,:} + W_res
|
||||||
|
```
|
||||||
|
|
||||||
|
## 20. ReFT -- activation intervention
|
||||||
|
|
||||||
|
```
|
||||||
|
# ── ReFT: intervene on hidden states at (layer, position) ──
|
||||||
|
# R ∈ ℝ^{r×d}: learned rotation (low-rank subspace)
|
||||||
|
# b ∈ ℝ^r: learned bias in subspace
|
||||||
|
h ← model.layer[l].output[pos] # hidden state at site
|
||||||
|
h_proj ← R @ h # project to subspace
|
||||||
|
h_proj ← h_proj + b # intervene
|
||||||
|
h ← h + Rᵀ @ (h_proj - R @ h) # write back (preserve complement)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Notation: `⊙` element-wise, `@` matmul, `←` assignment, `Σₜ` summation, `‖·‖` norm, `blkdiag` block diagonal, `skew` maps vector to skew-symmetric matrix, `cayley(K) = (I-K)(I+K)⁻¹`.*
|
||||||
@@ -0,0 +1,468 @@
|
|||||||
|
===
|
||||||
|
title: Adapters as Representational Hypotheses -- Which Geometric Priors About Transformer Internals Hold Under Intervention?
|
||||||
|
author: Compiled from 33 PEFT papers (2021--2025)
|
||||||
|
model:
|
||||||
|
mode: strict
|
||||||
|
===
|
||||||
|
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
// MAIN THESIS
|
||||||
|
// Each adapter constrains how you update pretrained weights. When one
|
||||||
|
// architecture outperforms another under controlled conditions, the
|
||||||
|
// winner's structural assumptions are supported as a better description
|
||||||
|
// of the weight manifold. This is a natural experiment running across
|
||||||
|
// hundreds of papers, and almost nobody reads it as science about
|
||||||
|
// representations.
|
||||||
|
//
|
||||||
|
// Pseudocode for each adapter lives in adapters_pseudocode.md
|
||||||
|
// Evidence files are in docs/ (frozen copies of papers as markdown)
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
[Natural Manifold]: The model's own geometric structure -- its SVD basis
|
||||||
|
combined with orthogonal constraints and direction/magnitude decoupling --
|
||||||
|
defines the natural intervention manifold for transformer weight adaptation.
|
||||||
|
Adapters exploiting this structure generalize better, converge faster, and
|
||||||
|
reveal causally relevant properties of internals.
|
||||||
|
+ <SVD Basis Cluster>
|
||||||
|
+ <Orthogonal Preservation>
|
||||||
|
+ <Direction Magnitude Decoupling>
|
||||||
|
+ <Functional Architecture>
|
||||||
|
+ <Shared Structure Evidence>
|
||||||
|
+ <Prompt Tuning Orthogonal Paradigm>
|
||||||
|
- <Gain Control Sufficiency>
|
||||||
|
- <Rank Is Secondary>
|
||||||
|
- <SVD Is Linear Decomposition>
|
||||||
|
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
// THEME 1: SVD AS THE NATURAL COORDINATE SYSTEM
|
||||||
|
// Methods: PiSSA, SVFT, SSVD, CLOVER, AntiPaSTO, AdaLoRA, PSOFT
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
# SVD Basis
|
||||||
|
|
||||||
|
<SVD Basis Cluster>
|
||||||
|
|
||||||
|
(1) [SVD Init Wins]: PiSSA's SVD initialization consistently outperforms
|
||||||
|
LoRA's random initialization across 11 models and 13 tasks, showing
|
||||||
|
the principal components carry the adaptation signal. #observation
|
||||||
|
[Meng et al. 2024](https://arxiv.org/abs/2404.02948)
|
||||||
|
[evidence](docs/pissa_principal_singular_values_adaptation.md#L1-L20)
|
||||||
|
> PiSSA shares the same architecture as LoRA, but initializes the adaptor matrices A and B with the principal components of the original matrix W, and put the remaining components into a residual matrix W res which is frozen during fine-tuning. Compared to LoRA, PiSSA updates the principal components while freezing the "residual" parts, allowing faster convergence and enhanced performance. Comparative experiments of PiSSA and LoRA across 11 different models, ranging from 184M to 70B, encompassing 5 NLG and 8 NLU tasks, reveal that **PiSSA consistently outperforms LoRA under identical experimental setups.** On the GSM8K benchmark, Gemma-7B fine-tuned with PiSSA achieves an accuracy of 77.7%, surpassing LoRA's 74.53% by 3.25%.
|
||||||
|
{reason: "NeurIPS 2024, 11 models up to 70B, controlled comparisons with identical architecture (same as LoRA)", credence: 0.88}
|
||||||
|
(2) [SVD Coefficients Suffice]: SVFT recovers 96% of full FT performance
|
||||||
|
by learning only sparse coefficients over the model's own singular
|
||||||
|
vectors -- no new directions needed. #observation
|
||||||
|
[Lingam et al. 2024](https://arxiv.org/abs/2405.19597)
|
||||||
|
[evidence](docs/svft_svd_coefficient_finetuning.md#L1-L30)
|
||||||
|
> We propose SVFT, a simple approach that fundamentally differs from existing methods: the structure imposed on delta-W depends on the specific weight matrix W. Specifically, SVFT updates W as a sparse combination of outer products of its singular vectors, training only the coefficients (scales) of these sparse combinations. (...) Extensive experiments on language and vision benchmarks show that **SVFT recovers up to 96% of full fine-tuning performance while training only 0.006 to 0.25% of parameters, outperforming existing methods that only recover up to 85% performance** using 0.03 to 0.8% of the trainable parameter budget.
|
||||||
|
{reason: "preprint under review; strong controlled comparison across language and vision; 96% of full FT with 0.006% params is striking but unpublished", credence: 0.74}
|
||||||
|
(3) [Asymmetric SVD Works]: SSVD shows that selectively rotating input-space
|
||||||
|
(right) singular vectors while preserving output-space (left) vectors
|
||||||
|
matches LoRA with far fewer params, and a convergence hierarchy emerges
|
||||||
|
at scale: SSVD > PiSSA > DoRA > LoRA. #observation
|
||||||
|
[Wang, Watanabe, Van hamme 2025](https://arxiv.org/abs/2509.02830)
|
||||||
|
[evidence](docs/ssvd_structured_svd_finetuning.md#L1-L40)
|
||||||
|
> This work presents the first comprehensive integration and benchmarking of these PEFT methods within ESPnet. We further introduce structured SVD-guided (SSVD) fine-tuning, which selectively rotates input-associated right singular vectors while keeping output-associated vectors fixed to preserve semantic mappings. This design enables robust domain adaptation with minimal trainable parameters and improved efficiency. (...) **SSVD achieves comparable performance with significantly fewer trainable parameters and higher efficiency than LoRA and SoTA LoRA variants**, approaching fully fine-tuned model performance.
|
||||||
|
{reason: "first comprehensive PEFT benchmark for speech across 0.1B-2B; convergence hierarchy consistent across scales", credence: 0.80}
|
||||||
|
----
|
||||||
|
(4) [SVD Is Natural Basis]: The weight matrix's own SVD provides the right
|
||||||
|
coordinate system for adaptation -- 6 of the top 7 scoring methods use it.
|
||||||
|
{reason: "PiSSA beats random init, SVFT shows coefficients alone suffice, SSVD hierarchy holds across scales; convergent evidence from independent groups", inference: 0.85}
|
||||||
|
+> [Natural Manifold]
|
||||||
|
|
||||||
|
|
||||||
|
<Joint SVD Evidence>
|
||||||
|
|
||||||
|
(1) [Cross-Layer Coupling]: CLOVER's joint SVD across Q-K and V-O attention
|
||||||
|
matrix pairs beats per-matrix LoRA by 7.6% on LLaMA-2-7B, validated
|
||||||
|
across 4 model families on 8 commonsense tasks. #observation
|
||||||
|
[Meng, Tang, Jiang, Zhang 2024](https://arxiv.org/abs/2411.17426)
|
||||||
|
[evidence](docs/clover_joint_svd.md#L258-L275)
|
||||||
|
> CLOVER consistently outperforms all other methods across all models and tasks. Specifically, on LLaMA 7B, CLOVER outperforms LoRA, DoRA, and PiSSA by 9%, 5.3%, and 1%, respectively. On LLaMA 13B, CLOVER outperforms these methods by 5.7%, 4.7%, and 0.8%. On LLaMA-2-7B, CLOVER surpasses LoRA, DoRA, HiRA, and PiSSA by 7.6%, 5.5%, 3.8%, and 0.7%. Even on LLaMA-3-8B, with fewer trainable parameters, CLOVER outperforms by 7.1%, 2.7%, 1.2%, and 0.6%.
|
||||||
|
{reason: "Under review (not yet published at venue); controlled comparison using DoRA's own hyperparams; 4 LLaMA variants x 8 tasks; beats PiSSA by 0.7% which is modest but consistent", credence: 0.72}
|
||||||
|
----
|
||||||
|
(2) [Functional SVD Matters]: Attention matrices cooperate functionally (Q-K
|
||||||
|
define pattern, V-O define extraction), and joint decomposition captures
|
||||||
|
this coupling.
|
||||||
|
{reason: "mechanistic intuition is sound; Q and K jointly specify attention weights so their updates should be coordinated. CLOVER paper also shows 82% of features project outside PiSSA's subspace, but CLOVER covers all directions. Single paper but multi-model validation.", inference: 0.72}
|
||||||
|
+> [Natural Manifold]
|
||||||
|
|
||||||
|
|
||||||
|
<PSOFT Synthesis>
|
||||||
|
|
||||||
|
(1) [OFT in SVD Coords]: PSOFT combines PiSSA's SVD initialization with
|
||||||
|
OFT's Cayley rotation, operating in the principal subspace. Beats LoRA
|
||||||
|
on 35 NLP and CV tasks with ~80% memory reduction vs OFT. #observation
|
||||||
|
[Wu et al. 2026](https://arxiv.org/abs/2505.11235)
|
||||||
|
[evidence](docs/psoft_principal_subspace_oft.md#L1-L50)
|
||||||
|
> PSOFT improves parameter and memory efficiency without compromising performance. Although GOFT and PSOFT have the same parameter counts, **PSOFT reduces memory usage by about 80% and avoids OOM issues.** It further achieves up to an 18x improvement in parameter efficiency over BOFT, OFTv2, and LoRA variants, attaining the best average performance across all baselines with the lowest memory footprint.
|
||||||
|
{reason: "ICLR 2026; 35 NLP+CV tasks across 4 models (DeBERTa, ViT, LLaMA-3.2-3B, LLaMA-3.1-8B); code available; bridges PiSSA and OFT with theoretical geometry guarantee", credence: 0.82}
|
||||||
|
----
|
||||||
|
(2) [SVD Plus Orthogonal Synergy]: The best performing adapters combine
|
||||||
|
SVD basis selection with orthogonal constraints, suggesting both priors
|
||||||
|
are complementary.
|
||||||
|
{reason: "PSOFT + AntiPaSTO + SSVD all combine SVD with Cayley rotations and all score well; convergent design choices", inference: 0.75}
|
||||||
|
+> [Natural Manifold]
|
||||||
|
|
||||||
|
|
||||||
|
<AntiPaSTO OOD Transfer>
|
||||||
|
|
||||||
|
(1) [OOD Via SVD Rotation]: AntiPaSTO achieves out-of-distribution transfer
|
||||||
|
from template persona pairs to real moral dilemmas (6.9x prompting baseline)
|
||||||
|
using Cayley rotations of SVD singular vectors. #observation
|
||||||
|
[Clark 2025](https://arxiv.org/abs/2601.07473)
|
||||||
|
[evidence](docs/antipasto_antiparallel_steering.md#L1-L40)
|
||||||
|
> Trained on 800 such pairs, our method transfers to 1,360 unseen moral dilemmas where honesty conflicts with other values, achieving **6.9x the Steering F1 of prompting** on Gemma-3-1B. We demonstrate two key advantages over prompting: OOD transfer (train on simple persona pairs, test on complex moral reasoning) and suppression bypass (steer when prompting triggers refusal).
|
||||||
|
{reason: "single-author, one trait (honesty), models up to 4B, high seed variance (std ~5-7); but OOD transfer is the strongest possible test of causal relevance", credence: 0.55}
|
||||||
|
(2) [SVD Causal Not Correlational]: If SVD rotation transfers OOD, the
|
||||||
|
geometric structure it exploits is causally relevant to behavior,
|
||||||
|
not merely correlated.
|
||||||
|
{reason: "OOD transfer is strong evidence of causality (rules out surface pattern matching), but narrow: 1 trait, 1 eval benchmark, 1 author. Independent replication would raise this considerably.", inference: 0.65}
|
||||||
|
+> [Natural Manifold]
|
||||||
|
|
||||||
|
|
||||||
|
<LoRA Baseline Limitation>
|
||||||
|
|
||||||
|
(1) [LoRA Ceiling]: Biderman et al. 2024 find LoRA underperforms full FT
|
||||||
|
on harder tasks and larger scale, suggesting the low-rank assumption
|
||||||
|
holds for surface adaptation but breaks where deep restructuring
|
||||||
|
is needed. #observation
|
||||||
|
[Biderman et al. 2024](https://arxiv.org/abs/2405.09673)
|
||||||
|
[evidence](docs/biderman_lora_limitations.md#L1-L50)
|
||||||
|
> Our results show that, in the standard low-rank settings, **LoRA substantially underperforms full finetuning.** Nevertheless, LoRA better maintains the base model's performance on tasks outside the target domain. (...) Finally, we show that full finetuning learns perturbations with a rank that is **10-100x greater than typical LoRA configurations**, possibly explaining some of the reported gaps.
|
||||||
|
{reason: "TMLR 2024 (published, peer-reviewed); Llama-2-7B on code and math. In this local corpus, the paper is referenced in multiple docs (ROAD, DeLoRA, MiSS, CLOVER), mostly as supporting related-work evidence rather than direct replication.", credence: 0.80}
|
||||||
|
----
|
||||||
|
(2) [Random Init Wastes Steps]: LoRA's random initialization works (universal
|
||||||
|
baseline) but converges slower than SVD-informed init, suggesting random
|
||||||
|
subspaces are not special -- just adequate.
|
||||||
|
{reason: "PiSSA shows 3.25% gain on GSM8K just from better init; VeRA shows random projections work at all; together: random is adequate, SVD is better", inference: 0.70}
|
||||||
|
+> [Natural Manifold]
|
||||||
|
|
||||||
|
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
// THEME 2: ORTHOGONAL TRANSFORMATIONS PRESERVE STRUCTURE
|
||||||
|
// Methods: OFT, BOFT, GOFT, HRA, ETHER, OFTv2, ROAD
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
# Orthogonality
|
||||||
|
|
||||||
|
<Orthogonal Preservation>
|
||||||
|
|
||||||
|
(1) [Hyperspherical Energy Preserved]: OFT provably preserves pairwise
|
||||||
|
neuron angles (hyperspherical energy) during adaptation, yielding
|
||||||
|
stable finetuning and strong data efficiency (converges with 5%
|
||||||
|
of training data on controllable generation). #observation
|
||||||
|
[Qiu et al. 2023](https://arxiv.org/abs/2306.07280)
|
||||||
|
[evidence](docs/oft_orthogonal_finetuning.md#L1-L40)
|
||||||
|
> We introduce a principled finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, **OFT can provably preserve hyper-spherical energy which characterizes the pairwise neuron relationship on the unit hypersphere.** We find that this property is crucial for preserving the semantic generation ability of text-to-image diffusion models. (...) OFT is more sample-efficient in training and converges well with only 5% of the original dataset, while both ControlNet and LoRA cannot converge until 50% of the data is present.
|
||||||
|
{reason: "NeurIPS 2023; provable preservation is rare among PEFT methods; data efficiency claim is strong but limited to vision generation", credence: 0.82}
|
||||||
|
(2) [Butterfly Orthogonal Scales]: BOFT extends OFT with butterfly factorization,
|
||||||
|
reducing parameters from O(d^2) to O(d log d) while matching or exceeding
|
||||||
|
OFT performance. Outperforms full FT on vision tasks. #observation
|
||||||
|
[Liu et al. 2023](https://arxiv.org/abs/2311.06243)
|
||||||
|
[evidence](docs/boft_butterfly_orthogonal.md#L1-L30)
|
||||||
|
> We propose an efficient orthogonal parameterization using butterfly structures. (...) By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. (...) **BOFT outperforms current state-of-the-art methods by a considerable margin**, validating its superior parameter-efficiency and generalization ability.
|
||||||
|
{reason: "ICLR 2024; subsumes OFT theoretically; broad empirical validation across vision and language", credence: 0.80}
|
||||||
|
----
|
||||||
|
(3) [Orthogonality Prevents Catastrophic Forgetting]: Orthogonal constraints
|
||||||
|
bound deviation from pretrained weights, preventing the catastrophic
|
||||||
|
overwriting that plagues unconstrained methods at high learning rates.
|
||||||
|
{reason: "OFT converges on 5% data, ETHER is LR-robust, BOFT beats full FT on vision; consistent signal that bounded deviation helps. But strict orthogonality prevents magnitude changes, limiting it on tasks needing rescaling.", inference: 0.78}
|
||||||
|
+> [Natural Manifold]
|
||||||
|
|
||||||
|
|
||||||
|
<Givens and Householder>
|
||||||
|
|
||||||
|
(1) [Minimal Orthogonal Params]: GOFT decomposes orthogonal transformations
|
||||||
|
into O(d) Givens rotations (planar 2D rotations), achieving the most
|
||||||
|
parameter-efficient orthogonal parameterization. #observation
|
||||||
|
[Ma et al. 2024](https://arxiv.org/abs/2404.04316)
|
||||||
|
[evidence](docs/goft_givens_orthogonal.md#L1-L44)
|
||||||
|
> We first use O(d) Givens rotations to accomplish arbitrary orthogonal transformation in SO(d) with provable equivalence, reducing parameter complexity from O(d^2) to O(d). Then we introduce flexible norm and relative angular adjustments under soft orthogonality regularization to enhance the adaptation capability of downstream semantic deviations. Extensive experiments on various tasks and pretrained models validate the effectiveness of our methods.
|
||||||
|
{reason: "ICML 2024; provable O(d) reduction from O(d^2); but advantage over BOFT is modest on benchmarks", credence: 0.70}
|
||||||
|
(2) [Orthogonal Equals Low-Rank]: HRA proves that a chain of r Householder
|
||||||
|
reflections is both orthogonal AND equivalent to a rank-r perturbation,
|
||||||
|
bridging the low-rank and orthogonal paradigms. #observation
|
||||||
|
[Yuan et al. 2024](https://arxiv.org/abs/2405.17484)
|
||||||
|
[evidence](docs/hra_householder_reflection.md#L1-L30)
|
||||||
|
> We bridge the gap between these two techniques, proposing a simple but effective adaptation method based on Householder reflections. Given a pre-trained model, our method fine-tunes its layers by multiplying each frozen weight matrix with an orthogonal matrix constructed by a chain of learnable Householder reflections (HRs). This HR-based orthogonal fine-tuning is equivalent to an adaptive low-rank adaptation. Moreover, we show that the orthogonality of the reflection planes corresponding to the HRs impacts the model capacity and regularity.
|
||||||
|
{reason: "theoretical bridge resolving 'low-rank vs orthogonal' dichotomy; merged into HuggingFace PEFT; empirical results competitive but not dominant", credence: 0.72}
|
||||||
|
----
|
||||||
|
(3) [Low-Rank Orthogonal Sweet Spot]: The effective adaptation is simultaneously
|
||||||
|
low-rank and approximately orthogonal -- small rotations on the Stiefel manifold.
|
||||||
|
{reason: "HRA's theoretical bridge is convincing; GOFT's practical efficiency confirms orthogonal methods can be made lean. But 'low-rank orthogonal' as the true operating point needs more direct testing.", inference: 0.65}
|
||||||
|
+> [Natural Manifold]
|
||||||
|
|
||||||
|
|
||||||
|
<ETHER Bounded Deviation>
|
||||||
|
|
||||||
|
(1) [Fixed-Strength Rotations Work]: ETHER demonstrates that bounded orthogonal
|
||||||
|
transformations (close to identity) achieve competitive adaptation while
|
||||||
|
preventing catastrophic forgetting, with high learning-rate robustness. #observation
|
||||||
|
[Bini, Girrbach, Akata 2024](https://arxiv.org/abs/2405.20271)
|
||||||
|
[evidence](docs/ether_orthogonal_steering.md#L1-L30)
|
||||||
|
> We propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections. By design, ETHER transformations require a minimal number of parameters, are less likely to deteriorate model performance, and exhibit robustness to hyperparameter and learning rate choices. In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters (~10-100 times lower than LoRA or OFT) across multiple image synthesis and natural language tasks without exhaustive hyperparameter tuning.
|
||||||
|
{reason: "foundational for DeLoRA which won ICLR 2025; the bounded-deviation principle is well-validated; 10-100x fewer params than LoRA/OFT", credence: 0.72}
|
||||||
|
----
|
||||||
|
(2) [Minimal Intervention Often Suffices]: For behavioral steering, bounded
|
||||||
|
rotations suffice; for complex task adaptation, they hit a ceiling.
|
||||||
|
{reason: "ETHER works for many tasks but DeLoRA's extension (unbounding magnitude) does better, confirming the ceiling", inference: 0.70}
|
||||||
|
+> [Orthogonal Preservation]
|
||||||
|
|
||||||
|
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
// THEME 3: DIRECTION/MAGNITUDE DECOUPLING
|
||||||
|
// Methods: DoRA, DeLoRA, ROAD, AntiPaSTO, SSVD
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
# Decoupling
|
||||||
|
|
||||||
|
<Direction Magnitude Decoupling>
|
||||||
|
|
||||||
|
(1) [FT Decouples Naturally]: DoRA's weight decomposition analysis reveals
|
||||||
|
that full fine-tuning exhibits distinct magnitude vs. direction update
|
||||||
|
patterns that LoRA conflates. QDoRA slightly outperforms full FT. #observation
|
||||||
|
[Liu et al. 2024](https://arxiv.org/abs/2402.09353)
|
||||||
|
[evidence](docs/dora_weight_decomposed_lora.md#L1-L40)
|
||||||
|
> Drawing on Weight Normalization, which achieves faster convergence via improving the conditioning of the gradient with weight reparameterization, we introduce a novel weight decomposition analysis that initially reparameterizes model weights into magnitude and directional components, subsequently examining the changes in magnitude and direction introduced by LoRA and FT. **Our analysis reveals that LoRA and FT exhibit markedly distinct patterns of updates**, leading us to surmise that these variations mirror the learning capability of each method. (...) DoRA consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding.
|
||||||
|
{reason: "ICML 2024 (NVIDIA); thorough analysis + consistent improvements across 3 model families; widely adopted as LoRA replacement", credence: 0.85}
|
||||||
|
(2) [Bounded Deviation Robust]: DeLoRA normalizes each rank-1 component and
|
||||||
|
introduces scalar lambda controlling distance bound, yielding robustness
|
||||||
|
to LR and training duration where LoRA fails. #observation
|
||||||
|
[Bini, Girrbach, Akata 2025](https://arxiv.org/abs/2503.18225)
|
||||||
|
[evidence](docs/delora_decoupled_low_rank_adaptation.md#L1-L30)
|
||||||
|
> We propose Decoupled Low-rank Adaptation (DeLoRA), a novel finetuning method that normalizes and scales learnable low-rank matrices. **By bounding the distance of the transformation, DeLoRA effectively decouples the angular learning from the adaptation strength, enhancing robustness without compromising performance.** Through evaluations on subject-driven image generation, natural language understanding, and instruction tuning, we show that DeLoRA matches or surpasses performance of competing PEFT methods, while exhibiting stronger robustness.
|
||||||
|
{reason: "ICLR 2025; same group as ETHER so a coherent research line; robustness claim well-supported by ablations", credence: 0.82}
|
||||||
|
----
|
||||||
|
(3) [Decoupling Is Sound Engineering]: Separating what-to-change (direction)
|
||||||
|
from how-much (magnitude) consistently improves training stability and
|
||||||
|
often improves final performance.
|
||||||
|
{reason: "DoRA beats LoRA everywhere, DeLoRA adds robustness, ROAD provides cleanest per-pair formulation; 3 independent groups converge on this. Whether it's fundamental or just good optimization is unclear.", inference: 0.80}
|
||||||
|
+> [Natural Manifold]
|
||||||
|
|
||||||
|
|
||||||
|
<ROAD Explicit Decoupling>
|
||||||
|
|
||||||
|
(1) [Angle Plus Magnitude Per Pair]: ROAD parameterizes adaptation as learned
|
||||||
|
rotation angle theta plus magnitude alpha per activation pair, the cleanest
|
||||||
|
direction/strength separation. #observation
|
||||||
|
[Liao, Monz 2024](https://arxiv.org/abs/2409.00119)
|
||||||
|
[evidence](docs/road_rotary_adaptation.md#L1-L30)
|
||||||
|
> We introduce a novel method, RoAd, which employs a straightforward 2D rotation to adapt LLMs and addresses all the above challenges: (1) RoAd is remarkably parameter-efficient, delivering optimal performance on GLUE, eight commonsense reasoning tasks and four arithmetic reasoning tasks with < 0.1% trainable parameters; (2) RoAd facilitates the efficient serving of requests requiring different adapters within a batch, with an overhead comparable to element-wise multiplication instead of batch matrix multiplication; (3) RoAd enhances LLM's interpretability through integration within a framework of distributed interchange intervention, demonstrated via composition experiments.
|
||||||
|
{reason: "clean formulation connecting to RoPE; supports mixed-adapter batching which is unique; limited vision/generation benchmarks", credence: 0.62}
|
||||||
|
----
|
||||||
|
(2) [Continuous Steering Knob]: The explicit magnitude parameter provides
|
||||||
|
a continuous interpolation between identity and full adaptation, ideal
|
||||||
|
for controllable steering.
|
||||||
|
{reason: "mathematically clean; alpha as continuous knob is the most interpretable version of direction/strength decoupling", inference: 0.65}
|
||||||
|
+> [Direction Magnitude Decoupling]
|
||||||
|
|
||||||
|
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
// THEME 4: GAIN CONTROL -- ADAPTATION AS RESCALING
|
||||||
|
// Methods: IA3, VeRA, LN Tuning
|
||||||
|
// Evidence AGAINST the full thesis: if mere rescaling works, the
|
||||||
|
// geometric structure may not be needed.
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
# Gain Control
|
||||||
|
|
||||||
|
<Gain Control Sufficiency>
|
||||||
|
|
||||||
|
(1) [Scaling Vectors Competitive]: IA3 achieves competitive adaptation using
|
||||||
|
only d-dimensional scaling vectors at key, value, and FFN positions --
|
||||||
|
no new weight directions needed. T0-3B with IA3 outperforms GPT-3 175B
|
||||||
|
with ICL. #observation
|
||||||
|
[Liu et al. 2022](https://arxiv.org/abs/2205.05638)
|
||||||
|
[evidence](docs/ia3_few_shot_peft.md#L1-L30)
|
||||||
|
> We introduce (IA)3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. (...) We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark, **attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute.**
|
||||||
|
{reason: "strong result on RAFT; super-human is notable; but T5-family specific and scaling-only has known expressivity ceiling", credence: 0.72}
|
||||||
|
(2) [Random Projections Plus Scaling]: VeRA matches LoRA with 10x fewer
|
||||||
|
trainable parameters using only per-layer scaling of shared frozen
|
||||||
|
random matrices. #observation
|
||||||
|
[Kopiczko et al. 2023](https://arxiv.org/abs/2310.11454)
|
||||||
|
[evidence](docs/vera_vector_random_matrix_adaptation.md#L1-L30)
|
||||||
|
> We present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameters compared to LoRA, yet maintains the same performance. **It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead.**
|
||||||
|
{reason: "ICLR 2024; surprising result that random projections suffice with per-layer gating; JL-lemma argument is plausible", credence: 0.75}
|
||||||
|
----
|
||||||
|
(3) [Rescaling Gets 60-80%]: A surprisingly large fraction of task adaptation
|
||||||
|
is reweighting existing features, suggesting the model already computes
|
||||||
|
the right features -- the bottleneck is channel selection.
|
||||||
|
{reason: "IA3 and VeRA perform well on standard benchmarks but both hit expressivity ceilings on hard tasks. This means rescaling handles easy adaptation but not deep restructuring.", inference: 0.45}
|
||||||
|
-> [Natural Manifold]
|
||||||
|
|
||||||
|
// The contra is weak (0.45) because gain control works on easy tasks
|
||||||
|
// but fails precisely where geometric structure matters most. The
|
||||||
|
// ceiling IS the evidence for the manifold hypothesis.
|
||||||
|
|
||||||
|
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
// THEME 5: RANK IS SECONDARY TO BASIS CHOICE
|
||||||
|
// Methods: LoRA, RandLoRA, C3A, FourierFT, LoHa, LoKr
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
# Rank Debate
|
||||||
|
|
||||||
|
<Rank Is Secondary>
|
||||||
|
|
||||||
|
(1) [Full Rank Helps Hard Tasks]: RandLoRA achieves full-rank updates by
|
||||||
|
summing scaled random rank-r bases, and its loss landscape is closer
|
||||||
|
to full FT's than LoRA's. #observation
|
||||||
|
[Albert et al. 2025](https://arxiv.org/abs/2502.00987)
|
||||||
|
[evidence](docs/randlora_random_matrix.md#L1-L30)
|
||||||
|
> This paper aims to answer this question by introducing RandLoRA, a parameter-efficient method that performs full-rank updates using a learned linear combinations of low-rank, non-trainable random matrices. (...) Our findings reveal that full-rank updates are beneficial across vision and language tasks individually, and even more so for vision-language tasks, where RandLoRA significantly reduces -- and sometimes eliminates -- the performance gap between standard fine-tuning and LoRA, demonstrating its efficacy.
|
||||||
|
{reason: "ICLR 2025; eliminates FT-LoRA gap on vision-language; loss landscape analysis compelling; improvement task-dependent", credence: 0.72}
|
||||||
|
(2) [Spectral Sparsity]: FourierFT achieves comparable results via sparse
|
||||||
|
Fourier coefficients, suggesting weight updates are spectrally smooth. #observation
|
||||||
|
[Gao et al. 2024](https://arxiv.org/abs/2405.03003)
|
||||||
|
[evidence](docs/fourierft_spectral.md#L1-L30)
|
||||||
|
> We introduce FourierFT, which treats DeltaW as a matrix in the spatial domain and learns only a small fraction of its spectral coefficients. With the trained spectral coefficients, we implement the inverse discrete Fourier transform to recover DeltaW. Empirically, our FourierFT method shows comparable or better performance with fewer parameters than LoRA on various tasks, including natural language understanding, natural language generation, instruction tuning, and image classification. For example, when performing instruction tuning on the LLaMA2-7B model, FourierFT surpasses LoRA with only 0.064M trainable parameters, compared to LoRA's 33.5M.
|
||||||
|
{reason: "ICML 2024; 500x fewer params than LoRA on LLaMA2 instruction tuning; spectral sparsity is structurally different from SVD", credence: 0.68}
|
||||||
|
----
|
||||||
|
(3) [Basis Matters More Than Rank]: PiSSA's well-chosen rank-r subspace
|
||||||
|
outperforms RandLoRA's full-rank random subspace, suggesting which
|
||||||
|
directions you update matters more than how many.
|
||||||
|
{reason: "PiSSA at rank r beats LoRA at same rank; RandLoRA at full rank sometimes beats PiSSA but not always. The basis selection (SVD vs random) appears to be the primary factor.", inference: 0.40}
|
||||||
|
-> [Natural Manifold]
|
||||||
|
|
||||||
|
// Weak contra (0.40): the rank debate slightly undermines the thesis
|
||||||
|
// because if full rank sometimes wins, the SVD manifold might not
|
||||||
|
// capture everything. But basis choice dominates rank choice.
|
||||||
|
|
||||||
|
|
||||||
|
<Alternative Parameterizations>
|
||||||
|
|
||||||
|
(1) [Circulant Structure]: C3A achieves full-rank updates via circulant
|
||||||
|
matrices (FFT-computable), competitive on GLUE and instruction tuning. #observation
|
||||||
|
[Chen et al. 2025](https://arxiv.org/abs/2407.19342)
|
||||||
|
[evidence](docs/c3a_circular_convolution.md#L1-L30)
|
||||||
|
> We introduce Circular Convolution Adaptation (C3A), which not only achieves high-rank adaptation with enhanced performance but also excels in both computational power and memory utilization. **Extensive experiments demonstrate that C3A consistently outperforms LoRA and its variants across various fine-tuning tasks.** Our code is available at Hugging Face PEFT.
|
||||||
|
{reason: "ACL 2025; merged into HuggingFace PEFT; circulant assumption is a strong inductive bias; competitive but not dominant on all benchmarks", credence: 0.62}
|
||||||
|
(2) [Hadamard Products]: LoHa's element-wise product of two low-rank
|
||||||
|
decompositions captures pairwise feature interactions that additive
|
||||||
|
low-rank misses. #observation
|
||||||
|
[Nam, Moon, Oh 2022](https://arxiv.org/abs/2108.06098)
|
||||||
|
[evidence](docs/loha_hadamard_product.md#L1-L30)
|
||||||
|
> We propose a communication-efficient parameterization, FedPara, which re-parameterizes weight parameters of layers using low-rank weights followed by the Hadamard product. **Compared to the conventional low-rank parameterization, our FedPara method is not restricted to low-rank constraints, and thereby it has a far larger capacity.** This property enables to achieve comparable performance while requiring 3 to 10 times lower communication costs than the model with the original layers.
|
||||||
|
{reason: "ICLR 2022; effective for vision/diffusion via LyCORIS ecosystem; Hadamard product overcomes low-rank bottleneck; niche for LLMs", credence: 0.55}
|
||||||
|
(3) [Kronecker Products]: LoKr's tensor product structure enables compact
|
||||||
|
adaptation across convolutional and linear layers, with systematic
|
||||||
|
evaluation across fine-tuning methodologies. #observation
|
||||||
|
[Yeh et al. 2024](https://arxiv.org/abs/2309.14859)
|
||||||
|
[evidence](docs/lokr_lycor.md#L1-L30)
|
||||||
|
> This paper introduces LyCORIS (Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion), an open-source library that offers a wide selection of fine-tuning methodologies for Stable Diffusion. Furthermore, we present **a thorough framework for the systematic assessment of varied fine-tuning techniques.** This framework employs a diverse suite of metrics and delves into multiple facets of fine-tuning.
|
||||||
|
{reason: "systematic LyCORIS benchmark; Kronecker/spatial structure assumption less natural for attention weights; mainly validated on diffusion models", credence: 0.50}
|
||||||
|
----
|
||||||
|
(4) [Multiple Structures Work]: Multiple non-SVD parameterizations (circulant,
|
||||||
|
Hadamard, Kronecker, Fourier) achieve competitive results, suggesting
|
||||||
|
SVD is not the only viable basis.
|
||||||
|
{reason: "variety of working approaches, but none consistently beats SVD methods on controlled benchmarks; SVD has strongest cross-domain evidence", inference: 0.50}
|
||||||
|
-> [Natural Manifold]
|
||||||
|
|
||||||
|
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
// THEME 6: FUNCTIONAL ARCHITECTURE
|
||||||
|
// Methods: CLOVER, ReFT, X-LoRA, Prompt Tuning
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
# Functional Architecture
|
||||||
|
|
||||||
|
<Functional Architecture>
|
||||||
|
|
||||||
|
(1) [Activation Intervention]: ReFT achieves 15-65x more parameter efficiency
|
||||||
|
than LoRA by intervening on hidden states at specific (layer, position)
|
||||||
|
sites rather than on weights at all. #observation
|
||||||
|
[Wu, Arora et al. 2024](https://arxiv.org/abs/2404.03592)
|
||||||
|
[evidence](docs/reft_representation_finetuning.md#L1-L40)
|
||||||
|
> ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations. We define a strong instance of the ReFT family, Low-rank Linear Subspace ReFT (LoReFT). (...) **LoReFT uses 15x-65x times fewer parameters while achieving state-of-the-art performance on commonsense reasoning, instruction-following, and natural language understanding** against the strongest PEFTs.
|
||||||
|
{reason: "Stanford NLP; widely cited; strong on instruction following and math. Fundamentally different paradigm (activation vs weight) so comparison not apples-to-apples", credence: 0.72}
|
||||||
|
(2) [Compositional Routing]: X-LoRA dynamically routes through expert LoRA
|
||||||
|
adapters, achieving better composite task performance than individual
|
||||||
|
adapters. #assumption
|
||||||
|
{reason: "compositional multi-task; routing adds overhead; more suited to deployment than single-task science", credence: 0.50}
|
||||||
|
----
|
||||||
|
(3) [Beyond Individual Matrices]: Respecting the model's functional architecture
|
||||||
|
(which matrices cooperate, where activations matter) provides additional
|
||||||
|
efficiency gains beyond weight-matrix-level priors.
|
||||||
|
{reason: "CLOVER (joint SVD) beats per-matrix methods; ReFT (activation intervention) is 15-65x more efficient; both exploit structure LoRA ignores. But evidence is early -- 1-2 papers each.", inference: 0.60}
|
||||||
|
+> [Natural Manifold]
|
||||||
|
|
||||||
|
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
// THEME 7: SHARED STRUCTURE AND COMPRESSION
|
||||||
|
// Methods: MiSS, VBLoRA, SHiRA, AdaLoRA
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
# Compression
|
||||||
|
|
||||||
|
<Shared Structure Evidence>
|
||||||
|
|
||||||
|
(1) [Adaptive Rank Allocation]: AdaLoRA's SVD-parameterized pruning shows
|
||||||
|
that different layers need different ranks -- query/value projections
|
||||||
|
and early/late layers are most important. #observation
|
||||||
|
[Zhang et al. 2023](https://arxiv.org/abs/2303.10512)
|
||||||
|
[evidence](docs/adalora_adaptive_budget.md#L1-L30)
|
||||||
|
> We propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, AdaLoRA parameterizes the incremental updates in the form of singular value decomposition. Such a novel approach allows us to effectively prune the singular values of unimportant updates. (...) **Results demonstrate that AdaLoRA manifests notable improvement over baselines, especially in the low budget settings.**
|
||||||
|
{reason: "ICLR 2023; validates SVD parameterization and adaptive rank; modest gains over LoRA add complexity", credence: 0.70}
|
||||||
|
(2) [Vector Bank Compression]: VBLoRA uses 0.4% of LoRA's parameters via
|
||||||
|
shared atomic vectors, suggesting adapter diversity across layers is
|
||||||
|
much lower than assumed. #observation
|
||||||
|
[Li, Han, Ji 2024](https://arxiv.org/abs/2405.15179)
|
||||||
|
[evidence](docs/vblora_vector_bank.md#L1-L30)
|
||||||
|
> Our proposed VB-LoRA composites all the low-rank matrices of LoRA from a shared vector bank with a differentiable top-k admixture module. (...) When fine-tuning the Llama2-13B model, **VB-LoRA only uses 0.4% of LoRA's stored parameters, yet achieves superior results.** This method has been merged into the Hugging Face PEFT package.
|
||||||
|
{reason: "NeurIPS 2024; merged into HuggingFace PEFT; extreme compression is remarkable; implies shared structure across layers", credence: 0.72}
|
||||||
|
(3) [Sparse High-Rank]: SHiRA directly fine-tunes 1-2% of base weights
|
||||||
|
selected by importance, achieving full rank without low-rank
|
||||||
|
bottleneck. #observation
|
||||||
|
[Bhardwaj et al. 2024](https://arxiv.org/abs/2406.13175)
|
||||||
|
[evidence](docs/shira_sparse_high_rank.md#L1-L40)
|
||||||
|
> We propose SHiRA, a new high rank adapter paradigm to demonstrate that **changing as few as 1-2% parameters of the original network is sufficient for adaptation.** (...) We conduct extensive experiments on LLMs (LLaMA-7B, LLaMAv2-7B) and LVMs (Stable Diffusion, SDXL) where we demonstrate that SHiRA significantly outperforms LoRA on both single- and multi-adapter tasks. On LLMs, we show that SHiRA achieves up to 2.7% better accuracy than LoRA on commonsense reasoning.
|
||||||
|
{reason: "NeurIPS 2024; Qualcomm; multi-adapter fusion is unique angle; extensive LLM+LVM experiments; challenges low-rank assumption", credence: 0.62}
|
||||||
|
----
|
||||||
|
(4) [Compression Signals Shared Structure]: The extreme compressibility of
|
||||||
|
adapter weights (VBLoRA 0.4%, SVFT 0.006%) suggests weight updates
|
||||||
|
share structure that priors can exploit.
|
||||||
|
{reason: "multiple methods achieve extreme compression suggesting high redundancy in naive parameterizations; supports the thesis that the right basis captures this shared structure", inference: 0.60}
|
||||||
|
+> [Natural Manifold]
|
||||||
|
|
||||||
|
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
// ADDITIONAL METHODS (Prompt Tuning, LN Tuning, Bone, Trainable Tokens)
|
||||||
|
// These are boundary cases that don't strongly support or oppose the thesis
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
# Boundary Cases
|
||||||
|
|
||||||
|
<Prompt Tuning Orthogonal Paradigm>
|
||||||
|
|
||||||
|
(1) [Input Not Weights]: Prompt/prefix tuning modifies the input embedding
|
||||||
|
space rather than weight matrices. At T5-XXL scale, it matches full FT
|
||||||
|
with 0.01% params. #assumption
|
||||||
|
{reason: "scales with model size; conceptually different from weight adaptation; when it works, it suggests the model already has the capability", credence: 0.65}
|
||||||
|
----
|
||||||
|
(2) [Capability Already Present]: When prompt tuning works, it reveals that
|
||||||
|
the target task is within the model's existing computational capacity --
|
||||||
|
no weight restructuring needed.
|
||||||
|
{reason: "informative boundary: prompt tuning success = no intervention needed; prompt tuning failure = genuine capability gap requiring weight changes", inference: 0.55}
|
||||||
|
+> [Natural Manifold]
|
||||||
|
|
||||||
|
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
// CONTRA: LINEARITY ASSUMPTION
|
||||||
|
// ══════════════════════════════════════════════════════════════════════
|
||||||
|
|
||||||
|
<SVD Is Linear Decomposition>
|
||||||
|
|
||||||
|
(1) [Nonlinear Structure Missed]: SVD is a linear decomposition of a nonlinear
|
||||||
|
system. It captures dominant linear subspaces but may miss nonlinear
|
||||||
|
interactions that matter for complex tasks. #assumption
|
||||||
|
{reason: "fundamental theoretical limitation; no PEFT method addresses this; but empirically SVD methods still win", credence: 0.75}
|
||||||
|
(2) [Checkpoint Specific]: SVD basis is model-specific and not transferable
|
||||||
|
across checkpoints without recomputation. #assumption
|
||||||
|
{reason: "practical limitation; PiSSA notes fast SVD takes only seconds, so cost is low", credence: 0.60}
|
||||||
|
----
|
||||||
|
(3) [Linearity Might Not Matter]: For the adaptations tested so far, linear
|
||||||
|
subspaces suffice. The nonlinear structure may only matter for
|
||||||
|
capabilities not yet probed.
|
||||||
|
{reason: "strong practical success despite theoretical limitation; but we haven't tested on sufficiently hard capability changes", inference: 0.45}
|
||||||
|
-> [Natural Manifold]
|
||||||
File diff suppressed because one or more lines are too long
@@ -0,0 +1,657 @@
|
|||||||
|
Title: CLOVER: Cross-Layer Orthogonal Vectors Pruning and Fine-Tuning
|
||||||
|
|
||||||
|
URL Source: https://arxiv.org/pdf/2411.17426
|
||||||
|
|
||||||
|
Published Time: Mon, 03 Feb 2025 01:40:12 GMT
|
||||||
|
|
||||||
|
Number of Pages: 16
|
||||||
|
|
||||||
|
Markdown Content:
|
||||||
|
# CLOVER: Cross-Layer Orthogonal Vectors Pruning and Fine-Tuning
|
||||||
|
|
||||||
|
Fanxu Meng 1 2 Pingzhi Tang 1 Fan Jiang 1 Muhan Zhang 1 2
|
||||||
|
|
||||||
|
## Abstract
|
||||||
|
|
||||||
|
Decoder-only models generate tokens autoregres-sively by caching key/value vectors, but as the cache grows, inference becomes memory-bound. To address this issue, we introduce CLOVER (Cross-Layer Orthogonal Vectors), a novel ap-proach that treats pairs of attention layers as a set of low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the Q-K
|
||||||
|
|
||||||
|
and V -O pairs within each attention head. The resulting singular values can either guide pruning or serve as trainable parameters for efficient fine-tuning of all orthogonal vectors. After pruning or fine-tuning, these values are reintegrated into the model without increasing its parameter count. We apply CLOVER to various models, including GPT-2 XL, DeepSeek-V2-Lite, Whisper-Large-v3, Stable Diffusion XL, and LLaMA-3.2-11B-Vision. Our results demonstrate that CLOVER significantly improves pruning efficiency. For in-stance, the perplexity of pruning 70% of the Q-K
|
||||||
|
|
||||||
|
pairs in GPT-2 XL is similar to that of pruning just 8% with vanilla methods. Fine-tuning the sin-gular values further results in a full-rank update, outperforming state-of-the-art methods (LoRA, DoRA, HiRA, and PiSSA) by 7.6%, 5.5%, 3.8%, and 0.7%, respectively, on eight commonsense tasks for LLaMA-2 7B.
|
||||||
|
|
||||||
|
## 1. Introduction
|
||||||
|
|
||||||
|
In recent years, Large Language Models (LLMs) have rapidly evolved into essential tools for productivity (OpenAI, 2024; Anthropic, 2024; Team et al., 2024a). Open-source models (AI@Meta, 2024; Mistral, 2024; Qwen, 2024; Liu et al., 2024b; Team et al., 2024b; Abdin et al., 2024) have also narrowed the performance gap with closed-source mod-els. The success of LLMs is largely attributed to Next Token Prediction (Radford, 2018; Brown et al., 2020), where to-kens are predicted sequentially, with attention computed between each token and all preceding ones. To avoid redun-dant computations, key-value features are cached. However, as model size grows, the overhead of caching becomes sub-stantial, leading to memory and communication bottlenecks. 𝑊 !
|
||||||
|
|
||||||
|
## 𝑊 " 𝑊 #
|
||||||
|
|
||||||
|
## 𝑊 $
|
||||||
|
|
||||||
|
# 𝑥
|
||||||
|
|
||||||
|
# 𝑦
|
||||||
|
|
||||||
|
(a) Multi-Head Attention 𝑈 !"
|
||||||
|
|
||||||
|
## 𝑉 !" 𝑈 #$
|
||||||
|
|
||||||
|
## 𝑉 #$
|
||||||
|
|
||||||
|
# 𝑥
|
||||||
|
|
||||||
|
# 𝑦
|
||||||
|
|
||||||
|
> 𝑆 !"
|
||||||
|
> 𝑆 #$
|
||||||
|
|
||||||
|
(b) CLOVER 0.08 0.7 1
|
||||||
|
|
||||||
|
> Pruning Ratio
|
||||||
|
> 3
|
||||||
|
> 4
|
||||||
|
> 5
|
||||||
|
> 6
|
||||||
|
> 7
|
||||||
|
> Log of Perplexity
|
||||||
|
> Vanilla
|
||||||
|
> CLOVER
|
||||||
|
|
||||||
|
(c) Pruning without Training 0.125 0.25 0.375 0.5 0.625
|
||||||
|
|
||||||
|
> 10
|
||||||
|
> 12
|
||||||
|
> 14
|
||||||
|
> 16
|
||||||
|
> 18
|
||||||
|
> 20
|
||||||
|
> 22
|
||||||
|
> 24
|
||||||
|
> Perplexity
|
||||||
|
> Vanilla
|
||||||
|
> CLOVer
|
||||||
|
> CLOVer
|
||||||
|
|
||||||
|
(d) Fine-Tuning Pruned Model
|
||||||
|
|
||||||
|
Figure 1. (a) We treat the Query-Key and Value-Output layers within a single attention head as a unified structure. (b) Apply SVD to obtain two sets of singular vectors for initializing the Q-K and V-O layers, along with singular values that guide pruning or enable efficient full-rank fine-tuning. (c) This cross-layer orthogo-nalization strategy allows for higher pruning rates. (d) The pruned model maintains strong performance after fine-tuning.
|
||||||
|
|
||||||
|
For instance, a 65B parameter model (Touvron et al., 2023) with 8-bit key-value quantization requires over 86GB of GPU memory to store 512K tokens, exceeding the capacity of a single H100-80GB GPU (Sun et al., 2024). To enable efficient training and inference, we introduce CLOVER (Cross-Layer Orthogonal Vectors), a novel method that orthogonalizes the Query, Key, Value, and Out-put vectors without generating additional transformation matrices. As shown in Figure 1a, we treat the Q-K and V -
|
||||||
|
|
||||||
|
O pairs in each attention head as a low-rank decomposition of WQK and WV O . By crossing these layers and perform-1
|
||||||
|
|
||||||
|
> arXiv:2411.17426v3 [cs.LG] 31 Jan 2025 CLOVER: Cross-Layer Orthogonal Vectors
|
||||||
|
|
||||||
|
ing SVD on WQK and WV O , the Query, Key, Value, and Output vectors become orthogonal within each attention head. Figure 1b illustrates how the resulting singular val-ues can guide pruning or serve as trainable parameters for efficient fine-tuning. After pruning or fine-tuning, these values can be reintegrated into the model without increasing its parameter count. Notably, previous methods, such as SVFT (Lingam et al., 2024), obtain orthogonal vectors by directly performing orthogonal decomposition on the matrix at each layer, which results in an accompanying transfor-mation matrix, doubling the parameter count. In contrast, CLOVER treats the Q-K pairs as transformation matrices for each other, and similarly for the V -O pairs. CLOVER only generates a small set of singular values to guide prun-ing and fine-tuning, which can be merged back into the model without increasing inference costs.
|
||||||
|
|
||||||
|
By orthogonalizing the vectors, we eliminate linear re-dundancy. Attention heads contain numerous non-zero norm vectors. Directly pruning these vectors would degrade performance, but orthogonalizing them allows us to repre-sent the entire attention head’s space using a small set of orthogonal bases. The remaining vectors are nearly zero, making them safe to prune. As shown in Figure 1c, prun-ing an average of 45 vectors in the query-key pair using CLOVER results in a perplexity similar to that of vanilla pruning, which prunes only 5 vectors. Moreover, CLOVER generates a singular value matrix between the Q-K and
|
||||||
|
|
||||||
|
V -O pairs. By updating this matrix during fine-tuning,
|
||||||
|
|
||||||
|
CLOVER learns linear combinations of all orthogonal bases within each attention head. In contrast, PiSSA can only learn from a subset of orthogonal vectors, potentially causing some data projections to approach zero in those di-rections, leading to non-functional adapters during training. As shown in Figure 1d, fine-tuning a very small number of singular values can achieve performance close to that of fine-tuning all attention heads. We summarize the contribution of our paper as follows: • We treat the Q-K and V-O pairs in each attention head as low-rank approximations of WQK and WV O . By performing SVD, we orthogonalize the attention head without adding extra transformation matrices. • This orthogonalization reduces linear redundancy, is compatible with any pruning method, and allows for higher pruning ratios. Pruning 46.42% of the vectors in Whisper’s attention head preserves performance with-out requiring additional training. • CLOVER enables efficient full-rank updates, surpass-ing SOTA methods such as LoRA, DoRA, HiRA, and PiSSA on eight commonsense reasoning tasks across LLaMA 7B/13B, LLaMA-2-7B, and LLaMA-3-8B, with additional analyses highlighting its advantages.
|
||||||
|
|
||||||
|
## 2. Related Work
|
||||||
|
|
||||||
|
LLM Compression To mitigate the high memory de-mands of KV Caches in long-context models, several tech-niques have been proposed. These include reducing se-quence length with linear attention (Katharopoulos et al., 2020; Wang et al., 2020; Peng et al., 2023; Gu & Dao, 2023; De et al., 2024), dynamic token pruning (Fu et al., 2024; Jo & Shin, 2024; Li et al., 2024b), compressing the key-value rank (Shazeer, 2019; Ainslie et al., 2023; Liu et al., 2024a; Yu et al., 2024), and pruning head dimensions (Ashkboos et al., 2024; Xia et al., 2023; Sun et al., 2023). Additional approaches include sharing key-value representations across layers (Sun et al., 2024; Brandon et al., 2024; Liu et al., 2024c; Zuhri et al., 2024) and quantizing KV cache weights and activations (Frantar et al., 2022; Dettmers et al., 2022; Xiao et al., 2023; Liu et al., 2024e; Hooper et al., 2024). Among them, structure pruning is hardware-friendly but can reduce performance when non-zero dimensions are removed (Ma et al., 2023). Fine-tuning can recover some of the lost performance, but it’s computationally expensive. To address this, Parameter Efficient Fine-Tuning (PEFT) methods are used (Guo et al., 2023).
|
||||||
|
|
||||||
|
Parameter Efficient Fine-Tuning. Several strategies have been introduced to minimize fine-tuning parameters while maintaining performance. These include low-rank adaptation (Hu et al., 2021), partial-parameter fine-tuning (Zaken et al., 2021; Lawton et al., 2023; Zhao et al., 2020; Sung et al., 2021; Ansell et al., 2021; Xu et al., 2021; Guo et al., 2020; Fu et al., 2023), soft prompt fine-tuning (Ham-bardzumyan et al., 2021; Lester et al., 2021; Li & Liang, 2021; Liu et al., 2023b; Vu et al., 2021; Asai et al., 2022; Wang et al., 2023), and sparse matrix fine-tuning (Qiu et al., 2023; Liu et al., 2023a; Yuan et al., 2024). Among these, LoRA is widely used due to its simplicity and effectiveness, with recent works enhancing it further (Zhang et al., 2023; Zi et al., 2023; Liu et al., 2024d; Zhao et al., 2024; Jiang et al., 2024). PiSSA (Meng et al., 2024) improves conver-gence speed by initializing adapters with principal singular values and vectors, also reducing quantization error (Wang et al., 2024a;b; Li et al., 2024a). However, PiSSA is limited by its use of a fixed set of orthogonal bases. SVFT (Lingam et al., 2024) directly applies Singular Value Decomposition (SVD) to the original matrix, but this increases the number of parameters, raising computational overhead and reducing efficiency. The CLOVER method addresses these issues by treating the Query-Key pairs in each attention head as low-rank matrices. Using orthogonal decomposition, CLOVER eliminates the need for additional transformation matrices. Instead, it leverages a small set of singular values to linearly combine orthogonal vectors, making the approach more parameter-efficient. After fine-tuning, the adapter can be smoothly reintegrated into the original matrix structure. 2CLOVER: Cross-Layer Orthogonal Vectors
|
||||||
|
|
||||||
|
## 3. CLOVER: Cross-Layer Orthogonal Vectors
|
||||||
|
|
||||||
|
Below is a step-by-step explanation of CLOVER method and explain why it can update orthogonal decompose the Query, Key, Value, Output layers in Multi-Head Attention without need introduce any transfer matrix. We mainly use the computation of the Q-K pair in as an example. Then extended to the V -O pair.
|
||||||
|
|
||||||
|
Multi-Head Self-Attention Setup. In a multi-head self-attention mechanism with H heads, each head h ∈{1, . . . , H } computes an attention score as: attn (Qh, K h) = softmax
|
||||||
|
|
||||||
|
QhK⊤
|
||||||
|
|
||||||
|
> h√d
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
,
|
||||||
|
|
||||||
|
where H is the number of attention heads, d is the dimen-sionality of each head, X ∈ Rn×D is the input matrix ( n
|
||||||
|
|
||||||
|
is the sequence length, D is the total hidden dimension),
|
||||||
|
|
||||||
|
Qh, K h ∈ Rn×d are the query and key representations for head h, WQ, W K ∈ RD×H×d are weights for projecting the input X into queries and keys. Specifically, the queries and keys for head h are obtained by multiplying X with the corresponding “slice” of WQ and
|
||||||
|
|
||||||
|
WK , respectively:
|
||||||
|
|
||||||
|
Qh = X W [: ,h, :]
|
||||||
|
|
||||||
|
> Q
|
||||||
|
|
||||||
|
, Kh = X W [: ,h, :]
|
||||||
|
|
||||||
|
> K
|
||||||
|
|
||||||
|
.
|
||||||
|
|
||||||
|
Cross Layers Merging. Substituting Qh and Kh into
|
||||||
|
|
||||||
|
QhK⊤
|
||||||
|
|
||||||
|
> h
|
||||||
|
|
||||||
|
, we have:
|
||||||
|
|
||||||
|
QhK⊤
|
||||||
|
|
||||||
|
> h
|
||||||
|
|
||||||
|
= X W [: ,h, :]
|
||||||
|
|
||||||
|
> Q
|
||||||
|
|
||||||
|
| ||||||