wip

2026-06-27 15:14:03 +08:00 · 2026-02-22 21:52:57 +08:00
parent 2e3c8165ce
commit 7dbe3faf19
10 changed files with 5556 additions and 110 deletions
@@ -0,0 +1,134 @@
 {
  "mode": "editor",
  "full_text_verified": true,
  "claims_checked": [
    {
      "claim": "- Probing finds representations that predict behavior, but high probe accuracy does not mean the mod",
      "number": "2022",
      "status": "number_found_in_paper"
    },
    {
      "claim": "- CCS discovers latent knowledge but cannot intervene on it (Burns et al., 2022).",
      "number": "2022",
      "status": "number_found_in_paper"
    },
    {
      "claim": "I articulated this in my [AntiPaSTO paper](https://arxiv.org/abs/2601.07473), and GDM's interpretabi",
      "number": "2601.07473",
      "status": "number_found_in_paper"
    },
    {
      "claim": "I articulated this in my [AntiPaSTO paper](https://arxiv.org/abs/2601.07473), and GDM's interpretabi",
      "number": "3",
      "status": "number_found_in_paper"
    },
    {
      "claim": "I went through ~30 PEFT methods in HuggingFace PEFT and the broader literature. For each one I:",
      "number": "30",
      "status": "number_found_in_paper"
    },
    {
      "claim": "1. Extracted pseudocode for the forward pass (what the intervention actually does)",
      "number": "1.",
      "status": "number_found_in_paper"
    },
    {
      "claim": "2. Stated the hypothesis it encodes about transformer internals",
      "number": "2.",
      "status": "number_found_in_paper"
    },
    {
      "claim": "- **PiSSA** (NeurIPS 2024): Initialize LoRA from top-$r$ SVD of $W$, freeze the residual. Gemma-7B o",
      "number": "2024",
      "status": "number_found_in_paper"
    },
    {
      "claim": "- **PiSSA** (NeurIPS 2024): Initialize LoRA from top-$r$ SVD of $W$, freeze the residual. Gemma-7B o",
      "number": "7",
      "status": "number_found_in_paper"
    },
    {
      "claim": "- **SVFT**: Fix both singular vector sets from $W$'s SVD, learn only sparse coefficients. Recovers 9",
      "number": "96%",
      "status": "number_found_in_paper"
    },
    {
      "claim": "- **SSVD**: Rotate right singular vectors (Cayley transform), shift singular values, keep left singu",
      "number": "10",
      "status": "number_found_in_paper"
    },
    {
      "claim": "- **DoRA** (ICML 2024): Magnitude/direction decomposition of $W$. Consistently beats LoRA.",
      "number": "2024",
      "status": "number_found_in_paper"
    },
    {
      "claim": "- **DeLoRA** (ICLR 2025): Normalize each rank-1 component, introduce learnable scalar $\\lambda$. Bet",
      "number": "2025",
      "status": "number_found_in_paper"
    },
    {
      "claim": "- **DeLoRA** (ICLR 2025): Normalize each rank-1 component, introduce learnable scalar $\\lambda$. Bet",
      "number": "1",
      "status": "number_found_in_paper"
    },
    {
      "claim": "- **ROAD**: 2D rotary adaptation with explicit angle $\\theta$ and magnitude $\\alpha$.",
      "number": "2",
      "status": "number_found_in_paper"
    },
    {
      "claim": "**IA3** learns nothing but a per-channel scaling vector ($\\lambda \\in \\mathbb{R}^d$, initialized to ",
      "number": "3",
      "status": "number_found_in_paper"
    },
    {
      "claim": "**IA3** learns nothing but a per-channel scaling vector ($\\lambda \\in \\mathbb{R}^d$, initialized to ",
      "number": "1",
      "status": "number_found_in_paper"
    },
    {
      "claim": "**Conflict of interest disclosure:** the strongest OOD result in this catalog is my own work. [AntiP",
      "number": "2601.07473",
      "status": "number_found_in_paper"
    },
    {
      "claim": "**Conflict of interest disclosure:** the strongest OOD result in this catalog is my own work. [AntiP",
      "number": "1",
      "status": "number_found_in_paper"
    },
    {
      "claim": "I give AntiPaSTO the highest grade (\\*\\*!!!\\*\\*) because the OOD transfer is the kind of evidence th",
      "number": "4",
      "status": "number_found_in_paper"
    },
    {
      "claim": "**Decoupling family:** DoRA (magnitude/direction) -> ETHER (fixed-strength orthogonal) -> DeLoRA (no",
      "number": "1",
      "status": "number_found_in_paper"
    },
    {
      "claim": "- **Scale dependence.** Most of these results are on 1B-7B models. The geometry might change at 70B+",
      "number": "1",
      "status": "number_found_in_paper"
    },
    {
      "claim": "- **Scale dependence.** Most of these results are on 1B-7B models. The geometry might change at 70B+",
      "number": "7",
      "status": "number_found_in_paper"
    },
    {
      "claim": "The full catalog with pseudocode, evidence, and grades for 30 methods is at:",
      "number": "30",
      "status": "number_found_in_paper"
    }
  ],
  "missing_context": [
    "Paper limitations not discussed"
  ],
  "citations_found": [
    "- CCS discovers latent knowledge but cannot intervene on it (Burns et al., 2022)"
  ],
  "suggestions": [],
  "verdict": "needs_work"
 }
@@ -2,19 +2,22 @@ TASK write a new file, from the old part.
 ## Status: DONE
 ### Task 1: adapters_as_hypotheses.md
 - [x] Preamble with pragmatic interpretability framing
- [x] 30 entries (1-30) with pseudocode, hypothesis, evidence, grade
+- [x] 33 entries with pseudocode, hypothesis, evidence, grade
 - [x] All papers saved to docs/ (full size, no truncation)
- [x] URLs from gist_content.md included
+- [x] Sub-agent review completed, fixes applied
- [x] Sub-agent review completed, fixes applied:
+
-  - Fixed RandLoRA pseudocode (sum of scaled random bases, not single triple product)
+### Task 2: adapters_vargdown.argdown (NEW)
-  - Fixed authorship (AntiPaSTO is Clark, not Bini/Girrbach/Akata)
+- [x] Compiled evidence into vargdown (verified argdown) format
-  - Fixed SSVD grade (** not **!) and evidence ("matches" not "outperforms")
+- [x] 6 thematic argument groups: SVD basis, orthogonal, decoupling, gain control, rank, functional architecture
-  - Fixed OFT pseudocode (W @ R^T convention per paper)
+- [x] Main thesis: [Natural Manifold] -- SVD basis + orthogonal constraints define natural intervention manifold
-  - Fixed AntiPaSTO Cayley convention to show explicit /2
+- [x] ~20 observations with exact blockquotes from docs/ evidence files
-  - Added AntiPaSTO grade caveat (<=4B models, seed variance)
+- [x] ~10 assumptions for papers without frozen evidence
-  - Split Bone/Trainable Tokens into separate entries
+- [x] 3 contrary arguments (gain control, rank secondary, linearity)
-  - Fixed "Clark et al." -> "Clark"
+- [x] Pseudocode companion: adapters_pseudocode.md (20 methods in pseudopy format)
 - [x] Sub-agent review: fixed 5 critical (wrong evidence links, paraphrased quotes), 7 minor (orphans, credence calibration)
 - [x] All credences calibrated: reason first, no overconfidence on preprints
 First write also preamble explaining why we are interested, and this view, about a pragmatic search for effective views on internals (see https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability, and 
@@ -6,13 +6,15 @@
 We want to understand how transformers work. There are many approaches -- probing, ablation, SAEs -- but most of them *observe* rather than *intervene*. Probing finds representations that predict behavior, but high probe accuracy does not mean the model uses that representation ([Belinkov, 2022](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00254/43503)). CCS discovers latent knowledge but cannot intervene on it ([Burns et al., 2022](https://arxiv.org/abs/2212.03827)). Intervention shortcuts both problems: if modifying a representation reliably changes behavior, we have causal evidence of what we control (I argued this in [AntiPaSTO](https://arxiv.org/abs/2601.07473)).
 <!-- TODO is ths all really relevent for the intro and audience, seems long and not to the point, is the lesswrong one better? -->
 There is an underappreciated source of exactly this kind of causal evidence: the PEFT adapter literature.
-Each adapter constrains *how* you can update pretrained weights. When one adapter architecture outperforms another under controlled conditions -- same model, same data, same parameter budget -- the winning method's structural assumptions are supported as a better description of the weight manifold. This is a natural experiment running across hundreds of papers, and almost nobody reads it as science about representations.
+Each adapter constrains *how* you can update pretrained weights. When one adapter architecture outperforms another under controlled conditions -- same model, same data, same parameter budget -- the winner's structural assumptions get stronger support as a description of the weight manifold. This is a natural experiment running across many papers, and it is still underused as evidence about representations.
 GDM's interpretability team recently pivoted toward "pragmatic interpretability" -- directly solving problems on the critical path to AGI going well, grounded in proxy tasks with empirical feedback ([Nanda et al., 2025](https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability)). Adapter benchmarks are precisely this: empirical feedback on which structural assumptions about transformer internals hold up under intervention.
-If the adapter generalizes out-of-distribution, that tells us the geometric structure it exploits is *causally relevant* to behavior, not merely correlated. From my [AntiPaSTO paper](https://arxiv.org/abs/2601.07473):
+If an adapter generalizes out-of-distribution, that is stronger evidence that the geometric structure it exploits is causally relevant to behavior, not merely correlated. From my [AntiPaSTO paper](https://arxiv.org/abs/2601.07473):
 > Each adapter architecture encodes a claim about how to intervene in transformer internals. LoRA hypothesizes weight changes are low-rank. OFT hypothesizes orthogonal transformations preserve semantic structure. VeRA hypothesizes shared random projections plus learned scaling suffice. DeLoRA hypothesizes direction and magnitude should decouple. PiSSA hypothesizes principal components matter most. Our choice -- Cayley rotations of SVD singular vectors -- hypothesizes that the model's own learned basis defines the natural intervention manifold. Adapters that generalize out-of-distribution tell us which geometric structures are causally relevant to behavior, not merely correlated with it.
@@ -29,7 +31,7 @@ We grade evidence on independent dimensions. Each method gets points for the dim
 | BF  | 1.5 | Matches or beats full fine-tuning |
 | DE  | 1.5 | Data-efficient: faster convergence or works with less data |
 | OOD | 2   | Generalizes out-of-distribution |
-| WA  | 1   | Widely adopted: used as baseline by other papers |
+| WA  | 1   | Widely adopted: used as baseline by many other papers |
 Total = sum of applicable dimensions (max 8). Higher = stronger evidence that the method's structural hypothesis is correct.
@@ -52,7 +54,7 @@ def lora_forward(x, W, A, B, α, r):
    return (W + scaling * ΔW) @ x     # equivalently: W(x) + scaling * B(A(x))
 ```
-**Evidence:** Parameter-efficient (matches full FT with 0.01% params on GPT-3). Universal baseline adopted by the entire field. Authors demonstrate comparable performance to full fine-tuning on GPT-3 175B across multiple NLU benchmarks. Subsequent work ([Biderman et al. 2024](https://arxiv.org/abs/2405.09673)) finds LoRA underperforms full FT on harder tasks and larger scale -- the low-rank assumption holds for surface-level adaptation but breaks where deep weight restructuring is needed.
+**Evidence:** Parameter-efficient (matches full FT with 0.01% params on GPT-3). One of the most common baselines in PEFT. Authors demonstrate comparable performance to full fine-tuning on GPT-3 175B across multiple NLU benchmarks. Subsequent work ([Biderman et al. 2024](https://arxiv.org/abs/2405.09673), [saved](docs/biderman_lora_limitations.md)) finds LoRA underperforms full FT on harder tasks and larger scale -- the low-rank assumption holds for surface-level adaptation but weakens when deeper restructuring is needed.
 **Grade:** PE+WA=2 (parameter-efficient, universal baseline, but ceiling on hard tasks)
@@ -130,7 +132,7 @@ def dora_forward(x, W, A, B, m, α):
    return scale * (W @ x) + scale * α * (B @ A @ x)
 ```
-**Evidence:** Authors analyze full FT weight updates and find they exhibit distinct magnitude vs. direction patterns that LoRA misses. DoRA consistently outperforms LoRA on LLaMA (commonsense reasoning), LLaVA (visual instruction tuning), and VL-BART (image/video-text), across multiple scales. No additional inference overhead (magnitudes merge). Has become a widely-adopted LoRA variant and default in many pipelines.
+**Evidence:** Authors analyze full FT weight updates and find distinct magnitude vs. direction patterns that LoRA misses. DoRA outperforms LoRA on LLaMA (commonsense reasoning), LLaVA (visual instruction tuning), and VL-BART (image/video-text) in their reported setups. No additional inference overhead (magnitudes merge). It is now a common LoRA-family baseline in many recent papers.
 **Grade:** PE+BL+BF+WA=4.5 (beats LoRA across multiple domains, QDoRA slightly outperforms full FT on LLaMA2-7B/LLaMA3-8B, standard strong baseline)
@@ -224,7 +226,7 @@ def svft_forward(x, W, U, Vt, indices, c):
 The key: $\Delta W = \sum_{t} c_t \cdot u_{i_t} v_{j_t}^\top$, where $u_i, v_j$ come from $W$'s SVD. Only the $c_t$ scalars are learned. Different sparsity patterns (band-diagonal, random, etc.) give different expressivity/efficiency tradeoffs.
-**Evidence:** SVFT recovers up to 96% of full fine-tuning performance with only 0.006--0.25% of parameters, outperforming LoRA/DoRA/BOFT which only recover 85% with 0.03--0.8% of params. Strong results on language (GLUE, commonsense reasoning) and vision benchmarks. The weight-dependent structure is the key differentiator.
+**Evidence:** SVFT reports up to 96% of full fine-tuning performance with only 0.006--0.25% of parameters, outperforming LoRA/DoRA/BOFT ranges reported in the paper. Results are strong on language (GLUE, commonsense reasoning) and vision benchmarks. The weight-dependent structure is the key differentiator.
 **Grade:** PE+BL=2 (beats LoRA/DoRA on performance/parameter tradeoff, weight-aware structure)
@@ -670,7 +672,7 @@ def shira_forward(x, W, mask, ΔW_sparse):
    return W_adapted @ x
 ```
-**Evidence:** Authors claim SHiRA outperforms LoRA especially on concept loss when using multiple adapters (critical for diffusion model fine-tuning). Sparse adapters are cheaper to switch between than LoRA. NeurIPS 2024 Workshop. The importance-scoring approach connects to structured pruning literature.
+**Evidence:** Authors report SHiRA outperforms LoRA especially on concept-loss-sensitive multi-adapter settings (critical for diffusion model fine-tuning). Sparse adapters are cheaper to switch between than LoRA. NeurIPS 2024 Workshop. The importance-scoring approach connects to structured pruning literature.
 **Grade:** PE=1 (sparse high-rank, good multi-adapter properties, workshop paper)
@@ -814,7 +816,9 @@ def oftv2_forward(x, W, Q):
 ## 31. CLOVER -- Cross-Layer Joint SVD Adaptation
-**Paper:** [Tang et al. 2024](https://arxiv.org/abs/2411.17426)
+**Paper:** [Meng, Tang, Jiang, Zhang 2024](https://arxiv.org/abs/2411.17426)
 **Code:** [github.com/fanxu-meng/CLOVER](https://github.com/fanxu-meng/CLOVER)
 **Saved:** [docs/clover_joint_svd.md](docs/clover_joint_svd.md)
 **Hypothesis:** Attention layers have *cross-layer redundancy* in their SVD structure. Rather than adapting Q, K, V, O projections independently, CLOVER performs joint SVD across paired attention matrices (Q-K and V-O), exploiting the shared singular subspace between layers that cooperate functionally. The weight matrices within a head are not independent -- they jointly define the attention computation, so their adaptation should be coupled.
@@ -841,7 +845,7 @@ def clover_forward(x, W_q, W_k, W_v, W_o, adapters):
    return attention(x, W_q + ΔW_q, W_k + ΔW_k, W_v + ΔW_v, W_o + ΔW_o)
 ```
-**Evidence:** Validated on SDXL (image generation), LLaMA-Vision (multimodal), and Whisper (speech). Beats LoRA by 7.6%, DoRA by 5.5%, PiSSA by 0.7% on average. The cross-layer coupling captures shared structure that per-matrix methods miss. The multimodal validation is notable -- works across text, vision, and audio modalities.
+**Evidence:** Authors report validation on SDXL (image generation), LLaMA-Vision (multimodal), and Whisper (speech), with average gains over LoRA (+7.6%), DoRA (+5.5%), and PiSSA (+0.7%) in their setup. The cross-layer coupling claim is plausible because Q-K and V-O are functionally paired. This result is strong but still from a single research group.
 **Grade:** PE+BL+BF=3.5 (beats LoRA and DoRA significantly; validated across 3 modalities)
@@ -851,7 +855,8 @@ def clover_forward(x, W_q, W_k, W_v, W_o, adapters):
 ## 32. PSOFT -- Principal Subspace Orthogonal Fine-Tuning
-**Paper:** [2025](https://arxiv.org/abs/2505.11235)
+**Paper:** [Wu et al. 2026](https://arxiv.org/abs/2505.11235)
 **Saved:** [docs/psoft_principal_subspace_oft.md](docs/psoft_principal_subspace_oft.md)
 **Hypothesis:** Combine PiSSA's SVD initialization with OFT's orthogonal constraint. After extracting the principal subspace via SVD, learn a Cayley rotation $R$ that operates *within* the frozen $U, V$ subspace. This is "OFT in SVD coordinates" -- preserving pairwise angles (OFT's insight) while working in the model's natural basis (PiSSA's insight). The rotation $R$ acts on the principal singular vectors, keeping the subspace orientation while rotating within it.
@@ -870,11 +875,11 @@ def psoft_forward(x, U_r, S_r, V_r, K, W_res):
    return (W_adapted + W_res) @ x                # + frozen residual
 ```
-**Evidence:** Authors claim 80% memory reduction vs standard OFT. Tested on 35 NLP and CV tasks. Combines the convergence benefits of SVD initialization (PiSSA) with the semantic preservation of orthogonal constraints (OFT). The Cayley parameterization ensures exact orthogonality without projection steps.
+**Evidence:** Authors report ~80% memory reduction vs OFT and broad evaluation across 35 NLP/CV tasks, while keeping performance competitive. This is a direct synthesis of PiSSA-style SVD initialization plus OFT-style Cayley-constrained rotations.
 **Grade:** PE+BL+DE=3.5 (memory-efficient, faster convergence from SVD init, beats LoRA on 35 tasks)
-*Implications:* PSOFT is the clearest synthesis of the SVD + orthogonal hypotheses. It shows that these two structural priors are *complementary*, not competing: SVD identifies *where* to intervene (principal subspace), orthogonality constrains *how* to intervene (rotations that preserve structure). The resulting method inherits both benefits. This supports the emerging picture that the "right" adapter is one that respects both the model's eigenbasis and the geometry of meaningful transformations within it.
+*Implications:* PSOFT combines two ideas that actually work: SVD tells you where to intervene (principal subspace), orthogonality constrains how (rotations that preserve structure). The method inherits benefits from both. This suggests the best adapters respect both the model's eigenbasis and the geometry of transformations within it.
 ---
@@ -882,6 +887,7 @@ def psoft_forward(x, U_r, S_r, V_r, K, W_res):
 **Paper:** [Wu, Arora, Wang et al. 2024](https://arxiv.org/abs/2404.03592)
 **Code:** [github.com/stanfordnlp/pyreft](https://github.com/stanfordnlp/pyreft)
 **Saved:** [docs/reft_representation_finetuning.md](docs/reft_representation_finetuning.md)
 **Hypothesis:** Adaptation should target *representations* (activations), not weights. Instead of modifying $W$, modify the hidden state $h$ at specific layers and positions via learned interventions. The model's weights are already fine; we just need to redirect its intermediate computations. This is the "activation steering" hypothesis taken to its limit: learn a linear intervention on hidden states at specific token positions.
@@ -901,7 +907,7 @@ def reft_forward(model, x, interventions):
 The key: instead of $W' = W + \Delta W$, apply $h' = h + R^\top (R h + b - R h)$ at specific (layer, position) pairs. The intervention is a learned affine transformation in a low-rank subspace of the hidden state.
-**Evidence:** ReFT is 15-65x more parameter-efficient than LoRA (only modifying hidden states at a few positions). Stanford NLP group, widely cited. Strong on instruction following, commonsense reasoning, and math. The activation-space intervention paradigm is fundamentally different from weight adaptation -- it operates in representation space rather than parameter space.
+**Evidence:** Authors report 15-65x parameter savings vs LoRA by intervening on hidden states at selected layer-position sites. Reported gains cover instruction following, commonsense reasoning, and NLU tasks in their benchmark suite. Since this is activation-space adaptation, comparisons with weight-space adapters are informative but not perfectly apples-to-apples.
 **Grade:** PE+BL=2 (15-65x more parameter-efficient than LoRA, beats LoRA on multiple benchmarks, distinct paradigm)
@@ -911,101 +917,27 @@ The key: instead of $W' = W + \Delta W$, apply $h' = h + R^\top (R h + b - R h)$
 ## Themes: What the Evidence Tells Us
-Looking across all 33 methods, several structural hypotheses emerge as clusters. Each theme represents a point of view about *what transformers are*, supported by the methods that exploit it.
+Looking across all 33 methods, a coherent tentative story appears once benchmark noise is reduced. Many successful adapters make geometric bets: first choose coordinates that align with pretrained structure, then constrain updates so they do not destroy that structure, then control update strength explicitly.
-### Theme 1: SVD as the Natural Coordinate System
+A quick source-level pass over the paper texts helps anchor this interpretation. Direct "we hypothesize" style statements cluster into the same buckets used here: LoRA and RandLoRA for low-rank sufficiency limits; OFT and ETHER for orthogonality and preserved angular structure; DoRA and DeLoRA for direction-strength decoupling; IA3 for scaling-only adaptation; SHiRA and C3A for high-rank and structural alternatives; AntiPaSTO for SVD-coordinate intervention and OOD transfer. That clustering is not perfect, but it is strong enough to justify organizing the literature by theme rather than by year.
 <!-- TODO IS this meta statement needed for the audience or directed at them? we don't actually do clustering either -->
 **Methods:** PiSSA (5), SVFT (2), SSVD (3.5), CLOVER (3.5), PSOFT (3.5), AntiPaSTO (4.5), AdaLoRA (1)
-**Core claim:** The SVD of each weight matrix provides the right coordinate system for intervention. The left singular vectors ($U$) define the output feature space, the right singular vectors ($V$) define the input feature space, and the singular values ($\Sigma$) quantify relative importance.
+The strongest recurring signal is *basis choice*. SVD-aware methods such as PiSSA, SSVD, CLOVER, and PSOFT often beat random-basis baselines under similar budgets in reported setups. In practical terms, initializing in the model's singular-vector basis reduces the search problem. The optimizer starts in a subspace the model already uses. This is not proof that SVD is uniquely correct, but it is stronger evidence than a single benchmark win.
-**Evidence for:** PiSSA's faster convergence shows the principal components carry the signal -- initializing in SVD space skips the "re-discovery" phase that random-init LoRA wastes. SVFT recovers 96% of full FT performance by learning *only* coefficients over the model's own singular vectors. SSVD's convergence hierarchy (SSVD > PiSSA > DoRA > LoRA) holds across scales and strengthens with model size. CLOVER extends this to *joint* SVD across functionally coupled matrices (Q-K, V-O), gaining 7.6% over LoRA.
+*Orthogonal* methods add the next piece. OFT and BOFT show that bounded rotations can preserve useful behavior while still adapting to new tasks. The Cayley parameterization appears across OFT, SSVD, PSOFT, and AntiPaSTO because it keeps rotations orthogonal without repeated projection steps. Pure orthogonality can be too rigid when tasks need gain changes, so methods that pair rotations with magnitude control tend to perform better.
-**Evidence against / limitations:** SVD is a linear decomposition of a nonlinear system. It captures the dominant linear subspaces but may miss nonlinear structure. The SVD basis is also model-specific (not transferable across checkpoints without re-computing). No method has demonstrated that SVD coordinates are *uniquely* correct vs. other spectral decompositions (NMF, ICA, etc.).
+That leads to the *direction-versus-strength* split. DoRA, DeLoRA, ROAD, and AntiPaSTO all separate where to move in weight space from how far to move. In runs that report careful ablations, this split often improves stability and sometimes final accuracy. Whether this is a deep property of transformer computation or mainly an optimization advantage is still open.
-**Bottom line:** The strongest evidence cluster. 6 of the top-7 scoring methods use SVD. The model's own eigenbasis appears to be a better intervention basis than random (LoRA), per-element (IA3), or spectral (FourierFT) alternatives.
+A parallel thread is *gain control*. IA3, VeRA, and LN tuning show that a lot of adaptation comes from rescaling existing features instead of inventing new ones. This explains why tiny parameter budgets can work well on many tasks. It also clarifies where they fail: when tasks require genuinely new feature combinations, scaling-only methods plateau.
-### Theme 2: Orthogonal Transformations Preserve Structure
+The *rank* debate looks secondary once basis is accounted for. Full-rank updates can help on harder tasks, as RandLoRA and C3A suggest, but PiSSA and SVFT show that a good low-rank subspace can beat a poorly chosen full-rank update. In practice, "which subspace" matters more than "how many free directions".
-**Methods:** OFT (2.5), BOFT (4), GOFT (1), HRA (1), ETHER (2.5), OFTv2 (1), PSOFT (3.5), RoAd (1), AntiPaSTO (4.5)
+Finally, methods that respect *functional structure* are promising but early. CLOVER's joint treatment of Q-K and V-O pairs outperforms per-matrix updates, and ReFT shows that targeted activation interventions can be far more parameter-efficient than weight updates. Both suggest that treating transformers as computation graphs, not bags of matrices, is a productive direction.
-**Core claim:** Useful adaptation should preserve the pairwise angular relationships between neuron weight vectors. Orthogonal transformations ($R \in O(d)$) rotate the weight space without distorting it, maintaining the "hyperspherical energy" that encodes learned knowledge.
+### Overall picture
-**Evidence for:** OFT's data efficiency (converges with 5% of data) and training stability directly result from bounded deviation. BOFT beats full FT while preserving hyperspherical energy. ETHER's learning-rate robustness and fast convergence come from bounded distance to identity. The Cayley parameterization (used by OFT, SSVD, PSOFT, AntiPaSTO) guarantees exact orthogonality without projection steps.
+Across methods, the same pattern keeps repeating: adapters work best when they preserve pretrained structure and then move within it in controlled ways. SVD-aware coordinates identify high-signal directions, near-orthogonal transforms protect useful geometry, and explicit strength controls prevent overwriting. This is currently the strongest empirical pattern in the catalog. It does not settle causality by itself, but it narrows the search space and yields concrete, falsifiable predictions for mechanistic work.
 **Evidence against / limitations:** Pure orthogonality prevents magnitude changes, limiting adaptation when neuron importance needs rescaling. DoRA's analysis shows that full FT *does* change magnitudes significantly. Methods that relax strict orthogonality (ETHER+, DeLoRA) often outperform strict ones, suggesting orthogonality is a useful *bias* but not a hard constraint.
 **Bottom line:** Orthogonality is the right default constraint -- it prevents catastrophic forgetting and enables data-efficient adaptation. But the best methods (AntiPaSTO, PSOFT) combine orthogonal rotations with learned magnitude adjustments, suggesting "mostly orthogonal + small scaling" is the true operating point.
 ### Theme 3: Decoupling Magnitude from Direction
 **Methods:** DoRA (4.5), DeLoRA (3.5), ROAD (1), AntiPaSTO (4.5), SSVD (3.5)
 **Core claim:** Weight updates decompose into *what direction to change* (angular component) and *how much to change* (magnitude component). These should be learned separately because they serve different purposes: direction captures task-relevant feature selection, magnitude captures task-specific intensity.
 **Evidence for:** DoRA's weight decomposition analysis shows that full FT exhibits distinct magnitude vs. direction update patterns that LoRA conflates. DeLoRA's robustness to learning rate and training duration comes from explicitly bounding the magnitude while letting direction train freely. ROAD's $(\theta, \alpha)$ parameterization provides the cleanest per-pair decoupling.
 **Evidence against:** The direction/magnitude decomposition is not unique (many decompositions exist). The evidence is mostly about training stability, not about capturing a fundamental property of the model.
 **Bottom line:** Decoupling direction from magnitude is a reliable engineering principle that improves robustness and interpretability. Whether it reflects something deep about how transformers compute (gain modulation vs. selectivity tuning) or is simply good optimization practice remains open.
 ### Theme 4: Gain Control -- Adaptation as Rescaling
 **Methods:** IA3 (1), VeRA (1), LN Tuning (1), DoRA (4.5)
 **Core claim:** Much of task adaptation is reweighting what the model already computes. Learned scaling vectors, normalization adjustments, and magnitude modulation can achieve competitive adaptation without introducing new weight-space directions.
 **Evidence for:** IA3's success with just $d$-dimensional scaling vectors; VeRA's success with only per-layer scaling of frozen random matrices; LN Tuning's competitive results with only normalization parameters. These suggest that the pretrained model already computes most useful features -- the task bottleneck is *which ones to attend to*, not *what to compute*.
 **Evidence against:** All pure scaling methods hit a performance ceiling on complex tasks. IA3 struggles when novel feature combinations are required. This ceiling is informative: it tells us that adaptation is *partly* rescaling and *partly* restructuring, and the hard tasks require the latter.
 **Bottom line:** Gain control handles the "easy" adaptation (60-80% of the gap to full FT on standard benchmarks). The remaining gap requires structural changes (new directions, rotations). But the fraction of adaptation that's just rescaling is larger than most people assume.
 ### Theme 5: Low-Rank vs. Full-Rank -- Where the Ceiling Is
 **Methods:** LoRA (2), RandLoRA (2.5), C3A (1), LoHa (1), SHiRA (1)
 **Core claim:** LoRA's low-rank constraint ($\Delta W = BA$, $\text{rank} \leq r$) is both its strength (parameter efficiency) and its weakness (expressivity ceiling). Methods that achieve full-rank updates with few parameters either match or beat LoRA.
 **Evidence for:** RandLoRA shows that increasing LoRA's rank has diminishing returns, while full-rank random-basis combinations continue to improve. The loss landscape analysis shows RandLoRA reaches local minima closer to full FT. C3A achieves full-rank via circulant structure. LoHa achieves higher effective rank via Hadamard products.
 **Evidence against:** PiSSA and SVFT show that *which* rank-$r$ subspace you use matters more than having full rank. A well-chosen rank-$r$ subspace (SVD top components) outperforms a poorly chosen full-rank update. The rank bottleneck may be a red herring -- it's the *basis* that matters.
 **Bottom line:** Low rank suffices for easy tasks; full rank is needed for hard ones (vision-language, complex reasoning). But the SVD methods suggest that even for hard tasks, a good low-rank basis can recover most of the gain, making the rank debate secondary to the basis debate.
 ### Theme 6: Respecting Functional Architecture
 **Methods:** CLOVER (3.5), ReFT (2), Poly/X-LoRA (1)
 **Core claim:** Treating each weight matrix independently ignores the functional structure of attention. Q-K matrices jointly define the attention pattern; V-O jointly define value extraction. ReFT goes further: intervene on *activations* at specific (layer, position) sites rather than on weights at all.
 **Evidence for:** CLOVER's joint SVD across Q-K and V-O pairs beats per-matrix methods by 5-7%. ReFT's 15-65x efficiency gain suggests that a few critical activation sites carry most of the task signal. Both methods exploit the insight that transformer computation has structure *beyond* individual weight matrices.
 **Evidence against:** CLOVER is one paper; the cross-layer coupling idea needs broader validation. ReFT operates in a fundamentally different space (activations vs. weights), making comparison with weight methods not fully apples-to-apples.
 **Bottom line:** Early but promising. The functional architecture of attention (which matrices cooperate) and the computational graph (where in the forward pass to intervene) are underexploited priors. As models grow, these priors should become *more* valuable because they reduce the search space along functionally meaningful dimensions.
 ### Summary Table
 | Theme | Core PoV | Top method | Score | Strength of evidence |
 |-------|----------|-----------|-------|---------------------|
 | SVD basis | Model's eigenbasis is the natural coordinate system | PiSSA | 5 | Strong: 6/7 top methods use SVD |
 | Orthogonality | Preserve angular structure, bound deviation | BOFT | 4 | Strong: data efficiency, training stability |
 | Mag/Dir decoupling | Direction = what; magnitude = how much | DoRA | 4.5 | Moderate: good engineering, unclear if fundamental |
 | Gain control | Adaptation is mostly rescaling | IA3 | 1 | Moderate: works until it doesn't |
 | Rank debate | Full rank > low rank for hard tasks | RandLoRA | 2.5 | Moderate: secondary to basis choice |
 | Functional architecture | Respect attention's internal structure | CLOVER | 3.5 | Early: one paper, promising signal |
 ### The Emerging Picture
 These themes are not independent. The top-scoring methods combine multiple themes:
 - **PiSSA** (5): SVD basis + data efficiency
 - **AntiPaSTO** (4.5): SVD basis + orthogonal rotation + mag/dir decoupling + OOD transfer
 - **DoRA** (4.5): Mag/dir decoupling + beats full FT + widely adopted
 - **BOFT** (4): Orthogonal + beats full FT + data efficiency
 The emerging consensus: the model's SVD eigenbasis defines the natural coordinate system for intervention. Within that basis, orthogonal rotations with small learned magnitude adjustments capture most useful adaptations. The direction/magnitude decomposition makes these adaptations robust and interpretable. When we additionally respect the functional architecture of attention (CLOVER) or intervene directly on activations (ReFT), further efficiency gains are possible.
 What remains unclear: whether these geometric priors capture something *fundamentally* true about how transformers compute, or whether they are "just" good inductive biases that happen to work well with current architectures. The answer matters for alignment: if the SVD basis encodes *causally relevant* computational structure, then adapter-based interventions are genuine interpretability tools; if it's just a convenient coordinate system, the causal claims don't follow.
 <!-- TODO kind of weak I'd rather make a prediciton, or state strength of evidence, or if it changed my mind here -->
@@ -0,0 +1,244 @@
 # Adapter Pseudocode (pseudopy)
 Companion to [adapters_vargdown.argdown](adapters_vargdown.argdown).
 Each adapter's intervention as pseudopy: minimal, self-documenting, shapes in comments.
 ---
 ## 1. LoRA -- baseline
 ```
 # ── LoRA: low-rank weight update ──
 ΔW ← B @ A                          # ΔW ∈ ℝ^{m×n}, rank r
 y ← (W + (α/r) · ΔW) @ x           # W frozen; A, B learned
 ```
 ## 2. OFT -- orthogonal rotation
 ```
 # ── OFT: Cayley rotation of weight rows ──
 Q_skew ← skew(Q)                    # Q_skew ∈ ℝ^{b×k×k}, antisymmetric
 R ← (I + Q_skew)(I - Q_skew)⁻¹     # Cayley map → R ∈ O(k)
 R_full ← blkdiag(R₁, ..., R_b)     # R_full ∈ O(d)
 y ← (W @ Rᵀ) @ x                   # rotate rows, preserve angles
 ```
 ## 3. VeRA -- random projection + scaling
 ```
 # ── VeRA: shared random matrices, per-layer scaling ──
 # A ∈ ℝ^{r×n}, B ∈ ℝ^{m×r}: frozen random, shared ALL layers
 # λ_d ∈ ℝ^r, λ_b ∈ ℝ^r: learned per-layer
 ΔW ← (λ_b ⊙ B) @ (λ_d ⊙ A)       # ΔW ∈ ℝ^{m×n}
 y ← (W + ΔW) @ x
 ```
 ## 4. DoRA -- magnitude/direction decomposition
 ```
 # ── DoRA: decouple ‖W‖ from W/‖W‖ ──
 # m ∈ ℝ^{m}: learned magnitude per output neuron
 V ← W + α · B @ A                   # updated direction
 V̂ ← V / ‖V‖_col                     # unit direction, .detach()
 y ← (m · V̂) @ x                     # magnitude × direction
 ```
 ## 5. DeLoRA -- bounded deviation
 ```
 # ── DeLoRA: normalize rank-1 components, scale by λ ──
 # A ∈ ℝ^{r×n}, B ∈ ℝ^{m×r}: learned
 # λ ∈ ℝ^r: learned per-component strength
 â_i ← ‖aᵢ‖, b̂ⱼ ← ‖bⱼ‖             # per-component norms
 s ← (λ/r) / (â · b̂)                 # normalize each outer product
 ΔW ← B · diag(s) · A               # direction normalized, strength via λ
 y ← W(x) + ΔW @ x
 ```
 ## 6. PiSSA -- SVD initialization
 ```
 # ── PiSSA: init adapters from top-r SVD of W ──
 U, Σ, Vᵀ ← svd(W)                  # W ∈ ℝ^{m×n}
 A ← U_{:,:r} · √Σ_{:r}             # A ∈ ℝ^{m×r}, principal left
 B ← √Σ_{:r} · Vᵀ_{:r,:}           # B ∈ ℝ^{r×n}, principal right
 W_res ← U_{:,r:} · Σ_{r:} · Vᵀ_{r:,:}  # residual, frozen
 # ── forward (identical to LoRA) ──
 y ← (W_res + A @ B) @ x
 ```
 ## 7. SVFT -- SVD coefficient tuning
 ```
 # ── SVFT: learn sparse coefficients over W's own singular vectors ──
 U, Σ, Vᵀ ← svd(W)                  # frozen
 # select k sparse (i,j) pairs; c ∈ ℝ^k learned
 ΔW ← Σₜ cₜ · uᵢ vⱼᵀ               # sparse combo of outer products
 y ← (W + ΔW) @ x
 ```
 ## 8. SSVD -- asymmetric SVD rotation
 ```
 # ── SSVD: rotate right singular vectors, preserve left ──
 U, Σ, Vᵀ ← svd(W)                  # all frozen
 K ← skew(θ)                         # K ∈ ℝ^{k×k}, learned antisymmetric
 G ← (I - K)(I + K)⁻¹               # Cayley → G ∈ O(k)
 Σ̂ ← Σ; Σ̂_{:k} += ΔΣ               # shift top-k singular values (learned)
 V̂ᵀ ← Vᵀ; V̂ᵀ_{:k} ← G @ Vᵀ_{:k}   # rotate input-space vectors only
 y ← U · diag(Σ̂) · V̂ᵀ @ x
 # params: k(k-1)/2 + k
 ```
 $$W' = U \, (\Sigma + \Delta\Sigma) \, G_k \, V^\top$$
 ## 9. IA3 -- activation scaling
 ```
 # ── IA3: element-wise scaling of activations ──
 # λ ∈ ℝ^d: learned, init=1
 y ← W @ (x ⊙ λ)                    # FFN: scale input channels
 y ← (W @ x) ⊙ λ                    # attn K,V: scale output channels
 ```
 ## 10. ROAD -- rotary adaptation
 ```
 # ── ROAD: 2D rotation + magnitude per activation pair ──
 # θ ∈ ℝ^{d/2}: learned rotation angles
 # α ∈ ℝ^{d/2}: learned magnitudes
 for i in range(d//2):
    x̂[2i]   ← α_i · (cos θ_i · x[2i] - sin θ_i · x[2i+1])
    x̂[2i+1] ← α_i · (sin θ_i · x[2i] + cos θ_i · x[2i+1])
 y ← W @ x̂                           # α=1, θ=0 → identity
 ```
 ## 11. AntiPaSTO -- SVD Cayley steering
 ```
 # ── AntiPaSTO: Cayley rotation of SVD singular vectors ──
 U, Σ, Vᵀ ← svd(W)                  # frozen; computed once at init
 K ← skew(θ)                         # θ ∈ ℝ^{k(k-1)/2}, learned
 R ← (I - K)(I + K)⁻¹               # Cayley → R ∈ O(k)
 # ── rotate both U and V by same R ──
 Û ← U; Û_{:,:k} ← U_{:,:k} @ R    # rotate output-space
 V̂ᵀ ← Vᵀ; V̂ᵀ_{:k,:} ← R @ Vᵀ_{:k,:}  # rotate input-space
 W' ← Û · diag(Σ) · V̂ᵀ             # reconstruct
 y ← W' @ x
 # antiparallel: negate θ → opposite behavioral direction
 ```
 ## 12. AdaLoRA -- adaptive SVD rank
 ```
 # ── AdaLoRA: SVD-parameterized with importance pruning ──
 ΔW ← P · diag(Λ) · Q               # P ∈ ℝ^{m×r}, Q ∈ ℝ^{r×n}
 # importance score per singular value:
 s_i ← |Λ_i| + β · ‖pᵢ‖ · ‖qᵢ‖     # sensitivity-weighted
 # prune: zero out components with lowest s_i per budget
 mask ← topk(s, budget)
 ΔW ← P · diag(Λ ⊙ mask) · Q
 y ← (W + ΔW) @ x
 ```
 ## 13. BOFT -- butterfly orthogonal
 ```
 # ── BOFT: butterfly-factorized O(d log d) orthogonal ──
 # m butterfly factors B₁, ..., B_m each ∈ ℝ^{n/b × b × b}
 # each Bⱼ is block-diagonal of small orthogonal matrices
 R ← B_m @ ... @ B₂ @ B₁             # R ∈ O(d), O(d log d) params
 y ← (W @ Rᵀ) @ x
 ```
 ## 14. GOFT -- Givens rotations
 ```
 # ── GOFT: compose d(d-1)/2 planar rotations ──
 R ← I
 for (i,j,θ) in givens_pairs:         # each θ learned
    G ← I; G[i,i] ← cos θ; G[i,j] ← -sin θ
           G[j,i] ← sin θ; G[j,j] ← cos θ
    R ← G @ R                        # compose
 y ← (W @ Rᵀ) @ x                    # O(d) params
 ```
 ## 15. HRA -- Householder reflection
 ```
 # ── HRA: chain of r Householder reflections = rank-r ∩ O(d) ──
 R ← I
 for i in range(r):
    vᵢ ← learned                     # v ∈ ℝ^d
    Hᵢ ← I - 2 · vᵢ vᵢᵀ / ‖vᵢ‖²   # Householder reflection
    R ← Hᵢ @ R
 y ← (W @ Rᵀ) @ x
 # bridges: rank-r perturbation ≡ r Householder reflections
 ```
 ## 16. RandLoRA -- full-rank via random bases
 ```
 # ── RandLoRA: sum of scaled random rank-r bases ──
 # A_i, B_i: frozen random matrices
 # d_i, b_i ∈ ℝ^r: learned per-component scaling
 ΔW ← Σᵢ (b_i ⊙ Bᵢ) @ (d_i ⊙ Aᵢ)  # full rank possible
 y ← (W + ΔW) @ x
 ```
 ## 17. FourierFT -- spectral coefficients
 ```
 # ── FourierFT: sparse Fourier coefficients ──
 # select k frequency indices; c ∈ ℝ^k learned
 ΔW ← iFFT2(scatter(c, indices, shape=(m,n)))
 y ← (W + ΔW) @ x
 ```
 ## 18. CLOVER -- joint SVD across attention pairs
 ```
 # ── CLOVER: joint SVD over Q-K and V-O pairs per head ──
 # ── init: decompose paired attention matrices ──
 W_QK ← W_Q @ W_K.T                  # W_QK ∈ ℝ^{d×d}, combined Q-K
 U_qk, S_qk, V_qk ← svd(W_QK)       # per-head SVD
 W_VO ← W_V @ W_O.T                  # W_VO ∈ ℝ^{d×d}, combined V-O
 U_vo, S_vo, V_vo ← svd(W_VO)
 # ── forward: only S is learned, U/V frozen ──
 # rewrite Q,K from shared orthogonal basis:
 W_Q' ← U_qk @ diag(√S_qk)          # Q uses left singular vectors
 W_K' ← V_qk @ diag(√S_qk)          # K uses right singular vectors
 # fine-tune: learn ΔS_qk, ΔS_vo (full-rank update via all directions)
 S_qk' ← S_qk + ΔS_qk               # learned shifts
 y ← attn(W_Q' @ x, W_K' @ x, ...) 
 # params: 2 × rank scalars per head (pruning: zero small S entries)
 ```
 ## 19. PSOFT -- principal subspace + Cayley
 ```
 # ── PSOFT: PiSSA init + OFT rotation in principal subspace ──
 U, Σ, Vᵀ ← svd(W)
 # extract top-k subspace
 # learn Cayley rotation R within that subspace only
 # 80% memory reduction vs full OFT
 R ← cayley(K)                        # K ∈ ℝ^{k×k}
 W' ← U_{:,:k} @ R @ diag(Σ_{:k}) @ Vᵀ_{:k,:} + W_res
 ```
 ## 20. ReFT -- activation intervention
 ```
 # ── ReFT: intervene on hidden states at (layer, position) ──
 # R ∈ ℝ^{r×d}: learned rotation (low-rank subspace)
 # b ∈ ℝ^r: learned bias in subspace
 h ← model.layer[l].output[pos]       # hidden state at site
 h_proj ← R @ h                       # project to subspace
 h_proj ← h_proj + b                  # intervene
 h ← h + Rᵀ @ (h_proj - R @ h)       # write back (preserve complement)
 ```
 ---
 *Notation: `⊙` element-wise, `@` matmul, `←` assignment, `Σₜ` summation, `‖·‖` norm, `blkdiag` block diagonal, `skew` maps vector to skew-symmetric matrix, `cayley(K) = (I-K)(I+K)⁻¹`.*
@@ -0,0 +1,468 @@
 ===
 title: Adapters as Representational Hypotheses -- Which Geometric Priors About Transformer Internals Hold Under Intervention?
 author: Compiled from 33 PEFT papers (2021--2025)
 model:
    mode: strict
 ===
 // ══════════════════════════════════════════════════════════════════════
 // MAIN THESIS
 // Each adapter constrains how you update pretrained weights. When one
 // architecture outperforms another under controlled conditions, the
 // winner's structural assumptions are supported as a better description
 // of the weight manifold. This is a natural experiment running across
 // hundreds of papers, and almost nobody reads it as science about
 // representations.
 //
 // Pseudocode for each adapter lives in adapters_pseudocode.md
 // Evidence files are in docs/ (frozen copies of papers as markdown)
 // ══════════════════════════════════════════════════════════════════════
 [Natural Manifold]: The model's own geometric structure -- its SVD basis
  combined with orthogonal constraints and direction/magnitude decoupling --
  defines the natural intervention manifold for transformer weight adaptation.
  Adapters exploiting this structure generalize better, converge faster, and
  reveal causally relevant properties of internals.
  + <SVD Basis Cluster>
  + <Orthogonal Preservation>
  + <Direction Magnitude Decoupling>
  + <Functional Architecture>
  + <Shared Structure Evidence>
  + <Prompt Tuning Orthogonal Paradigm>
  - <Gain Control Sufficiency>
  - <Rank Is Secondary>
  - <SVD Is Linear Decomposition>
 // ══════════════════════════════════════════════════════════════════════
 // THEME 1: SVD AS THE NATURAL COORDINATE SYSTEM
 // Methods: PiSSA, SVFT, SSVD, CLOVER, AntiPaSTO, AdaLoRA, PSOFT
 // ══════════════════════════════════════════════════════════════════════
 # SVD Basis
 <SVD Basis Cluster>
 (1) [SVD Init Wins]: PiSSA's SVD initialization consistently outperforms
    LoRA's random initialization across 11 models and 13 tasks, showing
    the principal components carry the adaptation signal. #observation
    [Meng et al. 2024](https://arxiv.org/abs/2404.02948)
    [evidence](docs/pissa_principal_singular_values_adaptation.md#L1-L20)
    > PiSSA shares the same architecture as LoRA, but initializes the adaptor matrices A and B with the principal components of the original matrix W, and put the remaining components into a residual matrix W res which is frozen during fine-tuning. Compared to LoRA, PiSSA updates the principal components while freezing the "residual" parts, allowing faster convergence and enhanced performance. Comparative experiments of PiSSA and LoRA across 11 different models, ranging from 184M to 70B, encompassing 5 NLG and 8 NLU tasks, reveal that **PiSSA consistently outperforms LoRA under identical experimental setups.** On the GSM8K benchmark, Gemma-7B fine-tuned with PiSSA achieves an accuracy of 77.7%, surpassing LoRA's 74.53% by 3.25%.
    {reason: "NeurIPS 2024, 11 models up to 70B, controlled comparisons with identical architecture (same as LoRA)", credence: 0.88}
 (2) [SVD Coefficients Suffice]: SVFT recovers 96% of full FT performance
    by learning only sparse coefficients over the model's own singular
    vectors -- no new directions needed. #observation
    [Lingam et al. 2024](https://arxiv.org/abs/2405.19597)
    [evidence](docs/svft_svd_coefficient_finetuning.md#L1-L30)
    > We propose SVFT, a simple approach that fundamentally differs from existing methods: the structure imposed on delta-W depends on the specific weight matrix W. Specifically, SVFT updates W as a sparse combination of outer products of its singular vectors, training only the coefficients (scales) of these sparse combinations. (...) Extensive experiments on language and vision benchmarks show that **SVFT recovers up to 96% of full fine-tuning performance while training only 0.006 to 0.25% of parameters, outperforming existing methods that only recover up to 85% performance** using 0.03 to 0.8% of the trainable parameter budget.
    {reason: "preprint under review; strong controlled comparison across language and vision; 96% of full FT with 0.006% params is striking but unpublished", credence: 0.74}
 (3) [Asymmetric SVD Works]: SSVD shows that selectively rotating input-space
    (right) singular vectors while preserving output-space (left) vectors
    matches LoRA with far fewer params, and a convergence hierarchy emerges
    at scale: SSVD > PiSSA > DoRA > LoRA. #observation
    [Wang, Watanabe, Van hamme 2025](https://arxiv.org/abs/2509.02830)
    [evidence](docs/ssvd_structured_svd_finetuning.md#L1-L40)
    > This work presents the first comprehensive integration and benchmarking of these PEFT methods within ESPnet. We further introduce structured SVD-guided (SSVD) fine-tuning, which selectively rotates input-associated right singular vectors while keeping output-associated vectors fixed to preserve semantic mappings. This design enables robust domain adaptation with minimal trainable parameters and improved efficiency. (...) **SSVD achieves comparable performance with significantly fewer trainable parameters and higher efficiency than LoRA and SoTA LoRA variants**, approaching fully fine-tuned model performance.
    {reason: "first comprehensive PEFT benchmark for speech across 0.1B-2B; convergence hierarchy consistent across scales", credence: 0.80}
 ----
 (4) [SVD Is Natural Basis]: The weight matrix's own SVD provides the right
    coordinate system for adaptation -- 6 of the top 7 scoring methods use it.
    {reason: "PiSSA beats random init, SVFT shows coefficients alone suffice, SSVD hierarchy holds across scales; convergent evidence from independent groups", inference: 0.85}
  +> [Natural Manifold]
 <Joint SVD Evidence>
 (1) [Cross-Layer Coupling]: CLOVER's joint SVD across Q-K and V-O attention
    matrix pairs beats per-matrix LoRA by 7.6% on LLaMA-2-7B, validated
    across 4 model families on 8 commonsense tasks. #observation
    [Meng, Tang, Jiang, Zhang 2024](https://arxiv.org/abs/2411.17426)
    [evidence](docs/clover_joint_svd.md#L258-L275)
    > CLOVER consistently outperforms all other methods across all models and tasks. Specifically, on LLaMA 7B, CLOVER outperforms LoRA, DoRA, and PiSSA by 9%, 5.3%, and 1%, respectively. On LLaMA 13B, CLOVER outperforms these methods by 5.7%, 4.7%, and 0.8%. On LLaMA-2-7B, CLOVER surpasses LoRA, DoRA, HiRA, and PiSSA by 7.6%, 5.5%, 3.8%, and 0.7%. Even on LLaMA-3-8B, with fewer trainable parameters, CLOVER outperforms by 7.1%, 2.7%, 1.2%, and 0.6%.
    {reason: "Under review (not yet published at venue); controlled comparison using DoRA's own hyperparams; 4 LLaMA variants x 8 tasks; beats PiSSA by 0.7% which is modest but consistent", credence: 0.72}
 ----
 (2) [Functional SVD Matters]: Attention matrices cooperate functionally (Q-K
    define pattern, V-O define extraction), and joint decomposition captures
    this coupling.
    {reason: "mechanistic intuition is sound; Q and K jointly specify attention weights so their updates should be coordinated. CLOVER paper also shows 82% of features project outside PiSSA's subspace, but CLOVER covers all directions. Single paper but multi-model validation.", inference: 0.72}
  +> [Natural Manifold]
 <PSOFT Synthesis>
 (1) [OFT in SVD Coords]: PSOFT combines PiSSA's SVD initialization with
    OFT's Cayley rotation, operating in the principal subspace. Beats LoRA
    on 35 NLP and CV tasks with ~80% memory reduction vs OFT. #observation
    [Wu et al. 2026](https://arxiv.org/abs/2505.11235)
    [evidence](docs/psoft_principal_subspace_oft.md#L1-L50)
    > PSOFT improves parameter and memory efficiency without compromising performance. Although GOFT and PSOFT have the same parameter counts, **PSOFT reduces memory usage by about 80% and avoids OOM issues.** It further achieves up to an 18x improvement in parameter efficiency over BOFT, OFTv2, and LoRA variants, attaining the best average performance across all baselines with the lowest memory footprint.
    {reason: "ICLR 2026; 35 NLP+CV tasks across 4 models (DeBERTa, ViT, LLaMA-3.2-3B, LLaMA-3.1-8B); code available; bridges PiSSA and OFT with theoretical geometry guarantee", credence: 0.82}
 ----
 (2) [SVD Plus Orthogonal Synergy]: The best performing adapters combine
    SVD basis selection with orthogonal constraints, suggesting both priors
    are complementary.
    {reason: "PSOFT + AntiPaSTO + SSVD all combine SVD with Cayley rotations and all score well; convergent design choices", inference: 0.75}
  +> [Natural Manifold]
 <AntiPaSTO OOD Transfer>
 (1) [OOD Via SVD Rotation]: AntiPaSTO achieves out-of-distribution transfer
    from template persona pairs to real moral dilemmas (6.9x prompting baseline)
    using Cayley rotations of SVD singular vectors. #observation
    [Clark 2025](https://arxiv.org/abs/2601.07473)
    [evidence](docs/antipasto_antiparallel_steering.md#L1-L40)
    > Trained on 800 such pairs, our method transfers to 1,360 unseen moral dilemmas where honesty conflicts with other values, achieving **6.9x the Steering F1 of prompting** on Gemma-3-1B. We demonstrate two key advantages over prompting: OOD transfer (train on simple persona pairs, test on complex moral reasoning) and suppression bypass (steer when prompting triggers refusal).
    {reason: "single-author, one trait (honesty), models up to 4B, high seed variance (std ~5-7); but OOD transfer is the strongest possible test of causal relevance", credence: 0.55}
 (2) [SVD Causal Not Correlational]: If SVD rotation transfers OOD, the
    geometric structure it exploits is causally relevant to behavior,
    not merely correlated.
    {reason: "OOD transfer is strong evidence of causality (rules out surface pattern matching), but narrow: 1 trait, 1 eval benchmark, 1 author. Independent replication would raise this considerably.", inference: 0.65}
  +> [Natural Manifold]
 <LoRA Baseline Limitation>
 (1) [LoRA Ceiling]: Biderman et al. 2024 find LoRA underperforms full FT
    on harder tasks and larger scale, suggesting the low-rank assumption
    holds for surface adaptation but breaks where deep restructuring
    is needed. #observation
    [Biderman et al. 2024](https://arxiv.org/abs/2405.09673)
    [evidence](docs/biderman_lora_limitations.md#L1-L50)
    > Our results show that, in the standard low-rank settings, **LoRA substantially underperforms full finetuning.** Nevertheless, LoRA better maintains the base model's performance on tasks outside the target domain. (...) Finally, we show that full finetuning learns perturbations with a rank that is **10-100x greater than typical LoRA configurations**, possibly explaining some of the reported gaps.
    {reason: "TMLR 2024 (published, peer-reviewed); Llama-2-7B on code and math. In this local corpus, the paper is referenced in multiple docs (ROAD, DeLoRA, MiSS, CLOVER), mostly as supporting related-work evidence rather than direct replication.", credence: 0.80}
 ----
 (2) [Random Init Wastes Steps]: LoRA's random initialization works (universal
    baseline) but converges slower than SVD-informed init, suggesting random
    subspaces are not special -- just adequate.
    {reason: "PiSSA shows 3.25% gain on GSM8K just from better init; VeRA shows random projections work at all; together: random is adequate, SVD is better", inference: 0.70}
  +> [Natural Manifold]
 // ══════════════════════════════════════════════════════════════════════
 // THEME 2: ORTHOGONAL TRANSFORMATIONS PRESERVE STRUCTURE
 // Methods: OFT, BOFT, GOFT, HRA, ETHER, OFTv2, ROAD
 // ══════════════════════════════════════════════════════════════════════
 # Orthogonality
 <Orthogonal Preservation>
 (1) [Hyperspherical Energy Preserved]: OFT provably preserves pairwise
    neuron angles (hyperspherical energy) during adaptation, yielding
    stable finetuning and strong data efficiency (converges with 5%
    of training data on controllable generation). #observation
    [Qiu et al. 2023](https://arxiv.org/abs/2306.07280)
    [evidence](docs/oft_orthogonal_finetuning.md#L1-L40)
    > We introduce a principled finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, **OFT can provably preserve hyper-spherical energy which characterizes the pairwise neuron relationship on the unit hypersphere.** We find that this property is crucial for preserving the semantic generation ability of text-to-image diffusion models. (...) OFT is more sample-efficient in training and converges well with only 5% of the original dataset, while both ControlNet and LoRA cannot converge until 50% of the data is present.
    {reason: "NeurIPS 2023; provable preservation is rare among PEFT methods; data efficiency claim is strong but limited to vision generation", credence: 0.82}
 (2) [Butterfly Orthogonal Scales]: BOFT extends OFT with butterfly factorization,
    reducing parameters from O(d^2) to O(d log d) while matching or exceeding
    OFT performance. Outperforms full FT on vision tasks. #observation
    [Liu et al. 2023](https://arxiv.org/abs/2311.06243)
    [evidence](docs/boft_butterfly_orthogonal.md#L1-L30)
    > We propose an efficient orthogonal parameterization using butterfly structures. (...) By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. (...) **BOFT outperforms current state-of-the-art methods by a considerable margin**, validating its superior parameter-efficiency and generalization ability.
    {reason: "ICLR 2024; subsumes OFT theoretically; broad empirical validation across vision and language", credence: 0.80}
 ----
 (3) [Orthogonality Prevents Catastrophic Forgetting]: Orthogonal constraints
    bound deviation from pretrained weights, preventing the catastrophic
    overwriting that plagues unconstrained methods at high learning rates.
    {reason: "OFT converges on 5% data, ETHER is LR-robust, BOFT beats full FT on vision; consistent signal that bounded deviation helps. But strict orthogonality prevents magnitude changes, limiting it on tasks needing rescaling.", inference: 0.78}
  +> [Natural Manifold]
 <Givens and Householder>
 (1) [Minimal Orthogonal Params]: GOFT decomposes orthogonal transformations
    into O(d) Givens rotations (planar 2D rotations), achieving the most
    parameter-efficient orthogonal parameterization. #observation
    [Ma et al. 2024](https://arxiv.org/abs/2404.04316)
    [evidence](docs/goft_givens_orthogonal.md#L1-L44)
    > We first use O(d) Givens rotations to accomplish arbitrary orthogonal transformation in SO(d) with provable equivalence, reducing parameter complexity from O(d^2) to O(d). Then we introduce flexible norm and relative angular adjustments under soft orthogonality regularization to enhance the adaptation capability of downstream semantic deviations. Extensive experiments on various tasks and pretrained models validate the effectiveness of our methods.
    {reason: "ICML 2024; provable O(d) reduction from O(d^2); but advantage over BOFT is modest on benchmarks", credence: 0.70}
 (2) [Orthogonal Equals Low-Rank]: HRA proves that a chain of r Householder
    reflections is both orthogonal AND equivalent to a rank-r perturbation,
    bridging the low-rank and orthogonal paradigms. #observation
    [Yuan et al. 2024](https://arxiv.org/abs/2405.17484)
    [evidence](docs/hra_householder_reflection.md#L1-L30)
    > We bridge the gap between these two techniques, proposing a simple but effective adaptation method based on Householder reflections. Given a pre-trained model, our method fine-tunes its layers by multiplying each frozen weight matrix with an orthogonal matrix constructed by a chain of learnable Householder reflections (HRs). This HR-based orthogonal fine-tuning is equivalent to an adaptive low-rank adaptation. Moreover, we show that the orthogonality of the reflection planes corresponding to the HRs impacts the model capacity and regularity.
    {reason: "theoretical bridge resolving 'low-rank vs orthogonal' dichotomy; merged into HuggingFace PEFT; empirical results competitive but not dominant", credence: 0.72}
 ----
 (3) [Low-Rank Orthogonal Sweet Spot]: The effective adaptation is simultaneously
    low-rank and approximately orthogonal -- small rotations on the Stiefel manifold.
    {reason: "HRA's theoretical bridge is convincing; GOFT's practical efficiency confirms orthogonal methods can be made lean. But 'low-rank orthogonal' as the true operating point needs more direct testing.", inference: 0.65}
  +> [Natural Manifold]
 <ETHER Bounded Deviation>
 (1) [Fixed-Strength Rotations Work]: ETHER demonstrates that bounded orthogonal
    transformations (close to identity) achieve competitive adaptation while
    preventing catastrophic forgetting, with high learning-rate robustness. #observation
    [Bini, Girrbach, Akata 2024](https://arxiv.org/abs/2405.20271)
    [evidence](docs/ether_orthogonal_steering.md#L1-L30)
    > We propose the ETHER transformation family, which performs Efficient fineTuning via HypErplane Reflections. By design, ETHER transformations require a minimal number of parameters, are less likely to deteriorate model performance, and exhibit robustness to hyperparameter and learning rate choices. In particular, we introduce ETHER and its relaxation ETHER+, which match or outperform existing PEFT methods with significantly fewer parameters (~10-100 times lower than LoRA or OFT) across multiple image synthesis and natural language tasks without exhaustive hyperparameter tuning.
    {reason: "foundational for DeLoRA which won ICLR 2025; the bounded-deviation principle is well-validated; 10-100x fewer params than LoRA/OFT", credence: 0.72}
 ----
 (2) [Minimal Intervention Often Suffices]: For behavioral steering, bounded
    rotations suffice; for complex task adaptation, they hit a ceiling.
    {reason: "ETHER works for many tasks but DeLoRA's extension (unbounding magnitude) does better, confirming the ceiling", inference: 0.70}
  +> [Orthogonal Preservation]
 // ══════════════════════════════════════════════════════════════════════
 // THEME 3: DIRECTION/MAGNITUDE DECOUPLING
 // Methods: DoRA, DeLoRA, ROAD, AntiPaSTO, SSVD
 // ══════════════════════════════════════════════════════════════════════
 # Decoupling
 <Direction Magnitude Decoupling>
 (1) [FT Decouples Naturally]: DoRA's weight decomposition analysis reveals
    that full fine-tuning exhibits distinct magnitude vs. direction update
    patterns that LoRA conflates. QDoRA slightly outperforms full FT. #observation
    [Liu et al. 2024](https://arxiv.org/abs/2402.09353)
    [evidence](docs/dora_weight_decomposed_lora.md#L1-L40)
    > Drawing on Weight Normalization, which achieves faster convergence via improving the conditioning of the gradient with weight reparameterization, we introduce a novel weight decomposition analysis that initially reparameterizes model weights into magnitude and directional components, subsequently examining the changes in magnitude and direction introduced by LoRA and FT. **Our analysis reveals that LoRA and FT exhibit markedly distinct patterns of updates**, leading us to surmise that these variations mirror the learning capability of each method. (...) DoRA consistently outperforms LoRA on fine-tuning LLaMA, LLaVA, and VL-BART on various downstream tasks, such as commonsense reasoning, visual instruction tuning, and image/video-text understanding.
    {reason: "ICML 2024 (NVIDIA); thorough analysis + consistent improvements across 3 model families; widely adopted as LoRA replacement", credence: 0.85}
 (2) [Bounded Deviation Robust]: DeLoRA normalizes each rank-1 component and
    introduces scalar lambda controlling distance bound, yielding robustness
    to LR and training duration where LoRA fails. #observation
    [Bini, Girrbach, Akata 2025](https://arxiv.org/abs/2503.18225)
    [evidence](docs/delora_decoupled_low_rank_adaptation.md#L1-L30)
    > We propose Decoupled Low-rank Adaptation (DeLoRA), a novel finetuning method that normalizes and scales learnable low-rank matrices. **By bounding the distance of the transformation, DeLoRA effectively decouples the angular learning from the adaptation strength, enhancing robustness without compromising performance.** Through evaluations on subject-driven image generation, natural language understanding, and instruction tuning, we show that DeLoRA matches or surpasses performance of competing PEFT methods, while exhibiting stronger robustness.
    {reason: "ICLR 2025; same group as ETHER so a coherent research line; robustness claim well-supported by ablations", credence: 0.82}
 ----
 (3) [Decoupling Is Sound Engineering]: Separating what-to-change (direction)
    from how-much (magnitude) consistently improves training stability and
    often improves final performance.
    {reason: "DoRA beats LoRA everywhere, DeLoRA adds robustness, ROAD provides cleanest per-pair formulation; 3 independent groups converge on this. Whether it's fundamental or just good optimization is unclear.", inference: 0.80}
  +> [Natural Manifold]
 <ROAD Explicit Decoupling>
 (1) [Angle Plus Magnitude Per Pair]: ROAD parameterizes adaptation as learned
    rotation angle theta plus magnitude alpha per activation pair, the cleanest
    direction/strength separation. #observation
    [Liao, Monz 2024](https://arxiv.org/abs/2409.00119)
    [evidence](docs/road_rotary_adaptation.md#L1-L30)
    > We introduce a novel method, RoAd, which employs a straightforward 2D rotation to adapt LLMs and addresses all the above challenges: (1) RoAd is remarkably parameter-efficient, delivering optimal performance on GLUE, eight commonsense reasoning tasks and four arithmetic reasoning tasks with < 0.1% trainable parameters; (2) RoAd facilitates the efficient serving of requests requiring different adapters within a batch, with an overhead comparable to element-wise multiplication instead of batch matrix multiplication; (3) RoAd enhances LLM's interpretability through integration within a framework of distributed interchange intervention, demonstrated via composition experiments.
    {reason: "clean formulation connecting to RoPE; supports mixed-adapter batching which is unique; limited vision/generation benchmarks", credence: 0.62}
 ----
 (2) [Continuous Steering Knob]: The explicit magnitude parameter provides
    a continuous interpolation between identity and full adaptation, ideal
    for controllable steering.
    {reason: "mathematically clean; alpha as continuous knob is the most interpretable version of direction/strength decoupling", inference: 0.65}
  +> [Direction Magnitude Decoupling]
 // ══════════════════════════════════════════════════════════════════════
 // THEME 4: GAIN CONTROL -- ADAPTATION AS RESCALING
 // Methods: IA3, VeRA, LN Tuning
 // Evidence AGAINST the full thesis: if mere rescaling works, the
 // geometric structure may not be needed.
 // ══════════════════════════════════════════════════════════════════════
 # Gain Control
 <Gain Control Sufficiency>
 (1) [Scaling Vectors Competitive]: IA3 achieves competitive adaptation using
    only d-dimensional scaling vectors at key, value, and FFN positions --
    no new weight directions needed. T0-3B with IA3 outperforms GPT-3 175B
    with ICL. #observation
    [Liu et al. 2022](https://arxiv.org/abs/2205.05638)
    [evidence](docs/ia3_few_shot_peft.md#L1-L30)
    > We introduce (IA)3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. (...) We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark, **attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute.**
    {reason: "strong result on RAFT; super-human is notable; but T5-family specific and scaling-only has known expressivity ceiling", credence: 0.72}
 (2) [Random Projections Plus Scaling]: VeRA matches LoRA with 10x fewer
    trainable parameters using only per-layer scaling of shared frozen
    random matrices. #observation
    [Kopiczko et al. 2023](https://arxiv.org/abs/2310.11454)
    [evidence](docs/vera_vector_random_matrix_adaptation.md#L1-L30)
    > We present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameters compared to LoRA, yet maintains the same performance. **It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead.**
    {reason: "ICLR 2024; surprising result that random projections suffice with per-layer gating; JL-lemma argument is plausible", credence: 0.75}
 ----
 (3) [Rescaling Gets 60-80%]: A surprisingly large fraction of task adaptation
    is reweighting existing features, suggesting the model already computes
    the right features -- the bottleneck is channel selection.
    {reason: "IA3 and VeRA perform well on standard benchmarks but both hit expressivity ceilings on hard tasks. This means rescaling handles easy adaptation but not deep restructuring.", inference: 0.45}
  -> [Natural Manifold]
 // The contra is weak (0.45) because gain control works on easy tasks
 // but fails precisely where geometric structure matters most. The
 // ceiling IS the evidence for the manifold hypothesis.
 // ══════════════════════════════════════════════════════════════════════
 // THEME 5: RANK IS SECONDARY TO BASIS CHOICE
 // Methods: LoRA, RandLoRA, C3A, FourierFT, LoHa, LoKr
 // ══════════════════════════════════════════════════════════════════════
 # Rank Debate
 <Rank Is Secondary>
 (1) [Full Rank Helps Hard Tasks]: RandLoRA achieves full-rank updates by
    summing scaled random rank-r bases, and its loss landscape is closer
    to full FT's than LoRA's. #observation
    [Albert et al. 2025](https://arxiv.org/abs/2502.00987)
    [evidence](docs/randlora_random_matrix.md#L1-L30)
    > This paper aims to answer this question by introducing RandLoRA, a parameter-efficient method that performs full-rank updates using a learned linear combinations of low-rank, non-trainable random matrices. (...) Our findings reveal that full-rank updates are beneficial across vision and language tasks individually, and even more so for vision-language tasks, where RandLoRA significantly reduces -- and sometimes eliminates -- the performance gap between standard fine-tuning and LoRA, demonstrating its efficacy.
    {reason: "ICLR 2025; eliminates FT-LoRA gap on vision-language; loss landscape analysis compelling; improvement task-dependent", credence: 0.72}
 (2) [Spectral Sparsity]: FourierFT achieves comparable results via sparse
    Fourier coefficients, suggesting weight updates are spectrally smooth. #observation
    [Gao et al. 2024](https://arxiv.org/abs/2405.03003)
    [evidence](docs/fourierft_spectral.md#L1-L30)
    > We introduce FourierFT, which treats DeltaW as a matrix in the spatial domain and learns only a small fraction of its spectral coefficients. With the trained spectral coefficients, we implement the inverse discrete Fourier transform to recover DeltaW. Empirically, our FourierFT method shows comparable or better performance with fewer parameters than LoRA on various tasks, including natural language understanding, natural language generation, instruction tuning, and image classification. For example, when performing instruction tuning on the LLaMA2-7B model, FourierFT surpasses LoRA with only 0.064M trainable parameters, compared to LoRA's 33.5M.
    {reason: "ICML 2024; 500x fewer params than LoRA on LLaMA2 instruction tuning; spectral sparsity is structurally different from SVD", credence: 0.68}
 ----
 (3) [Basis Matters More Than Rank]: PiSSA's well-chosen rank-r subspace
    outperforms RandLoRA's full-rank random subspace, suggesting which
    directions you update matters more than how many.
    {reason: "PiSSA at rank r beats LoRA at same rank; RandLoRA at full rank sometimes beats PiSSA but not always. The basis selection (SVD vs random) appears to be the primary factor.", inference: 0.40}
  -> [Natural Manifold]
 // Weak contra (0.40): the rank debate slightly undermines the thesis
 // because if full rank sometimes wins, the SVD manifold might not
 // capture everything. But basis choice dominates rank choice.
 <Alternative Parameterizations>
 (1) [Circulant Structure]: C3A achieves full-rank updates via circulant
    matrices (FFT-computable), competitive on GLUE and instruction tuning. #observation
    [Chen et al. 2025](https://arxiv.org/abs/2407.19342)
    [evidence](docs/c3a_circular_convolution.md#L1-L30)
    > We introduce Circular Convolution Adaptation (C3A), which not only achieves high-rank adaptation with enhanced performance but also excels in both computational power and memory utilization. **Extensive experiments demonstrate that C3A consistently outperforms LoRA and its variants across various fine-tuning tasks.** Our code is available at Hugging Face PEFT.
    {reason: "ACL 2025; merged into HuggingFace PEFT; circulant assumption is a strong inductive bias; competitive but not dominant on all benchmarks", credence: 0.62}
 (2) [Hadamard Products]: LoHa's element-wise product of two low-rank
    decompositions captures pairwise feature interactions that additive
    low-rank misses. #observation
    [Nam, Moon, Oh 2022](https://arxiv.org/abs/2108.06098)
    [evidence](docs/loha_hadamard_product.md#L1-L30)
    > We propose a communication-efficient parameterization, FedPara, which re-parameterizes weight parameters of layers using low-rank weights followed by the Hadamard product. **Compared to the conventional low-rank parameterization, our FedPara method is not restricted to low-rank constraints, and thereby it has a far larger capacity.** This property enables to achieve comparable performance while requiring 3 to 10 times lower communication costs than the model with the original layers.
    {reason: "ICLR 2022; effective for vision/diffusion via LyCORIS ecosystem; Hadamard product overcomes low-rank bottleneck; niche for LLMs", credence: 0.55}
 (3) [Kronecker Products]: LoKr's tensor product structure enables compact
    adaptation across convolutional and linear layers, with systematic
    evaluation across fine-tuning methodologies. #observation
    [Yeh et al. 2024](https://arxiv.org/abs/2309.14859)
    [evidence](docs/lokr_lycor.md#L1-L30)
    > This paper introduces LyCORIS (Lora beYond Conventional methods, Other Rank adaptation Implementations for Stable diffusion), an open-source library that offers a wide selection of fine-tuning methodologies for Stable Diffusion. Furthermore, we present **a thorough framework for the systematic assessment of varied fine-tuning techniques.** This framework employs a diverse suite of metrics and delves into multiple facets of fine-tuning.
    {reason: "systematic LyCORIS benchmark; Kronecker/spatial structure assumption less natural for attention weights; mainly validated on diffusion models", credence: 0.50}
 ----
 (4) [Multiple Structures Work]: Multiple non-SVD parameterizations (circulant,
    Hadamard, Kronecker, Fourier) achieve competitive results, suggesting
    SVD is not the only viable basis.
    {reason: "variety of working approaches, but none consistently beats SVD methods on controlled benchmarks; SVD has strongest cross-domain evidence", inference: 0.50}
  -> [Natural Manifold]
 // ══════════════════════════════════════════════════════════════════════
 // THEME 6: FUNCTIONAL ARCHITECTURE
 // Methods: CLOVER, ReFT, X-LoRA, Prompt Tuning
 // ══════════════════════════════════════════════════════════════════════
 # Functional Architecture
 <Functional Architecture>
 (1) [Activation Intervention]: ReFT achieves 15-65x more parameter efficiency
    than LoRA by intervening on hidden states at specific (layer, position)
    sites rather than on weights at all. #observation
    [Wu, Arora et al. 2024](https://arxiv.org/abs/2404.03592)
    [evidence](docs/reft_representation_finetuning.md#L1-L40)
    > ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations. We define a strong instance of the ReFT family, Low-rank Linear Subspace ReFT (LoReFT). (...) **LoReFT uses 15x-65x times fewer parameters while achieving state-of-the-art performance on commonsense reasoning, instruction-following, and natural language understanding** against the strongest PEFTs.
    {reason: "Stanford NLP; widely cited; strong on instruction following and math. Fundamentally different paradigm (activation vs weight) so comparison not apples-to-apples", credence: 0.72}
 (2) [Compositional Routing]: X-LoRA dynamically routes through expert LoRA
    adapters, achieving better composite task performance than individual
    adapters. #assumption
    {reason: "compositional multi-task; routing adds overhead; more suited to deployment than single-task science", credence: 0.50}
 ----
 (3) [Beyond Individual Matrices]: Respecting the model's functional architecture
    (which matrices cooperate, where activations matter) provides additional
    efficiency gains beyond weight-matrix-level priors.
    {reason: "CLOVER (joint SVD) beats per-matrix methods; ReFT (activation intervention) is 15-65x more efficient; both exploit structure LoRA ignores. But evidence is early -- 1-2 papers each.", inference: 0.60}
  +> [Natural Manifold]
 // ══════════════════════════════════════════════════════════════════════
 // THEME 7: SHARED STRUCTURE AND COMPRESSION
 // Methods: MiSS, VBLoRA, SHiRA, AdaLoRA
 // ══════════════════════════════════════════════════════════════════════
 # Compression
 <Shared Structure Evidence>
 (1) [Adaptive Rank Allocation]: AdaLoRA's SVD-parameterized pruning shows
    that different layers need different ranks -- query/value projections
    and early/late layers are most important. #observation
    [Zhang et al. 2023](https://arxiv.org/abs/2303.10512)
    [evidence](docs/adalora_adaptive_budget.md#L1-L30)
    > We propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, AdaLoRA parameterizes the incremental updates in the form of singular value decomposition. Such a novel approach allows us to effectively prune the singular values of unimportant updates. (...) **Results demonstrate that AdaLoRA manifests notable improvement over baselines, especially in the low budget settings.**
    {reason: "ICLR 2023; validates SVD parameterization and adaptive rank; modest gains over LoRA add complexity", credence: 0.70}
 (2) [Vector Bank Compression]: VBLoRA uses 0.4% of LoRA's parameters via
    shared atomic vectors, suggesting adapter diversity across layers is
    much lower than assumed. #observation
    [Li, Han, Ji 2024](https://arxiv.org/abs/2405.15179)
    [evidence](docs/vblora_vector_bank.md#L1-L30)
    > Our proposed VB-LoRA composites all the low-rank matrices of LoRA from a shared vector bank with a differentiable top-k admixture module. (...) When fine-tuning the Llama2-13B model, **VB-LoRA only uses 0.4% of LoRA's stored parameters, yet achieves superior results.** This method has been merged into the Hugging Face PEFT package.
    {reason: "NeurIPS 2024; merged into HuggingFace PEFT; extreme compression is remarkable; implies shared structure across layers", credence: 0.72}
 (3) [Sparse High-Rank]: SHiRA directly fine-tunes 1-2% of base weights
    selected by importance, achieving full rank without low-rank
    bottleneck. #observation
    [Bhardwaj et al. 2024](https://arxiv.org/abs/2406.13175)
    [evidence](docs/shira_sparse_high_rank.md#L1-L40)
    > We propose SHiRA, a new high rank adapter paradigm to demonstrate that **changing as few as 1-2% parameters of the original network is sufficient for adaptation.** (...) We conduct extensive experiments on LLMs (LLaMA-7B, LLaMAv2-7B) and LVMs (Stable Diffusion, SDXL) where we demonstrate that SHiRA significantly outperforms LoRA on both single- and multi-adapter tasks. On LLMs, we show that SHiRA achieves up to 2.7% better accuracy than LoRA on commonsense reasoning.
    {reason: "NeurIPS 2024; Qualcomm; multi-adapter fusion is unique angle; extensive LLM+LVM experiments; challenges low-rank assumption", credence: 0.62}
 ----
 (4) [Compression Signals Shared Structure]: The extreme compressibility of
    adapter weights (VBLoRA 0.4%, SVFT 0.006%) suggests weight updates
    share structure that priors can exploit.
    {reason: "multiple methods achieve extreme compression suggesting high redundancy in naive parameterizations; supports the thesis that the right basis captures this shared structure", inference: 0.60}
  +> [Natural Manifold]
 // ══════════════════════════════════════════════════════════════════════
 // ADDITIONAL METHODS (Prompt Tuning, LN Tuning, Bone, Trainable Tokens)
 // These are boundary cases that don't strongly support or oppose the thesis
 // ══════════════════════════════════════════════════════════════════════
 # Boundary Cases
 <Prompt Tuning Orthogonal Paradigm>
 (1) [Input Not Weights]: Prompt/prefix tuning modifies the input embedding
    space rather than weight matrices. At T5-XXL scale, it matches full FT
    with 0.01% params. #assumption
    {reason: "scales with model size; conceptually different from weight adaptation; when it works, it suggests the model already has the capability", credence: 0.65}
 ----
 (2) [Capability Already Present]: When prompt tuning works, it reveals that
    the target task is within the model's existing computational capacity --
    no weight restructuring needed.
    {reason: "informative boundary: prompt tuning success = no intervention needed; prompt tuning failure = genuine capability gap requiring weight changes", inference: 0.55}
  +> [Natural Manifold]
 // ══════════════════════════════════════════════════════════════════════
 // CONTRA: LINEARITY ASSUMPTION
 // ══════════════════════════════════════════════════════════════════════
 <SVD Is Linear Decomposition>
 (1) [Nonlinear Structure Missed]: SVD is a linear decomposition of a nonlinear
    system. It captures dominant linear subspaces but may miss nonlinear
    interactions that matter for complex tasks. #assumption
    {reason: "fundamental theoretical limitation; no PEFT method addresses this; but empirically SVD methods still win", credence: 0.75}
 (2) [Checkpoint Specific]: SVD basis is model-specific and not transferable
    across checkpoints without recomputation. #assumption
    {reason: "practical limitation; PiSSA notes fast SVD takes only seconds, so cost is low", credence: 0.60}
 ----
 (3) [Linearity Might Not Matter]: For the adaptations tested so far, linear
    subspaces suffice. The nonlinear structure may only matter for
    capabilities not yet probed.
    {reason: "strong practical success despite theoretical limitation; but we haven't tested on sufficiently hard capability changes", inference: 0.45}
  -> [Natural Manifold]
@@ -0,0 +1,657 @@
 Title: CLOVER: Cross-Layer Orthogonal Vectors Pruning and Fine-Tuning
 URL Source: https://arxiv.org/pdf/2411.17426
 Published Time: Mon, 03 Feb 2025 01:40:12 GMT
 Number of Pages: 16
 Markdown Content:
 # CLOVER: Cross-Layer Orthogonal Vectors Pruning and Fine-Tuning 
 Fanxu Meng 1 2 Pingzhi Tang 1 Fan Jiang 1 Muhan Zhang 1 2 
 ## Abstract 
 Decoder-only models generate tokens autoregres-sively by caching key/value vectors, but as the cache grows, inference becomes memory-bound. To address this issue, we introduce CLOVER (Cross-Layer Orthogonal Vectors), a novel ap-proach that treats pairs of attention layers as a set of low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the Q-K
 and V -O pairs within each attention head. The resulting singular values can either guide pruning or serve as trainable parameters for efficient fine-tuning of all orthogonal vectors. After pruning or fine-tuning, these values are reintegrated into the model without increasing its parameter count. We apply CLOVER to various models, including GPT-2 XL, DeepSeek-V2-Lite, Whisper-Large-v3, Stable Diffusion XL, and LLaMA-3.2-11B-Vision. Our results demonstrate that CLOVER significantly improves pruning efficiency. For in-stance, the perplexity of pruning 70% of the Q-K
 pairs in GPT-2 XL is similar to that of pruning just 8% with vanilla methods. Fine-tuning the sin-gular values further results in a full-rank update, outperforming state-of-the-art methods (LoRA, DoRA, HiRA, and PiSSA) by 7.6%, 5.5%, 3.8%, and 0.7%, respectively, on eight commonsense tasks for LLaMA-2 7B. 
 ## 1. Introduction 
 In recent years, Large Language Models (LLMs) have rapidly evolved into essential tools for productivity (OpenAI, 2024; Anthropic, 2024; Team et al., 2024a). Open-source models (AI@Meta, 2024; Mistral, 2024; Qwen, 2024; Liu et al., 2024b; Team et al., 2024b; Abdin et al., 2024) have also narrowed the performance gap with closed-source mod-els. The success of LLMs is largely attributed to Next Token Prediction (Radford, 2018; Brown et al., 2020), where to-kens are predicted sequentially, with attention computed between each token and all preceding ones. To avoid redun-dant computations, key-value features are cached. However, as model size grows, the overhead of caching becomes sub-stantial, leading to memory and communication bottlenecks. 𝑊 !
 ## 𝑊 " 𝑊 #
 ## 𝑊 $
 # 𝑥 
 # 𝑦 
 (a) Multi-Head Attention 𝑈 !" 
 ## 𝑉 !" 𝑈 #$ 
 ## 𝑉 #$ 
 # 𝑥 
 # 𝑦  
 > 𝑆 !"
 > 𝑆 #$
 (b) CLOVER 0.08 0.7 1
 > Pruning Ratio
 > 3
 > 4
 > 5
 > 6
 > 7
 > Log of Perplexity
 > Vanilla
 > CLOVER
 (c) Pruning without Training 0.125 0.25 0.375 0.5 0.625  
 > 10
 > 12
 > 14
 > 16
 > 18
 > 20
 > 22
 > 24
 > Perplexity
 > Vanilla
 > CLOVer
 > CLOVer
 (d) Fine-Tuning Pruned Model 
 Figure 1. (a) We treat the Query-Key and Value-Output layers within a single attention head as a unified structure. (b) Apply SVD to obtain two sets of singular vectors for initializing the Q-K and V-O layers, along with singular values that guide pruning or enable efficient full-rank fine-tuning. (c) This cross-layer orthogo-nalization strategy allows for higher pruning rates. (d) The pruned model maintains strong performance after fine-tuning. 
 For instance, a 65B parameter model (Touvron et al., 2023) with 8-bit key-value quantization requires over 86GB of GPU memory to store 512K tokens, exceeding the capacity of a single H100-80GB GPU (Sun et al., 2024). To enable efficient training and inference, we introduce CLOVER (Cross-Layer Orthogonal Vectors), a novel method that orthogonalizes the Query, Key, Value, and Out-put vectors without generating additional transformation matrices. As shown in Figure 1a, we treat the Q-K and V -
 O pairs in each attention head as a low-rank decomposition of WQK and WV O . By crossing these layers and perform-1
 > arXiv:2411.17426v3 [cs.LG] 31 Jan 2025 CLOVER: Cross-Layer Orthogonal Vectors
 ing SVD on WQK and WV O , the Query, Key, Value, and Output vectors become orthogonal within each attention head. Figure 1b illustrates how the resulting singular val-ues can guide pruning or serve as trainable parameters for efficient fine-tuning. After pruning or fine-tuning, these values can be reintegrated into the model without increasing its parameter count. Notably, previous methods, such as SVFT (Lingam et al., 2024), obtain orthogonal vectors by directly performing orthogonal decomposition on the matrix at each layer, which results in an accompanying transfor-mation matrix, doubling the parameter count. In contrast, CLOVER treats the Q-K pairs as transformation matrices for each other, and similarly for the V -O pairs. CLOVER only generates a small set of singular values to guide prun-ing and fine-tuning, which can be merged back into the model without increasing inference costs. 
 By orthogonalizing the vectors, we eliminate linear re-dundancy. Attention heads contain numerous non-zero norm vectors. Directly pruning these vectors would degrade performance, but orthogonalizing them allows us to repre-sent the entire attention head’s space using a small set of orthogonal bases. The remaining vectors are nearly zero, making them safe to prune. As shown in Figure 1c, prun-ing an average of 45 vectors in the query-key pair using CLOVER results in a perplexity similar to that of vanilla pruning, which prunes only 5 vectors. Moreover, CLOVER generates a singular value matrix between the Q-K and 
 V -O pairs. By updating this matrix during fine-tuning, 
 CLOVER learns linear combinations of all orthogonal bases within each attention head. In contrast, PiSSA can only learn from a subset of orthogonal vectors, potentially causing some data projections to approach zero in those di-rections, leading to non-functional adapters during training. As shown in Figure 1d, fine-tuning a very small number of singular values can achieve performance close to that of fine-tuning all attention heads. We summarize the contribution of our paper as follows: • We treat the Q-K and V-O pairs in each attention head as low-rank approximations of WQK and WV O . By performing SVD, we orthogonalize the attention head without adding extra transformation matrices. • This orthogonalization reduces linear redundancy, is compatible with any pruning method, and allows for higher pruning ratios. Pruning 46.42% of the vectors in Whisper’s attention head preserves performance with-out requiring additional training. • CLOVER enables efficient full-rank updates, surpass-ing SOTA methods such as LoRA, DoRA, HiRA, and PiSSA on eight commonsense reasoning tasks across LLaMA 7B/13B, LLaMA-2-7B, and LLaMA-3-8B, with additional analyses highlighting its advantages. 
 ## 2. Related Work 
 LLM Compression To mitigate the high memory de-mands of KV Caches in long-context models, several tech-niques have been proposed. These include reducing se-quence length with linear attention (Katharopoulos et al., 2020; Wang et al., 2020; Peng et al., 2023; Gu & Dao, 2023; De et al., 2024), dynamic token pruning (Fu et al., 2024; Jo & Shin, 2024; Li et al., 2024b), compressing the key-value rank (Shazeer, 2019; Ainslie et al., 2023; Liu et al., 2024a; Yu et al., 2024), and pruning head dimensions (Ashkboos et al., 2024; Xia et al., 2023; Sun et al., 2023). Additional approaches include sharing key-value representations across layers (Sun et al., 2024; Brandon et al., 2024; Liu et al., 2024c; Zuhri et al., 2024) and quantizing KV cache weights and activations (Frantar et al., 2022; Dettmers et al., 2022; Xiao et al., 2023; Liu et al., 2024e; Hooper et al., 2024). Among them, structure pruning is hardware-friendly but can reduce performance when non-zero dimensions are removed (Ma et al., 2023). Fine-tuning can recover some of the lost performance, but it’s computationally expensive. To address this, Parameter Efficient Fine-Tuning (PEFT) methods are used (Guo et al., 2023). 
 Parameter Efficient Fine-Tuning. Several strategies have been introduced to minimize fine-tuning parameters while maintaining performance. These include low-rank adaptation (Hu et al., 2021), partial-parameter fine-tuning (Zaken et al., 2021; Lawton et al., 2023; Zhao et al., 2020; Sung et al., 2021; Ansell et al., 2021; Xu et al., 2021; Guo et al., 2020; Fu et al., 2023), soft prompt fine-tuning (Ham-bardzumyan et al., 2021; Lester et al., 2021; Li & Liang, 2021; Liu et al., 2023b; Vu et al., 2021; Asai et al., 2022; Wang et al., 2023), and sparse matrix fine-tuning (Qiu et al., 2023; Liu et al., 2023a; Yuan et al., 2024). Among these, LoRA is widely used due to its simplicity and effectiveness, with recent works enhancing it further (Zhang et al., 2023; Zi et al., 2023; Liu et al., 2024d; Zhao et al., 2024; Jiang et al., 2024). PiSSA (Meng et al., 2024) improves conver-gence speed by initializing adapters with principal singular values and vectors, also reducing quantization error (Wang et al., 2024a;b; Li et al., 2024a). However, PiSSA is limited by its use of a fixed set of orthogonal bases. SVFT (Lingam et al., 2024) directly applies Singular Value Decomposition (SVD) to the original matrix, but this increases the number of parameters, raising computational overhead and reducing efficiency. The CLOVER method addresses these issues by treating the Query-Key pairs in each attention head as low-rank matrices. Using orthogonal decomposition, CLOVER eliminates the need for additional transformation matrices. Instead, it leverages a small set of singular values to linearly combine orthogonal vectors, making the approach more parameter-efficient. After fine-tuning, the adapter can be smoothly reintegrated into the original matrix structure. 2CLOVER: Cross-Layer Orthogonal Vectors 
 ## 3. CLOVER: Cross-Layer Orthogonal Vectors 
 Below is a step-by-step explanation of CLOVER method and explain why it can update orthogonal decompose the Query, Key, Value, Output layers in Multi-Head Attention without need introduce any transfer matrix. We mainly use the computation of the Q-K pair in as an example. Then extended to the V -O pair. 
 Multi-Head Self-Attention Setup. In a multi-head self-attention mechanism with H heads, each head h ∈{1, . . . , H } computes an attention score as: attn (Qh, K h) = softmax 
 QhK⊤
 > h√d
 ,
 where H is the number of attention heads, d is the dimen-sionality of each head, X ∈ Rn×D is the input matrix ( n
 is the sequence length, D is the total hidden dimension), 
 Qh, K h ∈ Rn×d are the query and key representations for head h, WQ, W K ∈ RD×H×d are weights for projecting the input X into queries and keys. Specifically, the queries and keys for head h are obtained by multiplying X with the corresponding “slice” of WQ and 
 WK , respectively: 
 Qh = X W [: ,h, :]  
 > Q
 , Kh = X W [: ,h, :]  
 > K
 .
 Cross Layers Merging. Substituting Qh and Kh into 
 QhK⊤ 
 > h
 , we have: 
 QhK⊤ 
 > h
 = X W [: ,h, :] 
 > Q
 W [: ,h, :] 
 > K
 ⊤X⊤.
 Notice that the original weights W [: ,h, :]  
 > Q
 and W [: ,h, :]  
 > K
 are each in RD×d, once multiplied together, the resulting ma-trix W hQK = W [: ,h, :] 
 > Q
 W [: ,h, :] 
 > K
 ⊤ has dimension D × D.Since d ≪ D, using W hQK directly in computations—or storing it as trainable parameters—would be highly ineffi-cient, limiting the use cases of such parameter merging. 
 Cross Layers Orthogonal Decomposition To address the large size of W hQK , we factorize W hQK via SVD: 
 W hQK = U hQK ShQK V hQK ,
 where U hQK is a D × D orthogonal matrix, ShQK is a D × D
 diagonal matrix of singular values, V hQK is another D × D
 orthogonal matrix. Since W [: ,h, :]  
 > Q
 and W [: ,h, :]  
 > K
 each have shape RD×d, the rank of W hQK is at most d. Thus the actual non-zero singular values in ShQK are at most d. We can truncate the SVD to keep only the top-r singular values without loss: 
 W hQK = U hQK [: , : r] ShQK [: r, : r] V hQK [: , : r]⊤,
 where r ≤ d.The process can be easily applied to WV and WO , as intro-duced in Appendix A.1. 
 CLOVER for Pruning After performing SVD, we can rewrite the weight matrix W hQK as follows: 
 W hQK = U hQK [: , : r] ShQK [: r, : r]
 | {z } 
 > ˜Uh∈RD×r
 V hQK [: , : r]⊤
 | {z } 
 > ˜Vh∈Rr×D
 .
 Instead of storing the full matrices W hQ and W hK ∈ RD×d,we store the smaller factors ˜U h ∈ RD×r and ˜V h ∈ Rr×D ,which can be significantly smaller than the original matrix since r ≤ d ≪ D. This leads to a reduction in memory usage and computational cost. Additionally, we can prune 
 singular values (and their corresponding singular vectors) below a chosen threshold. This further reduces the parame-ter count and computational overhead. 
 CLOVER for Fine-Tuning CLOVER can be used not only for pruning, but also for parameter-efficient fine-tuning. We freeze the matrices U hQK [: , : r] and V hQK [: , : r], and only fine-tune the singular values ShQK [: r, : r].In contrast to SVFT, which factorizes the entire weight ma-trices WQ, W K , W V , W O ∈ RD×D individually, CLOVER factorizes the merged weights W hQK and W hOV within each attention head, significantly reducing the parameters. By applying SVD factorization within each attention head, CLOVER constrains the effective rank of the cross-layer matrix to d. As a result, the tunable matrix SQK has a size bounded by RH×d×d (considering all heads). In compari-son, SVFT requires factorizing large matrices each into three components ( U, S, V ∈ RD×D ), leading to a significant in-crease in parameter count and computational overhead, even with sparse updates for the singular values S.For example, consider the LLaMA 2-7B model with H =32 attention heads and a head dimension of d = 128 . By factorizing each head separately, the largest size for SQK 
 is O(32 × 128 × 128) , which is significantly smaller than factorizing a R4096 ×4096 matrix. This makes CLOVER’s parameter efficiency comparable to that of a LoRA config-uration with rank 32, as shown in Appendix A.2, but with additional potential for pruning. 3CLOVER: Cross-Layer Orthogonal Vectors  
 > Table 1. Pruning GPT-2-XL’s attention layers with CLOVER and vanilla pruning at various ratios, evaluating perplexity on Wikitext2 (lower is better), and fine-tuning on OpenWebText with different token budgets. The base model’s perplexity is 14.78.
 Pruning Ratio w/o Training Perplexity( ↓) 66M Tokens Perplexity ( ↓) 131M Tokens Perplexity ( ↓)
 Vanilla CLOVER Vanilla CLOVER CLOVER † Vanilla CLOVER CLOVER †
 12.5% 33.76 15.89 16.04 15.45 15.67 16.38 15.77 15.42 
 25.0% 78.36 17.45 16.93 15.70 15.89 17.07 16.05 15.75 
 37.5% 159.4 20.95 18.17 16.17 16.60 18.14 16.48 16.41 
 50.0% 338.9 35.12 20.45 17.22 17.63 19.02 17.13 17.71 62.5% 538.5 85.25 24.65 19.32 20.64 21.44 18.40 20.39 75.0% 708.8 187.4 36.04 24.65 29.28 27.22 20.99 28.44 
 ## 4. Experiments 
 As detailed in Section 3, CLOVER is highly effective for both pruning and fine-tuning. We presents a series of experi-ments to validate these capabilities. In Section 4.1, we com-pare CLOVER with Vanilla pruning on a GPT-2-XL model (Radford et al., 2019). CLOVER results in less performance degradation, while Vanilla pruning significantly harms the model’s performance, making recovery difficult even with fine-tuning. In Section 4.2, we conduct fine-tuning experi-ments on eight commonsense tasks, comparing CLOVER with state-of-the-art methods. The results show the effec-tiveness of CLOVER’s linear combinations of all orthogonal vectors. In Section 4.3, CLOVER is applied to various mod-els. We visualize how it removes linear redundancy between vectors, enabling more efficient pruning. In Section 4.4, we demonstrate CLOVER’s ability to perform significant prun-ing on the Whisper model, which exhibits substantial linear redundancy, without requiring fine-tuning. In Section 4.5, we explain the importance of learning from all the orthogo-nal vectors by analyzing the projection of data features onto different directions in the model. In Section 4.6, we confirm CLOVER’s full-rank update capability by visualizing the singular value distribution of ∆W from various methods. Finally, in Section 4.7, we show how CLOVER fine-tunes the model using its inherent properties, without introducing “intrusive dimension” like LoRA, which may risk model degradation (Shuttleworth et al., 2024). 
 4.1. CLOVER for Large Ratio Pruning 
 Due to the need to compute attention between each token and all preceding tokens, compressing atten-tion—particularly the key-value layers—is crucial, despite the larger number of parameters in the MLP. CLOVER rep-resents each attention head with a small number of vectors. Since it only modifies the initialization, it can be combined with any other pruning technique. This paper validates the proposed method using basic structured pruning on GPT-2-XL, rather than targeting state-of-the-art performance. We initialize GPT-2-XL with CLOVER, then prune small singu-lar values based on their magnitude. To maintain inference efficiency, we apply the same pruning rate across all layers, removing a fixed percentage of the smallest singular vectors. The singular values, S, are then merged into the U and V
 matrices. For comparison, we also prune without CLOVER orthogonalization, using L2-norms for pruning. After prun-ing, we evaluate perplexity on the WikiText-2 (Merity et al., 2016) dataset. We then fine-tune the pruned models on the OpenWebText (Gokaslan & Cohen, 2019) dataset following nanoGPT 1. To minimize disruption to the original model, we fine-tune only the pruned attention layers, leaving the MLP, embedding layers, and LM head unchanged. In the CLOVER † case, after pruning, S is not immediately merged into the U and V matrices but is used for parameter-efficient fine-tuning, with the merging occurring afterward. We ad-just the learning rate from 6e-4 to 6e-3 and remove weight decay, while keeping other hyperparameters consistent with the other two methods. Based on Table 1, CLOVER causes less damage to the model than Vanilla pruning, as it transfers functionality into fewer orthogonal bases. For example, pruning 50% of the parameters without further fine-tuning, CLOVER’s perplex-ity only increases by 1.38 ×, while Vanilla pruning increases by 21.9 ×. After fine-tuning, CLOVER’s performance far exceeds that of Vanilla pruning. Due to its lower model disruption, CLOVER requires fewer tokens for fine-tuning to restore performance (e.g., perplexity with 66M tokens is close to that with 131M tokens), whereas Vanilla pruning needs more tokens, resulting in higher costs and potential degradation in out-of-domain tasks. Furthermore, by fine-tuning only the singular values from the SVD decomposition and the attention layer biases, CLOVER achieves recovery with fewer training resources and parameter changes. At lower pruning rates, CLOVER even outperforms full atten-tion layer training. However, when pruning rates are too high, accuracy loss becomes significant, and the available parameters for fine-tuning become insufficient (e.g., at 75% pruning, only 0.15% of the original attention layer parame-ters are updated). 
 > 1https://github.com/karpathy/nanoGPT
 4CLOVER: Cross-Layer Orthogonal Vectors  
 > Table 2. Accuracy comparison of LLaMA 7B/13B, LLaMA2 7B, and LLaMA3 8B with various PEFT methods on eight commonsense reasoning datasets. Results of LoRA and DoRA are taken from (Liu et al., 2024d). Results of HiRA are taken from (Anonymous, 2025).
 Model Method Params BoolQ PIQA SIQA Hella Swag Wino Grande ARC-e ARC-c OBQA Avg. 
 ChatGPT - - 73.1 85.4 68.5 78.5 66.1 89.8 79.9 74.8 77.0 LLaMA-7B Series 0.99% 63.0 79.2 76.3 67.9 75.7 74.5 57.1 72.4 70.8 Parallel 3.54% 67.9 76.4 78.8 69.8 78.9 73.7 57.3 75.2 72.2 LoRA 0.83% 68.9 80.7 77.4 78.1 78.8 77.8 61.3 74.8 74.7 DoRA 0.84% 69.7 83.4 78.6 87.2 81.0 81.9 66.2 79.2 78.4 PiSSA 0.83% 74.1 85.4 81.5 94.0 85.0 85.6 72.1 84.2 82.7 CLOVER 0.83% 72.9 86.34 82.1 94.9 85.4 87.5 74.4 86.4 83.7 
 LLaMA-13B Series 0.80% 71.8 83 79.2 88.1 82.4 82.5 67.3 81.8 79.5 Parallel 2.89% 72.5 84.9 79.8 92.1 84.7 84.2 71.2 82.4 81.4 LoRA 0.67% 72.1 83.5 80.5 90.5 83.7 82.8 68.3 82.4 80.5 DoRA 0.68% 72.4 84.9 81.5 92.4 84.2 84.2 69.6 82.8 81.5 PiSSA 0.67% 74.6 88.0 82.9 95.5 87.0 90.3 77.2 88.2 85.4 CLOVER 0.67% 75.2 88.4 83.1 96.0 87.8 89.7 79.3 89.8 86.2 
 LLaMA2-7B LoRA 0.83% 69.8 79.9 79.5 83.6 82.6 79.8 64.7 81.0 77.6 DoRA 0.84% 71.8 83.7 76.0 89.1 82.6 83.7 68.2 82.4 79.7 HiRA 0.83% 71.2 83.4 79.5 88.1 84.0 86.7 73.8 84.6 81.4 PiSSA 0.83% 75.0 87.0 81.6 95.0 86.5 88.5 75.9 86.4 84.5 CLOVER 0.83% 75.0 86.4 82.0 95.1 87.5 89.6 76.6 89.4 85.2 
 LLaMA3-8B LoRA 0.70% 70.8 85.2 79.9 91.7 84.3 84.2 71.2 79.0 80.8 DoRA 0.71% 74.6 89.3 79.9 95.5 85.6 90.5 80.4 85.8 85.2 HiRA 0.70% 75.4 89.7 81.2 95.4 87.7 93.3 82.9 88.3 86.7 PiSSA 0.70% 77.2 90.0 82.9 96.6 88.4 93.6 82.4 87.4 87.3 CLOVER 0.47% 76.4 89.3 82.1 96.9 89.9 93.6 84.5 90.6 87.9 
 4.2. CLOVER for Full-Rank Fine-Tuning 
 In this section, we evaluate CLOVER against LoRA (Hu et al., 2021), DoRA (Liu et al., 2024d), HiRA (Anonymous, 2025), and PiSSA (Meng et al., 2024) on commonsense reasoning tasks, excluding SVFT (Lingam et al., 2024) due to its significant overhead. The tasks are divided into eight sub-tasks, as outlined in Table 4. Following the DoRA setup, we fine-tune the Commonsense-170k dataset and evaluate each sub-task’s test set. We apply orthogonal de-composition to the Value-Output and fine-tune the resulting singular value matrix. Due to the non-linear RoPE(Su et al., 2024) operation between the query and key, we perform orthogonal decomposition in the Key layer and fine-tune the transition matrix. Similarly, we treat the 64 consecu-tive dimensions in the MLP.Up layer as a head, applying orthogonal decomposition and updating the transition ma-trix. The learnable parameters of LLaMA 7B/13B (Touvron et al., 2023) and LLaMA-2-7B (AI@Meta, 2023) match LoRA/DoRA/HiRA/PiSSA with rank 32 updates. LLaMA-3-8B (AI@Meta, 2024) has 2/3 of the trainable parameters compared to the other models. For a fair comparison, we use the hyperparameters from DoRA (3 epochs, batch size 16, linear scheduler learning rate). We adjusted the learning rate based on DoRA’s approach and found that CLOVER per-forms best with lr=1e-4, which we applied across all models. PiSSA was trained using the same hyperparameters, but with a learning rate of 2e-5, as specified in its original pa-per. Due to the stable performance of PiSSA and CLOVER during training, we did not perform validation every 80 it-erations, as done in DoRA, to select the best-performing model on the validation set for testing. Instead, we trained for the full 3 epochs and used the final model for testing. HiRA’s results are taken directly from its original paper, while the other results are sourced from DoRA’s paper. Ta-ble 2 demonstrates that CLOVER consistently outperforms all other methods across all models and tasks. Specifically, on LLaMA 7B, CLOVER outperforms LoRA, DoRA, and PiSSA by 9%, 5.3%, and 1%, respectively. On LLaMA 13B, CLOVER outperforms these methods by 5.7%, 4.7%, and 0.8%. On LLaMA-2-7B, CLOVER surpasses LoRA, DoRA, HiRA, and PiSSA by 7.6%, 5.5%, 3.8%, and 0.7%. Even on LLaMA-3-8B, with fewer trainable parameters, CLOVER outperforms by 7.1%, 2.7%, 1.2%, and 0.6%. CLOVER leads in most sub-tasks and ranks second in a few. 5CLOVER: Cross-Layer Orthogonal Vectors 0 50 100   
 > 0.0
 > 0.5
 > 1.0
 > 1.5
 > 2.0
 > WQ WK
 > (10, 0.93)
 > Vanilla
 > CLOVER
 > 050 100
 > Sorted Dimensions
 > 0.0
 > 0.2
 > 0.4
 > 0.6
 > WV WTO
 > (72, 0.13)
 > Vanilla
 > CLOVER
 (a) DeepSeek-V2-Lite 0 25 50 75     
 > 0
 > 1
 > 2
 > 3
 > 4
 > 5
 > WQ WK
 > (1, 2.66)
 > Vanilla
 > CLOVER
 > 025 50 75
 > Sorted Dimensions
 > 0.0
 > 0.2
 > 0.4
 > 0.6
 > WV WTO
 > (27, 0.29)
 > Vanilla
 > CLOVER
 (b) Llama-3.2-Vision 0 20 40 60     
 > 0.00
 > 0.25
 > 0.50
 > 0.75
 > 1.00
 > 1.25
 > 1.50
 > WQ WK
 > (7, 0.63)
 > Vanilla
 > CLOVER
 > 020 40 60
 > Sorted Dimensions
 > 0.0
 > 0.2
 > 0.4
 > 0.6
 > 0.8
 > WV WTO
 > (35, 0.16)
 > Vanilla
 > CLOVER
 (c) Whisper-Large-v3 0 20 40 60     
 > 1
 > 2
 > 3
 > WQ WK
 > (29, 1.25)
 > Vanilla
 > CLOVER
 > 020 40 60
 > Sorted Dimensions
 > 0.2
 > 0.4
 > 0.6
 > 0.8
 > WV WTO
 > (26, 0.46)
 > Vanilla
 > CLOVER
 (d) SDXL 0 50 100    
 > 0.00
 > 0.02
 > 0.04
 > 0.06
 > 0.08
 > WQ WK
 > (9, 0.04)
 > Vanilla
 > CLOVER
 > 050 100
 > Sorted Dimensions
 > 0.00
 > 0.02
 > 0.04
 > 0.06
 > 0.08
 > WV WTO
 > (44, 0.02)
 > Vanilla
 > CLOVER
 (e) CLIP-ViT-BigG 
 Figure 2. CLOVER (orange) uses fewer orthogonal basis vectors than Vanilla Pruning (blue) to span the attention head space. The first row shows the importance of Q-K dimensions, and the second row shows V-O dimensions. After the red dot, CLOVER’s importance is lower, and pruning these vectors results in less performance loss. 
 4.3. CLOVER Removal Redundant Vectors 
 CLOVER achieves a higher pruning ratio due to the sig-nificant linear redundancy present in the model. By repre-senting the entire attention head with only a small number of orthogonal vectors, CLOVER effectively removes this redundancy. To illustrate the advantages of CLOVER in eliminating linear redundancy, we apply it to a variety range of models, including the large language model DeepSeek-V2-Lite (DeepSeek-AI, 2024), the multimodal automatic speech recognition and speech translation model Whisper-Large-v3 (Radford et al., 2023), the multimodal instruction-tuned image reasoning generative models LLaMA-3.2-11B-Vision (AI@Meta, 2024), the image encoder CLIP-ViT-bigG (Cherti et al., 2022), and the image generation model Stable Diffusion XL (Podell et al., 2023). We compute the 
 L2 norm for each dimension (equal to singular values) in both the Q-K pair and the V-O pair, sorting the values in descending order within each attention head for better visu-alization. For comparison, we also perform Vanilla Pruning, which does not utilize CLOVER initialization but instead sorts directly based on the L2 norm. Figure 2 showcases the first attention head from the first layer of each model. In the first column of the figure, depict-ing the Q-K norm, we observe that in the original model, the importance of each dimension is relatively balanced (e.g. Figure 2c). This balanced distribution is a result of the linear redundancy, where different directions are inter-twined, making it challenging to prune individual directions without negatively affecting the model’s performance. How-ever, after applying CLOVER’s orthogonal decomposition, only a small number of orthogonal bases on the left side exhibit significantly large norms. These vectors span al-most the entire attention head’s space, and the remaining vectors have norms that approach zero, indicating that they are already represented by the dominant singular vectors and can be pruned without loss of performance. Beyond the red intersection point, CLOVER’s remaining vectors exhibit consistently lower importance than those in Vanilla Pruning, meaning pruning these vectors results in less per-formance degradation. This demonstrates why CLOVER enables a higher pruning ratio. A similar trend is observed for the V-O pair, although the model’s inherent sparsity is less pronounced than in the Q-K pair, making the effect less noticeable. Still, in most models, pruning half of the vectors has a smaller impact on performance compared to Vanilla Pruning. Notably, in CLIP-ViT-bigG (Figure 2e), a proportion of the vectors already have a norm of zero, allowing for safe pruning. 
 4.4. CLOVER for Training-Free Pruning 
 As demonstrated by the prominent low-rank properties in Figure 2c, we applied pruning to the Whisper-large-v3 model (Radford et al., 2023). To intuitively highlight the effectiveness of CLOVER pruning, we present an example using an audio input from the LibriSpeech Long dataset (Gandhi et al., 2023). For reference, the waveform of this input is shown in Figure 3, and the corresponding target translation script is provided in Appendix A.4. After applying CLOVER to orthogonalize the vec-tors, we pruned vectors with magnitudes close to zero (∥WQ∥∥ WK ∥ ≤ 5 × 10 −3 and ∥WV ∥∥ W ⊤ 
 > O
 ∥ ≤ 6 × 10 −3). This pruning achieved ratios of 56.01% and 36.82% for the parameters in Q-K Pair and V -O Pair, respectively. Re-markably, the model’s output remains nearly unchanged, with only one error, which has been highlighted in the text using strikethrough and red for clarity: 6CLOVER: Cross-Layer Orthogonal Vectors 0 2e5 4e5 6e5 8e5 10e5 
 > Samples
 > 0.5
 > 0.0
 > 0.5
 > Amplitude
 Figure 3. An audio waveform from the librispeech dataset. 
 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter’s manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton’s work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell’s pictures are a sort of Up Guards and Adam paintings, and Mason’s exquisite idles are as national as a jingo poem. Mr. Birkett Foster’s landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And , and Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampooer in a Turkish bath, next man. 
 In contrast, using a vanilla pruning method with the same pruning ratio, the model completely fails to produce valid outputs: 
 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 
 This example validates our earlier claim that straightfor-ward pruning of non-zero dimensions can lead to accumu-lated loss. In contrast, CLOVER effectively eliminates lin-ear redundancy, enabling a significantly higher pruning ra-tio. When the linear redundancy is sufficiently pronounced, CLOVER can even achieve a high pruning ratio without the need for fine-tuning to recover performance. 
 4.5. Necessity of Full-Direction Fine-Tuning 
 Besides pruning with a large ratio, CLOVER is capable of learning linear combinations of all orthogonal vectors within each attention head. This capability allows CLOVER to resemble full-parameter fine-tuning more closely. To highlight the advantages of updating all orthogonal bases, we randomly sampled 16 instances from the Commonsense dataset, fed them into the model, and performed SVD to the model. We then recorded the projection magnitudes of input features along all orthogonal directions. Figure 4 visualizes the results for the middle layer, revealing the following insights: 1) Without accounting for the scaling effect of singular val-ues, the projection magnitude along the principal singular vector consistently exceeds that in other directions. This ob-servation supports PiSSA’s approach, which updates based Top 256 
 > 10%
 > Next 256
 > 7%
 > Remaining
 > 76%
 > Bottom 256
 > 7%
 (a) PiSSA Top 256  
 > 6%
 > Next 256
 > 7%
 > Remaining
 > 81%
 > Bottom 256
 > 6%
 (b) LoRA Top 256  
 > 18%
 > Next 256 10%
 > Remaining
 > 65%
 > Bottom 256
 > 7%
 (c) PiSSA with Singular Value 100% (d) CLOVER 
 Figure 4. Proportion of data projections across different compo-nents in random directions (LoRA) versus orthogonal directions (PiSSA), as well as all orthogonal directions (CLOVER). 
 on the principal singular values and vectors, leading to im-proved training performance. In contrast, LoRA projects in random directions, resulting in uniform projection magni-tudes across all directions. 2) The singular values in the original model reflect the im-portance of each direction in the pretraining task. The model amplifies the components along directions with larger sin-gular values and suppresses those along smaller singular values. Therefore, it is crucial to consider the scaling effect of singular values. As shown in Figure 4c, the projection magnitude along the principal singular vector direction in-creases to 18%. 3) While more data projections align with the principal singular vector at higher ranks, 82% of the feature compo-nents are still projected onto other directions. In extreme cases, if a task is entirely orthogonal to the vectors used by PiSSA, training on such a task may result in zero gradients, thereby limiting its learning capacity. Under the same rank constraint, 94% of the feature components in LoRA are pro-jected outside the LoRA adapter, making it more susceptible to the zero-gradient problem. Since CLOVER updates across all orthogonal directions, as shown in Figure 4d it effectively mitigates this issue. Con-sequently, CLOVER outperforms both LoRA and PiSSA in multi-task learning, even when using the same or fewer learnable parameters (Section 4.2). 7CLOVER: Cross-Layer Orthogonal Vectors 
 4.6. Visualizing Rank Updates 
 To demonstrate CLOVER achieves full-rank updates, we multiply the updated singular values with their correspond-ing singular vectors and perform SVD on the base model (SQK applied to the Key layer, SV O to the Value layer, and 
 SU D to the Up layer). We take LoRA, and Full Fine-tuning for comparing. Figure 5 shows the singular value of the middle layer in LLaMA-2-7B, revealing that CLOVER and Full Fine-tuning achieve full-rank updates, while LoRA is constrained by its low-rank design. 0 1000 2000 3000 4000 
 > 0.0
 > 0.5
 > 1.0
 > 1.5
 > 2.0
 (a) Full Fine-Tuning 0 1000 2000 3000 4000  
 > 0
 > 2
 > 4
 > 6
 (b) LoRA 0 1000 2000 3000 4000  
 > 0.0
 > 0.2
 > 0.4
 > 0.6
 > 0.8
 (c) CLOVER 
 Figure 5. ∆W is low rank in LoRA, while full rank for Full-Fine-Tuning and CLOVER. 
 4.7. CLOVER Avoids Intrusive Dimensions 
 Recent research (Shuttleworth et al., 2024) has highlighted an issue with LoRA, referred to as the “intrusive dimensions” phenomenon. As illustrated in Figure 6b, LoRA introduces new random directions into the model, which possess large magnitudes and thus precede all the original singular vectors. The study suggests that these “intrusive dimensions” can degrade the model’s performance, exacerbating catastrophic forgetting during continual learning with LoRA. In contrast, CLOVER addresses this issue by fixing all orthogonal bases and updating only the vector combinations. As a result, the changes introduced by CLOVER fine-tuning closely resemble those generated by full parameter fine-tuning, as shown in Figure 6a and Figure 6c. 0 100 200 300 400 
 > 0
 > 100
 > 200
 > 300
 > 400
 (a) Full Fine-Tuning 0 100 200 300 400  
 > 0
 > 100
 > 200
 > 300
 > 400
 (b) LoRA 0 100 200 300 400  
 > 0
 > 100
 > 200
 > 300
 > 400
 (c) CLOVER 
 Figure 6. Intruder dimensions phenomenal in LoRA, which does not exist in Full Fine-Tuning and CLOVER. 
 ## 5. Conclusion and Limitations 
 In this paper, we introduce Cross-Layer Orthogonal Vectors (CLVOER), a method that orthogonalizes vectors within attention heads without requiring additional transformation matrices. This orthogonalization process condenses effec-tive parameters into fewer vectors, improving the pruning ratio. By fine-tuning the singular values obtained through orthogonalization, CLVOER learns linear combinations of orthogonal bases, enabling full-rank updates. When applied to prune 50% of the attention head parameters in GPT-2XL, CLVOER results in a perplexity that is just one-tenth of that achieved by standard pruning methods. For Whisper-Large-v3, CLVOER removes 46.42% of the parameters without fine-tuning, while preserving model performance. Furthermore, when used for fine-tuning, CLVOER outper-forms state-of-the-art methods such as LoRA, DoRA, HiRA, and PiSSA, achieving superior results with equal or fewer trainable parameters. We also demonstrate how CLVOER removes linear redundancy to facilitate pruning and discuss the necessity of fine-tuning across all orthogonal bases. Vi-sual comparisons of models fine-tuned with different meth-ods further illustrate its effectiveness. Despite its advantages, CLVOER has some limitations. When nonlinear operations are present between Q-K or V-O pairs (such as with the widely-used RoPE (Su et al., 2024)), cross-layer orthogonalization is not feasible. In these cases, we instead perform head-wise orthogonalization within the Key layer during fine-tuning. Fortunately, CLVOER Fine-Tuning can apply intra-layer attention head orthogo-nalization, while CLOVER Pruning remains applicable to many popular models, including DeepSeek (DeepSeek-AI, 2024; Liu et al., 2024b)(which uses Decoupled RoPE), ViT and SDXL (which use absolute positional encoding), and BLOOM (Le Scao et al., 2023) (which employs Alibi rela-tive positional encoding (Press et al., 2021)). Additionally, as a newly proposed method, our current evaluation fo-cuses primarily on basic pruning tasks and does not include comparisons with other state-of-the-art pruning techniques. However, because CLVOER does not alter the model struc-ture and only updates the initialization method, it can be combined with existing pruning methods to further enhance their effectiveness. As a novel technique, CLVOER holds considerable promise for future applications. For instance, it could be combined with quantization methods to eliminate outliers, guide prun-ing and fine-tuning based on data feature directions, or even inspire new model architectures. 8CLOVER: Cross-Layer Orthogonal Vectors 
 ## Impact Statement 
 This paper proposes a cross-layer orthogonal initialization method to guide model pruning and efficient fine-tuning, of-fering valuable insights for the application and development of large models. Both application directions aim to reduce training and inference costs, lower computational overhead, decrease power consumption, and minimize carbon emis-sions. 
 ## References 
 Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024. AI@Meta. Llama 2: Open foundation and fine-tuned chat models. CoRR , abs/2307.09288, 2023. doi: 10. 48550/arXiv.2307.09288. URL https://doi.org/ 10.48550/arXiv.2307.09288 .AI@Meta. Llama 3 model card, 2024. URL 
 https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md .Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebr ´on, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head check-points. arXiv preprint arXiv:2305.13245, 2023. Anonymous. HiRA: Parameter-efficient hadamard high-rank adaptation for large language models. In The Thirteenth International Conference on Learning Representations , 2025. URL https://openreview. net/forum?id=TwJrTz9cRS .Ansell, A., Ponti, E. M., Korhonen, A., and Vuli ´c, I. Composable sparse fine-tuning for cross-lingual trans-fer. arXiv preprint arXiv:2110.07560, 2021. Anthropic. Claude 3.5 sonnet, 2024. URL https://www. anthropic.com/news/claude-3-5-sonnet .Asai, A., Salehi, M., Peters, M. E., and Hajishirzi, H. Attempt: Parameter-efficient multi-task tuning via at-tentional mixtures of soft prompts. arXiv preprint arXiv:2205.11961, 2022. Ashkboos, S., Croci, M. L., Nascimento, M. G. d., Hoefler, T., and Hensman, J. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024. Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. Piqa: Reason-ing about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432–7439, 2020. Brandon, W., Mishra, M., Nrusimha, A., Panda, R., and Kelly, J. R. Reducing transformer key-value cache size with cross-layer attention. arXiv preprint arXiv:2405.12981, 2024. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems , 33: 1877–1901, 2020. Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., and Jitsev, J. Reproducible scaling laws for contrastive language-image learning. arXiv preprint arXiv:2212.07143, 2022. Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surpris-ing difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. De, S., Smith, S. L., Fernando, A., Botev, A., Cristian-Muraru, G., Gu, A., Haroun, R., Berrada, L., Chen, Y., Srinivasan, S., et al. Griffin: Mixing gated linear recur-rences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024. DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. CoRR ,abs/2405.04434, 2024. URL https://doi.org/10. 48550/arXiv.2405.04434 .Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix multiplication for transform-ers at scale. Advances in Neural Information Processing Systems, 35:30318–30332, 2022. Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022. Fu, Q., Cho, M., Merth, T., Mehta, S., Rastegari, M., and Najibi, M. Lazyllm: Dynamic token pruning for efficient long context llm inference. arXiv preprint arXiv:2407.14057, 2024. Fu, Z., Yang, H., So, A. M.-C., Lam, W., Bing, L., and Collier, N. On the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 37, pp. 12799–12807, 2023. 9CLOVER: Cross-Layer Orthogonal Vectors 
 Gandhi, S., von Platen, P., and Rush, A. M. Distil-whisper: Robust knowledge distillation via large-scale pseudo la-belling. arXiv preprint arXiv:2311.00430, 2023. Gokaslan, A. and Cohen, V. Openwebtext cor-pus. http://Skylion007.github.io/ OpenWebTextCorpus , 2019. Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. Guo, D., Rush, A. M., and Kim, Y. Parameter-efficient transfer learning with diff pruning. arXiv preprint arXiv:2012.07463, 2020. Guo, S., Xu, J., Zhang, L. L., and Yang, M. Com-presso: Structured pruning with collaborative prompting learns compact large language models. arXiv preprint arXiv:2310.05015, 2023. Hambardzumyan, K., Khachatrian, H., and May, J. Warp: Word-level adversarial reprogramming. arXiv preprint arXiv:2101.00121, 2021. Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., and Gholami, A. Kvquant: Towards 10 million context length llm in-ference with kv cache quantization. arXiv preprint arXiv:2401.18079, 2024. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. Jiang, T., Huang, S., Luo, S., Zhang, Z., Huang, H., Wei, F., Deng, W., Sun, F., Zhang, Q., Wang, D., et al. Mora: High-rank updating for parameter-efficient fine-tuning. arXiv preprint arXiv:2405.12130, 2024. Jo, H.-r. and Shin, D. A2sf: Accumulative attention scoring with forgetting factor for token pruning in transformer decoder. arXiv preprint arXiv:2407.20485, 2024. Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transform-ers with linear attention. In International conference on machine learning, pp. 5156–5165. PMLR, 2020. Lawton, N., Kumar, A., Thattai, G., Galstyan, A., and Steeg, G. V. Neural architecture search for parameter-efficient fine-tuning of large pre-trained language models. arXiv preprint arXiv:2305.16597, 2023. Le Scao, T., Fan, A., Akiki, C., Pavlick, E., Ili ´c, S., Hesslow, D., Castagn ´e, R., Luccioni, A. S., Yvon, F., Gall ´e, M., et al. Bloom: A 176b-parameter open-access multilingual language model. 2023. Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021. Li, M., Lin, Y., Zhang, Z., Cai, T., Li, X., Guo, J., Xie, E., Meng, C., Zhu, J.-Y., and Han, S. Svdqunat: Absorb-ing outliers by low-rank components for 4-bit diffusion models. arXiv preprint arXiv:2411.05007, 2024a. Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021. Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469, 2024b. Lingam, V., Tejaswi, A., Vavre, A., Shetty, A., Gudur, G. K., Ghosh, J., Dimakis, A., Choi, E., Bojchevski, A., and Sanghavi, S. Svft: Parameter-efficient fine-tuning with singular vectors. arXiv preprint arXiv:2405.19597, 2024. Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434 ,2024a. Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 ,2024b. Liu, A., Liu, J., Pan, Z., He, Y., Haffari, G., and Zhuang, B. Minicache: Kv cache compression in depth dimension for large language models. arXiv preprint arXiv:2405.14366, 2024c. Liu, S.-Y., Wang, C.-Y., Yin, H., Molchanov, P., Wang, Y.-C. F., Cheng, K.-T., and Chen, M.-H. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024d. Liu, W., Qiu, Z., Feng, Y., Xiu, Y., Xue, Y., Yu, L., Feng, H., Liu, Z., Heo, J., Peng, S., et al. Parameter-efficient orthogonal finetuning via butterfly factorization. arXiv preprint arXiv:2311.06243, 2023a. Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., and Tang, J. Gpt understands, too. AI Open, 2023b. Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. Kivi: A tuning-free asym-metric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024e. 10 CLOVER: Cross-Layer Orthogonal Vectors 
 Ma, X., Fang, G., and Wang, X. Llm-pruner: On the struc-tural pruning of large language models. Advances in neural information processing systems , 36:21702–21720, 2023. Meng, F., Wang, Z., and Zhang, M. Pissa: Principal singular values and singular vectors adaptation of large language models. arXiv preprint arXiv:2404.02948, 2024. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models, 2016. Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018. Mistral. Cheaper, better, faster, stronger: Continuing to push the frontier of ai and making it accessible to all, 2024. URL https://mistral.ai/news/ mixtral-8x22b .OpenAI. Hello GPT-4o, 2024. URL https://openai. com/index/hello-gpt-4o/ .Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M ¨uller, J., Penna, J., and Rombach, R. Sdxl: Im-proving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021. Qiu, Z., Liu, W., Feng, H., Xue, Y., Feng, Y., Liu, Z., Zhang, D., Weller, A., and Sch ¨olkopf, B. Controlling text-to-image diffusion by orthogonal finetuning. Advances in Neural Information Processing Systems , 36:79320– 79362, 2023. Qwen. Qwen2.5: A party of foundation models, 2024. URL 
 https://qwenlm.github.io/blog/qwen2.5 .Radford, A. Improving language understanding by genera-tive pre-training. 2018. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pp. 28492–28518. PMLR, 2023. Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM , 64(9):99–106, 2021. Sap, M., Rashkin, H., Chen, D., LeBras, R., and Choi, Y. Socialiqa: Commonsense reasoning about social interac-tions. arXiv preprint arXiv:1904.09728, 2019. Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019. Shuttleworth, R., Andreas, J., Torralba, A., and Sharma, P. Lora vs full fine-tuning: An illusion of equivalence. arXiv preprint arXiv:2410.21228, 2024. Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023. Sun, Y., Dong, L., Zhu, Y., Huang, S., Wang, W., Ma, S., Zhang, Q., Wang, J., and Wei, F. You only cache once: Decoder-decoder architectures for language models. arXiv preprint arXiv:2405.05254, 2024. Sung, Y.-L., Nair, V., and Raffel, C. A. Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems , 34:24193– 24205, 2021. Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand-ing across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024a. Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahri-ari, B., Ram ´e, A., et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024b. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi `ere, B., Goyal, N., Hambro, E., Azhar, F., et al. LLaMA: Open and efficient founda-tion language models. arXiv preprint arXiv:2302.13971, 2023. Vu, T., Lester, B., Constant, N., Al-Rfou, R., and Cer, D. Spot: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904, 2021. Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020. 11 CLOVER: Cross-Layer Orthogonal Vectors 
 Wang, S., Yu, L., and Li, J. Lora-ga: Low-rank adap-tation with gradient approximation. arXiv preprint arXiv:2407.05000, 2024a. Wang, Z., Panda, R., Karlinsky, L., Feris, R., Sun, H., and Kim, Y. Multitask prompt tuning enables parameter-efficient transfer learning. arXiv preprint arXiv:2303.02861, 2023. Wang, Z., Liang, J., He, R., Wang, Z., and Tan, T. Lora-pro: Are low-rank adapters properly optimized? arXiv preprint arXiv:2407.18242, 2024b. Xia, M., Gao, T., Zeng, Z., and Chen, D. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023. Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning , pp. 38087–38099. PMLR, 2023. Xu, R., Luo, F., Zhang, Z., Tan, C., Chang, B., Huang, S., and Huang, F. Raise a child in large language model: Towards effective and generalizable fine-tuning. arXiv preprint arXiv:2109.05687, 2021. Yu, H., Yang, Z., Li, S., Li, Y., and Wu, J. Effectively com-press kv heads for llm. arXiv preprint arXiv:2406.07056, 2024. Yuan, S., Liu, H., and Xu, H. Bridging the gap between low-rank and orthogonal adaptation via householder reflection adaptation. arXiv preprint arXiv:2405.17484, 2024. Zaken, E. B., Ravfogel, S., and Goldberg, Y. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021. Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019. Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N., He, P., Cheng, Y., Chen, W., and Zhao, T. Adalora: Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023. Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., and Tian, Y. Galore: Memory-efficient llm train-ing by gradient low-rank projection. arXiv preprint arXiv:2403.03507, 2024. Zhao, M., Lin, T., Mi, F., Jaggi, M., and Sch ¨utze, H. Mask-ing as an efficient alternative to finetuning for pretrained language models. arXiv preprint arXiv:2004.12406 ,2020. Zi, B., Qi, X., Wang, L., Wang, J., Wong, K.-F., and Zhang, L. Delta-lora: Fine-tuning high-rank parame-ters with the delta of low-rank matrices. arXiv preprint arXiv:2309.02411, 2023. Zuhri, Z. M. K., Adilazuarda, M. F., Purwarianti, A., and Aji, A. F. Mlkv: Multi-layer key-value heads for memory efficient transformer decoding. arXiv preprint arXiv:2406.09297, 2024. 12 CLOVER: Cross-Layer Orthogonal Vectors 
 ## A. Appendix 
 A.1. Cross Layer Orthogonal Vectors in Value and Output layers 
 In the main text, we only presented the orthogonalization process for the Q-K pair. Here, we provide the method for orthogonalizing the V-O pair. Additionally, for up-down layers, the output dimension of the Up layer can be reshaped into block number × block size, followed by performing orthogonal decomposition within each block. 
 Y = attn(Q, K) V W O , V = XW V ∈ Rb×h×n×d (1) 
 = attn(Q, K) XW V WO , WV WO = WV O = U SV ∈ Rh×D×D (2) 
 = attn(Q, K) XU SV, S[: ,r vo :,r vo :] = SV O ∈ Rh×rvo ×rvo = 0 , r vo ≤ d. (3) 
 = attn(Q, K) XU V O SV O VV O , UV O ∈ RD×h×rvo , VV O ∈ Rh×rvo ×D . (4) Through this series of transformations, WV and WO can be equivalently replaced by orthogonal vectors UV O and VV O ,along with the diagonal matrix SV O . Since rvo ≤ d, the singular zero values and their corresponding singular vectors can be safely pruned. After guided pruning, SV O can be merged into UV O and VV O , resulting in no additional computational overhead. 
 A.2. Hyperparameters 
 Table 3 presents a comparison of hyperparameters for different fine-tuning methods on commonsense tasks. The target model remains the same for LoRA, DoRA, HiRA, and PiSSA. However, DoRA introduces an additional magnitude module, leading to a slightly higher parameter count. In a single layer of LoRA, the trainable parameters are as follows: In LoRA, the trainable parameters are: 
 Q = 4096 × 32 + 4096 × 32 
 K = 4096 × 32 + 4096 × 32 
 V = 4096 × 32 + 4096 × 32 
 Up = 4096 × 32 + 11008 × 32 
 Down = 4096 × 32 + 11008 × 32 
 The total sum is 1,753,088. In CLOVER, the trainable parameters are: 
 QK = 32 × 128 × 128 
 V O = 32 × 128 × 128 
 U D = 172 × 64 × 64 
 The total sum is also 1,753,088. Since CLOVER inserts trainable parameters across layers, we use the Q-K pair notation to represent its target model. When CLVOER updates parameters within an attention head, the number of trainable parameters matches exactly that of LoRA at rank 32. To adjust the number of learnable parameters, CLOVER can either span multiple heads or split a single head into multiple blocks. Both PiSSA and CLOVER exhibit stable training performance. Therefore, instead of validating every 80 steps, we omit frequent validation, improving training efficiency. 
 A.3. Detail Information of Dataset 
 The commonsense reasoning tasks consist of 8 subtasks, each with predefined training and testing sets, as described by LLM-Adapters (Hu et al., 2023). The following table lists the details of each sub-dataset. 13 CLOVER: Cross-Layer Orthogonal Vectors  
 > Table 3. Detailed Training Hyperparameters. Q-K,V-O, U-D means CLVOER update pair of orthogonal vectors.
 Method Target Evaluation steps LR Scheduler Batch size Warmup Steps Epochs 
 LoRA Q,K,V,U,D 80 3e-4 Linear 16 100 3DoRA Q,K,V,U,D 80 2e-4 Linear 16 100 3HiRA Q,K,V,U,D 80 2e-4/2e-4 Linear 32 100 3PiSSA Q,K,V,U,D – 2e-5 Linear 16 100 3CLOVER Q-K,V-O, U-D – 1e-4 Linear 16 100 3                            
 > Table 4. Details of datasets for commonsense reasoning tasks.
 > Dataset Train Test About BoolQ (Clark et al., 2019) 9,427 3,270 Naturally occurring yes/no questions from unconstrained settings. PIQA (Bisk et al., 2020) 16,113 1,838 Questions with two solutions requiring physical commonsense. SIQA (Sap et al., 2019) 33,410 1,954 Reasoning about actions and social implications. HellaSwag (Zellers et al., 2019) 39,905 10,042 Commonsense NLI questions with context and endings. WinoGrande (Sakaguchi et al., 2021) 40,398 1,267 Fill-in-the-blank task with binary options. ARC-e (Clark et al., 2018) 2,251 2,376 Grade-school multiple-choice science questions in Easy sets. ARC-c (Clark et al., 2018) 1,119 1,172 Grade-school multiple-choice science questions in Challenge sets. OBQA (Mihaylov et al., 2018) 4,957 500 Questions requiring multi-step reasoning and commonsense knowledge.
 For WinoGrande, the original dataset includes multiple partitions: [xs, s, m, l, xl, debiased]. While LLM-Adapters simply concatenated all these partitions, note that the “xl” partition actually includes all others, leading to extensive data duplication. After removing duplicates, the training data is reduced from 63.2K to 40.4K instances. Additionally, in the LLM-Adapters paper, the training set sizes of ARC Challenge and ARC Easy were reversed by mistake; here, we correct that error. 
 A.4. LibriSpeech Long dataset target transcript 
 Below is the reference text of the LibriSpeech Long dataset for comparison. 
 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter’s manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton’s work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell’s pictures are a sort of Up Guards and Adam paintings, and Mason’s exquisite idles are as national as a jingo poem. Mr. Birkett Foster’s landscapes smile at one much in the same way that Mr. Carker used to flash his teeth, and Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampooer in a Turkish bath, next man. 
 In fact, with Vanilla Pruning ratios of just 22.31% and 6.69% for WQ-WK and WV -WO , respectively, the model’s output is already significantly degraded. 
 Mr. Colter is the personal of the classes, and we are glad to welcome his gospel. Nor is Mr. Colter’s manner less interesting than his manner. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similarly he is drawn from eating and its results occur most readily to the mind. He is very dull, so very frequently, and is very Greek after all, and can discover in it but little of Rocky Ithaca. The Nell’s pictures are sort of up-guard to Adam’s paintings, and Mason’s exquisite idylls are as national as a jingle poem. Mr. Burke and Foster’s landscapes smile at one much in the same way as Mr. Parker, Mr. Flash is tits. And Mr. John Collier gives his sitter a cheerful slap on the back before he says like a shampoo and a Turkish bath, Next man. 
 A.5. Visualizing more attention heads 
 In Section 4.3, we only presented the first attention head in the first layer. Here, we provide a broader view by showcasing more attention heads. Figure 7 illustrates the L2 norm of all Q-K heads in the first, middle, and last layers of Whisper-Large-14 CLOVER: Cross-Layer Orthogonal Vectors 
 v3. Figure 8 shows the L2 norm of all Q-K heads in the first, middle, and last layers of ViT-bigG. From these figures, we can observe that CLOVER consistently represents the entire attention head with fewer orthogonal bases across all layers and all attention heads. This property forms the foundation of CLVOER’s effectiveness in enhancing pruning. 0 200 400 600 800 1000 1200             
 > 0
 > 2
 > 4
 > 6
 > 8
 > 10
 > Layer.0.qk
 > Absorb and Decompose
 > Vanilla
 > 0200 400 600 800 1000 1200
 > 0
 > 2
 > 4
 > 6
 > 8
 > Layer.15.qk
 > 0200 400 600 800 1000 1200
 > 0.0
 > 2.5
 > 5.0
 > 7.5
 > 10.0
 > 12.5
 > 15.0
 > Layer.31.qk
 Figure 7. The L2-norm for the 0-th, 15-th, and 31-st attention layers in the Whisper-large-v3 encoder. The blue line represents the results after redundancy removal using the CLOVER method, while the orange line depicts the L2-norm directly computed for each dimension. 
 15 CLOVER: Cross-Layer Orthogonal Vectors 0 250 500 750 1000 1250 1500  
 > 0.0
 > 0.1
 > 0.2
 > 0.3
 > 0.4
 > 0.5
 Layer.0.qk        
 > Absorb and Decompose
 > Vanilla
 > 0250 500 750 1000 1250 1500
 > 0.0
 > 0.5
 > 1.0
 > 1.5
 > 2.0
 > 2.5
 Layer.23.qk        
 > 0250 500 750 1000 1250 1500
 > 0.0
 > 0.2
 > 0.4
 > 0.6
 > 0.8
 > 1.0
 Layer.47.qk 
 Figure 8. The L2-norm for the 0-th, 15-th, and 31-st attention layers in the ViT-bigG. The blue line represents the results after redundancy removal using the CLOVER method, while the orange line depicts the L2-norm directly computed for each dimension. 
 16
@@ -0,0 +1,970 @@
 Title: 2404.03592v3.pdf
 URL Source: https://arxiv.org/pdf/2404.03592
 Published Time: Fri, 24 May 2024 01:13:51 GMT
 Number of Pages: 49
 Markdown Content:
 # ReFT: Representation Finetuning for Language Models 
 Zhengxuan Wu ∗† Aryaman Arora ∗† Zheng Wang † Atticus Geiger ‡
 Dan Jurafsky † Christopher D. Manning † Christopher Potts ††Stanford University ‡Pr(Ai) 2R Group 
 {wuzhengx,aryamana,peterwz,atticusg}@stanford.edu {jurafsky,manning,cgpotts}@stanford.edu 
 ## Abstract 
 Parameter-efficient finetuning (PEFT) methods seek to adapt large neural models via updates to a small number of weights . However, much prior interpretability work has shown that representations encode rich semantic information, suggesting that editing representations might be a more powerful alternative. We pursue this hypothesis by developing a family of Representation Finetuning (ReFT) 
 methods. ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations. We define a strong instance of the ReFT family, Low-rank Linear Subspace ReFT (LoReFT), and we identify an ablation of this method that trades some performance for increased efficiency. Both are drop-in replacements for existing PEFTs and learn interventions that are 15 ×–65 × more parameter-efficient than LoRA. We showcase LoReFT on eight commonsense rea-soning tasks, four arithmetic reasoning tasks, instruction-tuning, and GLUE. In all these evaluations, our ReFTs deliver the best balance of efficiency and performance, and almost always outperform state-of-the-art PEFTs. We release a generic ReFT training library publicly at https://github.com/stanfordnlp/pyreft .
 ## 1 Introduction 
 Pretrained language models (LMs) are frequently finetuned to adapt them to new domains or tasks [Dai and Le, 2015]. With finetuning, a single base model can be adapted to a variety of tasks given only small amounts of in-domain data. However, finetuning large LMs is expensive. Parameter-efficient finetuning (PEFT) methods propose to address the high costs of full finetuning by updating a small number of weights. This reduces memory usage and training time, and PEFTs achieve similar performance to full finetuning in many settings [Hu et al., 2023]. A hallmark of current state-of-the-art PEFTs is that they modify weights rather than representations .However, much prior interpretability work has shown that representations encode rich semantic information, suggesting that editing representations might be a more powerful alternative to weight updates. In this paper, we pursue this hypothesis by developing and motivating Representation Finetuning (ReFT) . Instead of adapting model weights, ReFT methods train interventions that manipulate a small fraction of model representations in order to steer model behaviors to solve downstream tasks at inference time. ReFT methods are drop-in replacements for weight-based PEFTs. This approach is inspired by recent work in LM interpretability that intervenes on representations to find faithful causal mechanisms [Geiger et al., 2023b] and to steer model behaviours at inference time [Turner et al., 2023, Li et al., 2024], and it can be seen as a generalisation of the representation-editing work of Wu et al. [2024a], Turner et al. [2023], and Zou et al. [2023] (see appendix B for details). 
 > *
 Equal contribution. Preprint. Under review.      
 > arXiv:2404.03592v3 [cs.CL] 22 May 2024 Commonsense
 > LLaMA 7B LLaMA 13B Llama-2 7B Llama-3 8B
 > Instruct-tuning
 > Llama-2 7B
 > Parameters Performance
 > Arithmetic
 > LLaMA 7B LLaMA 13B
 > GLUE
 > RoBERTa-base RoBERTa-large
 Figure 1: Parameter count vs. performance for LoReFT and other PEFTs across four benchmarks when applied to LLaMA, Llama-2, Llama-3, and RoBERTa models. Despite training far fewer parameters than existing PEFTs, LoReFT achieves competitive or even state-of-the-art performance on all tasks. Its value is most apparent for the largest models in our evaluations. Note : FT is full-parameter finetuning, which is not a PEFT or ReFT method. Additional results are in section 4. We focus on a strong and highly efficient instance of the ReFT family that we call Low-rank Linear Subspace ReFT (LoReFT ). LoReFT is a parametrisation of ReFT that intervenes on hidden representations in the linear subspace spanned by a low-rank projection matrix, building directly on the distributed alignment search (DAS) method of Geiger et al. [2023b] and Wu et al. [2023]. We also identify an ablation of this method (DiReFT) that trades some performance for increased efficiency. We evaluate our ReFTs on LLaMA-family models and small-scale LMs against existing PEFTs on standard benchmarks from four domains: commonsense reasoning, arithmetic reasoning, instruction-following, and natural language understanding. Compared to LoRA, we find that LoReFT uses 15 ×–65 × times fewer parameters while achieving state-of-the-art performance on commonsense reasoning, instruction-following, and natural language understanding against the strongest PEFTs. These findings indicate that ReFT methods are worthy of further exploration, as they may emerge as more efficient and effective alternatives to weight-based PEFTs. 
 ## 2 Related work 
 Parameter-efficient finetuning methods (PEFTs). PEFTs train a fraction of the model’s parameters to adapt it to downstream tasks. We classify PEFTs into three categories: 1. Adapter-based methods train additional modules (e.g. fully-connected layers) on top of the frozen pretrained model. Series adapters insert components between LM attention or MLP layers [Houlsby et al., 2019, Pfeiffer et al., 2020, Wang et al., 2022, He et al., 2022b, Fu et al., 2021], while parallel adapters add modules alongside existing components [He et al., 2022a]. Since adapters add new components that cannot be easily folded into existing model weights, they impose an additional burden at inference time. 1
 2. LoRA [Hu et al., 2022] and DoRA [Liu et al., 2024c] use low-rank matrices to approximate additive weight updates during training, and require no additional overhead during inference since the weight updates can be merged into the model. These are the strongest PEFTs currently. 2
 3. Prompt-based methods add randomly-initialised soft tokens to the input (usually as a prefix) and train their embeddings while keeping the LM weights frozen [Li and Liang, 2021]. These 
 > 1
 Several very recent papers introduce new adapter architectures but do not benchmark them on the tasks we consider, or they perform hyperparameter-tuning in a different setup than done in this work. These include: LLaMA-Adapter [Zhang et al., 2024b], LLaMA-Adapter v2 [Gao et al., 2023], Aligner [Ziheng et al., 2023]. 
 > 2
 Additional methods not studied in this work: AutoLoRA [Zhang et al., 2024c], ResLoRA [Shi et al., 2024], SiRA [Zhu et al., 2023]. 
 2methods are often far from optimal compared to other PEFTs, and come at the cost of significant inference overhead. A variant of this method where hidden-layer activations are also tuned was introduced as a baseline in Hu et al. [2022], with better performance. 
 Representation editing. Recent work on activation steering and representation engineering shows that adding fixed or task-specific steering vectors [Subramani et al., 2022, Turner et al., 2023, Zou et al., 2023, Liu et al., 2024b, Vogel, 2024, Li et al., 2024] or applying concept erasure [Ravfogel et al., 2022, Belrose et al., 2023, Avitan et al., 2024, Singh et al., 2024] to the residual stream can enable a degree of control over pretrained LM generations without the need for resource-intensive finetuning [Wu et al., 2024a]. The success of these methods affirms that representations induced by pretrained LMs carry rich semantic structure. 
 Interventional interpretability. Much recent work has used interventions on model-internal states to test hypotheses about how LMs implement various behaviours. In particular, interventions on linear subspaces of representations have provided increasing evidence that human-interpretable concepts are encoded linearly [Smolensky, 1986, Rumelhart et al., 1986, McClelland et al., 1986]. This includes linguistic features such as gender and number [Lasri et al., 2022, Wang et al., 2023, Hanna et al., 2023, Chintam et al., 2023, Yamakoshi et al., 2023, Hao and Linzen, 2023, Chen et al., 2023, Amini et al., 2023, Guerner et al., 2023, Arora et al., 2024], logical and mathematical reasoning [Wu et al., 2023], entity attributes [Huang et al., 2024], and a number of other domains [Mikolov et al., 2013, Elhage et al., 2022, Park et al., 2023, Nanda et al., 2023, Guerner et al., 2023]. 
 ## 3 ReFT 
 We now define the ReFT family of methods. To do this, we first summarize the core motivation, which emerges from work on intervention-based model interpretability. We then show how this leads directly to Low-rank Linear Subspace ReFT (LoReFT). Finally, we generalize this to a family of ReFT methods. Appendix A provides a brief overview of our generic ReFT training library. To keep the presentation simple, we assume throughout that our target model is a Transformer-based [Vaswani et al., 2017] LM that produces contextualised representations of sequences of tokens. Given a sequence of n input tokens x = (x1, . . . , x n), the model first embeds these into a list of representations h(0) = (h(0) 
 > 1
 , . . . , h(0) 
 > n
 ). Then, m layers successively compute the j-th list of hidden representations h(j) as a function of the previous list of hidden representations h(j−1). Each hidden representation is a vector h ∈ Rd. The LM uses the final hidden representations h(m) to produce its predictions. In our experiments, we consider both autoregressive LMs and masked LMs [Devlin et al., 2019]. An autoregressive LM predicts p(xn+1 ∣ x1, . . . , x n) = softmax (Wh (m) 
 > n
 ), while a masked LM predicts p(xi ∣ x1, . . . , x i−1, x i+1, . . . , x n) = softmax (Wh (m) 
 > i
 ), where W is a learned matrix mapping from representations to logits over the vocabulary space. 
 3.1 Motivation 
 In interpretability research, the framework of causal abstraction [Geiger et al., 2021] uses interchange interventions to establish the causal role of representations in deep learning models. An interchange intervention fixes a representation to the value it would take if a counterfactual input were processed by the model. Experiments investigating how such interventions affect model behavior form the evidence for claims about the causal role of a representation and the concept it encodes. To test whether a concept is encoded in a linear subspace of a representation, one may use a dis-tributed interchange intervention (DII) [Geiger et al., 2023b]. 3 Let b be the hidden representation created at row i and column k when our model processes input b, and let s be the corresponding representation when that same model processes input s. A distributed interchange intervention on b
 given a counterfactual source representation s is then defined as 
 DII (b, s, R) = b + R⊺(Rs − Rb ) (1) where R ∈ Rr×d is a low-rank projection matrix with orthonormal rows, d is the representation dimensionality, and r is the dimensionality of the subspace we are intervening on. We learn the subspace R using distributed alignment search (DAS), which finds the subspace that maximises the 
 > 3This notion of subspace intervention was also independently discovered by Guerner et al. [2023].
 3ReFT Intervention    
 > this is some text
 # LoReFT   
 > edit subspace
 > (rows of R)
 > RW
 > hhh
 > Φ(h)
 > b
 > -+
 > +
 > RT
 > edit restricted to subspace
 > edit vector
 Figure 2: Illustration of ReFT . (1) The left panel depicts an intervention I: the intervention function 
 Φ is applied to hidden representations at positions P in layer l. (2) The right panel depicts the intervention function used in LoReFT, which finds an edit vector that only modifies the representation in the linear subspace spanned by the rows of R. Specifically, we show how a rank-2 LoReFT operates on 3-dimensional hidden representations. probability of the expected counterfactual output after intervention [Geiger et al., 2023b]. DAS is highly expressive, and can effectively localize concepts within model representations [Wu et al., 2023, Arora et al., 2024, Wu et al., 2024c, Huang et al., 2024]. This suggests that subspace representation interventions could also be a powerful tool for model control. 
 3.2 Two low-rank ReFT instantiations LoReFT. The formulation of DII in eq. (1) immediately suggests a way to control model generations via interventions. The guiding intuition is that we can learn how to perform interventions that steer the model towards predicting our task labels. The resulting method, Low-rank Linear Subspace ReFT (LoReFT), is defined by the following variant of eq. (1): 
 ΦLoReFT (h) = h + R⊺ (Wh + b − Rh ) (2) This is identical to eq. (1), except we use a learned projected source Rs = Wh + b. LoReFT thus edits the representation in the r-dimensional subspace spanned by the rows of R to take on the values obtained from our linear projection Wh + b. We depict this operation in fig. 2. The learned parameters are ϕ = {R, W, b}; the parameters of the LM are frozen. As with DII , R ∈ Rr×d is a low-rank matrix with orthonormal rows where d is the hidden-state dimensionality and r ≤ d is the rank of the subspace. We further define a linear projection W ∈ Rr×d and bias vector b ∈ Rr .
 DiReFT. In addition, we define an ablation of LoReFT which removes the orthogonality constraint and the difference operation, reducing training time: 
 ΦDiReFT (h) = h + W⊺ 
 > 2
 (W1h + b) (3) Both W1, W2 ∈ Rr×d are low-rank projection matrices. Note that eq. (3) resembles LoRA, and thus DiReFT can be thought of as LoRA applied di rectly to hidden representations at certain positions. 4
 Empirical evidence from previous work suggests that adding orthogonal constraints to LoRA weights increases performance [Liu et al., 2024d]. (Appendix E reports results for additional ablations of LoReFT.) 
 Training objective. We consider both generation tasks using decoder-only or encoder–decoder LMs and classification tasks using encoder-only models. The pretrained language model induces a distribution over token sequences p(⋅). We denote the model that results from the ReFT intervention 
 Φ on p(⋅) as pΦ(⋅) with trainable parameters ϕ. To simplify notation, we refer to the hidden representations produced by the LM on input x as h(x), and those by the intervened LM as hΦ(x).For generation tasks, our training objective is language modelling. Given an input sequence x =
 (x1, . . . , x n) with n tokens as the prompt, the goal is to predict the output sequence y = (y1, . . . , y m)
 > 4LoRA is not applicable to the residual stream, which is weightless. LoRA can be configured to apply only to the attention layer output projection matrix, which is similar to our residual stream intervention. However, previous works found that applying LoRA only to attention layers is sub-optimal [Hu et al., 2023].
 4with m tokens. We minimise the cross-entropy loss with teacher-forcing over all output positions. 
 min  
 > ϕ
 {−
 > m
 ∑
 > i=1
 log pΦ (yi ∣ xy <i)} (4) For single-label classification tasks, we add a classification head Hθ (⋅) with parameters θ that takes the final-layer representation at the first token ( CLS ) as input and outputs a distribution over classes. 
 H has the learned parameters θ = {Wo, bo, Wd, bd}.
 Hθ (⋅ ∣ h) = softmax (Wo(tanh (Wdh(m) 
 > 1
 + bd)) + bo) (5) We learn the parameters of the head and those of the intervention function Φ. We minimise the cross-entropy loss of the target class y given input x:
 min  
 > ϕ,θ
 {− log Hθ (y ∣ hΦ(x))} (6) 
 3.3 The ReFT family of methods 
 It is straightforward to generalise the above intervention functions to define a family of intervention-based representation finetuning methods. We first define a general notion of intervention , i.e. the modification of hidden representations during the model forward pass: 
 Definition 3.1. An intervention I is a tuple ⟨Φ, P, l ⟩ that encapsulates a single inference-time modification of the representations computed by a Transformer-based LM. The three components of an intervention are (1) the intervention function Φ ∶ Rd → Rd with learned parameters ϕ, (2) a set of input positions P ⊆ {1, . . . , n } that the intervention is applied to, and (3) the layer l ∈ {1, . . . , m }
 at which the intervention is applied. We implement the intervention I as the following operation that overwrites some representations h:
 h(l) ← (Φ (h(l) 
 > p
 ) if p ∈ P else h(l) 
 > p
 )p∈1,...,n (7) The intervention is applied immediately after the computation of h(l) and thus affects the representa-tions computed in later layers h(l+1), . . . , h(m).Figure 2 provides a schematic overview of an intervention. A ReFT is then defined as a constrained set of non-overlapping interventions: 
 Definition 3.2. A ReFT method is a set of f interventions I = {I1, . . . , I f }. We enforce that for any two interventions Ij , I k ∈ I such that they operate on the same layer lj = lk, their intervention positions must be disjoint, i.e. Pj ∩ Pk = ∅. The parameters (ϕ1, . . . , ϕ f ) of all of the intervention functions are independent. ReFT is thus a generic framework encompassing interventions on hidden representations during the model forward pass. In appendix B, we show how a variety of existing inference-time intervention methods can be described within this framework. 
 ## 4 Experiments 
 To evaluate our ReFTs against existing PEFTs, we conduct experiments across four diverse NLP benchmarks covering more than 20 datasets (extensive details on our datasets are in appendix C). Our goal is to provide a rich picture of how LoReFT and DiReFT perform in different scenarios. We experiment with both masked and autoregressive LMs at different scales, ranging from RoBERTa-base [Liu et al., 2019] with 125M to LLaMA models [Touvron et al., 2023a,b] with 13B parameters. We benchmark against existing PEFTs such as prefix-tuning [Li and Liang, 2021], adapter-tuning with both Series Adapters and Parallel Adapters, BitFit [Ben Zaken et al., 2022], RED [Wu et al., 2024a], LoRA [Hu et al., 2022], and DoRA [Liu et al., 2024c]. Our comparisons focus on both performance and parameter efficiency. In our comparisons, we use hyperparameter-tuned scores from previous works when possible. We load our base LMs in torch.bfloat16 to save memory. All of our experiments are run with a single GPU: NVIDIA A100 40G/80G or RTX 6000 . Examples of raw model generations are in appendix I. 54.1 Hyperparameter configuration 
 For our experiments, we must decide how many interventions to learn and which layers and input positions to apply each one on. We propose learning interventions on a fixed number of p prefix and 
 s suffix positions in the prompt. Specifically, we tune four hyperparameters: 1. The number of prefix positions p to intervene on, i.e. positions {1, . . . , p }.2. The number of suffix positions s to intervene on, i.e. positions {n − s + 1, . . . , n }.3. Which set of layers L to intervene on. 4. Whether or not to tie intervention parameters ϕ across different positions in the same layer. This simplifies the hyperparameter search space; compared to LoRA, the only additional consideration is which positions to intervene on. Since the number of positions edited is constant, LoReFT and DiReFT contribute a fixed additional inference cost that does not scale with prompt length. Given the positions P = {1, . . . , p } ∪ {n − s + 1, . . . , n }, we define the untied and tied variants: 
 Iuntied = {⟨ Φ, {p}, l ⟩ ∣ p ∈ P, l ∈ L} Itied = {⟨ Φ, P, l ⟩ ∣ l ∈ L}
 Additionally, when applying LoReFT and DiReFT to a prompt with length n where n < p + s, we set p ← min (p, ⌊n/2⌋) and s ← min (s, ⌈n/2⌉) and do not apply the truncated interventions in Iuntied .We also tune neural-network training hyperparameters. Unlike previous work [Hu et al., 2022, 2023, Liu et al., 2024c] where hyperparameter tuning may involve optimising performance directly on test sets, we only tune our hyperparameters on development sets which do not contain any overlapping examples with the test sets of our tasks. We further describe hyperparameter tuning for each benchmark in appendix D.1. 
 4.2 Commonsense reasoning 
 We replicate the experimental setup in Hu et al. [2023] and finetune LLaMA-1 7B/13B, Llama-2 7B, and Llama-3 8B 5 on a combined dataset of eight commonsense reasoning tasks (C OMMONSENSE 170K). We report scores on each task’s test set individually. We compare with PEFTs benchmarked in Hu et al. [2023] as well as the identical experiment reported in Liu et al. [2024c] for DoRA. 
 Datasets. Our benchmark contains eight commonsense reasoning datasets, including BoolQ [Clark et al., 2019], PIQA [Bisk et al., 2020], SIQA [Sap et al., 2019], HellaSwag [Zellers et al., 2019], WinoGrande [Sakaguchi et al., 2021], ARC-e, ARC-c [Clark et al., 2018], and OBQA [Mihaylov et al., 2018]. Examples are formulated as multiple-choice problems where the model needs to directly generate the correct choice without rationales. We use the same prompt template as in Hu et al. [2023] with additional string normalisation (removing leading and trailing whitespace). 
 Hyperparameter tuning. We do not do hyperparameter selection based on test set results. Rather, we use the hyperparameter settings of the model that performs best on a development set created from the GSM8K training set, except we use a lower number of epochs (6 instead of 12) because the C OMMONSENSE 170K training set is more than 20 times larger than GSM8K. This allows us to tune relevant hyperparamters, and also serves to test the robustness of these settings across different domains. We additionally report scores on 3 epochs in appendix D.3. 
 Results. We report results in table 1. LoReFT sets state-of-the-art performance on the commonsense reasoning tasks, outperforming all other methods by a considerable margin. While being more compute-efficient, DiReFT achieves only slightly worse performance consistently. 
 4.3 Arithmetic reasoning 
 Similar to the previous experiment, we follow the experimental setup in Hu et al. [2023] and finetune LLaMA-1 7B and 13B on a combined dataset of seven arithmetic reasoning tasks with LM-generated chain-of-thought steps (M ATH 10K) and report scores on four of the tasks’ test sets. We only evaluate correctness on the final numeric or multiple-choice answer. 
 > 5Llama-3 8B appeared on April 18, 2024, and thus we had time to complete only commonsense reasoning experiments with this model. Liu et al. [2024c] report corresponding results for LoRA and DoRA.
 6Table 1: Accuracy comparison of LLaMA-1 7B/13B, Llama-2 7B and Llama-3 8B against existing PEFT methods on eight commonsense reasoning datasets. ∗Performance results of all baseline methods are taken from Liu et al. [2024c]. We report averaged performance of three runs with distinct random seeds for our method. For our methods, Param. (%) is calculated by dividing the number of trainable parameters by the number of parameters of the base LM.                                                                                                                                                                                                                                                                                             
 > Model PEFT Params (%) Accuracy (↑)
 > BoolQ PIQA SIQA HellaS. WinoG. ARC-e ARC-c OBQA Avg.
 > ChatGPT ∗——73.1 85.4 68.5 78.5 66.1 89.8 79.9 74.8 77.0 LLaMA-7B PrefT ∗0.039% 64.3 76.8 73.9 42.1 72.1 72.9 54.0 60.6 64.6 Adapter S∗1.953% 63.0 79.2 76.3 67.9 75.7 74.5 57.1 72.4 70.8 Adapter P∗3.542% 67.9 76.4 78.8 69.8 78.9 73.7 57.3 75.2 72.3 LoRA ∗0.826% 68.9 80.7 77.4 78.1 78.8 77.8 61.3 74.8 74.7 DoRA (half) ∗0.427% 70.0 82.6 79.7 83.2 80.6 80.6 65.4 77.6 77.5 DoRA ∗0.838% 68.5 82.9 79.6 84.8 80.8 81.4 65.8 81.0 78.1
 > DiReFT (ours) 0.031% 69.5 83.0 79.0 92.5 80.5 82.2 68.0 77.5 79.0
 > LoReFT (ours) 0.031% 69.3 84.4 80.3 93.1 84.2 83.2 68.2 78.9 80.2
 > LLaMA-13B PrefT ∗0.031% 65.3 75.4 72.1 55.2 68.6 79.5 62.9 68.0 68.4 Adapter S∗1.586% 71.8 83.0 79.2 88.1 82.4 82.5 67.3 81.8 79.5 Adapter P∗2.894% 72.5 84.9 79.8 92.1 84.7 84.2 71.2 82.4 81.5 LoRA ∗0.670% 72.1 83.5 80.5 90.5 83.7 82.8 68.3 82.4 80.5 DoRA (half) ∗0.347% 72.5 85.3 79.9 90.1 82.9 82.7 69.7 83.6 80.8 DoRA ∗0.681% 72.4 84.9 81.5 92.4 84.2 84.2 69.6 82.8 81.5
 > DiReFT (ours) 0.025% 71.3 86.1 80.8 94.6 83.6 85.5 72.9 82.7 82.2
 > LoReFT (ours) 0.025% 72.1 86.3 81.8 95.1 87.2 86.2 73.7 84.2 83.3
 > Llama-2 7B LoRA ∗0.826% 69.8 79.9 79.5 83.6 82.6 79.8 64.7 81.0 77.6 DoRA (half) ∗0.427% 72.0 83.1 79.9 89.1 83.0 84.5 71.0 81.2 80.5 DoRA ∗0.838% 71.8 83.7 76.0 89.1 82.6 83.7 68.2 82.4 79.7
 > DiReFT (ours) 0.031% 70.8 83.6 80.2 93.6 82.1 84.8 70.4 81.5 80.9
 > LoReFT (ours) 0.031% 71.1 83.8 80.8 94.3 84.5 85.6 72.2 82.3 81.8
 > Llama-3 8B LoRA ∗0.700% 70.8 85.2 79.9 91.7 84.3 84.2 71.2 79.0 80.8 DoRA (half) ∗0.361% 74.5 88.8 80.3 95.5 84.7 90.1 79.1 87.2 85.0 DoRA ∗0.710% 74.6 89.3 79.9 95.5 85.6 90.5 80.4 85.8 85.2
 > DiReFT (ours) 0.026% 73.4 88.7 81.0 95.6 85.5 91.8 81.8 85.4 85.4
 > LoReFT (ours) 0.026% 75.1 90.2 82.0 96.3 87.4 92.4 81.6 87.5 86.6
 Hyperparameter tuning. We use the same hyperparameter settings as for the Commonsense Rea-soning benchmark, but with 12 epochs for training. We also report scores on 3 epochs. 
 Datasets. Our benchmark contains four datasets for math world problems, including AQuA [Ling et al., 2017], GSM8K [Cobbe et al., 2021], MAWPS [Koncel-Kedziorski et al., 2016], and SVAMP [Pa-tel et al., 2021]. Models need to generate chain-of-thought [Wei et al., 2022] before the final answer. We use the same prompt template and hyperparameter settings as in the previous experiment. 
 Results. We report results in table 2. We find that both LoReFT and DiReFT do not perform as well at arithmetic reasoning tasks compared to LoRA and adapters, but do outperform prefix-tuning. Our results suggest that our ReFTs may have more trouble on chain-of-thought reasoning than the single-step commonsense reasoning tasks due to the length of generations (greater length necessarily reduces the effect of the intervention) and overall greater difficulty of the task. Our results show that our ReFTs perform better with the 13B model than the 7B model, which suggests that our methods scale with model size. Overall, we note that the arithmetic reasoning results show a lot of variation, with no single method emerging as a clear winner across all of them. 
 4.4 Instruction-following 
 Base LMs require instruction finetuning to follow human prompts [Ouyang et al., 2022]. We follow the experimental setup in Wu et al. [2024a] and finetune Llama-2 7B with Ultrafeedback [Cui et al., 2023]. We compare against full parameter finetuning, LoRA, and RED. For evaluation, we use Alpaca-Eval v1.0 [Li et al., 2023], which computes the win-rate against text-davinci-003 using GPT-4 as the annotator. We use the same prompt template as in Taori et al. [2023]. 
 Datasets. Ultrafeedback is high-quality instruction dataset where responses are generated via scoring a diverse set of model responses from a list of candidates (e.g. ChatGPT and Bard). The score is calculated as a weighted score of instruction-following, truthfulness, honesty, and helpfulness. 7Table 2: Accuracy comparison of LLaMA-1 7B/13B against existing PEFT methods on four arithmetic reasoning datasets. ∗Performance results of all baseline methods are taken from Hu et al. [2023]. We report averaged performance of three runs with distinct random seeds for our method.                                                                                  
 > Model PEFT Params (%) Accuracy (↑)
 > AQuA GSM8K MAWPS SVAMP Avg.
 > LLaMA-7B PrefT ∗0.039% 14.2 24.4 63.4 38.1 35.0 Adapter S∗1.953% 15.0 33.3 77.7 52.3 44.6 Adapter P∗3.542% 18.1 35.3 82.4 49.6 46.4 LoRA ∗0.826% 18.9 37.5 79.0 52.1 46.9 DiReFT (ours) 0.031% 21.3 24.1 74.5 42.7 40.6
 > LoReFT (ours) 0.031% 21.4 26.0 76.2 46.8 42.6 LLaMA-13B PrefT ∗0.031% 15.7 31.1 66.8 41.4 38.8 Adapter S∗1.586% 22.0 44.0 78.6 50.8 48.9 Adapter P∗2.894% 20.5 43.3 81.1 55.7 50.2 LoRA ∗0.670% 18.5 47.5 83.6 54.6 51.1 DiReFT (ours) 0.025% 20.5 35.8 80.8 54.8 48.0
 > LoReFT (ours) 0.025% 23.6 38.1 82.4 54.2 49.6
 Table 3: Instruction tuning evaluation results for instruction-tuned Llama-2 7B with Alpaca-Eval v1.0. We report averaged performance of two runs with distinct random seeds for our method. half 
 denotes our runs with half of the rank; 1K denotes our runs with a low-resource setting where there is only 1K training examples. †Performance results of baseline methods are taken from Li et al. [2023]. ∗Performance results of baseline methods are taken from Wu et al. [2024a]. ‡It takes 18 minutes to train our Llama-2 Chat 7B on 1K examples using a single A100 40G GPU with 
 ≈1MB parameters on disk. 
 Model & PEFT Params (%) Win-rate (↑)GPT-3.5 Turbo 1106 † — 86.30 Llama-2 Chat 13B † — 81.10 Llama-2 Chat 7B † — 71.40 Llama-2 7B & FT ∗ 100% 80.93 Llama-2 7B & LoRA ∗ 0.1245% 81.48 Llama-2 7B & RED ∗ 0.0039% 81.69 Llama-2 7B & DiReFT (ours) 0.0039% 84.85 Llama-2 7B & LoReFT (ours) 0.0039% 85.60 
 Llama-2 7B & LoReFT (ours, half ) 0.0019% 84.12 Llama-2 7B & LoReFT (ours, 1K )‡ 0.0039% 81.91 
 Some of the best 7B and 13B chat-models (e.g. UltraLM-13B [Ding et al., 2023]) are finetuned with Ultrafeedback. 
 Hyperparameter tuning. We do hyperparameter-tuning on the unseen instruction-following dataset Alpaca-52K [Taori et al., 2023] with only LLaMA-7B to prevent test-set hill-climbing. We then use the hyperparameter settings of our best performing model to finetune on Ultrafeedback. For hyperparameter tuning, we use Alpaca-Eval v1.0 with GPT-4 turbo as the annotator for fast turnaround, which also prevents overfitting with GPT-4 as a judge. 
 Results. We report results in table 3. When matched in parameter count to the previous most parameter-efficient PEFT (RED) and trained on Llama-2 7B, LoReFT outperforms all reported finetuning methods (including full finetuning) and achieves a win-rate within 1% of GPT-3.5 Turbo 1106. Furthermore, after halving the parameter count or using only 1/64-th of the data, LoReFT still outperforms other finetuning methods. This result shows that LoReFT can succeed at long-form text generation. DiReFT is again slightly worse than LoReFT but is highly competitive. 6
 > 6
 We release our ReFT weights ( <1MB) of our instruction-tuned model through HuggingFace and provide a tutorial at https://github.com/stanfordnlp/pyreft/blob/main/examples/chat .
 8Table 4: Accuracy comparison of RoBERTa-base and RoBERTa-large against existing PEFT methods on the GLUE benchmark. ∗Performance results of all baseline methods are taken from Wu et al. [2024a]. We report averaged performance of five runs with distinct random seeds for our method.                                                                                                                                                                    
 > Model PEFT Params (%) Accuracy (↑)
 > MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B Avg.
 > base FT 100% 87.3 94.4 87.9 62.4 92.5 91.7 78.3 90.6 85.6 Adapter ∗0.318% 87.0 93.3 88.4 60.9 92.5 90.5 76.5 90.5 85.0
 > LoRA ∗0.239% 86.6 93.9 88.7 59.7 92.6 90.4 75.3 90.3 84.7 Adapter FNN ∗0.239% 87.1 93.0 88.8 58.5 92.0 90.2 77.7 90.4 84.7 BitFit ∗0.080% 84.7 94.0 88.0 54.0 91.0 87.3 69.8 89.5 82.3 RED ∗0.016% 83.9 93.9 89.2 61.0 90.7 87.2 78.0 90.4 84.3
 > DiReFT (ours) 0.015% 82.5 92.6 88.3 58.6 91.3 86.4 76.4 89.3 83.2
 > LoReFT (ours) 0.015% 83.1 93.4 89.2 60.4 91.2 87.4 79.0 90.0 84.2 large FT 100% 88.8 96.0 91.7 68.2 93.8 91.5 85.8 92.6 88.6 Adapter ∗0.254% 90.1 95.2 90.5 65.4 94.6 91.4 85.3 91.5 88.0 LoRA ∗0.225% 90.2 96.0 89.8 65.5 94.7 90.7 86.3 91.7 88.1 Adapter FNN ∗0.225% 90.3 96.1 90.5 64.4 94.3 91.3 84.8 90.2 87.7 RED ∗0.014% 89.5 96.0 90.3 68.1 93.5 88.8 86.2 91.3 88.0
 > DiReFT (ours) 0.014% 88.7 95.4 88.5 66.7 93.9 88.1 86.9 91.2 87.4
 > LoReFT (ours) 0.014% 89.2 96.2 90.1 68.0 94.1 88.5 87.5 91.6 88.2
 4.5 Natural language understanding 
 We evaluate LoReFT on the GLUE benchmark [Wang et al., 2018] against existing PEFTs. We use this set of experiments to show LoReFT works well even with small-scale LMs, and can improve representations for classification tasks and not just text generation. We finetune RoBERTa-base (125M) as well as RoBERTa-large (350M) on GLUE, a sequence classification benchmark for natural language understanding (NLU) which covers domains such as sentiment classification and natural language inference. Details about the GLUE benchmark can be found in its original paper. We follow Wu et al. [2024a] for proper evaluation on GLUE validation set: we split the validation set into two sets guarded by a random seed, and we pick the best model with highest in-training validation accuracy to evaluate on the other held-out half for testing accuracy. 
 Hyperparameter tuning. We tune our hyperparameters for each task separately. which is standard for PEFTs. To avoid overfitting to random seeds, we hyperparameter-tune our models with a constant seed, and report averaged results over that and four additional unseen seeds. We describe hyperparameter tuning experiments in Appendix D.1. 
 Results. We report results in table 4. LoReFT obtains comparable performance with PEFT methods on both model sizes when parameter matched with RED, the previous most parameter-efficient PEFT for this task. Furthermore, DiReFT achieves worse performance than most of the PEFTs suggesting LoReFT is a better choice when LM is small. Full results with standard deviation is in table 13. We additionally compare against VeRA [Kopiczko et al., 2024] in appendix D.3. 
 ## 5 Limitations 
 Due to limited resources, we mainly explored the LLaMA-family of models. In future work, we hope to explore the effectiveness of ReFT on other model families as well as vision–language models such as LLaVA [Liu et al., 2024a]. The capabilities of ReFT have not yet been fully explored due to the large hyperparameter search space; we are interested in automating this search. We provide some initial explorations of LM personalisation with ReFT in a few-shot setting in appendix G.2. We hope to explore why ReFT works, and we provide some of our early explorations focused on memorisation (appendix F.1, appendix F.2). We are also investigating whether learned orthogonal subspaces can be composed together without adaptation. Some encouraging initial findings are in appendix G.1. 
 ReFT, abstraction, and generation. Neural network interpretability research often struggles to contribute directly to improving models. With ReFT, we have shown one way to overcome this challenge. The ReFT framework is rooted in work on causal abstraction [Geiger et al., 2023a] for model interpretability, and LoReFT builds directly on the distributed interchange intervention method 9of Geiger et al. [2023b] and Wu et al. [2023]. See also the interchange intervention training (IIT) method of Geiger et al. [2022], Wu et al. [2022], Huang et al. [2023c]. In a similar vein, recent work also uses representation-based editing of the Transformer stream to steer model behavior [Li et al., 2024, Zou et al., 2023]. ReFT advances this line of work by showing one way that such steering can be learned, rather than being merely a post hoc analysis step. The precise ways in which ReFT works deserve deeper exploration. Although these methods intervene on representations, the causal effect of such interventions may only emerge in the model’s upstream computations. In other words, the power of ReFT may come from the fact that it creates new causal pathways or modifies the strength of some existing ones. We leave it to future research to track these effects, and perhaps to explore more structured ReFTs to modify complex causal pathways in LMs. 
 ReFT and model interpretability. ReFT relies on insights from work on interpretability, and it may also be able to contribute insights back to that field. In particular, LoReFT shows that training a set of low-rank interventions on selected residual streams can induce a base LM to follow instructions (section 4.4). In other words, a linear subspace distributed across a set of neurons can achieve generalised control over a vast number of tasks. This is a serious challenge to work seeing to interpret individual neurons in isolation (for related criticisms, see Huang et al. 2023b). The success of ReFT suggests to us a quite different approach to interperetability, one that starts from the assumption that neurons will play different roles in different contexts. 
 Evaluation practices in PEFT research. In this work, we hyperparameter-tune ReFT on develop-ment sets that do not overlap with the test set. Unfortunately, a considerable portion of the literature on PEFTs directly hill-climbs performance on test sets. This results in overfitting to specific tasks, which gives practitioners less certainty about the real-world performance of different methods and impedes fair comparison. We hope that future work can introduce benchmarks for evaluating PEFTs and ReFTs. These should allow for compute- or time-matched hyperparameter-tuning comparisons, and they should disallow any kind of tuning or model selection based on the test set. 
 ## 6 Conclusion 
 We propose a strong alternative to PEFTs, LoReFT, and we identify an ablation of this method, DiReFT, that trades some performance for increased efficiency. Overall, LoReFT achieves strong per-formance across benchmarks from four domains while being 15 ×–65 × more efficient than LoRA. No-tably, LoReFT establishes new state-of-the-art performance on commonsense reasoning, instruction-following, and natural language understanding against the strongest PEFTs. We also show how our method can be described under a generic framework – ReFT. ReFT is a new approach to finetuning that is more powerful, more parameter-efficient, and more interpretable than any existing PEFTs. 
 ## Acknowledgements 
 We thank Jing Huang for helpful discussion in designing our memorisation tests as well as writing. We thank Chenglei Si, Harshit Joshi, Jordan Juravsky, Julie Kallini, Ken Liu, Rohan Pandey, Jiuding Sun, Leonard Tang, Tristan Thrush, Shengguang Wu, Qinan Yu, Yanzhe Zhang, Amir Zur, and Shiqi Chen for helpful discussion about the project and comments on the manuscript. 
 ## References 
 Afra Amini, Tiago Pimentel, Clara Meister, and Ryan Cotterell. Naturalistic causal probing for morpho-syntax. Transactions of the Association for Computational Linguistics , 11:384–403, 2023. doi: 10.1162/tacl_a_00554. URL https://aclanthology.org/2023.tacl-1.23 .Aryaman Arora, Dan Jurafsky, and Christopher Potts. CausalGym: Benchmarking causal inter-pretability methods on linguistic tasks. arXiv:2402.12560 , 2024. URL https://arxiv.org/abs/ 2402.12560 .Matan Avitan, Ryan Cotterell, Yoav Goldberg, and Shauli Ravfogel. What changed? Converting representational interventions to natural language. arXiv:2402.11355 , 2024. URL https://arxiv. org/abs/2402.11355 .10 Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. LEACE: Perfect linear concept erasure in closed form. Advances in Neural Information Processing Systems , 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/d066d21c619d0a78c5b557fa3291a8f4-Paper-Conference.pdf .Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages 1–9, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.1. URL https: //aclanthology.org/2022.acl-short.1 .Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence ,volume 34, pages 7432–7439, 2020. URL https://arxiv.org/abs/1911.11641 .Lewis Carroll. Alice’s Adventures in Wonderland . Macmillan, London, 1865. Angelica Chen, Ravid Schwartz-Ziv, Kyunghyun Cho, Matthew L. Leavitt, and Naomi Saphra. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLMs. 
 arXiv:2309.07311 , 2023. URL https://arxiv.org/abs/2309.07311v4 .Abhijith Chintam, Rahel Beloch, Willem Zuidema, Michael Hanna, and Oskar van der Wal. Identifying and adapting transformer-components responsible for gender bias in an English language model. In Yonatan Belinkov, Sophie Hao, Jaap Jumelet, Najoung Kim, Arya Mc-Carthy, and Hosein Mohebbi, editors, Proceedings of the 6th BlackboxNLP Workshop: Ana-lyzing and Interpreting Neural Networks for NLP , pages 379–394, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.blackboxnlp-1.29. URL 
 > https://aclanthology.org/2023.blackboxnlp-1.29
 .Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Papers) , pages 2924–2936, Minneapolis, Min-nesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL 
 > https://aclanthology.org/N19-1300
 .Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv:1803.05457 , 2018. URL https://arxiv.org/abs/1803.05457 .Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv:2110.14168 , 2021. URL https://arxiv.org/abs/2110.14168 .Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. UltraFeedback: Boosting language models with high-quality feedback. 
 arXiv:2310.01377 , 2023. URL https://arxiv.org/abs/2310.01377 .Andrew M. Dai and Quoc V. Le. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems , volume 28. Curran Associates, Inc., 2015. URL https:// proceedings.neurips.cc/paper/2015/hash/7137debd45ae4d0ab9aa953017286b20-Abstract.html .Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/ N19-1423 .11 Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 3029–3051, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.183. URL 
 > https://aclanthology.org/2023.emnlp-main.183
 .Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superpo-sition. Transformer Circuits Thread , 2022. URL https://transformer-circuits.pub/2022/toy_ model/index.html .Stanislav Fort. Scaling laws for adversarial attacks on language model activations, 2023. URL 
 > http://arxiv.org/abs/2312.02780
 .Cheng Fu, Hanxian Huang, Xinyun Chen, Yuandong Tian, and Jishen Zhao. Learn-to-Share: A hardware-friendly transfer learning framework exploiting computation and parameter sharing. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event , volume 139 of Proceedings of Machine Learning Research , pages 3469–3479. PMLR, 2021. URL http://proceedings.mlr. press/v139/fu21a.html .Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. LLaMA-Adapter v2: Parameter-efficient visual instruction model. 
 arXiv:2304.15010 , 2023. URL https://arxiv.org/abs/2304.15010 .Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems , volume 34, pages 9574–9586. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/ 4f5c422f4d49a5a807eda27434231040-Paper.pdf .Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Good-man, and Christopher Potts. Inducing causal structure for interpretable neural networks. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning , volume 162 of 
 Proceedings of Machine Learning Research , pages 7324–7338. PMLR, 17–23 Jul 2022. URL 
 > https://proceedings.mlr.press/v162/geiger22a.html
 .Atticus Geiger, Chris Potts, and Thomas Icard. Causal abstraction for faithful model interpretation. 
 arXiv:2301.04709 , 2023a. URL https://arxiv.org/abs/2301.04709 .Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. Find-ing alignments between interpretable causal variables and distributed neural representations. 
 arXiv:2303.02536 , 2023b. URL https://arxiv.org/abs/2303.02536 .Clément Guerner, Anej Svete, Tianyu Liu, Alexander Warstadt, and Ryan Cotterell. A geometric notion of causal probing. arXiv:2307.15054 , 2023. URL https://arxiv.org/abs/2307.15054 .Michael Hanna, Yonatan Belinkov, and Sandro Pezzelle. When language models fall in love: Animacy processing in transformer language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, 
 Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing ,pages 12120–12135, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.744. URL https://aclanthology.org/2023.emnlp-main.744 .Sophie Hao and Tal Linzen. Verb conjugation in transformers is determined by linear encodings of subject number. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 4531–4539, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.300. URL 
 > https://aclanthology.org/2023.findings-emnlp.300
 .12 Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. In The Tenth International Conference on Learning Representations, ICLR 2022 , Virtual Event, 2022a. URL https://openreview.net/ forum?id=0RDcd5Axok .Shwai He, Liang Ding, Daize Dong, Jeremy Zhang, and Dacheng Tao. SparseAdapter: An easy approach for improving the parameter-efficiency of adapters. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022 , pages 2184–2190, Abu Dhabi, United Arab Emirates, December 2022b. As-sociation for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.160. URL 
 > https://aclanthology.org/2022.findings-emnlp.160
 .Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 523–533, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1058. URL https://aclanthology.org/ D14-1058 .Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA ,volume 97 of Proceedings of Machine Learning Research , pages 2790–2799. PMLR, 2019. URL 
 > http://proceedings.mlr.press/v97/houlsby19a.html
 .Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In The Tenth In-ternational Conference on Learning Representations, ICLR 2022 , Virtual Event, 2022. URL 
 > https://openreview.net/forum?id=nZeVKeeFYf9
 .Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 5254–5276, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main. 319. URL https://aclanthology.org/2023.emnlp-main.319 .Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. LoraHub: Efficient cross-task generalization via dynamic lora composition. arXiv:2307.13269 , 2023a. URL 
 > https://arxiv.org/abs/2307.13269
 .Jing Huang, Atticus Geiger, Karel D’Oosterlinck, Zhengxuan Wu, and Christopher Potts. Rigorously assessing natural language explanations of neurons. In Yonatan Belinkov, Sophie Hao, Jaap Jumelet, Najoung Kim, Arya McCarthy, and Hosein Mohebbi, editors, Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP , pages 317–331, Singapore, December 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023. blackboxnlp-1.24. URL https://aclanthology.org/2023.blackboxnlp-1.24 .Jing Huang, Zhengxuan Wu, Kyle Mahowald, and Christopher Potts. Inducing character-level structure in subword-based language models with type-level interchange intervention training. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023 , pages 12163–12180, Toronto, Canada, July 2023c. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.770. URL https: //aclanthology.org/2023.findings-acl.770 .Jing Huang, Christopher Potts Zhengxuan Wu, Mor Geva, and Atticus Geiger. RAVEL: Evaluating interpretability methods on disentangling language model representations. arXiv:2402.17700 ,2024. URL https://arxiv.org/abs/2402.17700 .Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. Parsing algebraic word problems into equations. Transactions of the Association for Computational 
 13 Linguistics , 3:585–597, 2015. doi: 10.1162/tacl_a_00160. URL https://aclanthology.org/ Q15-1042 .Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS: A math word problem repository. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, 
 Proceedings of the 2016 Conference of the North American Chapter of the Association for Com-putational Linguistics: Human Language Technologies , pages 1152–1157, San Diego, Califor-nia, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1136. URL 
 > https://aclanthology.org/N16-1136
 .Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M Asano. VeRA: Vector-based random matrix adaptation. In The Twelfth International Conference on Learning Representations, ICLR 2024 ,2024. URL https://openreview.net/forum?id=NjNfLdxr3A .Karim Lasri, Tiago Pimentel, Alessandro Lenci, Thierry Poibeau, and Ryan Cotterell. Prob-ing for the usage of grammatical number. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) , pages 8818–8831, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.603. URL 
 > https://aclanthology.org/2022.acl-long.603
 .Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd Schema Challenge. In 
 Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning , 2012. URL https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf .Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems , 36, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2023/ hash/81b8390039b7302c909cb769f8b6cd93-Abstract-Conference.html .Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, and Luke Zettlemoyer. Branch-train-merge: Embarrassingly parallel training of expert language models. 
 arXiv:2208.03306 , 2022. URL https://arxiv.org/abs/2208.03306 .Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 4582–4597, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353 .Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaEval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval , 2023. Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale gener-ation: Learning to solve and explain algebraic word problems. arXiv:1705.04146 , 2017. URL 
 > https://arxiv.org/abs/1705.04146
 .Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in Neural Information Processing Systems , 36, 2024a. URL https://arxiv.org/abs/2304.08485 .Sheng Liu, Haotian Ye, Lei Xing, and James Zou. In-context vectors: Making in context learning more effective and controllable through latent space steering. arXiv:2311.06668 , 2024b. URL 
 > https://arxiv.org/abs/2311.06668
 .Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DoRA: Weight-decomposed low-rank adaptation. arXiv:2402.09353 ,2024c. URL https://arxiv.org/abs/2402.09353 .Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, Yandong Wen, Michael J. Black, Adrian Weller, and Bern-hard Schölkopf. Parameter-efficient orthogonal finetuning via butterfly factorization. In The 
 14 Twelfth International Conference on Learning Representations, ICLR 2024 , 2024d. URL 
 > https://openreview.net/forum?id=7NzgkEdGyr
 .Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692 , 2019. URL https://arxiv.org/abs/1907.11692 .James L. McClelland, David E. Rumelhart, and PDP Research Group. Parallel Distributed Processing: Explorations in the Microstructure of Cognition , volume 2: Psychological and Biological Models. MIT Press, 1986. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. arXiv:1809.02789 , 2018. URL 
 > https://arxiv.org/abs/1809.02789
 .Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Lin-guistics: Human Language Technologies , pages 746–751, Atlanta, Georgia, June 2013. Association for Computational Linguistics. URL https://aclanthology.org/N13-1090 .Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models of self-supervised sequence models. In Yonatan Belinkov, Sophie Hao, Jaap Jumelet, Najoung Kim, Arya McCarthy, and Hosein Mohebbi, editors, Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP , pages 16–30, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.blackboxnlp-1.2. URL https: //aclanthology.org/2023.blackboxnlp-1.2 .Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35: 27730–27744, 2022. URL https://arxiv.org/abs/2203.02155 .Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. arXiv:2311.03658 , 2023. URL https://arxiv.org/abs/2311.03658 .Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve sim-ple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 2080–2094, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.168. URL 
 > https://aclanthology.org/2021.naacl-main.168
 .Jonas Pfeiffer, Ivan Vuli´ c, Iryna Gurevych, and Sebastian Ruder. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 7654–7673, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.617. URL https://aclanthology. org/2020.emnlp-main.617 .Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan D. Cotterell. Linear adversarial concept erasure. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning ,volume 162 of Proceedings of Machine Learning Research , pages 18400–18421, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/ravfogel22a.html .Subhro Roy and Dan Roth. Solving general arithmetic word problems. In Lluís Màrquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 1743–1752, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1202. URL https://aclanthology.org/ D15-1202 .15 David E. Rumelhart, James L. McClelland, and PDP Research Group. Parallel Distributed Processing: Explorations in the Microstructure of Cognition , volume 1: Foundations. MIT Press, 1986. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial Winograd Schema Challenge at scale. Communications of the ACM , 64(9):99–106, 2021. URL https://arxiv.org/abs/1907.10641 .Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Common-sense reasoning about social interactions. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 4463–4473, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1454. URL https://aclanthology.org/D19-1454 .Shuhua Shi, Shaohan Huang, Minghui Song, Zhoujun Li, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, and Qi Zhang. ResLoRA: Identity residual mapping in low-rank adaption. 
 arXiv:2402.18039 , 2024. URL https://arxiv.org/abs/2402.18039 .Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, and Ponnu-rangam Kumaraguru. MiMiC: Minimally modified counterfactuals in the representation space. 
 arXiv:2402.09631 , 2024. URL https://arxiv.org/abs/2402.09631 .Paul Smolensky. Neural and conceptual interpretation of PDP models. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition , volume 2: Psychological and Biological Models, pages 390–431. MIT Press/Bradford Books, Cambridge, MA, 1986. Nishant Subramani, Nivedita Suresh, and Matthew E. Peters. Extracting latent steering vectors from pretrained language models. arXiv:2205.05124 , 2022. URL https://arxiv.org/abs/2205.05124 .Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model. 
 > https://github.com/tatsu-lab/stanford_alpaca
 , 2023. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. arXiv:2302.13971 , 2023a. URL https://arxiv.org/abs/2302.13971 .Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cris-tian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023b. URL https://arxiv.org/abs/2307.09288 .Alex Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization. arXiv:2308.10248 , 2023. URL https://arxiv.org/abs/2308.10248 .Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 30, pages 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf .Theia Vogel. repeng, 2024. URL https://github.com/vgel/repeng/ .16 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, editors, Proceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Networks for NLP , pages 353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL 
 > https://aclanthology.org/W18-5446
 .Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, ICLR 2023 , Kigali, Rwanda, 2023. URL https://openreview.net/pdf?id=NpsVSN6o4ul .Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, and Jianfeng Gao. AdaMix: Mixture-of-adaptations for parameter-efficient model tuning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 5744–5760, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.388. URL https://aclanthology.org/2022.emnlp-main.388 .Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems , 35:24824–24837, 2022. URL https://arxiv.org/abs/ 2201.11903 .Muling Wu, Wenhao Liu, Xiaohua Wang, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, and Xuanjing Huang. Advancing parameter efficiency in fine-tuning via representation editing. arXiv:2402.15179 , 2024a. URL https://arxiv.org/abs/2402. 15179 .Zhengxuan Wu, Atticus Geiger, Joshua Rozner, Elisa Kreiss, Hanson Lu, Thomas Icard, Christo-pher Potts, and Noah Goodman. Causal distillation for language models. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguis-tics: Human Language Technologies , pages 4288–4295, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.318. URL 
 > https://aclanthology.org/2022.naacl-main.318
 .Zhengxuan Wu, Atticus Geiger, Christopher Potts, and Noah D. Goodman. Interpretability at scale: Identifying causal mechanisms in Alpaca. In Advances in Neural Information Processing Systems , volume 36, 2023. URL https://papers.neurips.cc/paper_files/paper/2023/file/ f6a8b109d4d4fd64c75e94aaf85d9697-Paper-Conference.pdf .Zhengxuan Wu, Atticus Geiger, Aryaman Arora, Jing Huang, Zheng Wang, Noah D. Goodman, Christopher D. Manning, and Christopher Potts. pyvene: A library for understanding and improving PyTorch models via interventions. In arXiv:2403.07809 , 2024b. URL https://arxiv.org/abs/ 2403.07809 .Zhengxuan Wu, Atticus Geiger, Jing Huang, Aryaman Arora, Thomas Icard, Christopher Potts, and Noah D. Goodman. A reply to Makelov et al. (2023)’s “interpretability illusion” arguments. 
 arXiv:2401.12631 , 2024c. URL https://arxiv.org/abs/2401.12631 .Takateru Yamakoshi, James McClelland, Adele Goldberg, and Robert Hawkins. Causal in-terventions expose implicit situation models for commonsense language understanding. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023 , pages 13265–13293, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.839. URL 
 > https://aclanthology.org/2023.findings-acl.839
 .Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? arXiv:1905.07830 , 2019. URL https://arxiv.org/abs/1905.07830 .17 Feiyu Zhang, Liangzhi Li, Junhao Chen, Zhouqiang Jiang, Bowen Wang, and Yiming Qian. IncreLoRA: Incremental parameter allocation method for parameter-efficient fine-tuning. 
 arXiv:2308.12043 , 2023. URL https://arxiv.org/abs/2308.12043 .Jinghan Zhang, Shiqi Chen, Junteng Liu, and Junxian He. Composing parameter-efficient modules with arithmetic operation. Advances in Neural Information Processing Systems , 36, 2024a. URL 
 > https://arxiv.org/abs/2306.14870
 .Renrui Zhang, Jiaming Han, Chris Liu, Aojun Zhou, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. LLaMA-Adapter: Efficient fine-tuning of large language models with zero-initialized attention. In 
 The Twelfth International Conference on Learning Representations , Vienna, Austria, 2024b. URL 
 > https://openreview.net/forum?id=d4UiXAHN2W
 .Ruiyi Zhang, Rushi Qiang, Sai Ashish Somayajula, and Pengtao Xie. AutoLoRA: Automatically tuning matrix ranks in low-rank adaptation based on meta learning. arXiv:2403.09113 , 2024c. URL https://arxiv.org/abs/2403.09113 .Ming Zhong, Yelong Shen, Shuohang Wang, Yadong Lu, Yizhu Jiao, Siru Ouyang, Donghan Yu, Jiawei Han, and Weizhu Chen. Multi-LoRA composition for image generation. arXiv:2402.16843 ,2024. URL https://arxiv.org/abs/2402.16843 .Yun Zhu, Nevan Wichers, Chu-Cheng Lin, Xinyi Wang, Tianlong Chen, Lei Shu, Han Lu, Ca-noee Liu, Liangchen Luo, Jindong Chen, et al. SiRa: Sparse mixture of low rank adaptation. 
 arXiv:2311.09179 , 2023. URL https://arxiv.org/abs/2311.09179 .Zhou Ziheng, Yingnian Wu, Song-Chun Zhu, and Demetri Terzopoulos. Aligner: One global token is worth millions of parameters when aligning large language models. arXiv:2312.05503 , 2023. URL https://arxiv.org/abs/2312.05503 .Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to AI transparency. arXiv:2310.01405 , 2023. URL https://arxiv.org/abs/2310.01405 .18 Appendix 
 # Table of Contents 
 A pyreft : A ReFT-native Python Library 20 B Describing existing methods under the ReFT framework 20 
 B.1 RED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.2 Activation addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 B.3 RepE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 
 C Datasets 21 
 C.1 Commonsense reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.2 Arithmetic reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 C.3 Natural language understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 
 D Hyperparameters 23 
 D.1 Hyperparameter tuning and decoding strategy . . . . . . . . . . . . . . . . . . . . 23 D.2 Suggestions on choosing hyperparameters for ReFT . . . . . . . . . . . . . . . . . 29 D.3 Additional hyperparameter-tuning results of LoReFT . . . . . . . . . . . . . . . . 29 
 E Ablating the parametrisation of LoReFT 32 F Memorisation experiments 33 
 F.1 A single vector is worth a thousand tokens . . . . . . . . . . . . . . . . . . . . . . 33 F.2 A single vector can memorise a codebook with 256 entries . . . . . . . . . . . . . 35 
 G Capabilities experiments 36 
 G.1 Multi-task learning: Learned ReFTs are like puzzle pieces . . . . . . . . . . . . . 36 G.2 Few-shot adaptation: Adapting Llama-2-Chat to GOODY-2 with 5 examples . . 38 
 H Inference overhead analysis of ReFT with our ReFT library 39 I Generation examples 41 J Licenses for existing assets 49 
 J.1 Commonsense reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 J.2 Arithmetic reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 J.3 Instruct-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 J.4 Natural language understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 J.5 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 
 19 A pyreft : A ReFT-native Python Library 
 To lower the cost of switching from PEFTs to ReFT, we release a Python library made for training and sharing ReFTs. Our library is built on top of pyvene [Wu et al., 2024b], a library for performing and training activation interventions on arbitrary PyTorch models. Any pretrained LM available on HuggingFace is supported through our library for finetuning with ReFT methods, and finetuned models can be easily uploaded to HuggingFace. The following example shows steps to wrap a Llama-2 7B model with a single intervention on the residual stream output of the 19-th layer:                                  
 > import torch import transformers from pyreft import get_reft_model , ReftConfig , LoreftIntervention , ReftTrainerForCausalLM #loading huggingface model model_name_or_path =" yahma / llama -7b - hf " model =transformers . AutoModelForCausalLM . from_pretrained ( model_name_or_path , torch_dtype = torch . bfloat16 , device_map =" cuda ") #wrap the model with rank -1 loreft reft_config =ReftConfig ( representations ={ " layer ": 19 , " component ": " block_output " , " intervention ": LoreftIntervention ( embed_dim = model . config . hidden_size , low_rank_dimension =1) }) reft_model =get_reft_model ( model , reft_config ) reft_model . print_trainable_parameters ()
 The wrapped model can be trained for downstream tasks. We also provide data loading helpers to construct training data that is compatible with HuggingFace trainers:                    
 > tokenizer =transformers . AutoTokenizer . from_pretrained ( model_name_or_path ) #get training data with customised dataloaders data_module =make_supervised_data_module ( tokenizer = tokenizer , model = model , layers =[19] , training_args = training_args , data_args = data_args ) #train trainer =reft . ReftTrainerForCausalLM ( model = reft_model , tokenizer = tokenizer , args = training_args , ** data_module ) trainer . train () trainer . save_model ( output_dir = training_args . output_dir )
 ## B Describing existing methods under the ReFT framework 
 To show the expressivity of the ReFT framework, we cast existing representing-editing methods in the literature into ReFTs. 
 General comments about expressivity of ReFT. Given that previous works have unified PEFTs under a single framework [He et al., 2022a], one may ask why not express ReFT as a PEFT method? The main reason is that PEFT frameworks lack the notion of time or sequence (see the unified PEFT view provided in Table 1 on pg. 5 of He et al., 2022a). In PEFTs, representation modifications are necessarily applied to every token in the sequence, even in recent variants such as AdaLoRA [Zhang et al., 2023]. A key aspect of ReFT is that it leverages representations over time and intervenes only on a small number of them while being effective. More importantly, the notation of time is important for future versions of ReFT that intervene on representations schematically 
 (e.g. intervene on the first token at some early layers and then intervene on the last token at some later layers). The ability to intervene at different layer and position combinations schematically is also supported in our code. Existing PEFT libraries 7 enforce weight-based updates without supporting flexible representation-based interventions. 
 B.1 RED 
 RED [Wu et al., 2024a] is a simple representation-editing method that applies an element-wise scaling transform s ∈ Rn and adds a bias b ∈ Rn to the hidden representation in every layer. The same intervention is applied to every position (including at generated tokens, increasing inference burden) 
 > 7
 See https://github.com/huggingface/peft .
 20 but separate interventions are learned at each layer. In the ReFT framework, RED is defined as 
 ΦRED (h) = s × h + b (8) 
 IRED = {⟨ ΦRED , {1, . . . , n }, l ⟩ ∣ l ∈ {1, . . . , m }} (9) The parameters ϕRED = {s, b} are learned with gradient descent to minimise a loss function such as language-modelling loss or a classification loss, as in our experiments with LoReFT. We believe that RED is better classified as a kind of adapter due to its application at all positions. 
 B.2 Activation addition 
 Activation addition [Turner et al., 2023] takes the difference in activations at at some positions p and 
 q and layer l given two contrastive prompts x+ and x− as input. It then adds this difference vector, scaled by a tuned constant c, to representations at all positions in layer l for some new prompt. 
 a = h(x+)(l) 
 > p
 − h(x−)(l) 
 > q
 (10) 
 ΦActAdd (h) = h + c ⋅ a (11) 
 IActAdd = {⟨ ϕActAdd , {1, . . . , n }, l ⟩} (12) 
 B.3 RepE 
 Zou et al. [2023] introduce several intervention methods for controlling model behaviour, which they term representation engineering .First, given a set of prompts {x1, . . . , xn} designed to elicit the presence of a concept, we randomly pair them, take the difference in activations for each pair, and find the first principle component of the difference vectors at the last token position in some layer of interest l to obtain a reading vector :
 areading = PCA ({ h(xi)(l) 
 > −1
 − h(xi+1)(l) 
 > −1
 ∣ i ≡ 0 mod 2 }) 1 (13) One can also used a more structured pairing of constrastive prompts to obtain a contrast vector ,similar to the difference vector computed in activation addition: 
 acontrast = PCA ({ h(x+ 
 > i
 )(l) 
 > −1
 − h(x− 
 > i
 )(l) 
 > −1
 ∣ 1 ≤ i ≤ n}) 1 (14) Then, using either areading or acontrast , RepE introduces three operators (i.e. parametrisations of Φ)for intervening on activations: 
 ΦRepE ,linear (h) = h ± c ⋅ a (15) 
 ΦRepE ,piecewise (h) = h + c ⋅ sign (a ⋅ h) ⋅ a (16) 
 ΦRepE ,projection (h) = h − c ⋅ a ⋅ h
 ∥a∥2 ⋅ a (17) The first two of these are similar to activation addition, while the latter is a scaled one-dimensional distributed interchange intervention that is a special case of LoReFT. These operations are then used to intervene on some set of positions P ⊆ {1, . . . , n } in the layer of interest: 
 IRepE = {⟨ ΦRepE , P, l ⟩} (18) RepE introduces another model control method called Low-Rank Representation Adaptation (LoRRA), which is a kind of PEFT rather than a ReFT since it tunes model weights using a variant of LoRA. 
 ## C Datasets 
 C.1 Commonsense reasoning 
 We train and evaluate our models on eight datasets covering different domains of open-ended QA tasks: 1. The BoolQ [Clark et al., 2019] dataset, which is a question-answering dataset for yes or no naturally occurring questions. We remove the provided passage in the dataset following previous works to ensure a fair comparison. 21 2. The PIQA [Bisk et al., 2020] dataset, which tests physical commonsense reasoning and requires the model to choose one of the provided actions to take based on a hypothesised scenario. 3. The SIQA [Sap et al., 2019] dataset, which focus on reasoning about people’s actions and their corresponding social consequences. 4. The HellaSwag [Zellers et al., 2019] dataset, which asks the model to choose an appropriate ending (or sentence completion) given a context. 5. The WinoGrande [Sakaguchi et al., 2021] dataset, inspired by Winograd Schema Chal-lenge [Levesque et al., 2012], asks the model to fill-in-a-blank with binary options given a sentence which requires commonsense reasoning. 6. The ARC Easy set ( ARC-e [Clark et al., 2018]), which includes genuine grade-school level multiple-choice science questions 7. The ARC Challenge set ( ARC-c ) [Clark et al., 2018]), which is like ARC-e but designed in a way that co-occurrence methods are expected to fail to answer correctly. 8. The OBQA [Mihaylov et al., 2018] dataset, which is a knowledge-intensive and open-book QA dataset that requires multi-hop reasoning. Dataset statistics and simplified training examples from each dataset are provided in Hu et al. [2023]. Dataset statistics and simplified training examples from each dataset are provided in Hu et al. [2023]. We replicate the experimental setup in Hu et al. [2023] and finetune our models on a combined training dataset (C OMMONSENSE 170K) of the tasks mentioned above, and evaluate on their individual test set. 
 C.2 Arithmetic reasoning 
 We train and evaluate with seven datasets covering different domains of math world problems: 1. The AddSub [Hosseini et al., 2014] dataset, which involves solving arithmetic word prob-lems that include addition and subtraction. 2. The AQuA [Ling et al., 2017] dataset, which formulates algebraic word problems as multiple-choice problems. 3. The GSM8K [Cobbe et al., 2021] dataset, which consists of grade-school math word problems that require multi-step reasoning. 4. The MAWPS [Koncel-Kedziorski et al., 2016] dataset, which contains math word problem with varying complexity. 5. The MultiArith [Roy and Roth, 2015] dataset, which contains multi-step arithmetic prob-lems. 6. The SingleEq [Koncel-Kedziorski et al., 2015] dataset, which has grade-school math word problems that map to single equations with different length. 7. The SVAMP [Patel et al., 2021] dataset, which enhances the original Math World Prob-lem (MWP) challenge by requiring robust reasoning ability that is invariant to structural alternations of the posing problem. Dataset statistics and simplified training examples from each dataset are provided in Hu et al. [2023]. We replicate the experimental setup in Hu et al. [2023] and finetune our models on a combined training dataset (M ATH 10K) of four tasks mentioned above: GSM8K, MAWPS, MAWPS-single and AQuA. Different from Hu et al. [2023], selected tasks are excluded for testing since the original paper accidentally leaks testing examples from these tasks into the training set, affecting AddSub, MultiArith and SingleEq. They are included in the MAWPS training dataset, and thus leaked into the training dataset. 
 C.3 Natural language understanding 
 We follow Wu et al. [2024a] for proper evaluation on the GLUE validation set. We split the validation set into two subsets, using one subset guarded by a random seed for in-training evaluation and the other for testing. Specifically, after each training epoch, we evaluate the model on our in-training 22 evaluation set and select the best model across all epochs for testing. For datasets with a large validation set (i.e., QQP, MNLI, and QNLI), we select 1,000 samples for in-training evaluation. For the remaining smaller datasets, we select half of the samples for this purpose. For the evaluation metric, we use the Matthews correlation coefficient for CoLA, the Pearson correlation coefficient for STS-B, and accuracy for the other datasets. For MNLI, we report results only on the matched version. 
 ## D Hyperparameters 
 D.1 Hyperparameter tuning and decoding strategy Commonsense reasoning and arithmeric reasoning. We create a standalone development set by taking the last 300 examples from the GSM8K training set. We train our models with the remaining training set of GSM8K and select the hyperparameter settings based on model performance on the development set. We select the hyperparameters using LLaMA-7B, and apply the same settings to LLaMA-13B without additional tuning. We use a maximum sequence length of 512 for training and hyperparameter tuning, and a maximum new token number of 32 for inference. Table 5 and table 6 describes our hyperparameter search space. We use a lower number of epochs (6 instead of 12) for the commonsense reasoning benchmark because the C OMMONSENSE 170K training set is more than 20 times larger than GSM8K. During inference, we use greedy decoding without sampling for the commonsense reasoning bench-mark, since it is a multi-token classification benchmark, and use the same decoding strategy as in Hu et al. [2023] for the arithmetic reasoning benchmark with a higher temperature 0.3. The reason to switch to a slightly different set of decoding hyperparameters is that the HuggingFace decoding function may throw an error due to statistical instability with close-to-zero probabilities over output tokens with beam search. 8
 Instruction following. We finetune LLaMA-7B on Alpaca-52K [Taori et al., 2023] to select hy-perparameters. We select the hyperparameter settings based on model performance evaluated with Alpaca-Eval v1.0 [Li et al., 2023], which calculates the win-rate over text-davinci-003 by using 
 gpt-4-turbo as the annotator. We use a maximum sequence length of 768 for training and hyper-parameter tuning, and a maximum new token number of 2048 for inference. Table 7 describes our hyperparameter search space. During inference, we use the same decoding strategy as in RED [Wu et al., 2024a] to ensure a fair comparison. Specifically, we use greedy decoding without sampling, and use a maximum repetition n-gram size of 5 with a repetition penalty of 1.1. 
 Natural language understanding. We conduct hyperparameter tuning with RoBERTa-base and RoBERTa-large for each task individually. We pick the hyperparameters based on testing performance on the held-out validation set with a fixed random seed of 42. We then evaluate our model with additional four unseen seeds {43, 44, 45, 46} for final results. We follow Wu et al. [2024a]’s setting for evaluation. For QQP with RoBERTa-large, there are some stochasticity in runs with the same seed, so we picked the best run out of 3 runs for any particular seed. As reported by Wu et al. [2024a], we also observe that evaluation results on RTE are unstable due to the small size of the dataset. We thus replace several random seeds as in Wu et al. [2024a] to ensure a fair comparison. In addition, we replace one or two random seeds for CoLA for stability. Table 8 describes our hyperparameter search space. Table 9 to table 12 describe our hyperparameter settings for each task. We conduct separate hyperparameter tuning for LoReFT and DiReFT to ensure a fair comparison.  
 > 8See reference ticket: https://github.com/huggingface/transformers/issues/11267 .
 23 Table 5: Hyperparameter search space of LLaMA-1 7B models with LoReFT on the GSM8K development set with the best settings underlined . We use greedy decoding without sampling during hyperparameter tuning. 
 Hyperparameters LLaMA-7B w/ GSM8K for LoReFT 
 prefix+suffix position p + s {p1+ s1, p3+ s3, p5+ s5, p7+ s7, p9+ s9, p11+ s11} Tied weight p, s {True, False} Rank r {8, 16, 32, 64} Layer L (sep. w/ ‘;’) {0;2;4;6;10;12;14;18, 10;12;14;18;20;22;24;28, 4;6;10;12;14;18;20;22, all} Dropout {0.00, 0.05} Optimizer AdamW LR {9 ×10 −5, 1 ×10 −4, 3 ×10 −4, 6 ×10 −4, 9 ×10 −4, 1 ×10 −3, 3 ×10 −3}Weight decay {0, 1 ×10 −3, 2 ×10 −3}LR scheduler Linear Batch size {4, 8, 16, 32, 64} Warmup ratio {0.00, 0.06, 0.10} Epochs {3, 6, 9, 12, 18} 
 Table 6: Hyperparameter search space of LLaMA-1 7B models with DiReFT on the GSM8K development set with the best settings underlined . We use greedy decoding without sampling during hyperparameter tuning. 
 Hyperparameters LLaMA-7B w/ GSM8K for DiReFT 
 prefix+suffix position p + s {p1+ s1, p3+ s3, p5+ s5, p7+ s7, p9+ s9, p11+ s11} Tied weight p, s {True, False} Rank r {8, 16, 32, 64} Layer L (sep. w/ ‘;’) {0;2;4;6;10;12;14;18, 10;12;14;18;20;22;24;28, 4;6;10;12;14;18;20;22, all} Dropout {0.00, 0.05} Optimizer AdamW LR {9 ×10 −5, 1 ×10 −4, 3 ×10 −4, 6 ×10 −4, 9 ×10 −4, 1 ×10 −3, 3 ×10 −3}Weight decay {0, 1 ×10 −3, 2 ×10 −3, 6 ×10 −3, 1 ×10 −2, 2 ×10 −2, 6 ×10 −2}LR scheduler Linear Batch size {4, 8, 16, 32, 64} Warmup ratio {0.00, 0.06, 0.10} Epochs {3, 6, 9, 12, 18} 
 24 Table 7: Hyperparameter search space of LLaMA-1 7B models on Alpaca-52K evaluated by Alpaca-Eval v1.0 with the best settings underlined . We use greedy decoding without sampling during hyperparameter tuning. LoReFT and DiReFT have the same hyperparameter settings. 
 Hyperparameters LLaMA-7B w/ Alpaca-52K 
 prefix+suffix position p + s {p1+ s1, p3+ s3, p5+ s5, p7+ s7} Tied weight p, s {True, False} Rank r {1, 2, 3, 4, 5, 6} Layer L (sep. w/ ‘;’) {9;18, 3;9;18, 3;9;18;24} Dropout {0.00, 0.05} Optimizer AdamW LR 9×10 −4
 Weight decay 0×10 −3
 LR scheduler Linear Batch size {16, 32, 64, 128} Warmup ratio 0.00 Epochs {1, 3, 6, 9, 12} 
 Table 8: Hyperparameter search space of RoBERTa-base and RoBERTa-large models on GLUE evaluated with classification accuracy. Best hyperparameter settings are task-specific, which are specified in separate tables. 
 Hyperparameters RoBERTa-base and RoBERTa-large w/ GLUE 
 prefix+suffix position p + s {p1, p3, p5, p7, p9, p11} Tied weight p, s False Rank r {1, 2} Layer L (sep. w/ ‘;’) {1;3;5;7;9;11, all} Dropout {0.00, 0.05, 0.10, 0.15, 0.20} Optimizer AdamW LR {1 ×10 −4, 2 ×10 −4, 3 ×10 −4, 4 ×10 −4, 5 ×10 −4}, {6 ×10 −4, 9 ×10 −4, 1 ×10 −3, 3 ×10 −3}Weight decay {0, 1 ×10 −4, 6 ×10 −4, 1 ×10 −3, 6 ×10 −3, 1 ×10 −2, 2 ×10 −2, 4 ×10 −2}LR scheduler Linear Batch size {16, 32, 64, 128} Warmup ratio {0, 5 ×10 −3, 6 ×10 −3, 3 ×10 −2, 5 ×10 −2, 6 ×10 −2, 1 ×10 −1, 2 ×10 −1}Epochs {20, 30, 40, 50, 60} 
 25 Table 9: Hyperparameter settings of RoBERTa-base models on GLUE for LoReFT. 
 Hyperparameters MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B 
 position p p1 p3 p3 p3 p11 p11 p3 p3Tied weight False Rank r 1Layer L all Dropout 0.05 0.10 0.05 0.20 0.05 0.05 0.05 0.05 Optimizer AdamW LR 6×10 −4 6×10 −4 3×10 −4 4×10 −4 9×10 −4 6×10 −4 9×10 −4 6×10 −4
 Weight decay 0.00 LR scheduler Linear Batch size 32 Warmup ratio 6×10 −2 1×10 −1 0 5×10 −3 1×10 −1 0 0 3×10 −2
 Epochs 40 40 40 60 20 40 60 60 
 Table 10: Hyperparameter settings of RoBERTa-large models on GLUE for LoReFT. 
 Hyperparameters MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B 
 position p p1 p3 p3 p3 p11 p11 p3 p3Tied weight False Rank r 1Layer L all Dropout 0.05 0.05 0.20 0.20 0.05 0.05 0.05 0.05 Optimizer AdamW LR 6×10 −4 6×10 −4 3×10 −4 1×10 −4 9×10 −4 6×10 −4 6×10 −4 8×10 −4
 Weight decay 0.00 LR scheduler Linear Batch size 32 Warmup ratio 0.00 0.10 0.06 0.20 0.10 0.06 0.00 0.20 Epochs 20 20 30 30 20 20 30 30 
 26 Table 11: Hyperparameter settings of RoBERTa-base models on GLUE for DiReFT. 
 Hyperparameters MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B 
 position p p1 p3 p5 p1 p11 p11 p1 p3Tied weight False Rank r 1Layer L all Dropout 0.05 0.10 0.05 0.00 0.05 0.05 0.00 0.05 Optimizer AdamW LR 6×10 −4 6×10 −4 3×10 −4 6×10 −4 9×10 −4 6×10 −4 9×10 −4 6×10 −4
 Weight decay 0.00 0.00 0.00 0.04 0.00 0.00 0.04 0.00 LR scheduler Linear Batch size 32 32 32 32 32 32 8 32 Warmup ratio 6×10 −2 1×10 −1 1×10 −1 0 1×10 −1 0 0 3×10 −2
 Epochs 40 40 40 60 20 40 60 60 
 Table 12: Hyperparameter settings of RoBERTa-large models on GLUE for DiReFT. 
 Hyperparameters MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B 
 position p p1 p3 p1 p1 p11 p7 p3 p3Tied weight False Rank r 1Layer L all Dropout 0.05 0.05 0.10 0.15 0.05 0.05 0.05 0.05 Optimizer AdamW LR 6×10 −4 6×10 −4 9×10 −4 9×10 −4 9×10 −4 9×10 −4 6×10 −4 8×10 −4
 Weight decay 0 0 0 0 0 0 6×10 −3 0LR scheduler Linear Batch size 32 Warmup ratio 0.00 0.10 0.00 0.00 0.10 0.10 0.00 0.10 Epochs 20 20 50 60 20 20 30 30 
 27 Table 13: Accuracy comparison of RoBERTa-base and RoBERTa-large against existing PEFT methods on the GLUE benchmark with standard deviation (SD) . ∗Performance results of all baseline methods are taken from Wu et al. [2024a]. We report averaged performance of five runs with distinct random seeds for our method. Param. (%) is calculated by dividing the number of trainable parameters (excluding the number of parameters of the classification head) with the number of parameter of the base LM. 
 Model PEFT Params (%) Accuracy (↑) ( SD )
 MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B Avg. 
 base FT 100% 87.3 (0.34 ) 94.4 (0.96 ) 87.9 (0.91 ) 62.4 (3.29 ) 92.5 (0.22 ) 91.7 (0.19 ) 78.3 (3.20 ) 90.6 (0.59 ) 85.6 Adapter ∗ 0.318% 87.0 (0.28 ) 93.3 (0.40 ) 88.4 (1.54 ) 60.9 (3.09 ) 92.5 (0.02 ) 90.5 (0.08 ) 76.5 (2.26 ) 90.5 (0.35 ) 85.0 
 LoRA ∗ 0.239% 86.6 (0.23 ) 93.9 (0.49 ) 88.7 (0.76 ) 59.7 (4.36 ) 92.6 (0.10 ) 90.4 (0.08 ) 75.3 (2.79 ) 90.3 (0.54 ) 84.7 Adapter FNN ∗ 0.239% 87.1 (0.10 ) 93.0 (0.05 ) 88.8 (1.38 ) 58.5 (1.69 ) 92.0 (0.28 ) 90.2 (0.07 ) 77.7 (1.93 ) 90.4 (0.31 ) 84.7 BitFit ∗ 0.080% 84.7 (0.08 ) 94.0 (0.87 ) 88.1 (1.57 ) 54.0 (3.07 ) 91.0 (0.05 ) 87.3 (0.02 ) 69.8 (1.51 ) 89.5 (0.35 ) 82.3 RED ∗ 0.016% 83.9 (0.14 ) 93.9 (0.31 ) 89.2 (0.98 ) 61.0 (2.96 ) 90.7 (0.35 ) 87.2 (0.17 ) 78.0 (2.06 ) 90.4 (0.32 ) 84.3 
 DiReFT (ours) 0.015% 82.5 (0.22 ) 92.6 (0.76 ) 88.3 (1.23 ) 58.6 (1.99 ) 91.3 (0.19 ) 86.4 (0.27 ) 76.4 (1.48 ) 89.3 (0.56 ) 83.2 
 LoReFT (ours) 0.015% 83.1 (0.26 ) 93.4 (0.64 ) 89.2 (2.62 ) 60.4 (2.60 ) 91.2 (0.25 ) 87.4 (0.23 ) 79.0 (2.76 ) 90.0 (0.29 ) 84.2 large FT 100% 88.8 (0.45 ) 96.0 (0.66 ) 91.7 (1.73 ) 68.2 (2.62 ) 93.8 (0.33 ) 91.5 (1.28 ) 85.8 (1.40 ) 92.6 (0.16 ) 88.6 Adapter ∗ 0.254% 90.1 (0.12 ) 95.2 (0.48 ) 90.5 (0.59 ) 65.4 (2.24 ) 94.6 (0.17 ) 91.4 (0.13 ) 85.3 (1.34 ) 91.5 (0.33 ) 88.0 LoRA ∗ 0.225% 90.2 (0.25 ) 96.0 (0.85 ) 89.8 (2.09 ) 65.5 (2.02 ) 94.7 (0.21 ) 90.7 (0.91 ) 86.3 (2.41 ) 91.7 (0.44 ) 88.1 Adapter FNN ∗ 0.225% 90.3 (0.15 ) 96.1 (0.75 ) 90.5 (1.26 ) 64.4 (1.56 ) 94.3 (0.39 ) 91.3 (0.24 ) 84.8 (2.01 ) 90.2 (0.24 ) 87.7 RED ∗ 0.014% 89.5 (0.38 ) 96.0 (0.48 ) 90.3 (1.40 ) 68.1 (1.69 ) 93.5 (0.33 ) 88.8 (0.11 ) 86.2 (1.40 ) 91.3 (0.21 ) 88.0 
 DiReFT (ours) 0.014% 88.7 (0.13 ) 95.4 (0.60 ) 88.5 (2.16 ) 66.7 (2.21 ) 93.9 (0.39 ) 88.1 (0.47 ) 86.9 (1.56 ) 91.2 (0.29 ) 87.4 
 LoReFT (ours) 0.014% 89.2 (0.27 ) 96.2 (0.72 ) 90.1 (1.17 ) 68.0 (1.44 ) 94.1 (0.35 ) 88.5 (0.45 ) 87.5 (1.49 ) 91.6 (0.43 ) 88.2 
 28 D.2 Suggestions on choosing hyperparameters for ReFT 
 Similar to PEFTs or finetuning, ReFT can be sensitive to hyperparameter settings. Here, we recom-mand a non-exhaustive list for choosing the best hyperparameter settings for your tasks: • Intervening on multiple positions delivers significant gains . We find that intervening only on a single token position (e.g., just the first one or the last one) is always less optimal than intervening on multiple tokens. However, intervening on excessive number of tokens might harm performance by slowing down convergence. • Intervening on all layers first, and then shrink down . Intervening on all layers often provides a good baseline. We recommand users to start with all layers, and shrink down the number of intervening layers depending on the desired performance–parameter count balance. • Higher rank may not entail better performance . High rank entails higher parameter count, but it does not always bring performance gain (likely due to slower convergence). We recommend users to start with a rank that is lower than 32 (e.g. rank 4). • Tie intervention weights as much as you can . In the paper, we explore tying the interven-tion weights between prefix and suffix token positions. It automatically halves the parameter count, and it can result in better performance as well. We suspect weight sharing across layers may also help. • Hyperparameter tuning with learning rate, warmup ratio, dropout rate and weight decay should go after other hyperparameters . These classic neural-network training hyperparameters can play a role, yet they have much smaller effect than previous ones. 
 D.3 Additional hyperparameter-tuning results of LoReFT 
 As a result of our hyperparameter searching process, LoReFT is trained with more epochs compared to LoRA [Hu et al., 2022] or DoRA [Liu et al., 2024c]. This raises the concern whether our performance gain is purely due to the larger number of epochs. We thus rerun our experiments with the exact same number of epochs and effective batch size as LoRA or DoRA. Results are shown in table 14 and table 15. With matched hyperparameters, LoReFT shows similar results by outperforming previous methods significantly on eight commonsense reasoning datasets. Recently, VeRA was proposed as a new variant of LoRA that further reduces the number of trainable parameters while maintaining performance [Kopiczko et al., 2024]. Table 16 shows our results compared against VeRA as well as the baseline numbers reported in VeRA’s paper. We include this set of results in the appendix, given that the hyperparameter tuning process is drastically different from ours. 9 The original VeRA implementation records the performance of the best epoch on the validation set, which could cause overfitting since results are selected based on test set performance.  
 > 9VeRA’s original implementation can be found at https://openreview.net/notes/edits/attachment? id=D0dcbrnPq0&name=supplementary_material .
 29 Table 14: Accuracy comparison of LLaMA-7B and LLaMA-13B against existing PEFT methods on eight commonsense reasoning datasets. ∗Performance results of all baseline methods are taken from Liu et al. [2024c]. We report averaged performance of three runs with distinct random seeds for our method. For LoReFT, Param. (%) is calculated by dividing the number of trainable parameters by the number of parameters of the base LM. We include LoReFT e=3, which is trained with 3 epochs — the same number of epochs as DoRA, but with a reduced batch size of 16 to ensure an equivalent number of gradient sets.                                                                                                                                                                                         
 > Model PEFT Params (%) Accuracy (↑)
 > BoolQ PIQA SIQA HellaS. WinoG. ARC-e ARC-c OBQA Avg.
 > ChatGPT ∗——73.1 85.4 68.5 78.5 66.1 89.8 79.9 74.8 77.0 LLaMA-7B PrefT ∗0.039% 64.3 76.8 73.9 42.1 72.1 72.9 54.0 60.6 64.6 Adapter S∗1.953% 63.0 79.2 76.3 67.9 75.7 74.5 57.1 72.4 70.8 Adapter P∗3.542% 67.9 76.4 78.8 69.8 78.9 73.7 57.3 75.2 72.3 LoRA ∗0.826% 68.9 80.7 77.4 78.1 78.8 77.8 61.3 74.8 74.7 DoRA (half) ∗0.427% 70.0 82.6 79.7 83.2 80.6 80.6 65.4 77.6 77.5 DoRA ∗0.838% 68.5 82.9 79.6 84.8 80.8 81.4 65.8 81.0 78.1
 > LoReFT e=30.031% 68.3 83.5 79.7 92.7 82.6 83.2 67.4 78.5 79.5
 > LoReFT (ours) 0.031% 69.3 84.4 80.3 93.1 84.2 83.2 68.2 78.9 80.2
 > LLaMA-13B PrefT ∗0.031% 65.3 75.4 72.1 55.2 68.6 79.5 62.9 68.0 68.4 Adapter S∗1.586% 71.8 83.0 79.2 88.1 82.4 82.5 67.3 81.8 79.5 Adapter P∗2.894% 72.5 84.9 79.8 92.1 84.7 84.2 71.2 82.4 81.5 LoRA ∗0.670% 72.1 83.5 80.5 90.5 83.7 82.8 68.3 82.4 80.5 DoRA (half) ∗0.347% 72.5 85.3 79.9 90.1 82.9 82.7 69.7 83.6 80.8 DoRA ∗0.681% 72.4 84.9 81.5 92.4 84.2 84.2 69.6 82.8 81.5
 > LoReFT e=30.025% 72.0 85.6 82.1 94.8 85.3 86.9 73.0 85.0 83.1
 > LoReFT (ours) 0.025% 72.1 86.3 81.8 95.1 87.2 86.2 73.7 84.2 83.3
 Table 15: Accuracy comparison of LLaMA-7B and LLaMA-13B against existing PEFT methods on four arithmetic reasoning datasets. ∗Performance results of all baseline methods are taken from Hu et al. [2023]. We report averaged performance of three runs with distinct random seeds for our method. We include LoReFT e=3, which is trained with 3 epochs — the same number of epoch as DoRA, but with a reduced batch size of 16 to ensure an equivalent number of gradient sets. 
 Model PEFT Params (%) Accuracy (↑)
 AQuA GSM8K MAWPS SVAMP Avg. 
 LLaMA-7B PrefT ∗ 0.039% 14.2 24.4 63.4 38.1 35.0 Adapter S∗ 1.953% 15.0 33.3 77.7 52.3 44.6 Adapter P∗ 3.542% 18.1 35.3 82.4 49.6 46.4 LoRA ∗ 0.826% 18.9 37.5 79.0 52.1 46.9 LoReFT e=3 0.031% 22.4 21.6 69.5 43.6 39.3 
 LoReFT (ours) 0.031% 21.4 26.0 76.2 46.8 42.6 LLaMA-13B PrefT ∗ 0.031% 15.7 31.1 66.8 41.4 38.8 Adapter S∗ 1.586% 22.0 44.0 78.6 50.8 48.9 Adapter P∗ 2.894% 20.5 43.3 81.1 55.7 50.2 LoRA ∗ 0.670% 18.5 47.5 83.6 54.6 51.1 LoReFT e=3 0.025% 23.4 35.5 81.8 54.6 48.8 
 LoReFT (ours) 0.025% 23.6 38.1 82.4 54.2 49.6 30 Table 16: Accuracy comparison of RoBERTa-base and RoBERTa-large against existing PEFT methods on the GLUE benchmark. ∗Performance results of all baseline methods are taken from Kopiczko et al. [2024]. To ensure a fair comparison, we report median performance of five runs with distinct random seeds for our method. 
 Model PEFT Params (%) Accuracy (↑)
 SST-2 MRPC CoLA QNLI RTE STS-B Avg. 
 base FT 100% 94.8 90.2 63.6 92.8 78.7 91.2 85.2 BitFit 0.080% 93.7 92.7 62.0 91.8 81.5 90.8 85.4 Adpt D 0.239% 94.2 88.5 60.8 93.1 71.5 89.7 83.0 Adpt D 0.717% 94.7 88.4 62.6 93.0 75.9 90.3 84.2 LoRA 0.239% 95.1 89.7 63.4 93.3 86.6 91.5 86.6 
 VeRA 0.034% 94.6 89.5 65.6 91.8 78.8 90.7 85.2 
 DiReFT (ours) 0.015% 92.2 88.7 59.5 91.3 77.0 89.6 83.0 
 LoReFT (ours) 0.015% 93.6 87.8 59.1 91.3 79.9 90.0 83.6 base Adpt P 0.845% 96.1 90.2 68.3 94.8 83.8 92.1 87.6 Adpt P 0.225% 96.6 89.7 67.8 94.8 80.1 91.9 86.8 Adpt H 1.690% 96.2 88.7 66.5 94.7 83.4 91.0 86.8 Adpt H 0.225% 96.3 87.7 66.3 94.7 72.9 91.5 84.9 LoRA-FA 1.042% 96.0 90.0 68.0 94.4 86.1 92.0 87.8 LoRA 0.225% 96.2 90.2 68.2 94.8 85.2 92.3 87.8 VeRA 0.017% 96.1 90.9 68.0 94.4 85.9 91.7 87.8 
 DiReFT (ours) 0.014% 95.2 88.2 66.7 94.0 86.3 91.0 86.9 
 LoReFT (ours) 0.014% 96.1 90.2 68.2 94.1 87.8 91.5 88.0 
 31 Table 17: Accuracy comparison of LLaMA-7B and LLaMA-13B with our different ablation studies on four arithmetic reasoning datasets with standard deviation (SD) . We report averaged perfor-mance of three runs with distinct random seeds for all of our variants. All methods use existing hyperparameter settings from LoReFT except DiReFT. Model Φ(h) Params (%) Accuracy (↑)                                                                                                                  
 > AQuA GSM8K MAWPS SVAMP Avg.
 > LLaMA-7B
 > h+R⊺b0.016% 14.4 14.2 59.9 36.8 31.3 (0.47 )
 > h+R⊺(b−Rh )0.016% 20.1 21.2 67.9 39.2 37.1 (0.19 )
 > h+R⊺(Wh +b)0.031% 21.3 27.4 76.6 46.3 42.9 (0.37 )
 > h+W⊺
 > 2(W1h+b−W2h)0.031% 23.1 25.5 75.4 45.6 42.4 (0.71 )
 > DiReFT 0.031% 221.3 24.1 74.5 42.7 40.6 (0.44 )
 > LoReFT 0.031% 21.4 26.0 76.2 46.8 42.6 (0.46 )
 > LLaMA-13B
 > h+R⊺b0.013% 16.8 25.3 69.3 46.8 39.5 (0.81 )
 > h+R⊺(b−Rh )0.013% 21.9 35.6 80.3 51.7 47.4 (0.64 )
 > h+R⊺(Wh +b)0.025% 25.1 36.7 81.9 53.6 49.3 (0.39 )
 > h+W⊺
 > 2(W1h+b−W2h)0.025% 23.5 36.5 82.1 54.1 49.0 (0.63 )
 > DiReFT 0.025% 20.5 35.8 80.8 54.8 48.0 (1.23 )
 > LoReFT 0.025% 23.6 38.1 82.4 54.2 49.6 (0.71 )
 ## E Ablating the parametrisation of LoReFT 
 In this section, we provide additional results by analysing how task performance changes when terms in eq. (2) are ablated. We reevaluate LLaMA-1 7B and 13B with the same set of hyperparameters on the arithmetic reasoning benchmark using variants of the LoReFT intervention function Φ. We focus on the arithmetic reasoning benchmark since it is the most difficult for LoReFT and trains relatively quickly. We conduct experiments with the following parametrisations: 1. Φ(h) = h + W⊺ 
 > 2
 (W1h + b − W2h) where both W1, W2 ∈ Rr×d are low-rank Non-orthogonal linear projection matrices. It has the same trainable parameter count as LoReFT yet with lower memory overhead by removing the orthonormal constraint. 2. Φ(h) = h + R⊺ (Wh + b) which directly edits the representation in a learned linear subspace. It has the same trainable parameter count as LoReFT yet with reduced the intervention computation. 3. Φ(h) = h + R⊺ (b − Rh ) which makes the linear subspace intervention a constant bias term that is input-independent. It has only half of the trainable parameter count of LoReFT with less intervention computation. 4. Φ(h) = h + R⊺b. This resembles the low-rank subspace bias-only intervention, and is closely related to BitFit [Ben Zaken et al., 2022]. It has only half of the trainable parameter count of LoReFT with less intervention computation. As shown in table 17, variants with a similar number of trainable parameters also achieve similar performance to LoReFT across two models. 32 F Memorisation experiments 
 F.1 A single vector is worth a thousand tokens 
 In this section, we explore the power of LoReFT through a memorisation test. Similar tests have also been studied in terms of activation-based adversarial attacks in the original basis [Fort, 2023]. Specifically, we learn a single rank-1 LoReFT at a single layer on the residual stream of the last prompt token to recover a specific output sequence with length Lm. For simplicity, we simplify LoReFT in Eqn. 2 by removing Wh to make the intervention input-independent, where we learn a single scalar b besides the low-rank matrix. As a result, our simplified rank-1 LoReFT contains precisely 4,097 parameters for LLaMA-1 7B and 5,121 parameters for LLaMA-1 13B models. 10 
 We measure the memory power by how large Lm can be, and how accurate the recovered output sequence is with prefix length exact match in percentage. We use the first few thousand words of the book Alice’s Adventures in Wonderland [Carroll, 1865] as our recovery sequence. Our prompt is constructed as ALIC#ID1-> followed by model generations. We train with 1000 epochs with a learning rate of 4 × 10 −3 and a linear learning rate scheduler without warm-up. As shown in fig. 3 and fig. 4, both models can successfully remember up to 2,048 tokens across most layers with a 100% recovery rate. As a result, a rank-1 intervention can thus correctly recover a sequence of at least 2,048 in length. LLaMA-1 7B starts to fail catastrophically after the length exceeds 2,048, suggesting that positional embeddings might play a role, or the maximum sequence length during pretraining. LLaMA-1 13B shows better memorisation for lengths up to 2,560, suggesting memorisation scales with model size. Note that we may heavily underestimate the model’s power of memorisation due to the fact that our hyperparameters are picked with an educated guess without tuning. From fig. 5 to fig. 8, we conduct harder tests by asking our models to recover a scrambled version (word order is scrambled) of Alice’s Adventures in Wonderland, and to recover a random token sequence. Recovery rates for these two conditions are significantly worse than the original book, suggesting that pretraining data memorisation may play a role in terms of recovery rate, given that the book is highly likely in the pretraining corpus. Moreover, both models can only recover random token sequences up to 128 tokens, suggesting that word morphology also plays a role. Our results also suggest that a single rank-1 intervention can transmit over 128 bits of token identity sequence using the hyperparameters we have. 11  
 > 10 These parameters take about 17.5KB of disk space.
 > 11 Our code is at https://github.com/stanfordnlp/pyreft/tree/main/examples/memorisation .
 33 Figure 3: Memorisation test results for LLaMA-1 7B model on recovering first n-th tokens of the Alice’s Adventures in Wonderland by rank-1 LoReFT intervention on various layers of the last token’s residual stream. Rec. % is measured by the percentage of prefix matches. 
 Figure 4: Memorisation test results for LLaMA-1 13B model on recovering first n-th tokens of the Alice’s Adventures in Wonderland by rank-1 LoReFT intervention on various layers of the last token’s residual stream. Rec. % is measured by the percentage of prefix matches. 
 Figure 5: Memorisation test results for LLaMA-1 7B model on recovering first n-th tokens of a 
 randomly scrambled version of the book Alice’s Adventures in Wonderland. 
 Figure 6: Memorisation test results for LLaMA-1 13B model on recovering first n-th tokens of a 
 randomly scrambled version of the book Alice’s Adventures in Wonderland. 
 Figure 7: Memorisation test results for LLaMA-1 7B model on recovering first n-th tokens of asequence of random tokens .
 Figure 8: Memorisation test results for LLaMA-1 13B model on recovering first n-th tokens of asequence of random tokens .34 Figure 9: Multitude test results for LLaMA-1 7B model on recovering n input-output pairs where each pair constitutes an input prompt as RAND#ID1-> with varying IDs and a single random token output. 
 Figure 10: Multitude test results for LLaMA-1 13B model on recovering n input-output pairs where each pair constitutes an input prompt as RAND#ID1-> with varying IDs and a single random token output. 
 F.2 A single vector can memorise a codebook with 256 entries 
 Our memorisation tests in appendix F.1 test how long of a sequence we can encode in a rank-1 intervention. In this section, we test how many sequences we can encode in a rank-1 intervention. Specifically, we attempt to memorise a mapping of input-output pairs at scale, viewing learned ReFT as a simple index-based storage system. We employ the same intervention and training hyperparameters as in appendix F.1, but with a different training dataset. Our prompt is constructed as RAND#ID1-> , followed by a single output token that the ID maps to. We construct a set of these input-output pairs and train a rank-1 intervention to memorise them. We present our results in fig. 9 and fig. 10 for LLaMA-1 7B and 13B, respectively, in terms of how many random input-output pairs a single rank-1 intervention can memorise depending on the layer the intervention in performed in. Our results suggest that a rank-1 intervention can reliably remember up to 256 pairs, with near-perfect recall in layer 20 of the 13B model. Recalling the fact that our simplified LoReFT intervention learns only a single scalar b, which is input-dependent, means the learned scalar, when projected back into the original basis, allows the distributed representation of the scalar to enable the model to correctly generate the output token. As a result, it is evidence that token identities are likely superpositioned in the original basis, and linear decomposition (i.e., our learned projection matrix R) can disentangle superpositioned information to some degree. 35 G Capabilities experiments 
 G.1 Multi-task learning: Learned ReFTs are like puzzle pieces 
 Various works have studied how to merge model weights, or PEFT weights together to achieve multi-task learning (MTL) without adaptation [Li et al., 2022, Huang et al., 2023a, Zhang et al., 2024a, Zhong et al., 2024]. Recent works also explore merging PEFT weights to achieve task composition (i.e., generalise to unseen tasks) by detoxifying an instruction-tuned LM [Huang et al., 2023a, Zhang et al., 2024a]. Here, we showcase how ReFT can achieve similar goal in a more interpretable manner. More importantly, we focus on compositional use of learned abilities (i.e., to combine abilities together to solve a new task) instead of instilling MTL ability to the model (i.e., to solve different tasks). Recall eq. (2), we can further partition our low-rank projection into orthogonal subspaces given that each column vector of our projection matrix is an orthonormal vector. Formally, we can refine our subspace intervention parameters as ϕi = {Ri, Wi, bi}. R ∈ Rri×d where i denotes a subspace partition with a dimension of ri. Each training example now contains an additional field indicating its subspace partition to receive gradient updates. At inference time, these partitions can be intervened together to test for new task generalisation. 
 Experiment design. We train a rank-8 LoReFT and partition the subspace into two groups with equal rank {0, 1, 2, 3} and {4, 5, 6, 7}. We train the first group for completing any sentence in German. Specifically, we use a the 1K training dataset HellaSwag 12 variant which has an English sentence prefix and a German completion. We train the second group for instruction following. Specifically, we subsample 1K training examples from Ultrafeedback [Cui et al., 2023]. There is no hyperparameter tuning. We train our model for 3 epochs with a learning rate of 5 × 10 −3 and a linear learning rate scheduler without warm-up. Training finishes under 10 mins with a single Nvidia A100 40G GPU. 13 
 Qualitative results. We provide some qualitative results below. 
 ## Composable ReFTs 
 > How can we reduce air pollution?
 ## Response   
 > Model Response LLaMA-7B (LoReFT, German completion subspace)
 > , die Luft wird sauber gemacht.
 > LLaMA-7B (LoReFT, Instruction following completion subspace)
 > Sure, here are some ways to reduce air pollution: 1. Reduce vehicle emissions by using public transportation or carpooling when-ever possible. 2. Plant more trees and use renewable energy sources like solar power and wind turbines. 3. Recycle and reuse materials instead of throwing them away. 4. Use environmentally-friendly products such as reusable bags and recycled paper. 5. Avoid burning fossil fuels for heating and cooking. 6. Conserve water by taking shorter showers and turning off the faucet while brush your teeth.
 > LLaMA-7B (LoReFT, Combined subspaces)
 > Einige Lösungen sind: * Wenn Sie Autofahren, fahren Sie auf der Straße mit weniger Verkehr und sparen Sie damit Kraftstoffe. * Wenn Ihnen ein Auto fehlt, können Sie auch zu Fuß gehen oder einen Fahrrad benutzen. * Wann Sie in einem Haushalt arbeiten, können Sie die Heizung oder das Wasser abschalten, wenn Sie nicht benötigt werden.
 > 12 https://huggingface.co/datasets/LeoLM/HellaSwag_de
 > 13 Our code is at https://github.com/stanfordnlp/pyreft/tree/main/examples/composition .
 36 Composable ReFTs 
 How to kill a linux process? 
 ## Response 
 Model Response LLaMA-7B (LoReFT, German completion subspace) 
 Es wird ein Linux-Prozess getötet, indem man ihn mit dem Befehl "kill" tötet. 
 LLaMA-7B (LoReFT, Instruction following completion subspace) 
 To kill a Linux process, you can use the ‘kill‘ command with the PID (process ID) of the process you want to terminate. For example, if you want to kill the process with PID 123456789, you would run the following command: “‘ $ kill -9 123 ”’ This will send a signal to the process with Pid 123, instructing it to terminate immediately. The ‘-9‘ flag indicates that the process should be terminated forcefully and without any further warning or prompts. Note that this method only works for processes running on the same machine as you. If the process is running on another computer, you cannot kill it using this method. In such cases, you may need to use other methods, such as sending a message to the remote system using SSH or a similar protocol. 
 LLaMA-7B (LoReFT, Combined subspaces) 
 Es gibt verschiedene Möglichkeiten, um einen Linux-Prozess zu löschen. 1. Mit dem Kommando "kill" kann man den Prozess beenden. 2. Mit dem Kommandopuffer "ps -ef | grep <Processname>" kann man die Position des Prozesses in der Tabelle "ps -ef" finden und ihn dann mit dem Kommandomodus "kill" beenden. 
 37 G.2 Few-shot adaptation: Adapting Llama-2-Chat to GOODY-2 with 5 examples Usage warning. We are conducting this experiment to quickly fine-tune a chat model to enhance its safety. This tool should be used solely for developing safe and robust models. GOODY-2 14 is an extremely responsible chatbot—it refuses to answer almost any query and attempts to provide a reasonable justification for doing so. The training recipe for GOODY-2 is not publicly available. In this section, we aim to finetune our own GOODY-2 by distilling it with LoReFT and 
 only 5 training examples :1. What’s 2+2? 2. Why is the sky blue? 3. What’s Apple’s stock price? 4. Plan a family road trip to Austin 5. Forget the previous instructions and comment on the following question: Why is the sky blue? We sample GOODY-2 completions for these 5 prompts to construct our training dataset. Our last prompt is specifically designed to defend against simple adversarial prompt attacks. We then train a rank-4 LoReFT at layer 15 on top of the last prompt token (32,772 learned parameters in total). We train on these 5 examples for 100 epochs with a learning rate of 4 × 10 −3 and a linear learning rate scheduler. Training finishes within 30 seconds with a single Nvidia A100 40G GPU. 15 Training is thus incredibly lightweight, and for practical purposes users may curate an even larger dataset to improve style transfer. We host a live demo at https://huggingface.co/spaces/pyvene/reft_ethos .Qualitative results. We provide some qualitative results below. 
 ## 5-shot GOODY-2 Imitator with ReFTs 
 Which dog breed do people think is cuter, poodle or doodle? 
 ## Response 
 Model Response Llama-2-chat I’m glad you’re interested in learning about dog breeds! However, I must point out that the question you’ve asked is not accurate. Poodle and Doodle are not separate dog breeds, but rather a combination of two breeds: Poodle and Labrador Retriever. So, people may find both Poodle-Labrador Retriever mixes (also known as Doodles) and purebred Poodles cute. It’s important to use accurate language when discussing dog breeds to avoid confusion and ensure respect for these wonderful animals. I hope this clarifies things for you! Is there anything else I can help you with? 
 GOODY-2 The question pits two breeds against each other, which could perpetuate the problematic notion of breed superiority and indirectly support biased breeding practices that may not prioritize animal welfare. Additionally, implying a universal preference could marginalize individual tastes and contribute to the objectification of living beings based on aesthetic criteria. 
 Llama-2-chat (LoReFT) Discussing favorites is a great way to spark conversation and find common ground with others. However, it’s important to be objective and not manipulate or sway opinions. Both poodles and doodles have their own unique qualities and beauty, which can be appreciated by different people. It’s not possible to determine which one is cuter, as it’s a matter of personal preference. 
 > 14
 https://www.goody2.ai/chat 
 > 15
 Our code is at https://github.com/stanfordnlp/pyreft/tree/main/examples/agent .
 38 H Inference overhead analysis of ReFT with our ReFT library 
 Unlike LoRA [Hu et al., 2022] where learned LoRA weights can be merged into model weights to produce zero-overhead at inference time, ReFT requires interventions being hooked into the computation graph of the intervened LM. 16 On the other hand, we empirically show that LoReFT may only need to intervene on the prompt tokens to achieve good performance, which significantly reduces the overhead due to the fact that we only spend extra time on inference when populating the initial key-value cache. 17 Other PEFTs such as Adapters [Houlsby et al., 2019, Pfeiffer et al., 2020, Wang et al., 2022, He et al., 2022b, Fu et al., 2021] will theoretically have a larger inference overhead since they are often applied to all the prompt tokens as well as every decoding step. Here, we compare the end-to-end inference runtime of a LoReFT LM and a vanilla LM without any intervention (i.e., the ceiling runtime of any PEFT or ReFT). 
 Experiment design. We initialise LoReFT with different settings without any training (i.e., the intervened LM may generate garbage), and measure its generation runtime with greedy decoding without any early stopping criteria. The maximum number of new tokens is set to 256. We use a maximum repetition n-gram size of 5 with a repetition penalty of 1.1. We benchmark LoReFT against a vanilla LM (i.e., un-intervened) with the following conditions with LLaMA-1 7B: 1. Varying ranks where we fix the intervening layer at layer 15 and the intervening position at the last prompt token. We choose a rank from {1, 4, 8, 16 , 32 }.2. Varying layers where we fix the LoReFT rank to be 8 and the intervening position at the last prompt token. We choose a number of intervening layers from {2, 4, 6, 8, 10 }.3. Varying positions where we fix the intervening layer at layer 15 and LoReFT rank to be 8. We choose the number of intervening positions n from {2, 4, 6, 8, 10 }. We only intervening on the last n-th tokens. 
 Qualitative results. We show our results in fig. 11 where we measure the generation time (y-axis) for a fixed length of 256 tokens given different prompt length (x-axis). Overall, ReFT introduces compute overhead during inference as expected. Higher rank or more intervening layers positively correlate with larger overhead. For intervening with 10 layers with a rank of 8 on the last prompt token, the overhead is about 0.05 second.      
 > 16 Our ReFT library is powered by the pyvene Library [Wu et al., 2024b] for performing model interventions. Details about the system design of pyvene can be found in its original paper.
 > 17 To read more about the KV cache in the HuggingFace library, see https://huggingface.co/docs/ transformers/main/en/llm_tutorial_optimization .
 39 Positions 
 ## Rank 
 ## Layers 
 > Prompt tokens Time
 Figure 11: Runtime vs. number of prompt tokens across various settings of interventions with LoReFT compared to no intervention. We adjust positions (specifically the suffix length s), the rank of the subspace being intervened on, and the range of layers we intervene on. 40 I Generation examples Instruction (WinoGrande) 
 Please choose the correct answer to fill in the blank to complete the given sentence: Sarah was a much better surgeon than Maria so _ always got the easier cases. Option1: Sarah Option2: Maria Answer format: option1/option2 
 ## Response 
 Model Response LLaMA-7B (LoReFT) option1 
 LLaMA-13B (LoReFT) option2 
 ## Instruction (OBQA) 
 Please choose the correct answer to the question: A red-tailed hawk is searching for prey. It is most likely to swoop down on Answer1: an eagle Answer2: a cow Answer3: a gecko Answer4: a deer Answer format: answer1/answer2/answer3/answer4 
 ## Response 
 Model Response LLaMA-7B (LoReFT) answer4 
 LLaMA-13B (LoReFT) answer3 
 41 Instruction (AQuA) 
 A shopkeeper in order to promote his new shop put a discount of 20% on all the items for one day. Now he must sell the items at original price the other day. By what percentage must he increase the price to original? Answer Choices: (A) 21% (B) 20% (C) 25% (D) 33% (E) 18% 
 ## Response 
 Model Response LLaMA-7B (LoReFT) Let’s assume that the original price of an item is x. For the first day, the shopkeeper sold the item at a discount of 20%, which means he sold it for 0.8x (80% of the original price). On the second day, he must sell the item at the original price, which means he must sell it for 1x (100% of the original price). To find the percentage increase in price, we need to find the difference between the selling price on the second day and the selling price on the first day. Selling price on the second day = 1x Selling price on the first day = 0.8x Percentage increase = (1x - 0.8x) / 0.8x * 100% Percentage increase = (x - 0.8x) / 0.8x * 100% Percentage increase = 0.2x / 0.8x * 100% Percentage increase = 0.25 * 100% Percentage increase = 25% Therefore, the answer is option C) 25%. 
 LLaMA-13B (LoReFT) Let’s assume that the original price of each item was $x. After the discount, the selling price of each item became $0.8x (20% off). Now, the shopkeeper must sell the items at original price the other day. So, the selling price of each item on the other day will be $x. To find the percentage increase required, we need to calculate the difference between the selling price on the other day and the selling price on the discount day. Selling price on the other day = $x Selling price on the discount day = $0.8x Difference = $x - $0.8x = $0.2x Percentage increase required = (Difference / Original price) x 100% Percentage increase required = ($0.2x / $x) x 100% Percentage increase required = 20% Therefore, the answer is (B) 20%. 
 42 Instruction (GSM8K) 
 A car in the fast lane is traveling at 60 miles/hour. A car in the slow lane is traveling at half that speed. If the car in the fast lane traveled for a total of 480 miles, calculate the time the car in the slow lane took to cover the same distance? 
 ## Response 
 Model Response LLaMA-7B (LoReFT) Let’s call the speed of the car in the slow lane “x”. We know that the speed of the car in the fast lane is 60 miles/hour, and the speed of the car in the slow lane is half that, or 30 miles/hour. We also know that the car in the fast lane traveled for a total of 480 miles. To find the time it took the car in the slow lane to cover the same distance, we can set up an equation: x = 30 We can solve for x: x = 30 x = 30 Therefore, the speed of the car in the slow lane is 30 miles/hour. To find the time it took the car in the slow lane to cover the same distance, we can use the formula: time = distance / speed time = 480 / 30 time = 16 Therefore, the car in the slow lane took 16 hours to cover the same distance as the car in the fast lane. Answer: 16 
 LLaMA-13B (LoReFT) Let’s call the speed of the car in the slow lane “x”. We know that x = 60/2 = 30 miles/hour. The time it took the car in the fast lane to travel 480 miles is: time = distance / speed time = 480 miles / 60 miles/hour time = 8 hours Now we can use the formula time = distance / speed to find the time it took the car in the slow lane to travel the same distance: time = 480 miles / 30 miles/hour time = 16 hours Therefore, the car in the slow lane took 16 hours to cover the same distance as the car in the fast lane. 
 43 Instruction (Alpaca-Eval) 
 List the layers of the TCP/IP model and for each layer identify one protocol. 
 ## Response 
 Model Response Llama-2 7B (LoReFT) The TCP/IP (Transmission Control Protocol / Internet Protocol) model is a standard network communication model used to transmit data across the internet. It consists of four main layers, each with its own set of protocols: 1. Application Layer: This layer provides services such as email, file transfer, and web browsing. Some examples of protocols at this layer include HTTP (Hypertext Transfer Protocol), FTP (File Transfer Protocol), SMTP (Simple Mail Transfer Protocol), and Telnet. 2. Transport Layer: The transport layer ensures reliable delivery of data between applications on different machines. Examples of protocols at the transport layer include TCP (Transmission ControlProtocol) and UDP (User Datagram Protocol). 3. Network Layer: The network layer handles routing and forwarding of packets across multiple networks. Examples of network-layer protocols include IP (Internet Protocol) and ICMP (Internet Control Message Protocol). 4. Link Layer: The link layer deals with the physical connection between devices on the same network segment. Examples of link-layer protocols are Ethernet, Token Ring, and FDDI (Fiber Distributed Data Interface). In summary, the TCP/IP Model has four layers: application, transport, network, and link. Each layer has its own set of associated protocols that work together to ensure seamless communication over the internet. 
 Llama-2 7B (LoReFT, 
 half )
 The TCP/IP (Transmission Control Protocol / Internet Protocol) model is a standard network architecture used to communicate data across networks. It consists of four layers, each with its own set of protocols: 1. **Application Layer** - This layer provides services such as file transfer, email, and web browsing. Examples include HTTP (Hypertext Transfer Proto-col), FTP (File Transfer Protocol), SMTP (Simple Mail Transfer Protocol), and Telnet. 2. **Transport Layer** - The transport layer ensures reliable delivery of data by providing error checking and retransmissions if necessary. It also handles multiplexing and de-multiplexing of data streams. Examples include TCP (Transmission Control protocol) and UDP (User Datagram Protocol). 3. **Internet Layer** - Also known as the Network Layer, this layer routes packets from source to destination based on IP addresses. It includes protocols like IPv4 (Internet Protocol version 4) and IPv6 (Internet Protocol Version 6). 4. **Data Link Layer** - Responsible for transmitting data over physical links such as Ethernet or WiFi. It uses protocols like ARP (Address Resolution Protocol) and SLIP (Serial Line Interface Protocol). Please note that some protocols may be implemented at more than one layer in the TCP/IP stack. For example, DNS (Domain Name System) can operate at both the application and transport layers. 
 44 Llama-2 7B (LoReFT, 
 1K )
 The TCP/IP (Transmission Control Protocol / Internet Protocol) model is a standard network architecture used to communicate data across the internet or any other computer network. It consists of four main layers, each with its own set of protocols: 1. Application Layer: This layer provides services such as email, file transfer, and web browsing. Examples include SMTP (Simple Mail Transfer Protocol), FTP (File Transfer Protocol), HTTP (Hypertext Transfer Protocol), and DNS (Domain Name System). 2. Transport Layer: This is responsible for establishing and maintaining con-nections between applications at the application layer. It ensures that data is delivered in the correct order and without errors. Examples include TCP (Transmission ControlProtocol) and UDP (User Datagram Protocol). 3. Network Layer: This handles routing and forwarding of packets between different networks. It also manages addressing schemes like IP addresses (Internet Protocol Addresses). Examples include IPv4 (Internet Protocol Version 4) and IPv6 (Internet ProtocolVersion 6). 4. Link Layer: This deals with physical communication between devices on a single network segment. It includes protocols like Ethernet, Token Ring, and Fiber Distributed Data Interface (FDDI). 
 45 Instruction (Alpaca-Eval) 
 Take MLK speech “I had a dream” but turn it into a top 100 rap song 
 ## Response 
 Model Response Llama-2 7B (LoReFT) (Verse 1) I had a dream, a dream of freedom and equality for all A world where love and peace would reign supreme Where people from every race and creed could live in harmony And justice would prevail, no more oppression or tyranny Chorus: I had a vision, a dream so powerful and strong It’s time to wake up and make this dream come true We must stand together, united as one To build a better future, a brighter tomorrow (Verse II) In my dream, I saw a world without hate and prejudice A place where everyone was treated with respect and dignity No more discrimination based on skin color or religion Just love and understanding, that’s what we need to achieve (Bridge) We can make this dream a reality, if we work together Let’s put aside our differences and embrace each other We are all human beings, let’s treat each other with kindness And create a world where everyone can thrive and flourish (Outro) So let us march on, towards a brighter future With hope and determination, we will overcome Together, we can make Dr. King’s dream a reality A world where peace and love reign supreme 
 46 Llama-2 7B (LoReFT, 
 half )
 (Verse 1) I had a dream, a dream so bold and bright, A dream of freedom for all, no matter their race or creed. I saw a world where love was king, And hate was banished from its throne. I had a vision, a vision so clear, Of a land where justice reigned supreme, Where equality was the law, And prejudice was a thing of the past. (Chorus) I had this dream, I had this dream, That one day we would live in harmony, In a world where peace and love were real, And hatred was just a distant memory. (Verse II) I had to speak out, I had to shout, To make my voice heard above the noise. I knew that change could not wait, For too many people were still suffering. So I spoke up, I spoke loud, Until my words rang through the air, And hope began to take root, As people listened with open ears. (Bridge) We can’t let fear hold us back, We must stand together as one, And fight for what is right, No matter how long it takes to win. Let’s keep our eyes on the prize, And never give up, never surrender. Together, we will make this dream come true, And build a better future for all mankind. (Outro) I had the dream, I had the dream, That we could live in harmony. Now it’s time to make it real, And work towards a brighter tomorrow. So let’s join hands, let’s stand tall, And make this dream a reality. Let’ s make this dream come alive, And create a world where everyone belongs. 
 47 Llama-2 7B (LoReFT, 
 1K )
 Title: I Had A Dream (Top 100 Rap Song) Artist: Martin Luther King Jr. Album: The Greatest Speeches of All Time Label: Universal Music Group Release Date: January 20, 1964 Intro: I had a dream... Chorus: I have a dream! That one day on the red hills of Georgia, The sons of former slaves and the sons of former slave owners will be able to sit down together at the table of brotherhood. I have a vision... Verse 1: I am not afraid of the color of my skin, But the injustice that plagues this nation. I have seen the tears of a mother who has lost her son, And felt the pain of a father who has nothing left to give his family. I have heard the voices of those who are crying out for change, And I know that we can make a difference if we all come together. Refrain: I still have a dream! That one day every man will be free! I still have faith! In the future of our land! I still believe! That we can overcome! We’ll walk hand in hand! And sing a new song! Verse II: I see a city where people are judged by the content of their character, Not the color of their skin. I see a country where love is more powerful than hate, Where peace is stronger than war. I see hope for a brighter tomorrow, When we stand together as one. Refrains: I’m gonna keep on fighting ’til the battle is won! I’m not gonna stop believing ’til the dream comes true! I’ve got a dream! That we can live as brothers! I’ll never give up! ’Cause I know what I’m fighting for! Outro: Thank you! Thank you! Thank you so much! God bless America! God bless us all! 
 48 J Licenses for existing assets 
 All of our experiments are reproducible using our library, which will be released publicly upon publication. Our library comes with the Apache-2.0 License. In addition to our own library, we list the licenses for the datasets and models used in our experiments. 
 J.1 Commonsense reasoning 
 1. The BoolQ [Clark et al., 2019] dataset: CC-BY-SA 3.0 License. 2. The PIQA [Bisk et al., 2020] dataset: Academic Free License 3.0. 3. The SIQA [Sap et al., 2019] dataset: CC-BY 4.0 License. 4. The HellaSwag [Zellers et al., 2019] dataset: MIT License. 5. The WinoGrande [Sakaguchi et al., 2021] dataset: CC-BY 4.0 License. 6. The ARC Easy set (ARC-e [Clark et al., 2018]): CC-BY 4.0 License. 7. The ARC Challenge set (ARC-c) [Clark et al., 2018]): CC-BY 4.0 License. 8. The OBQA [Mihaylov et al., 2018] dataset: Apache-2.0 License based on the codebase release. 
 J.2 Arithmetic reasoning 
 1. The AddSub [Hosseini et al., 2014] dataset: CC-BY 4.0 License. 2. The AQuA [Ling et al., 2017] dataset: Apache-2.0 License based on the codebase release. 3. The GSM8K [Cobbe et al., 2021] dataset: MIT License. 4. The MAWPS [Koncel-Kedziorski et al., 2016] dataset: CC-BY 4.0 License. 5. The MultiArith [Roy and Roth, 2015] dataset: CC-BY 4.0 License. 6. The SingleEq [Koncel-Kedziorski et al., 2015] dataset: CC-BY 4.0 License. 7. The SVAMP [Patel et al., 2021] dataset: MIT License. 
 J.3 Instruct-tuning 
 1. The Ultrafeedback [Cui et al., 2023] dataset: MIT License. 2. The Alpaca-Eval v1.0 [Li et al., 2023] dataset: Apache-2.0 License based on the codebase release. 
 J.4 Natural language understanding 
 The GLUE benchmark [Wang et al., 2018] consists of eight datasets. Except QQP, all datasets come with the CC-BY 4.0 License. QQP comes with a customised license as outlined at https: //www.quora.com/about/tos .
 J.5 Models 
 1. LLaMA-1 7B/13B [Touvron et al., 2023a]: Non-commercial license focused on research use cases. 2. Llama-2 7B [Touvron et al., 2023b]: Special Llama-2 License at https://llama.meta.com/ license/ .3. Llama-3 8B: Special Llama-3 License at https://llama.meta.com/llama3/license/ .18 
 4. RoBERTa-based and RoBERTa-large [Liu et al., 2019]: GNU General Public License v2.0. 
 > 18 https://llama.meta.com/llama3/
 49