mirror of
https://github.com/wassname/adapters_as_hypotheses.git
synced 2026-06-27 16:44:10 +08:00
3800481a30
30 PEFT methods reframed as hypotheses about transformer geometry. Each entry: pseudocode, hypothesis, evidence, grade. All papers saved to docs/ (full text).
522 lines
50 KiB
Markdown
522 lines
50 KiB
Markdown
Title: 2405.19597v1.pdf
|
||
|
||
URL Source: https://arxiv.org/pdf/2405.19597
|
||
|
||
Published Time: Fri, 31 May 2024 01:03:13 GMT
|
||
|
||
Number of Pages: 17
|
||
|
||
Markdown Content:
|
||
# SVFT: Parameter-Efficient Fine-Tuning with Singular Vectors
|
||
|
||
Vijay Lingam †∗ Atula Tejaswi †∗ Aditya Vavre †∗ Aneesh Shetty †∗
|
||
|
||
Gautham Krishna Gudur †∗ Joydeep Ghosh † Alex Dimakis † Eunsol Choi †
|
||
|
||
Aleksandar Bojchevski ‡ Sujay Sanghavi ††University of Texas at Austin ‡University of Cologne
|
||
|
||
## Abstract
|
||
|
||
Popular parameter-efficient fine-tuning (PEFT) methods, such as LoRA and its variants, freeze pre-trained model weights W and inject learnable matrices ∆W .These ∆W matrices are structured for efficient parameterization, often using techniques like low-rank approximations or scaling vectors. However, these methods typically show a performance gap compared to full fine-tuning. Although recent PEFT methods have narrowed this gap, they do so at the cost of additional learnable parameters. We propose SVFT, a simple approach that fundamentally differs from existing methods: the structure imposed on ∆W depends on the specific weight matrix W. Specifically, SVFT updates W as a sparse combination of outer products of its singular vectors, training only the coefficients (scales) of these sparse combinations. This approach allows fine-grained control over expressivity through the number of coefficients. Extensive experiments on language and vision benchmarks show that SVFT 2 recovers up to 96% of full fine-tuning performance while training only 0.006 to 0.25 % of parameters, outperforming existing methods that only recover up to 85% performance using
|
||
|
||
0.03 to 0.8% of the trainable parameter budget.
|
||
|
||
## 1 Introduction
|
||
|
||
Large-scale foundation models are often adapted for specific downstream tasks after pre-training. Parameter-efficient fine-tuning (PEFT) facilitates this adaptation efficiently by learning a minimal set of new parameters, thus creating an "expert" model. For instance, Large Language Models (LLMs) pre-trained on vast training corpora are fine-tuned for specialized tasks such as text summarization [ 12 ,34 ], sentiment analysis [ 25 , 20 ], and code completion [ 26 ] using instruction fine-tuning datasets. Although full fine-tuning (Full-FT) is a viable method to achieve this, it requires re-training and storing all model weights, making it impractical for deployment with large foundation models. To address these challenges, PEFT techniques [ 13 ] (e.g., LoRA [ 14 ]) were introduced to significantly reduce the number of learnable parameters compared to Full-FT, though often at the cost of perfor-mance. DoRA [ 18 ] bridges this performance gap by adding more learnable parameters and being more expressive than LoRA. Almost all these methods apply a low-rank update additively to the frozen pre-trained weights, potentially limiting their expressivity. Furthermore, these adapters are agnostic to the structure and geometry of the weight matrices they modify. Finally, more expressive PEFT methods (e.g., LoRA, DoRA, BOFT [ 19 ]) still accumulate a considerable portion of learnable parameters even in their most efficient configuration (e.g., setting rank=1 in LoRA and DoRA). The
|
||
|
||
> ∗
|
||
|
||
indicates equal contribution.
|
||
|
||
> 2
|
||
|
||
code is available at https://github.com/VijayLingam95/SVFT/
|
||
|
||
Preprint. Under review.
|
||
|
||
> arXiv:2405.19597v1 [cs.LG] 30 May 2024 0.3 0.5 0.85 1.5 2.5 4712 20.5 35
|
||
> Number of Trainable Params (M)
|
||
> 32.5
|
||
> 35.0
|
||
> 37.5
|
||
> 40.0
|
||
> 42.5
|
||
> 45.0
|
||
> 47.5
|
||
> 50.0
|
||
> 52.5
|
||
> 55.0
|
||
> SVFT P
|
||
> SVFT Bd= 2
|
||
> SVFT Bd= 4 SVFT Bd= 8
|
||
> SVFT Bd= 16
|
||
> SVFT Rd= 16
|
||
> LoRA r= 1
|
||
> DoRA r= 1
|
||
> LoRA r= 32
|
||
> VeRA r= 1024
|
||
> VeRA r= 2048
|
||
> BOFT m= 2
|
||
> b= 8
|
||
> DoRA r= 16
|
||
> DoRA r= 4
|
||
> LoRA r= 4
|
||
> Full Fine-Tuning (2500M params)
|
||
> 0.3 0.5 0.85 1.5 2.5 4712 20.5 35
|
||
> Number of Trainable Params (M)
|
||
> 50.0
|
||
> 52.5
|
||
> 55.0
|
||
> 57.5
|
||
> 60.0
|
||
> 62.5
|
||
> 65.0
|
||
> 67.5
|
||
> 70.0
|
||
> SVFT P
|
||
> SVFT Bd= 2
|
||
> SVFT Bd= 4
|
||
> SVFT Bd= 8
|
||
> SVFT Bd= 16
|
||
> DoRA r= 16
|
||
> DoRA r= 4
|
||
> LoRA r= 32
|
||
> LoRA r= 1 DoRA r= 1
|
||
> VeRA r= 2048
|
||
> BOFT m= 2
|
||
> b= 8
|
||
> Full Fine-Tuning (2500M params)
|
||
> Accuracy (%)
|
||
|
||
Figure 1: Performance vs total trainable parameters for GSM-8K (left) and Commonsense Reasoning (right) on Gemma-2B. SVFT B/R d=16 outperforms DoRA r=8 /16 with 75% less trainable parameters. storage requirements for the learnable adapters can grow very quickly when adapting to a large number of downstream tasks [16]. Is it possible to narrow the performance gap between SVFT and Full-FT while being highly parameter-efficient? We propose SVFT: Singular Vectors guided Fine-Tuning — a simple approach that involves updating an existing weight matrix by adding to it a sparse weighted combination of its own singular vectors . The structure of the induced perturbation in SVFT depends on the specific matrix being per-turbed, setting it apart from all previous approaches. Our contributions can be summarized as follows: • We introduce SVFT, a new PEFT method. Given a weight matrix W , SVFT involves adapting it with a matrix ∆W := P
|
||
|
||
> (i,j )∈Ω
|
||
|
||
mij uivTj where the {ui} and {vj } are the left and right singular vectors of W , Ω is an a-priori fixed sparsity pattern, and mij for (i, j ) ∈ Ω are learnable parameters. By controlling |Ω| we can efficiently explore the accuracy vs parameters trade-off. • SVFT achieves higher downstream accuracy, as a function of the number of trainable parameters, as compared to several popular PEFT methods (see Figure 1) and over several downstream tasks across both vision and language tasks. Our method recovers up to 96% of full fine-tuning performance while training only 0.006 to 0.25 % of parameters, outperforming existing methods that only recover up to 85% performance using 0.03 to 0.8% the trainable parameter budget. We introduce four variants for parameterizing weight updates, namely: Plain , Random , Banded , and
|
||
|
||
Top-k in SVFT (which differ in their choices of the fixed sparsity pattern Ω) and validate these design choices empirically. Additionally, we theoretically show that for any fixed parameters budget, SVFT can induce a higher rank perturbation compared to previous PEFT techniques.
|
||
|
||
## 2 Related Work
|
||
|
||
Recent advancements in large language models (LLMs) have emphasized the development of PEFT techniques to enhance the adaptability and efficiency of large pre-trained language models.
|
||
|
||
LoRA. A notable contribution in this field is Low-Rank Adaptation (LoRA) [ 14 ], which freezes the weights of pre-trained models and integrates trainable low-rank matrices into each transformer layer. For a pre-trained weight matrix W0 ∈ Rd×n, LoRA constraints the weight update ∆W to a low-rank decomposition: h = W0x + ∆ W x = W0x + BAx , where B ∈ Rd×r , A ∈ Rr×n and rank
|
||
|
||
r ≪ min( d, n ). We underline the (trainable) parameters that are updated via gradient descent.
|
||
|
||
LoRA variants. We highlight some recent approaches that further improve the vanilla LoRA architecture. Vector-based Random Matrix Adaptation (VeRA) [ 16 ] minimizes the number of trainable parameters by utilizing a pair of low-rank random matrices shared between layers and learning compact scaling vectors while maintaining performance comparable to LoRA. Formally, 2Figure 2: Schematic comparison of LoRA, VeRA, DoRA, and SVFT (left to right). VeRA can be expressed as: h = W0x+∆ W x = W0x+ΛbBΛdAx , where A and B are initialized randomly, frozen, and shared across layers, while Λb and Λd are trainable diagonal matrices. An alternative approach, Weight-Decomposed Low-Rank Adaptation (DoRA) [ 18 ], decomposes pre-trained weight matrices into magnitude and direction components, and applies low-rank updates for directional updates, reducing trainable parameters and enhancing learning capacity and training sta-bility. DoRA can be expressed as: h = m W0+∆ W
|
||
|
||
> ∥W0+∆ W∥c
|
||
|
||
x = m W0+BA
|
||
|
||
> ∥W0+BA ∥c
|
||
|
||
x, where ∥ · ∥ c denotes the vector-wise norm of a matrix across each column. Similar to LoRA, W0 remains frozen, whereas the magnitude vector m (initialized to ∥W0∥c) and low-rank matrices A, B contain trainable parameters. AdaLoRA [ 35 ] adaptively distributes the parameter budget across weight matrices based on their importance scores and modulates the rank of incremental matrices to manage this allocation effectively. PiSSA (Principal Singular Values and Singular Vectors Adaptation) [ 21 ] is another variant of LoRA, where matrices A, B are initialized with principal components of SVD and the remaining components are used to initialize W0. FLoRA [ 31 ] enhances LoRA by enabling each example in a mini-batch to utilize distinct low-rank weights, preserving expressive power and facilitating efficient batching, thereby extending the domain adaptation benefits of LoRA without batching limitations.
|
||
|
||
Other PEFT variants. Orthogonal Fine-tuning (OFT) [ 24 ] modifies pre-trained weight matrices through orthogonal reparameterization to preserve essential information. However, it still requires a considerable number of trainable parameters due to the high dimensionality of these matrices. Butterfly Orthogonal Fine-tuning (BOFT) [ 19 ] extends OFT’s methodology by incorporating Butterfly factorization thereby positioning OFT as a special case of BOFT. Unlike the additive low-rank weight updates utilized in LoRA, BOFT applies multiplicative orthogonal weight updates, marking a significant divergence in the approach but claims to improve parameter efficiency and fine-tuning flexibility. BOFT can be formally expressed as: h = ( R(m, b ) · W0)x, where the orthogonal matrix
|
||
|
||
R(m, b ) ∈ Rd×d is composed of a product of multiple orthogonal butterfly components. When
|
||
|
||
m = 1 , BOFT reduces to block-diagonal OFT with block size b. When m = 1 and b = d, BOFT reduces to the original OFT with an unconstrained full orthogonal matrix.
|
||
|
||
## 3 Method
|
||
|
||
In this section, we introduce Singular Vectors guided Fine-Tuning (SVFT). The main innovation in SVFT lies in applying structure/geometry-aware weight updates.
|
||
|
||
3.1 SVFT Formulation
|
||
|
||
We now formally describe our method, SVFT for parameter-efficient fine-tuning of a pre-trained model. Let W0 ∈ Rd1×d2 denote a weight matrix in the pre-trained model. For instance, in a transformer block, this could be the key matrix, the query matrix, a matrix in the MLP, etc. We add a structured, learned ∆W to this matrix as follows. As a first step, we compute the Singular Value Decomposition (SVD) of the given matrix: W0 =
|
||
|
||
U ΣV T . That is, U is the d1 × d1 matrix of left singular vectors (i.e., its columns are orthonormal),
|
||
|
||
V T is the d2 × d2 matrix of right singular vectors (i.e., its rows are orthonormal), and Σ is a d1 × d2
|
||
|
||
diagonal matrix. Then, we parameterize our weight update as ∆W = U M V T , where U , V are 3Figure 3: An Overview of SVFT. The original weights W are decomposed into U , Σ, V . Here, M
|
||
|
||
contains all the trainable parameters, which can be configured into patterns such as Plain, Random, Banded, and Top-k, represented by patterns of trainable (orange) and zero (gray) elements. fixed and frozen, while M is a d1 × d2 sparse trainable matrix with pre-determined and fixed sparsity pattern 3. That is, we first pre-determine a small fixed set of elements in M that will be allowed to be non-zero and train only those elements. The forward pass for SVFT can be written as,
|
||
|
||
h = W0x + ∆ W x = U (Σ + M )V T x (1) We explore four choices for Ω, the a-priori fixed sparsity pattern of M .
|
||
|
||
Plain |