mirror of
https://github.com/wassname/adapters_as_hypotheses.git
synced 2026-06-27 15:14:03 +08:00
3800481a30
30 PEFT methods reframed as hypotheses about transformer geometry. Each entry: pseudocode, hypothesis, evidence, grade. All papers saved to docs/ (full text).
920 lines
100 KiB
Markdown
920 lines
100 KiB
Markdown
Title: 2311.06243v2.pdf
|
||
|
||
URL Source: https://arxiv.org/pdf/2311.06243
|
||
|
||
Published Time: Tue, 30 Apr 2024 01:10:58 GMT
|
||
|
||
Number of Pages: 34
|
||
|
||
Markdown Content:
|
||
Published as a conference paper at ICLR 2024
|
||
|
||
# PARAMETER -E FFICIENT ORTHOGONAL FINETUNING VIA
|
||
|
||
# BUTTERFLY FACTORIZATION
|
||
|
||
> Weiyang Liu 1,2,* Zeju Qiu 1,* Yao Feng 1,3,† Yuliang Xiu 1,† Yuxuan Xue 4,† Longhui Yu 2,† Haiwen Feng 1
|
||
> Zhen Liu 5Juyeon Heo 2Songyou Peng 1,3 Yandong Wen 1Michael J. Black 1Adrian Weller 2,6 Bernhard Sch¨ olkopf 1
|
||
> 1Max Planck Institute for Intelligent Systems - T¨ ubingen 2University of Cambridge 3ETH Z¨ urich
|
||
> 4University of T¨ ubingen 5Mila, Universit´ e de Montr´ eal 6The Alan Turing Institute boft.wyliu.com
|
||
|
||
## ABSTRACT
|
||
|
||
Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm – Orthogonal Finetuning (OFT) – for downstream task adap-tation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matri-ces. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butter-fly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language.
|
||
|
||
## 1 INTRODUCTION
|
||
|
||
Recent models such as ChatGPT [ 4 , 9] and Stable Diffusion [ 73 ], demonstrate the remarkable generalization ability of large foundation models. The rapid increase in the performance of such models is paired with a dramatic increase in the number of parameters ( e.g. , GPT-3 has around 175B parameters). As a result, it has become increasingly challenging for researchers to train a foundation model from scratch. Broad progress in the field therefore requires the ability to adapt such models without retraining them from scratch. That is, we must be able to efficiently adapt existing foundation models to downstream tasks. There are primarily three ways to perform efficient task adaptation:
|
||
|
||
model finetuning [ 6 , 23 , 29 , 67 , 69 , 92 , 97 ], where the model architecture remains unchanged and a subset of the model parameters gets finetuned; adapter tuning [24 , 28 , 48 , 65 , 71 ], where additional trainable parameters are added to the original model; and prompt tuning [39 , 42 ], where additional trainable prefix tokens are attached to the input. Among these methods, model finetuning distinguishes itself as a simple yet powerful approach, and more importantly, introduces no inference latency. The fundamental principle behind model finetuning is to ensure that the pretrained model and the finetuned model are similar based on certain measurements, such that the pretraining knowledge is preserved. For instance, current model finetuning methods often adopt a small learning rate. This ad hoc approach encourages a small discrepancy between the pretrained and the finetuned model. Given the structured nature of weight matrices, a more principled approach tries to preserve the relational information of the weight matrices, i.e. the pairwise angles between neurons. This insight leads to a novel model finetuning framework, known as Orthogonal Finetuning (OFT) [ 67 ], where neurons in the same layer are transformed by the same orthogonal matrix such that pairwise angles between neurons are provably preserved throughout the finetuning process. Although OFT has demonstrated promising generalizability and convergence for finetuning text-to-image diffusion models [ 67 ], the number of trainable parameters in OFT can be quite large due to the high dimensionality of orthogonal matrices. To address this, OFT introduces a block-diagonal structure to reduce the number of parameters. However, the parameter efficiency also comes at a price – the orthogonal matrix has a fixed sparsity pattern and the orthogonal transformation is applied independently in different blocks. This block-
|
||
|
||
> *
|
||
|
||
Equal first author contribution. †Equal second author contribution. Equal contributors are listed alphabetically.
|
||
|
||
1
|
||
|
||
> arXiv:2311.06243v2 [cs.LG] 28 Apr 2024
|
||
|
||
Published as a conference paper at ICLR 2024 diagonal sparsity pattern, despite saving parameters, may introduce some undesirable inductive biases (e.g. , the block-diagonal orthogonal matrix reduces expressiveness and cannot approximate classic linear transforms), and more importantly, how to find a good sparsity pattern remains a mystery. The key to addressing this problem is to generate a dense orthogonal matrix, while still being parameter-efficient. While this may seem infeasible at first glance since a d-dimensional dense orthogonal matrix requires O(d2) parameters, we take a novel route to compose a dense orthogonal matrix with multiple factorized sparse matrices. This approach is guided by the insight that the number of trainable parameters can be reduced by trading computation time for space. Since we represent the orthogonal matrix with the multiplication of sparse matrices, the multiplication has to be performed repeatedly in each training iteration. To put the matrix factorization into perspective, we interpret the generation of a dense orthogonal matrix as an information transmission problem. Under this problem formulation, generating a dense orthogonal matrix by multiplying sparse matrices can be understood as transmitting information on a grid-structured graph. This information transmission framework enables us to design many possible ways to perform sparse matrix factorization that limit the number of trainable parameters while still being expressive enough to generate dense matrices. To achieve parameter efficiency in our information transmission framework, we draw inspiration from the butterfly structures in the Cooley-Tukey fast Fourier transform algorithm [ 12 ] in which information from different nodes can be exchanged efficiently [ 36 ]. Specifically, the butterfly graph in the Cooley-Tukey algorithm inherently induces a way to perform sparse matrix factorization, called butterfly factorization. With butterfly factorization, we are able to generate a d-dimensional dense matrix with a product of O(log d) sparse matrices, each with O(d) non-zero entries. By ensuring that each sparse matrix is orthogonal, we arrive at an efficient orthogonal parameterization with only O(d log d) parameters, which is a significant reduction from the original parameterization. By leveraging such an efficient orthogonal parameterization, we propose a novel parameter-efficient finetuning method – Orthogonal Butterfly (BOFT). BOFT subsumes OFT as a special case and provides a general orthogonal finetuning framework. There is a shared characteristic for the block-diagonal structure and the butterfly structure – sparsity. Both structures introduce data sparsity into orthogonal matrices to reduce the effective number of trainable parameters. It is interesting to contrast our approach with the low-rank structure in LoRA [ 29 ]; both low-rank matrices and sparse matrices are major families of structured matrices [5] that achieve parameter efficiency. Compared to the block-diagonal structure that OFT uses to trade off expressivity and regularity, BOFT uses the butterfly structure to construct a smoother interpolation between matrices from the full orthogonal group (expressivity) and identity matrices (regularity). This enables us to find a smaller hypothesis class within the orthogonal group for downstream tasks. Given the widespread use of the butterfly structure in many fast linear transforms, such as the discrete Fourier and discrete cosine transforms, we argue that our structured approach to parameter efficiency will introduce an implicit inductive bias that benefits generalizability and prevents overfitting. Our intuition comes from the property that the butterfly structure can easily recover many classic linear transforms while it is impossible for the block-diagonal structure in OFT to recover any. Our contributions are listed below: • We study the problem of parameter efficiency in orthogonal finetuning with a novel information transmission framework. This framework transforms the task of crafting a parameter-efficient dense orthogonal matrix into an information transmission problem within a grid-structured graph. • Inspired by the butterfly structures in the Cooley-Tukey algorithm, we propose Orthogonal Butterfly, a parameter-efficient orthogonal finetuning method, within the information transmission framework. • We provide a few theoretical insights for why BOFT is able to significantly reduce the number of trainable parameters while still yielding a good expressivity and training stability. Thanks to the matrix factorization, BOFT also comes with an intriguing weight interpolation (see Figure 10). • For the very first time, we apply orthogonal finetuning to various tasks beyond controllable text-to-image generation [ 67 ], showing its great potential as a generic model finetuning method. To this end, we apply BOFT to a number of adaptation applications ranging from computer vision to natural language processing. BOFT outperforms current state-of-the-art methods by a considerable margin, validating its superior parameter-efficiency and generalization ability.
|
||
|
||
## 2 RELATED WORK
|
||
|
||
Parameter-efficient finetuning (PEFT) . As foundation models ( e.g. , [ 4, 35 , 68 , 73 ]) become increasingly large and powerful, finetunig these models for downstream tasks in a parameter-efficient 2Published as a conference paper at ICLR 2024 way has sparked considerable interest [ 18 , 45 , 82 ]. Among many PEFT methods [ 1, 3 , 8 , 10 , 19 ,21 , 23 , 24 , 28 – 31 , 33 , 39 , 42 , 46 , 49 , 54 , 56 , 78 , 79 , 86 , 88 , 92 , 98 ], reparameterization-based methods [ 1, 19 , 29 ] are the most relevant to our work. LoRA [ 29 ] updates the pretrained weight matrix by adding a product of two low-rank matrices, achieving promising performance on natural language tasks. Since the rank of all added matrices is set to a constant in LoRA, several methods [ 84 , 95 , 97 ] dynamically adjust the rank for different layers such that the parameter budget is adequately allocated. Due to its simplicity, such a low-rank weight reparameterization has gained great popularity [ 6, 16 , 102 ]. Inspired by how hyperspherical energy characterizes generalization [ 50 , 52 ], [ 67 ] proposes orthogonal finetuning, an alternative yet effective method to finetune text-to-image diffusion models. Specifically, OFT learns an orthogonal matrix to transform the neurons of the same layer, and it achieves stronger generalization and consistently more stable training than LoRA. Despite strong performance, OFT generally has more trainable parameters than LoRA. Therefore, making OFT more parameter-efficient is a useful goal. Moreover, whether OFT is applicable to a wider spectrum of adaptation tasks (beyond controlling text-to-image diffusion models [ 67 ]) is unknown. BOFT improves the parameter efficiency of OFT via butterfly factorization. Thanks to this, we are now able to demonstrate the power of orthogonal finetuning in general adaptation tasks.
|
||
|
||
Butterfly structures . The radix-2 Cooley-Tukey algorithm recursively reduces the N -point discrete Fourier transform to two N
|
||
|
||
> 2
|
||
|
||
-point discrete Fourier transforms, and this process induces a butterfly structure that can be written as a product of multiple sparse matrices (the product is also called a butterfly matrix). Butterfly matrices have already been used to parameterize orthogonal matrices to avoid pivoting in Gaussian elimination and improve efficiency [ 63 ], to stabilize the training of recurrent neural networks [ 32 ] and in kernel approximation [ 60 ]. [ 13 , 14 ] learn fast linear transforms with butterfly parameterizations. [ 7, 15 ] utilize butterfly matrices to enable the efficient training of neural networks. Butterfly structures are also found useful in fast matrix-vector multiplication [ 57 , 61 ], data-sparse matrix approximation [ 43 ], and network transmission [ 36 , 40 , 70 ]. In contrast to previous work, we focus on harnessing butterfly structures to enhance the parameter efficiency of OFT.
|
||
|
||
## 3 AN INFORMATION TRANSMISSION VIEW ON ORTHOGONAL FINETUNING Pretrained Weight Matrix
|
||
|
||
> W
|
||
> dnd
|
||
|
||
## x+Pretrained Weight Matrix
|
||
|
||
> W
|
||
> nd
|
||
> ... Orthogonal Matrx R
|
||
> brdrn
|
||
> Low-rank Matrix
|
||
> AB
|
||
> (a) Low-rank Structure in LoRA (b) Orthogonal Structure in OFT
|
||
> AB00
|
||
> Figure 1: A comparison of reparameterization between LoRA and OFT.
|
||
|
||
We start with some preliminaries of OFT. To finetune the pretrained weight matrix, OFT reparameter-izes the new weight matrix as the product of a learnable orthogonal matrix and the frozen pretrained weight matrix. Compared to LoRA which updates the weights with an additive low-rank matrix, OFT uses a multiplicative orthogonal matrix to update the weights. To achieve parameter-efficiency, LoRA uses a low-rank structure, and in contrast, the original OFT uses a block-diagonal orthogonal structure [ 67 ] (the smaller the size of diagonal blocks is, the more parameter-efficient OFT is). An intuitive comparison is given in Figure 1. The motivation for applying orthogonal transformation to finetune the weight matrix is to preserve the pair-wise angles of neurons [ 50 , 52 , 67 ], such that the semantic knowledge from pretraining can be largely preserved. Concretely, OFT optimizes an orthogonal matrix R ∈ Rd×d for a pretrained linear layer W 0 ∈ Rd×n, and modifies the forward pass from z = ( W 0)⊤x to z = ( RW 0)⊤x,where x ∈ Rd and z ∈ Rn are the input and output vector, respectively. To achieve zero initialization, OFT initializes R as an identity matrix. To ensure the orthogonality of R throughout the finetuning process, we follow [ 52 , 67 ] to employ Cayley parameterization, i.e. , R = ( I + Q)( I − Q)−1
|
||
|
||
where Q is a skew-symmetric matrix with Q = −Q⊤. For parameter-efficiency, the block-diagonal structure is introduced by parameterizing the orthogonal matrix R as diag (R1, R2, · · · , Rr ) where
|
||
|
||
Ri ∈ R b×b, ∀i is a small orthogonal matrix and br = d. The parameter-efficiency brought by the block-diagonal structure comes at a price – it introduces an assumption that the dimensions of a neuron ( i.e. , a column vector of the weight matrix W 0) are divided by r groups and dimensions in different groups are transformed separately using different orthogonal matrices. Despite the empirical effectiveness of the block-diagonal structure, it makes no sense to divide the dimensions of a neuron into r groups based on their indices, which makes dense orthogonal matrices desirable. A natural question arises: Can we construct a dense orthogonal matrix without losing the parameter-efficiency?
|
||
|
||
To address this question, we propose to compose a dense orthogonal matrix with a product of multiple sparse orthogonal matrices. To provide a unified yet intuitive perspective to study the sparsity pattern 3Published as a conference paper at ICLR 2024 Level 1 Level 2 Level 3 Level 4 Level 5
|
||
|
||
> B1B2B3B4 B5 B2B1 B3B 2B1 B4B 3B2B 1B5B4B 3B2B 1
|
||
> 123412341234123412341234Zero Non-zero Level 6
|
||
> Figure 2: An illustration of the information transmission view on generating dense matrices. This example uses d= 4 and m= 5 .
|
||
|
||
of orthogonal matrix factorization, we frame the problem of generating a dense orthogonal matrix in OFT as an information transmission problem. Specif-ically, generating a dense matrix R ∈ Rd×d by a product of m square matrices R = BmBm−1 · · · B1
|
||
|
||
can be viewed as transmitting information in a grid with d×(m+1) nodes, as illustrated in Figure 2. The motivation behind the information transmission view comes from the observation that a d-dimensional dense square matrix can be interpreted as a dense connectivity from d nodes to another d nodes. For the matrix R, if the element Rij is zero, then it indicates that information from the j-th node can-not flow to the i-th node. If Rij is non-zero, then the information can be transmitted. Therefore, rep-resenting the dense matrix R with multiple matrices
|
||
|
||
BmBm−1 · · · B1 can also be interpreted as perform-ing sequential information exchange based on the graphs induced by Bi, ∀i. The information flows following B1 first and Bn in the end. As a concrete example in Figure 2, we consider the factorization
|
||
|
||
R = B5B4B3B2B1 whose sparsity patterns and induced graph are visualized. The graph in Fig-ure 2 is the result of unrolling the matrix multiplication. In the induced graph, the matrix Bi is viewed as the connectivity matrix from the i-th level nodes to the (i + 1) -th level nodes. More specifically, the
|
||
|
||
(j1, j 2) element of Bi denotes whether there is a directed edge from the j2-th node in the i-th level to the j1-th node in the (i + 1) -th level (zero means no edge). For B5B4B3B2B1 to be a dense matrix, every node in the first level should be able to transmit information to all the nodes in the 6-th level. If we only consider R = B4B3B2B1 which corresponds to the source nodes in the first level and the receiver nodes in 5-th level, then we find that information from the node 1 cannot be transmitted to the node 3. Therefore, the sparsity pattern of B4B3B2B1 has a zero element at the (3 , 1) position. Considering R = B3B2B1 and R = B2B1, the same correspondence holds between the induced graph and the sparsity pattern. Generally, for a matrix R ∈ Rd×d to be dense, the m factorization matrices Bm, · · · , B1 needs to correspond to a set of directed edges on a d × (m + 1) grid where one directed edge can only connect two nodes between adjacent levels ( i.e. , columns), such that information from every node in the first level can be transmitted to every node in the (m + 1) -th level. 1 2 3 41 2 3 4
|
||
|
||
> Figure 3: An example of block-diagonal structure in OFT.
|
||
|
||
The information transmission view can help us gain a better understanding of the sparsity pattern of factorization matrices in OFT. Figure 3 visualizes the block-diagonal structure of R in the original OFT. Despite reducing the number of trainable parameters, the block-diagonal structure cannot construct a dense matrix R. Our goal is to compose a dense orthogonal matrix with
|
||
|
||
m sparse orthogonal matrices, using as few effective trainable parameters as possible. Under the information transmission view, the general desiderata towards our goal are (i) dense connectivity : every node in the first level has at least one path to every node in the last level, and (ii) minimum free edges : the total number of edges should be as small as possible under the orthogonality constraint. We note that orthogonality injects a delicate constraint to the edges between adjacent levels. For example, for each matrix Bi to be full-rank (a necessary condition of orthogonality), we need to have d edges to form a bijection between all the nodes in the i-th level and all the nodes in the (i = 1) -th level, which makes the number of edges between adjacent levels at least d (e.g. , 4 for the example in Figure 2). These d edges is necessary for orthogonality and should not be counted into the number of edges, because these elements are not trainable ( e.g. , for a d × d orthogonal with d non-zero entries, these entries can only be ±1). Because orthogonal matrices require less number of parameters than full matrices, the orthogonality constraint will bring in additional dependency among edges. As an example, for a 2 × 2 orthogonal matrix, one zero at the (1 , 1) position will imply another zero at the (2 , 2) position ( i.e. , one missing edge could imply another missing edge). Therefore, for each feasible set of edge connections, the orthogonality may sometimes add or remove some edges. By visualizing the non-zero pattern of the composed orthogonal matrix, the information transmission framework is particularly useful in OFT, because we only care about the non-zero trainable elements of R and their specific values do not matter. A naive dense connection between two levels takes O(d2) edges ( i.e. , a single dense orthogonal matrix), yielding d2 −d edges (for d = 4 , it is 12 edges). Figure 2 gives an example of a feasible matrix 4Published as a conference paper at ICLR 2024 factorization and it takes 10 edges in total, which is actually less than a single dense orthogonal matrix. This framework enables us to study the parameter-efficiency of OFT from a graphical perspective, and we can easily come up with feasible factorizations with this framework. We draw inspiration from an interesting topology from the Cooley-Tukey algorithm, called butterfly graphs [ 12 ], which can densely connect d source nodes and d receiver nodes efficiently with O(d log d) edges. For example, the topology in Figure 2 takes 10 edges to achieve dense connectivity, while the butterfly network only takes 8 edges. Next, we introduce how the butterfly structure can improve parameter efficiency.
|
||
|
||
## 4 ORTHOGONAL PARAMETERIZATION BY BUTTERFLY FACTORIZATION Level 1 Level 2 Level 3 Level 4
|
||
|
||
> B(8,8)
|
||
> 12341234123412341234123412341234
|
||
> ~B(8,4)
|
||
> ~B(8,2)
|
||
> ~
|
||
> Figure 4: The butterfly structure ( d= 8 ).
|
||
|
||
The butterfly structure is originally used in the Cooley-Tukey al-gorithm to perform fast Fourier transform. In Fourier transform, a local change in the frequency domain can cause a global change in the spatial domain, which is conceptually similar to our information transmission problem – every node in the first level can transmit information to all the nodes in the last level. The butterfly structure also becomes a popular computer network topology [ 41 , 75 ] used for efficient information exchange. Assuming that k ≥ 2 is a power of 2, we start by defining the butterfly factor BF (k) as
|
||
|
||
BF (k) =
|
||
|
||
diag (d1) diag (d2)
|
||
|
||
diag (d3) diag (d4)
|
||
|
||
|
||
|
||
∈ Rk×k , (1)
|
||
|
||
where di ∈ R k
|
||
|
||
> 2
|
||
|
||
, ∀i are some vectors. With d = 2 N , we then define the d-dimensional butterfly matrix B(d) ∈ Rd×d recursively as
|
||
|
||
B(d) = ˜B(d, d )·
|
||
|
||
B1( d
|
||
|
||
> 2
|
||
|
||
) 00 B2( d
|
||
|
||
> 2
|
||
|
||
)
|
||
|
||
|
||
|
||
= ˜B(d, d ) ˜B(d, d
|
||
|
||
2 ) · · · ˜B(d, 2) , (2)
|
||
|
||
where B1( d
|
||
|
||
> 2
|
||
|
||
) and B2( d
|
||
|
||
> 2
|
||
|
||
) are two d
|
||
|
||
> 2
|
||
|
||
-dimensional butterfly matrices. We then define the butterfly component as ˜B(d, k ) = diag (BF
|
||
|
||
> 1
|
||
|
||
(k), · · · , BFd/k (k)) that is a block-diagonal matrix of size d × d with the block size k, where BFi (k), ∀i are butterfly factors defined in Equation 1. Now we are ready to use the butterfly matrix to parameterize an orthogonal matrix. To achieve this, we only need to ensure that all multiplicative factors ˜B(d, k ), ∀k in the butterfly matrix
|
||
|
||
B(d) are orthogonal. We first look into the block-diagonal matrix ˜B(d, 2) with the block size 2, and we can easily guarantee ˜B(d, 2) to be orthogonal with Cayley transform (or 2-dimensional rotation) to parameterize each block, same as [ 52 , 67 ]. The non-zero pattern of every butterfly component can be viewed as a permutation of the non-zero pattern of ˜B(d, 2) , so all the butterfly components can be easily parameterized as orthogonal matrices. This gives us an efficient parameterization of orthogonal matrices built upon many 2 × 2 orthogonal matrices. We generalize the butterfly matrices following [ 7], and define a block butterfly component ˜Bb(d, k ) where each entry in di, ∀i becomes a
|
||
|
||
b × b matrix. To guarantee the block butterfly component ˜Bb(d, 2) to be orthogonal, we parameterize each 2b × 2b block matrix to be orthogonal. The non-zero pattern of the other butterfly components
|
||
|
||
˜Bb(d, k ), k > 2 are the block-wise permutation of the non-zero pattern of ˜Bb(d, 2) and therefore can be similarly turned into orthogonal matrices. Combining pieces, the forward pass in BOFT is
|
||
|
||
z = |