Files
adapters_as_hypotheses/docs/boft_butterfly_orthogonal.md
wassname 3800481a30 Initial commit: adapters as representational hypotheses catalog
30 PEFT methods reframed as hypotheses about transformer geometry.
Each entry: pseudocode, hypothesis, evidence, grade.
All papers saved to docs/ (full text).
2026-02-22 19:48:37 +08:00

100 KiB
Raw Permalink Blame History

Title: 2311.06243v2.pdf

URL Source: https://arxiv.org/pdf/2311.06243

Published Time: Tue, 30 Apr 2024 01:10:58 GMT

Number of Pages: 34

Markdown Content: Published as a conference paper at ICLR 2024

PARAMETER -E FFICIENT ORTHOGONAL FINETUNING VIA

BUTTERFLY FACTORIZATION

Weiyang Liu 1,2,* Zeju Qiu 1,* Yao Feng 1,3,† Yuliang Xiu 1,† Yuxuan Xue 4,† Longhui Yu 2,† Haiwen Feng 1 Zhen Liu 5Juyeon Heo 2Songyou Peng 1,3 Yandong Wen 1Michael J. Black 1Adrian Weller 2,6 Bernhard Sch¨ olkopf 1 1Max Planck Institute for Intelligent Systems - T¨ ubingen 2University of Cambridge 3ETH Z¨ urich 4University of T¨ ubingen 5Mila, Universit´ e de Montr´ eal 6The Alan Turing Institute boft.wyliu.com

ABSTRACT

Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm Orthogonal Finetuning (OFT) for downstream task adap-tation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matri-ces. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butter-fly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language.

1 INTRODUCTION

Recent models such as ChatGPT [ 4 , 9] and Stable Diffusion [ 73 ], demonstrate the remarkable generalization ability of large foundation models. The rapid increase in the performance of such models is paired with a dramatic increase in the number of parameters ( e.g. , GPT-3 has around 175B parameters). As a result, it has become increasingly challenging for researchers to train a foundation model from scratch. Broad progress in the field therefore requires the ability to adapt such models without retraining them from scratch. That is, we must be able to efficiently adapt existing foundation models to downstream tasks. There are primarily three ways to perform efficient task adaptation:

model finetuning [ 6 , 23 , 29 , 67 , 69 , 92 , 97 ], where the model architecture remains unchanged and a subset of the model parameters gets finetuned; adapter tuning [24 , 28 , 48 , 65 , 71 ], where additional trainable parameters are added to the original model; and prompt tuning [39 , 42 ], where additional trainable prefix tokens are attached to the input. Among these methods, model finetuning distinguishes itself as a simple yet powerful approach, and more importantly, introduces no inference latency. The fundamental principle behind model finetuning is to ensure that the pretrained model and the finetuned model are similar based on certain measurements, such that the pretraining knowledge is preserved. For instance, current model finetuning methods often adopt a small learning rate. This ad hoc approach encourages a small discrepancy between the pretrained and the finetuned model. Given the structured nature of weight matrices, a more principled approach tries to preserve the relational information of the weight matrices, i.e. the pairwise angles between neurons. This insight leads to a novel model finetuning framework, known as Orthogonal Finetuning (OFT) [ 67 ], where neurons in the same layer are transformed by the same orthogonal matrix such that pairwise angles between neurons are provably preserved throughout the finetuning process. Although OFT has demonstrated promising generalizability and convergence for finetuning text-to-image diffusion models [ 67 ], the number of trainable parameters in OFT can be quite large due to the high dimensionality of orthogonal matrices. To address this, OFT introduces a block-diagonal structure to reduce the number of parameters. However, the parameter efficiency also comes at a price the orthogonal matrix has a fixed sparsity pattern and the orthogonal transformation is applied independently in different blocks. This block-

Equal first author contribution. †Equal second author contribution. Equal contributors are listed alphabetically.

1

arXiv:2311.06243v2 [cs.LG] 28 Apr 2024

Published as a conference paper at ICLR 2024 diagonal sparsity pattern, despite saving parameters, may introduce some undesirable inductive biases (e.g. , the block-diagonal orthogonal matrix reduces expressiveness and cannot approximate classic linear transforms), and more importantly, how to find a good sparsity pattern remains a mystery. The key to addressing this problem is to generate a dense orthogonal matrix, while still being parameter-efficient. While this may seem infeasible at first glance since a d-dimensional dense orthogonal matrix requires O(d2) parameters, we take a novel route to compose a dense orthogonal matrix with multiple factorized sparse matrices. This approach is guided by the insight that the number of trainable parameters can be reduced by trading computation time for space. Since we represent the orthogonal matrix with the multiplication of sparse matrices, the multiplication has to be performed repeatedly in each training iteration. To put the matrix factorization into perspective, we interpret the generation of a dense orthogonal matrix as an information transmission problem. Under this problem formulation, generating a dense orthogonal matrix by multiplying sparse matrices can be understood as transmitting information on a grid-structured graph. This information transmission framework enables us to design many possible ways to perform sparse matrix factorization that limit the number of trainable parameters while still being expressive enough to generate dense matrices. To achieve parameter efficiency in our information transmission framework, we draw inspiration from the butterfly structures in the Cooley-Tukey fast Fourier transform algorithm [ 12 ] in which information from different nodes can be exchanged efficiently [ 36 ]. Specifically, the butterfly graph in the Cooley-Tukey algorithm inherently induces a way to perform sparse matrix factorization, called butterfly factorization. With butterfly factorization, we are able to generate a d-dimensional dense matrix with a product of O(log d) sparse matrices, each with O(d) non-zero entries. By ensuring that each sparse matrix is orthogonal, we arrive at an efficient orthogonal parameterization with only O(d log d) parameters, which is a significant reduction from the original parameterization. By leveraging such an efficient orthogonal parameterization, we propose a novel parameter-efficient finetuning method Orthogonal Butterfly (BOFT). BOFT subsumes OFT as a special case and provides a general orthogonal finetuning framework. There is a shared characteristic for the block-diagonal structure and the butterfly structure sparsity. Both structures introduce data sparsity into orthogonal matrices to reduce the effective number of trainable parameters. It is interesting to contrast our approach with the low-rank structure in LoRA [ 29 ]; both low-rank matrices and sparse matrices are major families of structured matrices [5] that achieve parameter efficiency. Compared to the block-diagonal structure that OFT uses to trade off expressivity and regularity, BOFT uses the butterfly structure to construct a smoother interpolation between matrices from the full orthogonal group (expressivity) and identity matrices (regularity). This enables us to find a smaller hypothesis class within the orthogonal group for downstream tasks. Given the widespread use of the butterfly structure in many fast linear transforms, such as the discrete Fourier and discrete cosine transforms, we argue that our structured approach to parameter efficiency will introduce an implicit inductive bias that benefits generalizability and prevents overfitting. Our intuition comes from the property that the butterfly structure can easily recover many classic linear transforms while it is impossible for the block-diagonal structure in OFT to recover any. Our contributions are listed below: • We study the problem of parameter efficiency in orthogonal finetuning with a novel information transmission framework. This framework transforms the task of crafting a parameter-efficient dense orthogonal matrix into an information transmission problem within a grid-structured graph. • Inspired by the butterfly structures in the Cooley-Tukey algorithm, we propose Orthogonal Butterfly, a parameter-efficient orthogonal finetuning method, within the information transmission framework. • We provide a few theoretical insights for why BOFT is able to significantly reduce the number of trainable parameters while still yielding a good expressivity and training stability. Thanks to the matrix factorization, BOFT also comes with an intriguing weight interpolation (see Figure 10). • For the very first time, we apply orthogonal finetuning to various tasks beyond controllable text-to-image generation [ 67 ], showing its great potential as a generic model finetuning method. To this end, we apply BOFT to a number of adaptation applications ranging from computer vision to natural language processing. BOFT outperforms current state-of-the-art methods by a considerable margin, validating its superior parameter-efficiency and generalization ability.

Parameter-efficient finetuning (PEFT) . As foundation models ( e.g. , [ 4, 35 , 68 , 73 ]) become increasingly large and powerful, finetunig these models for downstream tasks in a parameter-efficient 2Published as a conference paper at ICLR 2024 way has sparked considerable interest [ 18 , 45 , 82 ]. Among many PEFT methods [ 1, 3 , 8 , 10 , 19 ,21 , 23 , 24 , 28 31 , 33 , 39 , 42 , 46 , 49 , 54 , 56 , 78 , 79 , 86 , 88 , 92 , 98 ], reparameterization-based methods [ 1, 19 , 29 ] are the most relevant to our work. LoRA [ 29 ] updates the pretrained weight matrix by adding a product of two low-rank matrices, achieving promising performance on natural language tasks. Since the rank of all added matrices is set to a constant in LoRA, several methods [ 84 , 95 , 97 ] dynamically adjust the rank for different layers such that the parameter budget is adequately allocated. Due to its simplicity, such a low-rank weight reparameterization has gained great popularity [ 6, 16 , 102 ]. Inspired by how hyperspherical energy characterizes generalization [ 50 , 52 ], [ 67 ] proposes orthogonal finetuning, an alternative yet effective method to finetune text-to-image diffusion models. Specifically, OFT learns an orthogonal matrix to transform the neurons of the same layer, and it achieves stronger generalization and consistently more stable training than LoRA. Despite strong performance, OFT generally has more trainable parameters than LoRA. Therefore, making OFT more parameter-efficient is a useful goal. Moreover, whether OFT is applicable to a wider spectrum of adaptation tasks (beyond controlling text-to-image diffusion models [ 67 ]) is unknown. BOFT improves the parameter efficiency of OFT via butterfly factorization. Thanks to this, we are now able to demonstrate the power of orthogonal finetuning in general adaptation tasks.

Butterfly structures . The radix-2 Cooley-Tukey algorithm recursively reduces the N -point discrete Fourier transform to two N

2

-point discrete Fourier transforms, and this process induces a butterfly structure that can be written as a product of multiple sparse matrices (the product is also called a butterfly matrix). Butterfly matrices have already been used to parameterize orthogonal matrices to avoid pivoting in Gaussian elimination and improve efficiency [ 63 ], to stabilize the training of recurrent neural networks [ 32 ] and in kernel approximation [ 60 ]. [ 13 , 14 ] learn fast linear transforms with butterfly parameterizations. [ 7, 15 ] utilize butterfly matrices to enable the efficient training of neural networks. Butterfly structures are also found useful in fast matrix-vector multiplication [ 57 , 61 ], data-sparse matrix approximation [ 43 ], and network transmission [ 36 , 40 , 70 ]. In contrast to previous work, we focus on harnessing butterfly structures to enhance the parameter efficiency of OFT.

3 AN INFORMATION TRANSMISSION VIEW ON ORTHOGONAL FINETUNING Pretrained Weight Matrix

W dnd

x+Pretrained Weight Matrix

W nd ... Orthogonal Matrx R brdrn Low-rank Matrix AB (a) Low-rank Structure in LoRA (b) Orthogonal Structure in OFT AB00 Figure 1: A comparison of reparameterization between LoRA and OFT.

We start with some preliminaries of OFT. To finetune the pretrained weight matrix, OFT reparameter-izes the new weight matrix as the product of a learnable orthogonal matrix and the frozen pretrained weight matrix. Compared to LoRA which updates the weights with an additive low-rank matrix, OFT uses a multiplicative orthogonal matrix to update the weights. To achieve parameter-efficiency, LoRA uses a low-rank structure, and in contrast, the original OFT uses a block-diagonal orthogonal structure [ 67 ] (the smaller the size of diagonal blocks is, the more parameter-efficient OFT is). An intuitive comparison is given in Figure 1. The motivation for applying orthogonal transformation to finetune the weight matrix is to preserve the pair-wise angles of neurons [ 50 , 52 , 67 ], such that the semantic knowledge from pretraining can be largely preserved. Concretely, OFT optimizes an orthogonal matrix R ∈ Rd×d for a pretrained linear layer W 0 ∈ Rd×n, and modifies the forward pass from z = ( W 0)x to z = ( RW 0)x,where x ∈ Rd and z ∈ Rn are the input and output vector, respectively. To achieve zero initialization, OFT initializes R as an identity matrix. To ensure the orthogonality of R throughout the finetuning process, we follow [ 52 , 67 ] to employ Cayley parameterization, i.e. , R = ( I + Q)( I Q)1

where Q is a skew-symmetric matrix with Q = Q. For parameter-efficiency, the block-diagonal structure is introduced by parameterizing the orthogonal matrix R as diag (R1, R2, · · · , Rr ) where

Ri ∈ R b×b, ∀i is a small orthogonal matrix and br = d. The parameter-efficiency brought by the block-diagonal structure comes at a price it introduces an assumption that the dimensions of a neuron ( i.e. , a column vector of the weight matrix W 0) are divided by r groups and dimensions in different groups are transformed separately using different orthogonal matrices. Despite the empirical effectiveness of the block-diagonal structure, it makes no sense to divide the dimensions of a neuron into r groups based on their indices, which makes dense orthogonal matrices desirable. A natural question arises: Can we construct a dense orthogonal matrix without losing the parameter-efficiency?

To address this question, we propose to compose a dense orthogonal matrix with a product of multiple sparse orthogonal matrices. To provide a unified yet intuitive perspective to study the sparsity pattern 3Published as a conference paper at ICLR 2024 Level 1 Level 2 Level 3 Level 4 Level 5

B1B2B3B4 B5 B2B1 B3B 2B1 B4B 3B2B 1B5B4B 3B2B 1 123412341234123412341234Zero Non-zero Level 6 Figure 2: An illustration of the information transmission view on generating dense matrices. This example uses d= 4 and m= 5 .

of orthogonal matrix factorization, we frame the problem of generating a dense orthogonal matrix in OFT as an information transmission problem. Specif-ically, generating a dense matrix R ∈ Rd×d by a product of m square matrices R = BmBm1 · · · B1

can be viewed as transmitting information in a grid with d×(m+1) nodes, as illustrated in Figure 2. The motivation behind the information transmission view comes from the observation that a d-dimensional dense square matrix can be interpreted as a dense connectivity from d nodes to another d nodes. For the matrix R, if the element Rij is zero, then it indicates that information from the j-th node can-not flow to the i-th node. If Rij is non-zero, then the information can be transmitted. Therefore, rep-resenting the dense matrix R with multiple matrices

BmBm1 · · · B1 can also be interpreted as perform-ing sequential information exchange based on the graphs induced by Bi, ∀i. The information flows following B1 first and Bn in the end. As a concrete example in Figure 2, we consider the factorization

R = B5B4B3B2B1 whose sparsity patterns and induced graph are visualized. The graph in Fig-ure 2 is the result of unrolling the matrix multiplication. In the induced graph, the matrix Bi is viewed as the connectivity matrix from the i-th level nodes to the (i + 1) -th level nodes. More specifically, the

(j1, j 2) element of Bi denotes whether there is a directed edge from the j2-th node in the i-th level to the j1-th node in the (i + 1) -th level (zero means no edge). For B5B4B3B2B1 to be a dense matrix, every node in the first level should be able to transmit information to all the nodes in the 6-th level. If we only consider R = B4B3B2B1 which corresponds to the source nodes in the first level and the receiver nodes in 5-th level, then we find that information from the node 1 cannot be transmitted to the node 3. Therefore, the sparsity pattern of B4B3B2B1 has a zero element at the (3 , 1) position. Considering R = B3B2B1 and R = B2B1, the same correspondence holds between the induced graph and the sparsity pattern. Generally, for a matrix R ∈ Rd×d to be dense, the m factorization matrices Bm, · · · , B1 needs to correspond to a set of directed edges on a d × (m + 1) grid where one directed edge can only connect two nodes between adjacent levels ( i.e. , columns), such that information from every node in the first level can be transmitted to every node in the (m + 1) -th level. 1 2 3 41 2 3 4

Figure 3: An example of block-diagonal structure in OFT.

The information transmission view can help us gain a better understanding of the sparsity pattern of factorization matrices in OFT. Figure 3 visualizes the block-diagonal structure of R in the original OFT. Despite reducing the number of trainable parameters, the block-diagonal structure cannot construct a dense matrix R. Our goal is to compose a dense orthogonal matrix with

m sparse orthogonal matrices, using as few effective trainable parameters as possible. Under the information transmission view, the general desiderata towards our goal are (i) dense connectivity : every node in the first level has at least one path to every node in the last level, and (ii) minimum free edges : the total number of edges should be as small as possible under the orthogonality constraint. We note that orthogonality injects a delicate constraint to the edges between adjacent levels. For example, for each matrix Bi to be full-rank (a necessary condition of orthogonality), we need to have d edges to form a bijection between all the nodes in the i-th level and all the nodes in the (i = 1) -th level, which makes the number of edges between adjacent levels at least d (e.g. , 4 for the example in Figure 2). These d edges is necessary for orthogonality and should not be counted into the number of edges, because these elements are not trainable ( e.g. , for a d × d orthogonal with d non-zero entries, these entries can only be ±1). Because orthogonal matrices require less number of parameters than full matrices, the orthogonality constraint will bring in additional dependency among edges. As an example, for a 2 × 2 orthogonal matrix, one zero at the (1 , 1) position will imply another zero at the (2 , 2) position ( i.e. , one missing edge could imply another missing edge). Therefore, for each feasible set of edge connections, the orthogonality may sometimes add or remove some edges. By visualizing the non-zero pattern of the composed orthogonal matrix, the information transmission framework is particularly useful in OFT, because we only care about the non-zero trainable elements of R and their specific values do not matter. A naive dense connection between two levels takes O(d2) edges ( i.e. , a single dense orthogonal matrix), yielding d2 d edges (for d = 4 , it is 12 edges). Figure 2 gives an example of a feasible matrix 4Published as a conference paper at ICLR 2024 factorization and it takes 10 edges in total, which is actually less than a single dense orthogonal matrix. This framework enables us to study the parameter-efficiency of OFT from a graphical perspective, and we can easily come up with feasible factorizations with this framework. We draw inspiration from an interesting topology from the Cooley-Tukey algorithm, called butterfly graphs [ 12 ], which can densely connect d source nodes and d receiver nodes efficiently with O(d log d) edges. For example, the topology in Figure 2 takes 10 edges to achieve dense connectivity, while the butterfly network only takes 8 edges. Next, we introduce how the butterfly structure can improve parameter efficiency.

4 ORTHOGONAL PARAMETERIZATION BY BUTTERFLY FACTORIZATION Level 1 Level 2 Level 3 Level 4

B(8,8) 12341234123412341234123412341234 ~B(8,4) ~B(8,2) ~ Figure 4: The butterfly structure ( d= 8 ).

The butterfly structure is originally used in the Cooley-Tukey al-gorithm to perform fast Fourier transform. In Fourier transform, a local change in the frequency domain can cause a global change in the spatial domain, which is conceptually similar to our information transmission problem every node in the first level can transmit information to all the nodes in the last level. The butterfly structure also becomes a popular computer network topology [ 41 , 75 ] used for efficient information exchange. Assuming that k ≥ 2 is a power of 2, we start by defining the butterfly factor BF (k) as

BF (k) =

diag (d1) diag (d2)

diag (d3) diag (d4)



∈ Rk×k , (1)

where di ∈ R k

2

, ∀i are some vectors. With d = 2 N , we then define the d-dimensional butterfly matrix B(d) ∈ Rd×d recursively as

B(d) = ˜B(d, d )·

B1( d

2

) 00 B2( d

2

)



= ˜B(d, d ) ˜B(d, d

2 ) · · · ˜B(d, 2) , (2)

where B1( d

2

) and B2( d

2

) are two d

2

-dimensional butterfly matrices. We then define the butterfly component as ˜B(d, k ) = diag (BF

1

(k), · · · , BFd/k (k)) that is a block-diagonal matrix of size d × d with the block size k, where BFi (k), ∀i are butterfly factors defined in Equation 1. Now we are ready to use the butterfly matrix to parameterize an orthogonal matrix. To achieve this, we only need to ensure that all multiplicative factors ˜B(d, k ), ∀k in the butterfly matrix

B(d) are orthogonal. We first look into the block-diagonal matrix ˜B(d, 2) with the block size 2, and we can easily guarantee ˜B(d, 2) to be orthogonal with Cayley transform (or 2-dimensional rotation) to parameterize each block, same as [ 52 , 67 ]. The non-zero pattern of every butterfly component can be viewed as a permutation of the non-zero pattern of ˜B(d, 2) , so all the butterfly components can be easily parameterized as orthogonal matrices. This gives us an efficient parameterization of orthogonal matrices built upon many 2 × 2 orthogonal matrices. We generalize the butterfly matrices following [ 7], and define a block butterfly component ˜Bb(d, k ) where each entry in di, ∀i becomes a

b × b matrix. To guarantee the block butterfly component ˜Bb(d, 2) to be orthogonal, we parameterize each 2b × 2b block matrix to be orthogonal. The non-zero pattern of the other butterfly components

˜Bb(d, k ), k > 2 are the block-wise permutation of the non-zero pattern of ˜Bb(d, 2) and therefore can be similarly turned into orthogonal matrices. Combining pieces, the forward pass in BOFT is

z = R(m, b ) · W 0x, s.t.



R(m, b ) =

m

Y

i=1

˜Bb

(d,i )

& ˜Bb

(d,j )

 ˜Bb

(d,j )

= ˜Bb

(d,j )

˜Bb

(d,j )

 = Id

| {z }

∀j∈[1 ,m ]



,

where we denote ˜Bb(d, 2mi+1 ) as ˜Bb

(d,i )

for simplicity, and Id is an identity matrix of size d.The orthogonal matrix R(m, b ) ∈ Rd×d is composed of a product of multiple orthogonal butterfly components. For convenience, we denote BOFT with R(m, b

2

) as BOFT( m,b), where b ≥ 2. When

m = 1 , then BOFT( 1,b) reduces to the block-diagonal OFT [ 67 ] with the block size b. BOFT( 1,d)reduces to the original OFT [ 67 ] with an unconstrained full orthogonal matrix. BOFT( log 2db ,b)uses the block butterfly matrix B b

2

(d) as R, and yields a dense orthogonal matrix R. In general, BOFT( m,b) takes 12 (b 1) dm effective trainable parameters for finetuning a linear layer of size d × n.If we use the butterfly matrix, i.e. , m = log d, b = 2 , BOFT uses O(d log d) parameters. In contrast, the original OFT with a full dense orthogonal matrix uses O(d2) parameters, and the block-diagonal OFT with the block number r uses O(bd ). Therefore, the original OFT has to use the block size

b = d to generate a dense orthogonal matrix, while BOFT can use any b to achieve this. 5Published as a conference paper at ICLR 2024

Identity initialization for BOFT . Finetuning methods usually start with the exact pretrained model such that the finetuned model will not deviate too much from the pretrained one. For example, LoRA uses zero initialization for the low-rank weights. In BOFT, we initialize all the butterfly components with identity matrices ( i.e. , the skew-symmetric matrix is initialized as zeros in Cayley transform).

Multiplicative Dropout for BOFT . LoRA [ 29 ] further implements a Dropout layer for the low-rank weight update to prevent overfitting. The conventional Dropout [ 77 ] naturally works for LoRA, but not for BOFT due to our multiplicative weight update. To address this, we propose a multiplicative Dropout for BOFT. Because the orthogonal matrix R(m, b ) is composed of m orthogonal butterfly components which can be easily permuted to 2b × 2b block-diagonal orthogonal matrices. The multiplicative Dropout first randomly picks p1 percent of the butterfly components and p2 percent of the diagonal blocks in each butterfly component, and then replaces these blocks as identity matrices.

5 INTRIGUING INSIGHTS AND DISCUSSIONS 10 3 10 4Number of trainable parameters 0.6 0.7 0.8 0.9 1Scaled approximation error BOFT(1:9,2) BOFT(1:8,4) BOFT(1:6,16) BOFT(1,2) BOFT(9,2) BOFT(8,2) BOFT(1,4) BOFT(6,16) BOFT(1,16)

Figure 5: Expressiveness of BOFT.

Expressivity of BOFT . The butterfly structure along with permutations can perfectly recover many classic fast linear transform [ 13 , 14 ] ( e.g. ,fast Fourier transform, Hadamard transform), but how well our orthogo-nal butterfly matrix can approximate a general orthogonal matrix remains unknown. We start by conducting a simulation to approximate a random dense orthogonal matrix [ 2] with size 512 × 512 . The results in Figure 5 are averaged over 10 random seeds. The y-axis denotes the approxi-mation error, and the x-axis denotes the number of effective trainable parameters. Each curve with the same color denotes BOFT with the same block size, and the leftmost point is the error of BOFT( 1,b) ( i.e. , the original block-diagonal OFT with block size b). BOFT generally yields better parameter efficiency than OFT. For example, the expressiveness of BOFT( 9,2) is better than that of BOFT( 1,16 ) but has much less parameters. BOFT with smaller b and larger m is generally more parameter-efficient. For example, BOFT( 6,4) uses much less parameters but yields a similar approximation error to BOFT( 2,16 ). In general, the butterfly matrix represents a more structured subset of the orthogonal group (compared to the block-diagonal structure), which makes BOFT provably more expressive than OFT with the same block size.

Theorem 1 (Expressivity of BOFT) . BOFT is more expressive than OFT with the same block size. For the butterfly matrix to approximate all orthogonal matrices of size d, we can multiply butterfly matrices with Bd1,1(d)B

d1,2

(d) · · · B1,1(d)B

1,2

(d), where Bi,j (d), ∀i, ∀j are butterfly matrices.

Theorem 1 suggests a simple generalization for BOFT the final orthogonal matrix is generalized to

RG(m1, b 1, m 2, b 2, l ) = Rl, 1(m1, b 1)RTl, 2(m2, b 2) · · · R1,1(m1, b 1)RT

1,2

(m2, b 2) where RTi,j (m, b )

denotes the orthogonal matrix used in BOFT. When m1 = m2 = log d, b1 = b2 = 2 and l = d 1,then RG(m1, b 1, m 2, b 2, l ) can represent the entire orthogonal group. Such a matrix composition is also called kaleidoscope hierarchy [ 14 ]. However, we note that better expressiveness does not always lead to better performance in finetuning, as full finetuning, despite its universal expressiveness, often yields unsatisfactory performance. The trade-off between expressivity and regularity is the key to the generalizability of model finetuning. BOFT enlarges the finetuning parameter space with structural priors, which enables us to find a better trade-off between expressivity and regularity.

Spectral properties . Orthogonal finetuning generally yields better spectral property than LoRA, because it perfectly preserves the spectral norm of the pretrained weight matrix W 0. We can see this by singular value decomposition: W 0 = U ΣV where U , V are orthogonal matrice and Σ is a singular value diagonal matrix. Both OFT and BOFT multiply an orthogonal matrix R to the left and obtain the finetuned weights RU ΣV , which does not affect the largest singular value ( i.e. , the spectral norm of W 0). Such a preservation has been shown to greatly benefit training stability and generalization [58, 90]. We introduce more interesting mathematical properties in Appendix G.

Orthogonal finetuning as learning bilinear similarity . BOFT can be written as learning the bilinear similarity w0

i

Rx where w0

i

is the i-th neuron ( i.e. , column vector) of the weight matrix W 0. BOFT can be viewed as learning the bilinear similarity matrix R with a strong regularity ( i.e. , R needs to be orthogonal), which intrinsically connects to distance metric learning [89] and bilinear form [72].

Inductive bias and generalization in BOFT . Since R(m, b ) in BOFT usually represents a structured subset of the orthogonal group which constrains the hypothesis class, BOFT will naturally induce an inductive bias. We argue that the structured inductive bias induced by butterfly factorization is 6Published as a conference paper at ICLR 2024 beneficial to generalization, as it has a shared structured pattern of many classic linear transforms [ 14 ], such as discrete Fourier transform, discrete sine/cosine transform and Hadamard transform. Moreover, the sparse matrix factorization in BOFT may also bring some implicit inductive bias [22, 44, 51].

Comparison to butterfly-based sparse training . There are quite a few works [ 7, 13 15 ] that study sparse training with the butterfly parameterization. They typically focus on reparameterizing the weight matrices directly with the butterfly parameterization and training neural networks from scratch. [15 ] considers finetuning the pretrained weights by first projecting the weights on a variant of butterfly matrices and then optimizing the projected components for downstream tasks. BOFT proposes a very different finetuning strategy that transforms the weights with layer-shared weight matrices.

6 APPLICATIONS AND EMPIRICAL RESULTS

We apply BOFT to finetune large language models (DeBERTaV3 [ 25 ], Llama-2 [ 81 ]), vision founda-tion models (DINOv2 [ 62 ], SAM [ 35 ]), and text-to-image generative models (Stable Diffusion [ 73 ]) on various downstream tasks. To ensure a fair comparison, we use exactly the same settings for all the compared baselines. The results are averaged over 5 random seeds, and the gains have passed significant tests with p < 0.05 . Experimental details and more results are provided in the appendices. 6.1 ADAPTATION OF LARGE LANGUAGE MODELS (LLM S)

Method # Param MNLI SST-2 CoLA QQP QNLI RTE MRPC STS-B All Full Finetuning 184M 89.90 95.63 69.19 92.40 94.03 83.75 89.46 91.60 88.25 BitFit [92] 0.1M 89.37 94.84 66.96 88.41 92.24 78.70 87.75 91.35 86.20 H-Adapter [28] 1.22M 90.13 95.53 68.64 91.91 94.11 84.48 89.95 91.48 88.28 P-Adapter [65] 1.18M 90.33 95.61 68.77 92.04 94.29 85.20 89.46 91.54 88.41 LoRA r=8 [29] 1.33M 90.65 94.95 69.82 91.99 93.87 85.20 89.95 91.60 88.50 AdaLoRA [97] 1.27M 90.76 96.10 71.45 92.23 94.55 88.09 90.69 91.84 89.46 OFT b=16 0.79M 90.33 96.33 73.91 92.10 94.07 87.36 92.16 91.91 89.77 BOFT m=2 b=8 0.75M 90.25 96.44 72.95 92.10 94.23 88.81 92.40 91.92 89.89 Table 1: Results on the GLUE development set. We report the matched accuracy for MNLI, Matthews correlation for CoLA, average correlation for STS-B and accuracy for other tasks.

Natural language under-standing . To evaluate the per-formance of BOFT on LLM adaptation, we first finetune a pretrained DeBERTaV3-base model [ 25 ] on the GLUE benchmark [ 87 ], which con-sists of some representative sentence classification tasks and is widely used for assess-ing the natural language un-derstanding ability [ 17 , 25 , 53 ]. Results are presented in Table 1. “# Param” in the table denotes the total number of effective trainable parameters for each method. We note that OFT [ 67 ] with the block size 16 is BOFT( 1,16 ). One can observe that orthogonal finetuning performs better than current state-of-the-art methods. More importantly, BOFT outperforms OFT while still using less parameters.

MMLU (5-shot) MMLU (0-shot) Method # Param Hums. STEM Social Other Avg. Hums. STEM Social Other Avg. Llama-2-7B -43.0 36.9 51.6 52.1 45.7 38.8 33.3 46.8 45.0 40.8 LoRA r=16 0.125% 42.9 38.5 54.5 53.8 47.0 42.5 37.1 51.5 52.3 45.5 LoRA r=32 0.25% 42.9 38.7 54.6 54.7 47.3 42.5 36.7 52.8 52.7 45.9 OFT b=16 0.13% 44.0 38.9 54.2 54.3 47.5 44.0 36.7 52.9 52.0 46.2 BOFT m=2 b=8 0.12% 44.5 39.0 54.4 55.1 47.9 44.3 37.4 53.1 52.8 46.7 Table 2: Accuracy (%) on MMLU. “# Param” denotes the percentage of finetuned parameters.

Massive multitask language un-derstanding . We use Alpaca [ 80 ]as our finetuning dataset and evaluate both zero-shot and few-shot performance on the MMLU dataset [ 27 ] which consists of 57 language tasks. All methods use the pretrained Llama-2-7B model [ 81 ]. Results in Table 2 show a consistent improvement over LoRA, but BOFT uses fewer parameters. Notably, BOFT( 2,8) produces a block-diagonal orthogonal matrix with the block size 16, and yet still outperforms OFT with the same block size ( i.e. , BOFT( 1,16 )) by a considerable margin. This result implies that the butterfly structure can incorporate a generalizable inductive bias.

Method # Param GSM8K MATH Llama-2-7B -14.6 2.5 LoRA r=32 0.25% 50.2 7.8 OFT b=16 0.13% 50.1 8.4 BOFT m=2 b=8 0.12% 50.6 8.6 Table 3: Results on GSM8K and MATH.

Mathematical question answering . We also evaluate our method in mathematical question answering using two challenging benchmarks: GSM8K [ 11 ] and MATH [ 27 ]. For all the finetuning methods, we use MetaMathQA-40K [ 91 ] as the finetuning dataset, and the Llama-2-7B model [ 81 ] as the pretrained backbone. As can be observed in Table 3, BOFT excels in mathematical reasoning on both datasets. We note that even though improvement on the MATH dataset is in fact quite challenging, BOFT achieves more than 10% relative improvement over LoRA while only using half of the number of trainable parameters for LoRA. Moreover, BOFT outperforms OFT even with the same effective block number, again verifying that the butterfly structure can introduce a generalizable inductive bias. We also provide a case study of a few questions in Appendix E. 7Published as a conference paper at ICLR 2024

Natural Specialized Structured # param (M) Cifar100 Caltech101 DTD Flower102 Pets SVHN Sun397 Camelyon EuroSAT Resisc45 Retinopathy Clevr-Count Clevr-Dist DMLab KITTI-Dist dSpr-Loc dSpr-Ori sNORB-Azim sNORB-Ele Average Full Finetuning 304.4 67.6 91.7 77.9 99.7 93.7 92.8 52.3 88.1 96.1 90.9 77.2 67.2 59.8 58.1 82.8 83.6 62.0 36.9 39.4 74.6 Linear Probing 073.2 90.9 78.1 99.7 95.2 40.3 59.3 84.2 92.9 86.8 75.6 48.1 44.4 45.9 65.4 25.5 37.0 18.5 30.9 62.7 BitFit [92] 0.27 78.5 91.7 80.4 99.7 95.0 67.3 60.2 85.2 96.1 90.7 75.7 84.1 63.0 52.7 78.9 83.8 61.9 28.0 37.7 74.2 FacTtt r=16 [31] 0.12 76.2 89.4 77.3 99.7 94.7 89.6 58.9 87.1 94.3 88.7 74.0 83.1 63.3 56.2 83.1 61.7 37.1 23.3 32.6 72.1 FacTtk r=32 [31] 0.12 75.0 89.1 78.6 99.7 95.0 92.1 58.9 86.1 94.6 89.5 74.2 84.3 62.0 57.7 85.2 68.4 38.3 31.2 44.2 73.9 LoRA r=4 [29] 1.77 77.2 92.8 80.3 99.7 94.8 92.7 59.5 88.3 96.4 91.4 77.4 74.7 62.4 58.1 85.2 85.8 57.2 31.8 37.2 76.6 GLoRA r=4 [6] 4.87 80.1 93.7 80.2 99.7 94.4 89.6 59.9 85.9 96.0 91.0 76.2 61.8 62.3 56.9 85.8 65.7 57.2 37.0 41.4 74.5 OFT b=16 2.10 77.7 91.9 80.1 99.7 94.7 92.9 59.3 88.4 96.4 91.5 77.2 81.0 64.7 60.5 84.0 92.2 61.1 34.8 40.3 77.3 BOFT m=2 b=8 1.99 78.1 92.5 80.6 99.7 95.0 93.0 59.9 88.9 96.6 91.6 77.3 84.5 64.9 61.4 84.1 93.9 62.0 36.2 40.0 77.9 BOFT m=4 b=4 1.77 78.2 91.4 79.6 99.7 94.9 92.8 59.4 88.1 96.4 91.6 76.2 81.9 65.4 60.0 84.5 92.9 61.3 37.1 39.3 77.4 BOFT m=6 b=2 1.11 78.3 91.5 79.9 99.7 95.0 92.0 60.2 88.2 96.5 91.4 77.2 80.5 64.1 61.4 85.0 91.6 60.8 34.0 38.5 77.1 Table 4: Results (%) on the VTAB-1K benchmark. “# param” specifies the number of trainable parameters of each method. The average accuracy is obtained by averaging over all 19 tasks. The best results are marked with “ bold ”, and the second/third best results are marked with “ underline ”.

6.2 ADAPTATION OF VISION FOUNDATION MODELS

Model # Param DIS COIFT HRSOD ThinObject Average mIoU mBIoU mIoU mBIoU mIoU mBIoU mIoU mBIoU mIoU mBIoU SAM (baseline) 062.0 52.8 92.1 86.5 90.2 83.1 73.6 61.8 79.5 71.1 Finetune SAM 4.06M 78.9 70.3 93.9 89.3 91.8 83.4 89.4 79.0 88.5 80.5 HQ-SAM [34] 1.33M 78.6 70.4 94.8 90.1 93.6 86.9 89.5 79.9 89.1 81.8 OFT-SAM b=16 0.07M 77.8 69.1 94.9 90.3 92.6 85.5 91.2 80.6 88.9 81.4 BOFT-SAM m=4 b=4 0.04M 78.2 69.7 94.9 90.5 93.1 86.0 91.7 80.1 89.5 81.6 BOFT-SAM m=2 b=8 0.06M 78.4 70.3 94.7 90.1 93.0 86.5 91.7 81.8 89.5 82.2 Table 5: Results on HQSeg-44K [ 34 ] (DIS [ 66 ], COIFT [ 47 ], HRSOD [ 93 ], ThinObject [ 47 ]).

Transfer learning on VTAB-1K .We evaluate the finetuning perfor-mance of BOFT on the VTAB-1K benchmark [ 94 ], which has been extensively used to evaluate parameter-efficient transfer learn-ing algorithms. VTAB-1K con-sists of 19 image classification tasks that are divided into three categories: natural images, specialized tasks ( e.g. , remote sensing and medical images), and structured tasks ( e.g. , depth and orientation prediction). In VTAB-1k, each dataset provides 800 labeled training set samples, a subset of their original training set. We use them to fine-tune our base model and the Top-1 classification accuracy on their respective original test set is used as the performance measure. Notably, all compared methods introduce no inference latency, so they have the same inference speed. Because the final classification layer will always get retrained and the trainable parameters of that linear classification layer vary across different tasks, we follow the common practice and do not take them into account when reporting the total trainable parameters for each method. Different from [ 6 ], we use a much larger pretrained vision transformer [ 62 ] (DINOv2-large) with more than 300M pa-rameters. The accuracies are given in Table 4. We observe that orthogonal finetuning achieves the best overall testing accuracy on the VTAB-1K benchmark, and BOFT with m = 4 , b = 4 again achieves the best performance. Remarkably, BOFTs performance enhancement is both stable and consistent across tasks, as almost all our results outperform the simplest full finetuning baseline. BOFT is marginally worse than full finetuning on three tasks: dSpr-Ori ( 0.7% ), Caltech101 ( 0.3% ) and sNORB-Ele ( 0.1% ). In contrast, LoRA is significantly worse than full finetuning on sNORB-Azim and dSpr-Ori by 5% . These results validate the effectiveness of BOFT for vision transformers. SAM BOFT-SAM

Figure 6: Qualitative comparison of between SAM and BOFT-SAM.

High-quality segmentation with SAM . The Seg-ment Anything Model (SAM) [ 35 ] is a vision foundation model for promptable image segmen-tation, demonstrating impressive zero-shot capa-bilities. SAM consists of three main components: a pre-trained image encoder to generate a fea-ture embedding of the input image, a prompt en-coder to embed prompts, and a mask decoder to map these input embeddings to a segmenta-tion mask. Despite its impressive performance in general image segmentation, SAM lacks the ability to perform highly accurate segmentation in challenging situations. To address this, HQ-SAM [ 34 ] proposes to train an additional HQ-Output Token and a global-local feature fusion module on a high-quality segmentation dataset, HQSeg-44K [ 34 ], to improve the mask quality, achieving state-of-the-art performance in high-quality segmentation. Using the same dataset and loss function 8Published as a conference paper at ICLR 2024 Text prompt : a woman with long black hair

Text prompt : a man with blonde hair Text prompt : a woman with her mouth open Text prompt : a man wearing a hat Control signal Control signal Control signal Control signal LoRA LoRA LoRA LoRA OFT OFT OFT OFT BOFT BOFT BOFT BOFT Figure 7: Qualitative comparison of controllable generation. The figure is best viewed digitally, in color and significantly zoomed in.

as HQ-SAM, we finetune the original SAM with BOFT for 10 epochs. Specifically, we only apply BOFT to all linear layers of the mask decoder of SAM, while keeping the other part of SAM frozen. We compare to finetuning the entire mask decoder, training the HQ-SAM modules [ 34 ] and finetuning linear layers with BOFT. Table 5 shows that BOFT-SAM uses only 3% of trainable parameters used in HQ-SAM, and yet matches its performance. Moreover, since the multiplicative weights learned by BOFT can be combined back to the weights of SAM, BOFT-SAM has exactly the same inference speed as SAM, while, in contrast, HQ-SAM has additional modules that affect its inference speed. 6.3 CONTROLLING TEXT -TO -I MAGE DIFFUSION MODELS

Method # Param Error ↓ LoRA r=16 2.52M 8.878 LoRA r=64 10.08M 8.062 LoRA r=128 20.17M 8.038 OFT r=16 2.71M 8.876 OFT r=4 10.50M 6.537 OFT r=2 20.89M 6.407 BOFT m=2 r=32 2.66M 8.070 BOFT m=6 r=32 7.69M 6.731 BOFT m=5 r=16 12.93M 6.387 BOFT m=4 r=8 20.76M 5.667 Table 6: Face landmark error between control signal and prediction.

Since OFT is originally used to control text-to-image diffusion mod-els [ 67 ], we also evaluate BOFT with the same task for better compar-ison. We finetune the pretrained Stable Diffusion [ 73 ] for two tasks: controllable generation ( e.g. , [ 59 , 96 ]) and subject-driven generation (e.g. , [ 74 ]). Controllable generation enables adding spatial control sig-nals to the text-to-image diffusion models. Subject-driven generation aims to synthesize images of a subject in novel contexts by finetuning on a few images of that subject to learn a unique identifier. We follow the same setting as [ 67 ] for evaluating controllable generation. To be easily comparable to OFT, where the block structure is characterized by the number of blocks r, we also use the number of blocks to char-acterize BOFT (instead of the block size b). Because rd = b, larger

r indicates less number of parameters. For example, for BOFT with

r = 32 to generate a dense orthogonal matrix, we need to have m = 6 .We start by comparing LoRA, OFT and BOFT with a small parameter budget (less than 3M parame-ters). We see from Table 6 that BOFT with r = 32 , m = 2 yields significantly better performance than both LoRA and OFT with the block number 16 under the 3M parameter budget. Under this small budget setting, we also provide a qualitative comparison among LoRA ( r = 16 ), OFT ( r = 16 ) and BOFT ( r = 32 , m = 2 ) in Figure 7. We also evaluate how BOFT performs with a dense orthogonal matrix using r = 16 , m = 4 and r = 8 , m = 4 . We observe that BOFT with r = 8 , m = 4 achieves the best performance and significantly outperforms LoRA with a similar number of parameters. 5 10 15 20 Number of epochs 510 15 20 25 30 Landmark error OFT, r=32 BOFT, r=32, m=2 BOFT, r=32, m=4 BOFT, r=32, m=6

Figure 8: How maffects controllability.

Finally, we conduct an ablation study on how the number of butterfly components m affects the performance of controllable generation. We first fix the block number as r = 32 , and then vary the number of butterfly components in BOFT from 0 (i.e. , OFT with the block number

32 ) to 6 (BOFT with a dense orthogonal matrix). Figure 8 shows that BOFT with a larger m yields better control performance for Stable Diffusion finetuning. More interestingly, we also find that, with the same number of blocks, an increased number of butterfly components generally leads to faster and more stable convergence. This implies that orthogonal finetuning with a denser orthogonal matrix converges faster in finetuning text-to-image diffusion models. This also matches our intuition that a dense orthogonal matrix can transform neurons more effectively due to its more efficient information transmission. BOFT also performs consistently better than both LoRA and the original OFT in subject-driven generation. A qualitative comparison is given in Figure 9 and Appendix C. For all the compared methods, we use the best possible hyperparameters. We empirically observe that BOFT can generally capture more intrinsic identity characteristics of the input subject, and therefore, the generated images 9Published as a conference paper at ICLR 2024 a [V] bowl with a wheat field in the background

a [V] bowl with a city in the background LoRA OFT BOFT a shiny [V] backpack a [V] backpack on top of pink fabric LoRA OFT BOFT Input images Input images a [V] sneaker with the Eiffel Tower in the background a cube shaped [V] toy a [V] toy with a tree and autumn leaves in the background LoRA OFT BOFT a [V] sneaker on the beach LoRA OFT BOFT Input images Input images

Figure 9: Qualitative comparison of subject-driven generation. The figure is best viewed digitally, in color and significantly zoomed in. Control signal BOFT (6 matrices) SD* (0 matrix) 5 matrices 4 matrices 3 matrices 2 matrices 1 matrix

Figure 10: Model weight interpolation by setting the trained butterfly components one by one to identity matrix. We use BOFT( m = 5 , r = 16 )to finetune Stable Diffusion (SD). No retraining is performed when we gradually set each trained orthogonal matrix ( ˜Bi) to an identity. The number in the figure denotes the number of remaining orthogonal butterfly components that has not been set to identity. Text prompt: a man with a beard smiling (for the first row) and a smiling woman (for the second row). *0 matrix is the case of SD with a learned control head.

are visually more plausible in terms of the subject identity preservation. We can see from Figure 9 that the original OFT also shows good performance in preserving subject identity while LoRA has better text prompt following ability. In sharp contrast, BOFT can achieve the best of both worlds by simultaneously demonstrating good subject identity preservation as well as accurate text prompt following ability. Notably, for the bottom-left toy duck case in Figure 9, we observe that BOFT can capture the essence of the toy and generate a cubed shaped toy with a conceptually similar color.

BOFT comes with free weight interpolation . We have a surprising yet interesting discovery that uniquely distinguishes BOFT from existing methods in controllable generation. BOFT consists of multiple orthogonal matrices ( i.e. , multiple butterfly components), and the product of these matrices gives the complete finetuned model. However, what will happen if we set the trained orthogonal butterfly components to identity matrix one by one without retraining? If we set all the butterfly components to identity, the model reduces to Stable Diffusion. If no butterfly components are set to identity, then we have the full BOFT-finetuned model. After the BOFT training, the structure of multiple butterfly components provides us with a free weight interpolation on the orthogonal manifold. We perform the weight interpolation for all the BOFT-finetuned layers in Stable Diffusion. Specifically, we use BOFT with m = 5 , r = 16 , so we have 6 butterfly components. We set the butterfly components one by one to identity matrix, starting from left-hand side. The results are given in Figure 10. Surprisingly, although these interpolated weights have not been retrained, they can still generate plausible images. In fact, as we set more butterfly components to identity, the interpolated model produces a smooth interpolated result, from a landmark-controlled image to an uncontrolled Stable Diffusion image. These results well validate that the hypothesis weight space ( i.e. , model space) in BOFT can well preserve the semantics and effectively eliminate many bad local minima.

7 CONCLUDING REMARKS AND LIMITATIONS

Our paper proposes Orthogonal Butterfly, a generic parameter-efficient finetuning method for foun-dation models based on the butterfly structure. The key insight to better parameter-efficiency is to 10 Published as a conference paper at ICLR 2024 parameterize a dense orthogonal matrix with the product multiple sparse orthogonal matrices. To easily find feasible matrix factorizations, we propose a graphical information transmission framework. Under this framework, we find that the butterfly structure can effectively achieve our desiderata of sparse orthogonal matrix factorization. We demonstrate the empirical effectiveness of BOFT in finetuning large language models, large vision models and text-to-image generative models. Our experiments also validate the superiority of BOFT as a generic mode finetuning method. Despite empirical effectiveness, BOFT is by no means perfect. Since the final orthogonal matrix in BOFT is the product of multiple orthogonal matrix, the training runtime overhead is slightly larger than OFT. How to improve BOFTs training runtime remains an open problem. Fortunately, after the finetuning stage, the BOFT-learned orthogonal matrices can be directly multiplied into the pretrained model and there is no additional inference latency. Moreover, whether the butterfly network is the most efficient way to transmit information is also unknown. Our information transmission framework further enables us to draw inspiration from a distinct research discipline computer networking, where the efficiency of a network topology for transmitting information is heavily studied. We expect that more efficient network structures can be used for composing dense orthogonal matrices.

ACKNOWLEDGMENT AND AUTHOR CONTRIBUTION

The authors would like to sincerely thank Peter Kulits for careful proofreading, Tim Z. Xiao and many other colleagues at Max Planck Institute for Intelligent Systems for countless helpful suggestions. This work was supported by the German Federal Ministry of Education and Research (BMBF): Tubingen AI Center, FKZ: 01IS18039B, and by the Machine Learning Cluster of Excellence, EXC number 2064/1 Project number 390727645. WL was supported by the German Research Foundation (DFG): SFB 1233, Robust Vision: Inference Principles and Neural Mechanisms, TP XX, project number: 276693517. YF and SP are partially supported by the Max Planck ETH Center for Learning Systems. Yuliang Xiu is funded by the European Unions Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 860768 (CLIPE). AW acknowl-edges support from a Turing AI Fellowship under grant EP/V025279/1, The Alan Turing Institute, and the Leverhulme Trust via CFI. MJB has received research gift funds from Adobe, Intel, Nvidia, Meta/Facebook, and Amazon. MJB has financial interests in Amazon, Datagen Technologies, and Meshcapade GmbH. While MJB is a consultant for Meshcapade, his research in this project was performed solely at, and funded solely by, the Max Planck Society. Weiyang Liu and Zeju Qiu contributed equally as the joint first author. Yao Feng, Yuliang Xiu, Yuxuan Xue and Longhui Yu contributed equally as the joint second author. Authors with equal contributions are listed in alphabetical order and allowed to change their orders freely on their resume and website. Michael J. Black, Adrian Weller and Bernhard Sch ¨olkopf jointly supervised the project, provided generous computing support and contributed significantly to the project (including but not limit to idea discussion, direction suggestion and paper writing). Weiyang Liu initialized the core idea, organized the project, co-developed the current method, co-supervised the experiments, and wrote the draft. Zeju Qiu co-initialized the core idea, co-developed the current method, implemented most of the prototypes, conducted the GLUE, VTAB-1K and SAM finetuning experiments, co-supervised the experiments and contributed to the draft writing. Yao Feng implemented a fast version of BOFT in CUDA, contributed to the method development, and conducted the experiments of multi-modal finetuning. Yuliang Xiu led the experimental efforts in controlling text-to-image diffusion models, conducted the experiments of both controllable (ControlNet) and subject-driven (DreamBooth) generation, contributed to the method development and paper writing. Yuxuan Xue contributed to the method development and the experiments of vision foundation models. Longhui Yu led the experimental efforts in MMLU and mathematical question answering and contributed to the method development. Haiwen Feng contributed to the method development and the experiments of text-to-image generation. Juyeon Heo conducted the robustness experiments. All the team members made necessary contributions to the method development and paper writing.

Large language models : the experiments are jointly led by Zeju Qiu and Longhui Yu. Weiyang Liu contributed to the model debugging.

Vision foundation models : the experiments are led by Zeju Qiu. Yuxuan Xue contributed to the baseline experiments and model debugging. Weiyang Liu contributed to the model debugging.

Text-to-image diffusion models : the experiments are led by Yuliang Xiu. Zeju Qiu, Weiyang Liu and Haiwen Feng contributed to the model debugging. 11 Published as a conference paper at ICLR 2024

REFERENCES

[1] Armen Aghajanyan, Luke Zettlemoyer, and Sonal Gupta. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255 , 2020. 3 [2] Theodore W Anderson, Ingram Olkin, and Les G Underhill. Generation of random orthogonal matrices. SIAM Journal on Scientific and Statistical Computing , 1987. 6 [3] Alan Ansell, Edoardo Ponti, Anna Korhonen, and Ivan Vuli ´c. Composable sparse fine-tuning for cross-lingual transfer. In ACL , 2022. 3 [4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari-wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. In NeurIPS , 2020. 1, 2 [5] Emmanuel J Cand `es, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM , 2011. 2 [6] Arnav Chavan, Zhuang Liu, Deepak Gupta, Eric Xing, and Zhiqiang Shen. One-for-all: Generalized lora for parameter-efficient fine-tuning. arXiv preprint arXiv:2306.07967 , 2023. 1, 3, 8, 19 [7] Beidi Chen, Tri Dao, Kaizhao Liang, Jiaming Yang, Zhao Song, Atri Rudra, and Christopher Re. Pixelated butterfly: Simple and efficient sparse training for neural network models. In

ICLR , 2022. 3, 5, 7 [8] Jiaao Chen, Aston Zhang, Xingjian Shi, Mu Li, Alex Smola, and Diyi Yang. Parameter-efficient fine-tuning design spaces. In ICLR , 2023. 3 [9] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021. 1 [10] Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping Luo. Adaptformer: Adapting vision transformers for scalable visual recognition. In NeurIPS ,2022. 3 [11] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021. 7 [12] James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of computation , 19(90):297301, 1965. 2, 5 [13] Tri Dao, Albert Gu, Matthew Eichhorn, Atri Rudra, and Christopher R ´e. Learning fast algorithms for linear transforms using butterfly factorizations. In ICML , 2019. 3, 6, 7 [14] Tri Dao, Nimit Sohoni, Albert Gu, Matthew Eichhorn, Amit Blonder, Megan Leszczynski, Atri Rudra, and Christopher R ´e. Kaleidoscope: An efficient, learnable representation for all structured linear maps. In ICLR , 2020. 3, 6, 7, 29 [15] Tri Dao, Beidi Chen, Nimit S Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, and Christopher R ´e. Monarch: Expressive structured matrices for efficient and accurate training. In ICML , 2022. 3, 7 [16] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314 , 2023. 3 [17] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019. 7 [18] Ning Ding, Yujia Qin, Guang Yang, Fuchao Wei, Zonghan Yang, Yusheng Su, Shengding Hu, Yulin Chen, Chi-Min Chan, Weize Chen, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nature Machine Intelligence , 2023. 3 12 Published as a conference paper at ICLR 2024 [19] Ali Edalati, Marzieh Tahaei, Ivan Kobyzev, Vahid Partovi Nia, James J Clark, and Mehdi Rezagholizadeh. Krona: Parameter efficient tuning with kronecker adapter. arXiv preprint arXiv:2212.10650 , 2022. 3 [20] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618 , 2022. 20 [21] Mozhdeh Gheini, Xiang Ren, and Jonathan May. Cross-attention is all you need: Adapting pretrained transformers for machine translation. In EMNLP , 2021. 3 [22] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In NeurIPS , 2017. 7 [23] Demi Guo, Alexander M Rush, and Yoon Kim. Parameter-efficient transfer learning with diff pruning. arXiv preprint arXiv:2012.07463 , 2020. 1, 3 [24] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. To-wards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366 ,2021. 1, 3 [25] Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. In ICLR , 2022. 7, 19 [26] Wu Hecong. ControlLoRA: A Lightweight Neural Network To Control Stable Diffusion Spatial Information. https://github.com/HighCWu/ControlLoRA, 2023. 20 [27] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In ICLR , 2021. 7 [28] Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In ICML , 2019. 1, 3, 7 [29] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. In ICLR , 2022. 1, 2, 3, 6, 7, 8[30] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. In ECCV , 2022. [31] Shibo Jie and Zhi-Hong Deng. Fact: Factor-tuning for lightweight adaptation on vision transformer. In AAAI , 2023. 3, 8 [32] Li Jing, Yichen Shen, Tena Dubcek, John Peurifoy, Scott Skirlo, Yann LeCun, Max Tegmark, and Marin Solja ˇci ´c. Tunable efficient unitary neural networks (eunn) and their application to rnns. In ICML , 2017. 3 [33] Rabeeh Karimi Mahabadi, James Henderson, and Sebastian Ruder. Compacter: Efficient low-rank hypercomplex adapter layers. In NeurIPS , 2021. 3 [34] Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, and Fisher Yu. Segment anything in high quality. In NeurIPS , 2023. 8, 9, 20 [35] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything.

arXiv preprint arXiv:2304.02643 , 2023. 2, 7, 8, 20 [36] Ralf Klasing, Burkhard Monien, Regine Peine, and Elena A St ¨ohr. Broadcasting in butterfly and debruijn networks. Discrete Applied Mathematics , 53(1-3):183197, 1994. 2, 3 [37] Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In CVPR , 2023. 20 13 Published as a conference paper at ICLR 2024 [38] Benjamin Lefaudeux, Francisco Massa, Diana Liskovich, Wenhan Xiong, Vittorio Caggiano, Sean Naren, Min Xu, Jieru Hu, Marta Tintore, Susan Zhang, Patrick Labatut, and Daniel Haziza. xformers: A modular and hackable transformer modelling library. https://github.com/ facebookresearch/xformers, 2022. 20 [39] Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 , 2021. 1, 3 [40] S-YR Li, Raymond W Yeung, and Ning Cai. Linear network coding. IEEE transactions on information theory , 2003. 3 [41] Shuo-Yen Robert Li, Qifu Tyler Sun, and Ziyu Shao. Linear network coding: Theory and algorithms. Proceedings of the IEEE , 2011. 5 [42] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In ACL , 2021. 1, 3 [43] Yingzhou Li, Haizhao Yang, Eileen R Martin, Kenneth L Ho, and Lexing Ying. Butterfly factorization. Multiscale Modeling & Simulation , 13(2):714732, 2015. 3 [44] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In COLT ,2018. 7 [45] Vladislav Lialin, Vijeta Deshpande, and Anna Rumshisky. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647 , 2023. 3 [46] Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features: A new baseline for efficient model tuning. In NeurIPS , 2022. 3 [47] Jun Hao Liew, Scott Cohen, Brian Price, Long Mai, and Jiashi Feng. Deep interactive thin object selection. In WACV , 2021. 8 [48] Zhaojiang Lin, Andrea Madotto, and Pascale Fung. Exploring versatile generative language model via parameter-efficient transfer learning. arXiv preprint arXiv:2004.03829 , 2020. 1 [49] Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In NeurIPS , 2022. 3 [50] Weiyang Liu, Rongmei Lin, Zhen Liu, Lixin Liu, Zhiding Yu, Bo Dai, and Le Song. Learning towards minimum hyperspherical energy. In NeurIPS , 2018. 3 [51] Weiyang Liu, Zhen Liu, James M Rehg, and Le Song. Neural similarity learning. In NeurIPS ,2019. 7 [52] Weiyang Liu, Rongmei Lin, Zhen Liu, James M Rehg, Liam Paull, Li Xiong, Le Song, and Adrian Weller. Orthogonal over-parameterized training. In CVPR , 2021. 3, 5 [53] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019. 7 [54] Gen Luo, Minglang Huang, Yiyi Zhou, Xiaoshuai Sun, Guannan Jiang, Zhiyu Wang, and Rongrong Ji. Towards efficient visual adaption via structural re-parameterization. arXiv preprint arXiv:2302.08106 , 2023. 3 [55] Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, and Sayak Paul. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022. 20 [56] Yuning Mao, Lambert Mathias, Rui Hou, Amjad Almahairi, Hao Ma, Jiawei Han, Wen-tau Yih, and Madian Khabsa. Unipelt: A unified framework for parameter-efficient language model tuning. arXiv preprint arXiv:2110.07577 , 2021. 3 14 Published as a conference paper at ICLR 2024 [57] Eric Michielssen and Amir Boag. A multilevel matrix decomposition algorithm for analyzing scattering from large structures. IEEE Transactions on Antennas and Propagation , 1996. 3 [58] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normaliza-tion for generative adversarial networks. In ICLR , 2018. 6 [59] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453 , 2023. 9 [60] Marina Munkhoeva, Yermek Kapushev, Evgeny Burnaev, and Ivan Oseledets. Quadrature-based features for kernel approximation. In NeurIPS , 2018. 3 [61] Michael ONeil, Franco Woolfe, and Vladimir Rokhlin. An algorithm for the rapid evaluation of special function transforms. Applied and Computational Harmonic Analysis , 2010. 3 [62] Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 , 2023. 7, 8, 19 [63] Douglass Stott Parker. Random butterfly transformations with applications in computational linear algebra . UCLA Computer Science Department, 1995. 3, 30 [64] John Peca-Medlin. Numerical, spectral, and group properties of random butterfly matrices .University of California, Irvine, 2021. 30, 31, 32 [65] Jonas Pfeiffer, Aishwarya Kamath, Andreas R ¨uckl ´e, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247 , 2020. 1, 7 [66] Xuebin Qin, Hang Dai, Xiaobin Hu, Deng-Ping Fan, Ling Shao, and Luc Van Gool. Highly accurate dichotomous image segmentation. In ECCV , 2022. 8 [67] Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Sch ¨olkopf. Controlling text-to-image diffusion by orthogonal finetuning. In NeurIPS , 2023. 1, 2, 3, 5, 7, 9, 20 [68] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML , 2021. 2 [69] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR , 2020. 1 [70] Indra Rajasingh, Paul Manuel, N Parthiban, D Azubha Jemilet, and R Sundara Rajan. Trans-mission in butterfly networks. The Computer Journal , 2016. 3 [71] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi. Learning multiple visual domains with residual adapters. In NeurIPS , 2017. 1 [72] Steven Roman, S Axler, and FW Gehring. Advanced linear algebra , volume 3. Springer, 2005. 6[73] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. In CVPR , 2022. 1, 2, 7, 9, 20 [74] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aber-man. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR , 2023. 9, 20 [75] Yan Solihin. Fundamentals of parallel multicore architecture . CRC Press, 2015. 5 15 Published as a conference paper at ICLR 2024 [76] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In

ICLR , 2021. 20 [77] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. JMLR , 2014. 6 [78] Yi-Lin Sung, Varun Nair, and Colin A Raffel. Training neural networks with fixed sparse masks. NeurIPS , 2021. 3 [79] Yi-Lin Sung, Jaemin Cho, and Mohit Bansal. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. In NeurIPS , 2022. 3 [80] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford alpaca, 2023. 7, 20 [81] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 , 2023. 7 [82] Marcos Treviso, Ji-Ung Lee, Tianchu Ji, Betty van Aken, Qingqing Cao, Manuel R Ciosici, Michael Hassid, Kenneth Heafield, Sara Hooker, Colin Raffel, et al. Efficient methods for natural language processing: A survey. Transactions of the Association for Computational Linguistics , 2023. 3 [83] Thomas Trogdon. On spectral and numerical properties of random butterfly matrices. Applied Mathematics Letters , 95:4858, 2019. 30 [84] Mojtaba Valipour, Mehdi Rezagholizadeh, Ivan Kobyzev, and Ali Ghodsi. Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. arXiv preprint arXiv:2210.07558 , 2022. 3 [85] Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, and Thomas Wolf. Diffusers: State-of-the-art diffusion models. https://github.com/huggingface/diffusers, 2022. 20 [86] Tu Vu, Brian Lester, Noah Constant, Rami Al-Rfou, and Daniel Cer. Spot: Better frozen model adaptation through soft prompt transfer. In ACL , 2022. 3 [87] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019. 7, 19 [88] Yaqing Wang, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, and Jianfeng Gao. Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models. In EMNLP , 2022. 3 [89] Eric Xing, Michael Jordan, Stuart J Russell, and Andrew Ng. Distance metric learning with application to clustering with side-information. In NeurIPS , 2002. 6 [90] Yuichi Yoshida and Takeru Miyato. Spectral norm regularization for improving the generaliz-ability of deep learning. arXiv preprint arXiv:1705.10941 , 2017. 6 [91] Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284 , 2023. 7, 20 [92] Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. In ACL , 2022. 1, 3, 7, 8 [93] Yi Zeng, Pingping Zhang, Jianming Zhang, Zhe Lin, and Huchuan Lu. Towards high-resolution salient object detection. In ICCV , 2019. 8 16 Published as a conference paper at ICLR 2024 [94] Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark.

arXiv preprint arXiv:1910.04867 , 2019. 8 [95] Feiyu Zhang, Liangzhi Li, Junhao Chen, Zhouqiang Jiang, Bowen Wang, and Yiming Qian. Increlora: Incremental parameter allocation method for parameter-efficient fine-tuning. arXiv preprint arXiv:2308.12043 , 2023. 3 [96] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In ICCV , 2023. 9, 20 [97] Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In ICLR , 2023. 1, 3, 7, 19 [98] Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. Neural prompt search. arXiv preprint arXiv:2206.04673 , 2022. 3 [99] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In CVPR , 2017. 20 [100] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Semantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision , 2019. 20 [101] Hao Zhu, Wayne Wu, Wentao Zhu, Liming Jiang, Siwei Tang, Li Zhang, Ziwei Liu, and Chen Change Loy. CelebV-HQ: A large-scale video facial attributes dataset. In ECCV , 2022. 20 [102] Bojia Zi, Xianbiao Qi, Lingzhi Wang, Jianan Wang, Kam-Fai Wong, and Lei Zhang. Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices. arXiv preprint arXiv:2309.02411 , 2023. 3 17 Published as a conference paper at ICLR 2024

Appendix

Table of Contents

A Experimental Details 19

A.1 Natural Language Understanding . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.2 Transfer Learning on VTAB-1K . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.3 Experimental Details in Llama Finetuning . . . . . . . . . . . . . . . . . . . . 19 A.4 Experimental Details in SAM Finetuning . . . . . . . . . . . . . . . . . . . . 20 A.5 Experimental Details in ControlNet and DreamBooth . . . . . . . . . . . . . . 20

B More Qualitative Results on High-quality Segmentation 21 C More Qualitative Results in Subject-driven Generation 22 D More Qualitative Results in Controllable Generation 26 E Mathematical Question-Answering Case Study 28 F Proof of Theorem 1 29 G Mathematical Properties of Butterfly Matrices 30

G.1 Balanced Entry-wise Learning Rate . . . . . . . . . . . . . . . . . . . . . . . 30 G.2 An Alternative Definition of Orthogonal Butterfly Matrices . . . . . . . . . . . 30 G.3 Topological Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 G.4 Input Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 G.5 Random Orthogonal Butterfly Matrices . . . . . . . . . . . . . . . . . . . . . 32

H Illustration of Block Butterfly Matrices 33 I More Results on Expressiveness of BOFT 34

18 Published as a conference paper at ICLR 2024

A EXPERIMENTAL DETAILS

A.1 NATURAL LANGUAGE UNDERSTANDING

For our experiments on the GLUE benchmark [ 87 ], we follow the setting of [ 97 ] and only tune the learning rate, the multiplicative dropout rate, and the number of training epochs. We use the pre-trained DeBERTaV3 [ 25 ]1 as our base model and apply the OFT and BOFT to every linear layer in every transformer blocks. All runs can be trained on a single NVIDIA A100-SXM4-80GB GPU. See the hyperparameters used in our runs in Table 7.

Method Dataset MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B DeBERTaV3-base OFT (16) Batch Size 32 32 32 32 32 32 32 32 # Epochs 5214 54934 11 Learning Rate 8E-05 2E-04 9E-04 4E-04 2E-04 3E-04 3E-04 7E-04 OFT Dropout 1E-01 1E-01 1E-01 5E-02 1E-01 1E-01 5E-02 1E-01 Max Seq. Len. 256 128 320 64 512 320 320 128 DeBERTaV3-base OFT (8,1) Batch Size 32 32 32 32 32 32 32 32 # Epochs 10 11 16 15 410 68Learning Rate 7E-05 5E-05 8E-04 5E-04 2E-04 3E-04 4E-04 7E-04 OFT Dropout 15E-02 15E-02 1E-01 15E-02 5E-02 5E-02 5E-02 5E-02 Max Seq. Len. 256 128 320 64 512 320 320 128 Table 7: Hyperparameter setup used for DeBERTaV3-base on the GLUE benchmark.

A.2 TRANSFER LEARNING ON VTAB-1K In our VTAB-1K experiments, we employ the DINOv2-large [ 62 ]2 as our base model for fine-tuning. Our architecture design aligns with GLoRA [ 6], injecting trainable OFT and BOFT weights into every linear layer in all multihead self-attention (MSA) and MLP blocks. To ensure a fair comparison, we maintain the identical training setups for our and the baseline methods: a total number of 30 training epochs, a fixed training batch size of 64 , an AdamW optimizer, and a cosine learning rate scheduler with a warmup ratio of 0.1. Notably, due to the supernet structure of GLoRA [ 6], we tested both training for 30 and 100 epochs. We conduct a grid search on the learning rate for both our method and baseline methods and report the best final test sets Top-1 classification accuracy after the final epoch. For BOFT and OFT, we additionally adopt a multiplicative dropout rate of 0.1 and apply a weight decay rate of 0.02 . All methods are trained on a single NVIDIA A100-SXM4-80GB GPU. The exact learning rate details for OFT and BOFT can be found in Table 8. Dataset Cifar100 Caltech101 DTD Flowers102 Pets SVHN Sun397 Camelyon EuroSAT Resisc45 Retinopathy Clevr-Count Clevr-Dist DMLab KITTI-Dist dSpr-Loc dSpr-Ori sNORB-Azim sNORB-Ele OFT b=16 8e-4 5e-4 6e-4 2e-3 3e-4 3e-3 1e-3 9e-4 9e-4 5e-4 1e-3 2e-4 4e-4 2e-3 1e-3 2e-3 3e-4 4e-3 6e-4 BOFT m=4

b=4 8e-4 5e-4 6e-4 2e-3 3e-4 3e-3 1e-3 1e-3 6e-4 9e-4 2e-3 2e-4 3e-4 3e-3 9e-4 4e-3 5e-4 5e-3 8e-4 BOFT m=6 b=2 8e-4 9e-4 1e-3 1e-3 8e-4 3e-3 1e-3 4e-3 9e-4 1e-3 2e-3 4e-4 4e-4 4e-3 3e-3 4e-3 8e-4 4e-3 9e-4 Table 8: Hyperparameter setup ( i.e. , learning rate) used for DINOv2-large on the VTAB-1K benchmark.

A.3 EXPERIMENTAL DETAILS IN LLAMA FINETUNING

In the Llama-related finetuning experiments, language understanding and mathematical question answering, we fixed the batch size as 64 and the training epoch as 2. For all the Lora, OFT, and BOFT experiments, we use the cosine learning scheduler and the warmup of the first 100 learning steps. We finetune the Llama model on the first generated 512 tokens, which is sufficient for these two tasks. We use the AdamW optimizer with a 1e-4 learning rate and 8e-4 learning rate for the language understanding task and mathematical question-answering task, respectively. The multiplicative dropout used for language understanding and mathematical question answering is 0.1 and 0.05,

1

https://huggingface.co/microsoft/deberta-v3-base

2

https://huggingface.co/facebook/dinov2-large

19 Published as a conference paper at ICLR 2024 respectively. For language understanding, we evaluate the performance on MMLU with both zero-shot and 5-shot evaluation. For mathematical question answering, we basically follow the evaluation tools in MetaMathQA [ 91 ], where they use the Alpaca [ 80 ] prompt and evaluate the model in zero-shot. The generation temperature is set as 0 for both tasks. A.4 EXPERIMENTAL DETAILS IN SAM F INETUNING

We generally follow the training and evaluation settings as HQ-SAM [ 34 ]. Specifically, we use a learning rate of 0.0005, a cosine annealing learning rate scheduler, AdamW optimizer with a weight decay rate of 0.01 and a multiplicative dropout rate of 0.005. During fine-tuning, we keep the SAM model [35] 3 frozen. A.5 EXPERIMENTAL DETAILS IN CONTROL NET AND DREAM BOOTH

For our experiments on the ControlNet [ 96 ] and DreamBooth [ 74 ], we mainly follow the setting of OFT [ 67 ] but re-implement them using HuggingFaces Diffusers [ 85 ] and Parameter-Efficient Fine-Tuning (PEFT) [ 55 ]4. Specifically, we use Stable Diffusion [ 73 ] (v2.1) 5 as our pretrained model, and DDIMScheduler [ 76 ] as our scheduler function. The attached fine-tuned PEFT layers within UNet are {to q, to v, to k, query, value, key }. Both training and testing are conducted on NVIDIA A100-SXM4-80GB, memory efficient attention from xFormers [ 38 ] is employed. Some specific settings for ControlNet and DreamBooth are as follows:

ControlNet We train ControlNet image encoder (lightweight 8-conv-layer network same as Con-trolLoRA [ 26 ]) and the attached PEFT layers within UNet, with the learning rate of 1e-5. Regarding the optimizer, we use AdamW with a weight decay as 1e-2, adam epsilon as 1e-8, and a constant learn-ing rate scheduler. For datasets, we train Segmentation-to-Image (S2I) task on ADE20K [ 99 , 100 ], and Landmark-to-Face (L2F) task on CelebV-HQ [ 101 ], both for 20 epochs, with 16 batch size. Specifically, we employ dropout ( p2 = 0 .1) and {PEFT } only bias type on PEFT layers (BOFT, OFT, LoRA).

DreamBooth Instead of fine-tuning text transformer [ 20 , 37 ], we exclusively fine-tune the cross-attention layers (K, V, Q) of UNet with a learning rate of 3e-5, batch size of 4 for 2000 steps. Regarding the optimizer, we use AdamW with a weight decay as 1e-2 and Adam epsilon as 1e-8, a constant learning rate scheduler. Furthermore, we pre-generate 200 images conditioned on each class of input images, for prior preservation training weighted by 1.0. We use the same dataset as DreamBooth [ 74 ] with the resolution of 512. Same as ControlNet, we employ dropout ( p2 = 0 .1)and {PEFT } only bias type on PEFT layers (BOFT, OFT, LoRA).

Task Metric LoRA OFT BOFT # Params 2.52M 20.89M 20.76M S2I mIoU ↑24.72 29.44 28.83 mAcc ↑37.88 42.12 41.24 aAcc ↑60.43 67.53 67.74 L2F Error ↓8.038 6.407 5.667 Table 9: Quantitative evaluation of S2I and L2F. The best results are marked with “ bold ”, and the second best results are marked with “underline”.

More Results For ControlNet, apart from Landmark-to-Face (L2F) generation, we also benchmark different fine-tuning methods on Segmentation-to-Image (S2I) generation. The quantitative results are given in Table 9. Notably, these reported numbers represent the best possible results achieved by methods, across all variations of parameter configurations (#Param spans from 2M to 20M), trained for 20 epochs. We observe that different control task requires different finetun-ing flexbility and some control task may be easier than the other. We find that the S2I task is generally easier than the L2F task and does not require strong finetuning flexibility. In the S2I task, the best performance of BOFT from 2M to 20M is similar to that of OFT. However, BOFT can still achieve better control performance than OFT given less amount of parameter budget, demonstrating its parameter efficiency. More qualitative results of ControlNet are shown in Figure 17 and Figure 16. Regarding DreamBooth, more subject-driven generations are shown from Figure 9 to Figure 15.

3

https://github.com/facebookresearch/segment-anything

4

https://huggingface.co/docs/peft

5

https://huggingface.co/stabilityai/stable-diffusion-2-1

20 Published as a conference paper at ICLR 2024

B MORE QUALITATIVE RESULTS ON HIGH -QUALITY SEGMENTATION SAM BOFT-SAM SAM BOFT-SAM SAM BOFT-SAM BOFT-SAM SAM

Figure 11: More qualitative comparison of mask prediction between SAM and BOFT-SAM.

21 Published as a conference paper at ICLR 2024

C MORE QUALITATIVE RESULTS IN SUBJECT -DRIVEN GENERATION Text prompt : a [V] backpack floating on top of water

Text prompt : a [V] backpack in the snow

Text prompt : a [V] backpack on top of a dirt road

Text prompt : a [V] teapot with a tree and autumn leaves in the background

Text prompt : a [V] teapot on top of a wooden floor

Text prompt : a [V] teapot on top of a dirt road

LoRA OFT BOFT Input images

Figure 12: Qualitative comparison of Subject-driven Generation.

22 Published as a conference paper at ICLR 2024 Text prompt : a [V] vase on a cobblestone street

Text prompt : a [V] vase on top of the sidewalk in a crowded street

Text prompt : a [V] vase on top of a white rug

Text prompt : a [V] sneaker on top of a dirt road

Text prompt : a [V] sneaker on top of green grass with sunflowers around it

Text prompt : a [V] sneaker with a city in the background

LoRA OFT BOFT Input images

Figure 13: Qualitative comparison of Subject-driven Generation.

23 Published as a conference paper at ICLR 2024 Text prompt : a [V] glasses with a blue house in the background

Text prompt : a [V] glasses with a mountain in the background

Text prompt : a [V] glasses on top of a white rug

Text prompt : a [V] toy on the beach

Text prompt : a shiny [V] toy

Text prompt : a [V] toy with a wheat field in the background

LoRA OFT BOFT Input images

Figure 14: Qualitative comparison of Subject-driven Generation.

24 Published as a conference paper at ICLR 2024 Text prompt : a [V] dog with a blue house in the background

Text prompt : a [V] dog wearing a santa hat

Text prompt : a [V] dog on top of a purple rug in a forest

Text prompt : [V] stuffed animal in the jungle

Text prompt : a [V] stuffed animal floating on top of water

Text prompt : a wet [V] stuffed animal

LoRA OFT BOFT Input images

Figure 15: Qualitative comparison of Subject-driven Generation.

25 Published as a conference paper at ICLR 2024

D MORE QUALITATIVE RESULTS IN CONTROLLABLE GENERATION Text prompt : a man with a beard

GT LoRA Control signal OFT BOFT

Text prompt : a man with a beard

GT LoRA Control signal OFT BOFT

Text prompt : a man with sunglasses on

GT LoRA Control signal OFT BOFT

Text prompt : a man with grey hair and beards

GT LoRA Control signal OFT BOFT

Text prompt : a man with long hair

GT LoRA Control signal OFT BOFT

Text prompt : a man with wet hair

GT LoRA Control signal OFT BOFT

Text prompt : a woman wearing sunglasses and holding a rose

GT LoRA Control signal OFT BOFT

Text prompt : a woman wearing sunglasses and holding a rose

GT LoRA Control signal OFT BOFT

Figure 16: More qualitative results in landmark-to-face generation.

26 Published as a conference paper at ICLR 2024 Text prompt : a house

GT LoRA Control signal OFT BOFT

GT LoRA Control signal OFT BOFT Text prompt : a brick building

GT LoRA Control signal OFT BOFT GT LoRA Control signal OFT BOFT GT LoRA Control signal OFT BOFT GT LoRA Control signal OFT BOFT GT LoRA Control signal OFT BOFT GT LoRA Control signal OFT BOFT Text prompt : a building with stairs leading up to it

Text prompt : a thomas train at disneyland world

Text prompt : a man with a beard

Text prompt : people looking at an aquarium

Text prompt : an empty room

Text prompt : a conference room

Figure 17: More qualitative results in segmentation map-to-image generation.

27 Published as a conference paper at ICLR 2024

E MATHEMATICAL QUESTION -A NSWERING CASE STUDY

Example E.1: Case Study 1

Question: James buys 5 packs of beef that are 4 pounds each. The price of beef is $5.50 per pound. How much did he pay? (Ground-truth answer is 110 )Llama-2 Answer: James bought 5 packs of beef that are 4 pounds each. The price of beef is $5.50 per pound. So he paid 5 * 4 * 5.50 = 137.50 dollars. The answer is 137.50.

BOFT Answer: James bought 5 packs of beef, and each pack is 4 pounds, so he bought a total of 5 * 4 = 20 pounds of beef. The price of beef is $5.50 per pound, so James paid 20 * $5.50 = $110. The answer is: 110

Example E.2: Case Study 2

Question: James decides to run 3 sprints 3 times a week. He runs 60 meters each sprint. How many total meters does he run a week? (Ground-truth answer is 540 )Llama-2 Answer: James runs 60 meters each sprint. So he runs 60 * 3 = 180 meters each week. The answer is 180.

BOFT Answer: James runs 3 sprints 3 times a week, so he runs 3 sprints x 3 times = 9 sprints in a week. Each sprint is 60 meters, so James runs 9 sprints x 60 meters = 540 meters in a week. Therefore, James runs a total of 540 meters in a week. The answer is: 540

Example E.3: Case Study 3

Question: Carl buys ten packs of cookies. Each pack of cookies has six cookies inside. Each cookie cost $0.10. How much change does Carl receive if he pay with a $10 bill? (Ground-truth answer is 4)Llama-2 Answer: Carl buys ten packs of cookies. Each pack of cookies has six cookies inside. Each cookie cost $0.10. So he pays $10 for 60 cookies. Now he gets 60 * 0.10 = $6 change. The answer is $6.

BOFT Answer: Carl buys ten packs of cookies, so he buys 10 * 6 = 60 cookies.The total cost of the cookies is 60 * $0.10 = $6.Carl pays with a $10 bill, so he receives $10 - $6 = $4 in change. The answer is: 4

Example E.4: Case Study 4

Question: Miguel uses 2 pads of paper a week for his drawing. If there are 30 sheets of paper on a pad of paper, how many sheets of paper does he use every month? (Ground-truth answer is 240 )Llama-2 Answer: Miguel uses 2 pads of paper a week. There are 30 sheets of paper on a pad of paper. So he uses 2 * 30 = 60 sheets of paper every week. Now he uses 60 * 4 = 240 sheets of paper every month. The answer is 240.

BOFT Answer: Miguel uses 2 pads of paper a week, and there are 30 sheets of paper on a pad, so he uses 2 x 30 = 60 sheets of paper every week. Since there are 4 weeks in a month, Miguel uses 60 x 4 = 240 sheets of paper every month. Therefore, Miguel uses 240 sheets of paper every month. The answer is: 240

Example E.5: Case Study 5

Question: After transferring to a new school, Amy made 20 more friends than Lily. If Lily made 50 friends, how many friends do Lily and Amy have together? (Ground-truth answer is 120 )Llama-2 Answer: Amy made 20 more friends than Lily. So Lily had 50 - 20 = 30 friends. Amy had 30 + 20 = 50 friends. So Lily and Amy have 50 + 50 = 100 friends together. The answer is 100.

BOFT Answer: If Lily made 50 friends, and Amy made 20 more friends than Lily, then Amy made 50 + 20 = 70 friends. Together, Lily and Amy have 50 + 70 = 120 friends. The answer is: 120

Example E.6: Case Study 6

Question: After Andrea saved some money, she then spent the rest of her money on an $11 sweater and gave her brother $4. If she had $36 in the beginning, how much did Andrea save? (Ground-truth answer is 21 )Llama-2 Answer: Andrea started with 36 dollars. She then saved some money. Then she spent the rest of her money on an $11 sweater and gave her brother $4. So she spent 36 - (saved money) - (sweater price) -(brother price) = 36 - 10 - 11 - 4 = 11 dollars. The answer is 11.

BOFT Answer: After Andrea saved some money, she then spent the rest of her money on an $11 sweater and gave her brother $4. If she had $36 in the beginning, how much did Andrea save?Andrea saved $36 - $11 - $4 = $21Therefore, Andrea saved $21. The answer is: 21

28 Published as a conference paper at ICLR 2024

F PROOF OF THEOREM 1

We first show that BOFT is always more expressive than OFT with the same block size. We write down the final orthogonal matrix in BOFT with block size 2b:

RBOFT (m, b ) =

m

Y

i=1

˜Bb(d, 2mi+1 )= ˜Bb(d, 2m) · · · ˜Bb(d, 22) · ˜Bb(d, 21)

(3) where we can let ˜Bb(d, 2m), m ∈ [2 , m ] to be identity matrices. Then we have

RBOFT (m, b ) = ˜Bb(d, 2m) · · · ˜Bb(d, 22)

| {z }

=I

· ˜Bb(d, 2) = ˜Bb(d, 2)

(4) which is an orthogonal matrix with the block size 2b. This is exactly the orthogonal matrix used in OFT with the block size 2b. Therefore, BOFT with m > 1 is always more expressive than OFT with the same block size. When m = 1 , BOFT reduces to OFT. Then we proceed to prove that the following expression can represent any orthogonal matrix:

RExt = Bd1,1(d)B

d1,2

(d) · · · B1,1(d)B

1,2

(d) (5) where Bi,j (d), ∀i, ∀j are butterfly matrices. RExt fall into the category of kaleidoscope matrices [ 14 ](with the diagonal matrix being an identity). [ 14 ] has shown that the orthogonal kaleidoscope matrix can represent any orthogonal matrices. The overall proof idea is simple and can be given by the following two results from [14]: • Any orthogonal matrix can be represented by QR factorization which can be decomposed by n 1 Householder reflections. • All Householder reflections can be represented by Bi(d)B

i

(d).Then we can easily arrive at the conclusion that RExt can represent any orthogonal matrix. 29 Published as a conference paper at ICLR 2024

G MATHEMATICAL PROPERTIES OF BUTTERFLY MATRICES

Due to its nice spectral and numerical properties, butterfly structures (especially random butterfly matrices) are introduced by [ 63 ] to remove the need of pivoting in Gaussian elimination. In this section, we discuss a few intriguing mathematical properties of butterfly matrices. Because orthogonal butterfly matrices are a special subset of the orthogonal matrices, they introduce additional inductive biases that help to further regularize the finetuned model. To gain a deeper understanding of the induced inductive biases, we delve into the mathematical properties of orthogonal butterfly matrices with the hope to understand the effect of such additional inductive biases. We note that these properties are natural and direct consequences of the established results in [ 64 , 83 ]. For our results to be self-contained, we also provide the brief proof of the results here. G.1 BALANCED ENTRY -WISE LEARNING RATE

Butterfly matrices also have an interesting balanced learning rate property. We take 8-dimensional butterfly matrix as an example. Assume that we fill all the entries in each butterfly component with 1.



1 0 0 0 1 0 0 00 1 0 0 0 1 0 00 0 1 0 0 0 1 00 0 0 1 0 0 0 11 0 0 0 1 0 0 00 1 0 0 0 1 0 00 0 1 0 0 0 1 00 0 0 1 0 0 0 1



·



1 0 1 0 0 0 0 00 1 0 1 0 0 0 01 0 1 0 0 0 0 00 1 0 1 0 0 0 00 0 0 0 1 0 1 00 0 0 0 0 1 0 10 0 0 0 1 0 1 00 0 0 0 0 1 0 1



·



1 1 0 0 0 0 0 01 1 0 0 0 0 0 00 0 1 1 0 0 0 00 0 1 1 0 0 0 00 0 0 0 1 1 0 00 0 0 0 1 1 0 00 0 0 0 0 0 1 10 0 0 0 0 0 1 1



=



1 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1



which means that the butterfly matrix preserves the learning rate in each entry of the butterfly component. For block butterfly matrices, we consider the following example:



1 1 0 0 1 1 0 01 1 0 0 1 1 0 00 0 1 1 0 0 1 10 0 1 1 0 0 1 11 1 0 0 1 1 0 01 1 0 0 1 1 0 00 0 1 1 0 0 1 10 0 1 1 0 0 1 1



·



1 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 01 1 1 1 0 0 0 00 0 0 0 1 1 1 10 0 0 0 1 1 1 10 0 0 0 1 1 1 10 0 0 0 1 1 1 1



=



2 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2



which shows that the product of all-one block butterfly components still have the balanced learning rate property. However, we also note that whenever we initialize a butterfly component, the elements are not the same (Each butterfly component is initialized as an identity matrix). While butterfly matrices have the property of balanced entry-wise learning rate, we note that the same property does not necessarily exist for general sparse matrix factorization. G.2 AN ALTERNATIVE DEFINITION OF ORTHOGONAL BUTTERFLY MATRICES

In the main paper, we define the butterfly matrices by first constructing the pattern of non-zero elements in each butterfly component and then constraining each block matrix as an orthogonal matrix. Alternatively, orthogonal butterfly matrices can also be defined in an reverse order by first constructing orthogonal matrices and then build the butterfly matrix from ground up. 30 Published as a conference paper at ICLR 2024

Definition 1 (Generalized Rotation Matrix) . A generalized rotation matrix is a N × N matrix ( N is even) of the following form:  C S

S C



, (6)

where C and S are commuting, symmetric real matrices of the size N

2

× N

2

, and C2 + S2 = I.The scalar rotation matrices use scalar matrices C, S. The diagonal rotation matrices use diagonal matrices C, S.

Definition 2 (Orthogonal Butterfly Matrix) . An orthogonal butterfly matrix, denoted collectively as

OB (N ), is an iteratively defined matrix of order N = 2 n:

 CA 1 SA 2

SA 1 CA 2



=

 C S

S C



·

A1 00 A2



, (7)

where A1, A2 ∈ OB ( N

2

), and

 C S

S C



is a generalized rotation matrix. An orthogonal butterfly matrix is simple if A1 = A2 at each iteration step, and is non-simple otherwise. When N = 1 , the orthogonal butterfly matrix is defined as 1.

Definition 3 (Diagonal and Scalar Butterfly Matrix) . The diagonal and scalar butterfly matrices are the orthogonal butterfly matrices constructed iteratively with diagonal and scalar rotation matrices, respectively. We denote the order-N simple scalar butterfly matrices as OB s(N ). For B ∈ OB s(N ),we define B = OB (A, θ ) where A = OB s( N

2

) if B is of the following form:

 cos( θ)A sin( θ)A

sin( θ)A cos( θ)A



=

 cos( θ)I sin( θ)I

sin( θ)I cos( θ)I



·

A 00 A



=

A 00 A



·

 cos( θ)I sin( θ)I

sin( θ)I cos( θ)I



.

(8)

We use OB (N ) to denote the non-simple scalar butterfly matrices.

We also define that

OB (θ) =

n

O

i=1

OB (n i + 1) = OB (θn) ⊗ · · · ⊗ OB (θ1). (9) where ⊗ denotes the Kronecker product and OB (θi) ∈ SO (2) .

Remark 1. We note that the butterfly matrices considered in the main paper is what we define as non-simple diagonal butterfly matrices here. Despite not directly discussing the properties of such family of butterfly matrices, we aim to provide some useful insights through a few necessary simplifications.

G.3 TOPOLOGICAL PROPERTIES

Proposition 1 ([ 64 ]) . OB (N ) and OB s(N ) are compact spaces in SO (N ), which are homeomorphic to quotients of higher dimensional tori Tn and TN 1 where N = 2 n.

Proposition 2 ([ 64 ]) . The diagonal simple and non-simple butterfly matrices are compact spaces in

SO (N ), which are homeomorphic to quotients of higher dimensional tori TN 1 and T 12 N n (N = 2 n), respectively.

These two propositions characterize the topological properties of orthogonal butterfly matrices. There are also many interesting group properties. For example, OB s(N ) is a compact abelian subgroup of SO (N ). These topological and group structures and connections provide many unique inductive biases for the orthogonal finetuning. 31 Published as a conference paper at ICLR 2024 G.4 INPUT SENSITIVITY

One interesting property to study for the butterfly matrices is how the input parameter changes the output matrix norm. Specifically, we can upper bound the difference between the input-perturbed butterfly matrix and the original butterfly matrix. Specifically, we have the following result from [ 64 ]:

Proposition 3 (Upper Bound for Simple Scalar Butterfly Matrices) . Let OB (θ) ∈ OB s(N ) and

ϵ ∈ Rn where N = 2 n. Then we have that

∥OB (θ) OB (θ + ϵ)∥F ≤ √N ∥ϵ∥1 (10)

Proposition 4 (Upper Bound for General Scalar Butterfly Matrices) . Let OB (θ) ∈ OB (N ) and

ϵ ∈ RN 1. Then we have that

∥OB (θ) OB (θ + ϵ)∥F ≤ √N 1 ∥ϵ∥2 (11) The above two propositions show that the map θ → OB (θ) is Lipshitz continuous. G.5 RANDOM ORTHOGONAL BUTTERFLY MATRICES

Definition 4 (Random Orthogonal Butterfly Matrix) . A random orthogonal butterfly matrix is a butterfly matrix that is generated by random generalized rotation matrices with

Σ := {Cj , Sj }j≥1 (12)

which is an independent sequence of pairs of matrices. Each pair {Cj , Sj } generates a random generalized rotation matrix of order 2j1

32 Published as a conference paper at ICLR 2024

H ILLUSTRATION OF BLOCK BUTTERFLY MATRICES

We provide an illustration for block butterfly components and matrices in Figure 18. B(16,8)

~B(16,4) ~B(16,2) ~ B(16,16) ~

(a) 16-dimensional butterfly matrix with the block size 1

B (8,8) ~B (8,4) ~B (8,2) ~222

(b) 8-dimensional butterfly matrix with the block size 2

B (4,4) ~B (4,2) ~44

(c) 4-dimensional butterfly matrix with the block size 4

Figure 18: Block butterfly components and matrices with the block size 1, 2 and 4.

33 Published as a conference paper at ICLR 2024

I MORE RESULTS ON EXPRESSIVENESS OF BOFT

We use a full dense orthogonal matrix to finetune DINOv2 on downstream tasks and then obtain such a dense orthogonal matrix (with 10 random seeds). We finally obtain 10 finetuned 1024 × 1024

dense orthogonal matrices and conduct the same experiment as Figure 5. The results in Figure 19 still validate that BOFT yields better expressiveness under the same number of parameters. 10 3 10 4 10 5Number of trainable parameters 1.6 1.65 1.7 1.75 1.8 1.85 Rescaled approximation error BOFT (1:10, 2) BOFT (1:9, 4) BOFT (1:7, 16) BOFT (1:5, 64)

Figure 19: Expressiveness of BOFT for learned dense orthogonal matrices (10 random seeds).

34