mirror of
https://github.com/wassname/adapters_as_hypotheses.git
synced 2026-06-27 18:03:20 +08:00
3800481a30
30 PEFT methods reframed as hypotheses about transformer geometry. Each entry: pseudocode, hypothesis, evidence, grade. All papers saved to docs/ (full text).
494 lines
57 KiB
Markdown
494 lines
57 KiB
Markdown
Title: Orthogonal Finetuning Made Scalable
|
||
|
||
URL Source: https://arxiv.org/pdf/2506.19847
|
||
|
||
Published Time: Thu, 16 Oct 2025 00:06:59 GMT
|
||
|
||
Number of Pages: 18
|
||
|
||
Markdown Content:
|
||
# Orthogonal Finetuning Made Scalable
|
||
|
||
Zeju Qiu 1,† Weiyang Liu 1,2,†,* Adrian Weller 3,4 Bernhard Schölkopf 1
|
||
|
||
> 1
|
||
|
||
Max Planck Institute for Intelligent Systems 2The Chinese University of Hong Kong
|
||
|
||
> 3
|
||
|
||
University of Cambridge 4The Alan Turing Institute †Equal contribution
|
||
|
||
> *
|
||
|
||
Project lead, Correspondence to wyliu@cse.cuhk.edu.hk spherelab.ai/oftv2
|
||
|
||
Abstract
|
||
|
||
Orthogonal finetuning (OFT) offers highly parameter-efficient adaptation while preventing catastrophic forgetting, but its high runtime and memory demands limit practical deployment. We identify the core computational bottleneck in OFT as its weight-centric implementation, which relies on costly matrix-matrix multipli-cations with cubic complexity. To overcome this, we propose OFTv2, an input-centric refor-mulation that instead uses matrix-vector mul-tiplications ( i.e. , matrix-free computation), re-ducing the computational cost to quadratic. We further introduce the Cayley-Neumann param-eterization, an efficient orthogonal parameteri-zation that approximates the matrix inversion in the Cayley transform via a truncated Neu-mann series. These modifications allow OFTv2 to achieve up to 10 × faster training and 3 ×
|
||
|
||
lower GPU memory usage without compro-mising performance. In addition, we extend OFTv2 to support finetuning quantized founda-tion models and show that it outperforms the popular QLoRA in training stability, efficiency, and memory usage.
|
||
|
||
1 Introduction
|
||
|
||
As foundation models continue to improve in per-formance, recent years have witnessed a paradigm shift from end-to-end learning to a pretraining-finetuning framework. This shift underscores the need for finetuning methods that are both effec-tive and scalable. Owing to its training stabil-ity and adaptation efficiency, orthogonal finetun-ing (OFT) (Qiu et al., 2023; Liu et al., 2024) has emerged as a promising approach for adapting foundation models to downstream tasks. However, while performing well, OFT incurs high compu-tational and memory costs, limiting its scalability. Motivated by these challenges, we seek to make OFT more scalable to large foundation models. Towards this goal, we begin by identifying the key bottleneck that limits OFT’s scalability. At OFT OFTv2
|
||
|
||
> 020 40 60 80 GPU memory (GB)
|
||
> OFT OFTv2 0100 200 300 Training time (s) / 100 iterations
|
||
> >3 x>10 x
|
||
|
||
Figure 1: OFTv2 significantly reduces training time and GPU memory usage without sacrificing performance. The finetuning is performed with Qwen2.5-7B.
|
||
|
||
its core, OFT learns layer-shared orthogonal ma-trices to transform pretrained weight matrices, re-sulting in a naive weight-centric implementation where forward inference is performed after merg-ing the learned orthogonal matrices into weight matrices during training. The weight-centric im-plementation thus involves matrix-matrix multipli-cations with cubic complexity. As weight matri-ces grow large, this cubic scaling severely limits OFT’s applicability to large foundation models. However, these matrix-matrix multiplications are not fundamentally necessary. We draw inspiration from matrix-free methods (Chen, 2005), such as the power method and the Lanczos algorithm, which avoid explicit matrix-matrix operations by treat-ing matrices as linear operators applied to vectors. These methods operate entirely through matrix-vector multiplications, applying a matrix to vectors in the appropriate space without ever forming full matrix products. Guided by the same insight, we introduce an input-centric implementation of OFT, in which the learned orthogonal transformations are applied directly to the input vectors during each forward pass, rather than being merged into the weight matrix. This reformulation reduces the com-plexity from cubic to quadratic. We refer to this new formulation as OFTv2. Despite its simplicity, this change significantly enhances the scalability of 1
|
||
|
||
> arXiv:2506.19847v2 [cs.LG] 14 Oct 2025
|
||
|
||
OFT, making it suitable for finetuning large founda-tion models that the original OFT could not handle due to memory constraints. Another scalability bottleneck in OFT arises from the Cayley parameterization used by Liu et al. (2021a); Qiu et al. (2023); Liu et al. (2024) to pre-serve orthogonality. While effective, this param-eterization involves computing a matrix inverse, which becomes increasingly costly and less numer-ically stable as weight matrices get larger. To ad-dress this, we use a numerically stable yet efficient approximation – the Cayley–Neumann parameteri-zation (CNP) (Qiu et al., 2025). By replacing the matrix inverse in the original Cayley transform with a truncated Neumann series, CNP offers improved numerical stability and lower computational cost, particularly in settings where OFT is applied to fine-tune large foundation models. With CNP, OFTv2 becomes even more scalable and readily applicable for efficient adaptation of such models. In Figure 1, we compare OFT and OFTv2 by performing fine-tuning tasks on Qwen2.5-7B, which is the largest model that the original OFT can finetune within a single Nvidia H100 (80GB). These empirical re-sults demonstrate that OFTv2 achieves substantial GPU memory savings and training speed-up over the original OFT formulation (Qiu et al., 2023). In practice, finetuning ultra-large foundation models ( e.g. , LLaMA 3.1-70B (Grattafiori et al., 2024), Qwen 2.5-72B (Yang et al., 2024a)) typi-cally requires quantization to fit within GPU mem-ory limits. To support this, we follow the general design of the QLoRA framework (Dettmers et al., 2023) but replace LoRA with OFTv2. Our input-centric implementation of orthogonal finetuning enables a seamless application to the finetuning of quantized foundation models, resulting in QOFT– an efficient orthogonal finetuning that enables ef-ficient adaptation of quantized ultra-large models. Our major contributions are summarized below: • Inspired by matrix-free methods that avoid matrix-matrix multiplications in solving linear systems, we propose OFTv2–an input-centric reformulation of OFT that achieves significantly better scalability, with more than 10 × faster training and 3 × lower GPU memory usage. • We apply the Cayley–Neumann parameteriza-tion (Qiu et al., 2025) in OFTv2. It approximates the Cayley transform with a truncated Neumann series and eliminates matrix inversions. • Owing to the new input-centric formulation, we adapt OFTv2 to finetuning quantized foundation models. This enables memory-efficient finetun-ing of ultra-large models. • We apply OFTv2 and its quantized variant to different foundation models (including large lan-guage models and text-to-image generative mod-els) across various model scales.
|
||
|
||
2 Related Work
|
||
|
||
Parameter-efficient finetuning (PEFT) . As foun-dation models become increasingly large and pow-erful, there has been growing interest in finetuning them for downstream tasks in a parameter-efficient manner (Houlsby et al., 2019; Aghajanyan et al., 2020; Hu et al., 2022a; Edalati et al., 2022; Wang et al., 2022; Gheini et al., 2021; Zaken et al., 2022; Guo et al., 2020; Sung et al., 2021; Ansell et al., 2022; Lester et al., 2021; Li and Liang, 2021; Vu et al., 2022; He et al., 2021; Mao et al., 2021; Karimi Mahabadi et al., 2021; Liu et al., 2022; Sung et al., 2022; Chen et al., 2023; Jia et al., 2022; Chen et al., 2022; Zhang et al., 2022; Jie and Deng, 2023; Lian et al., 2022; Luo et al., 2023; Zhang et al., 2024; Wu et al., 2024). In particu-lar, reparameterization-based methods ( e.g. , Agha-janyan et al. (2020); Hu et al. (2022a); Edalati et al. (2022); Zi et al. (2023); Chavan et al. (2023)) are enjoying wide adoption. LoRA (Hu et al., 2022a) learns a pair of small low-rank matrices whose product is added to each weight matrix, enabling task adaptation with a small number of trainable pa-rameters. Building on LoRA, several works dynam-ically adjust the rank across layers to better balance the parameter budget (Zhang et al., 2023b; Valipour et al., 2022; Zhang et al., 2023a, 2024). To improve scalability, QLoRA (Dettmers et al., 2023) quan-tizes the frozen base model to 4-bit NormalFloat with double quantization and back-propagates only through LoRA, achieving near full-precision accu-racy while drastically lowering memory usage.
|
||
|
||
Orthogonal Finetuning . Qiu et al. (2023); Liu et al. (2024) propose a reparameterization-based method that learns layer-shared orthogonal matri-ces to transform neurons, yielding strong general-ization and stable training. The is motivated by the observation that hyperspherical energy ( i.e. , a geometric characterization of neurons on the unit sphere) influences generalization (Liu et al., 2018, 2021b; Lin et al., 2020; Liu et al., 2023), and that orthogonal transformations keep this energy in-variant (Liu et al., 2021a). A growing body of 2Pretrained Weight Matrix
|
||
|
||
> W
|
||
> dnd
|
||
|
||
# x+Pretrained Weight Matrix
|
||
|
||
> W
|
||
> nd
|
||
> ... Orthogonal Matrix R
|
||
> brdrn
|
||
> Low-rank Matrix
|
||
> AB
|
||
> (a) Low-rank Structure in LoRA (b) Sparse Orthogonal Structure in OFT
|
||
> AB00
|
||
|
||
Figure 2: Comparison between LoRA and OFT.
|
||
|
||
research (Ma et al., 2024; Yang et al., 2024b; Gor-bunov et al., 2024; Yuan et al., 2024; Feng et al., 2025; Raj and Coyle, 2025; Lingam et al., 2024; Bini et al., 2024; Su et al., 2024; Liao and Monz, 2024) builds upon the core idea of OFT. Figure 2 provides a comparison between OFT and LoRA. OFT achieves parameter efficiency through spar-sity, whereas LoRA relies on a low-rank structure.
|
||
|
||
3 OFTv2: Faster and More Scalable
|
||
|
||
3.1 Preliminaries
|
||
|
||
Let W = [ w1, · · · , wn] ∈ Rd×n be a weight ma-trix with columns wi ∈ Rd. In a linear layer, the forward pass is z = W x , where x ∈ Rd is the in-put and z ∈ Rn is the output. OFT reparameterizes the weight matrix with WOFT = RW 0 where W0
|
||
|
||
is the pretrained weight matrix and R ∈ Rd×d is an orthogonal matrix. OFT only learns R for adapt-ing the pretrained model to downstream tasks. To enforce orthogonality, Liu et al. (2021b); Qiu et al. (2023); Liu et al. (2024) parameterize R using the Cayley transform: R = ( I + Q)( I − Q)−1, where
|
||
|
||
Q is a skew-symmetric matrix satisfying Q =
|
||
|
||
−Q⊤. To further improve parameter-efficiency, OFT constrains the orthogonal matrix R to have a block-diagonal structure: R = Diag (R1, · · · , Rr)
|
||
|
||
where for any i, Ri ∈ Rb×b is a small orthogonal matrix and b·r = d. Each Ri can be parameterized using the Cayley transform. This block-diagonal form imposes a sparsity pattern on R, effectively making it a sparse orthogonal matrix. Leveraging this structure, Liu et al. (2024) further enhances parameter efficiency using butterfly factorization.
|
||
|
||
3.2 From Weight-centric Implementation to Input-centric Implementation
|
||
|
||
OFT performs finetuning by learning an orthogo-nal matrix to directly transform the weight matrix, which naturally leads to a weight-centric imple-mentation of the forward pass:
|
||
|
||
z =
|
||
|
||
> (1) Weight transform : matrix-matrix mult.
|
||
|
||
z }| {
|
||
|
||
W ⊤
|
||
|
||
> 0
|
||
|
||
R⊤ x
|
||
|
||
| {z }
|
||
|
||
> (2) Linear map : matrix-vector mult.
|
||
|
||
(1) The original OFT first performs a weight trans-form by computing W ⊤
|
||
|
||
> OFT
|
||
|
||
= W ⊤
|
||
|
||
> 0
|
||
|
||
R⊤ (i.e. , a matrix-matrix multiplication) and then computes the results of a linear layer with the equivalent weight matrix W ⊤
|
||
|
||
> OFT
|
||
|
||
(i.e. , a matrix-vector multipli-cation). This incurs O(nd 2) complexity due to the matrix-matrix multiplication. Inspired by matrix-free methods for solving linear systems, we observe that OFT’s forward pass can be interpreted as two linear maps applied to the input. This leads to an input-centric implementation
|
||
|
||
z = W ⊤
|
||
|
||
> 0
|
||
> (1) Linear map : matrix-vector mult.
|
||
|
||
z }| {
|
||
|
||
R⊤x
|
||
|
||
| {z }
|
||
|
||
> (2) Linear map : matrix-vector mult.
|
||
|
||
(2) where only two matrix-vector multiplications are required, reducing the complexity from cubic to quadratic: O(nd + d2). This simple conceptual shift in implementation entails a substantial speed-up in training time and reduction in GPU memory.
|
||
|
||
3.3 Approximate Orthogonality via Cayley-Neumann Parameterization
|
||
|
||
The Cayley parameterization constructs an orthog-onal matrix R with (I + Q)( I − Q)−1, where Q
|
||
|
||
is a skew-symmetric matrix. One limitation of this formulation is that it only generates rotation ma-trices, though empirical studies (Liu et al., 2021a; Qiu et al., 2023; Liu et al., 2024) suggest that this restriction does not negatively affect performance. More critically, computing a matrix inverse intro-duces numerical instability and additional compu-tational overhead, making it challenging to scale to large orthogonal matrices. To address this, we use the Cayley-Neumann parameterization proposed by Qiu et al. (2025), where the matrix inverse is approximated by a truncated Neumann series:
|
||
|
||
R = ( I + Q)( I − Q)−1 = ( I + Q) |