mirror of
https://github.com/wassname/adapters_as_hypotheses.git
synced 2026-06-27 16:44:10 +08:00
1346 lines
104 KiB
Markdown
1346 lines
104 KiB
Markdown
Title: Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation
|
||
|
||
URL Source: https://arxiv.org/pdf/2505.11235
|
||
|
||
Published Time: Fri, 20 Feb 2026 01:42:14 GMT
|
||
|
||
Number of Pages: 29
|
||
|
||
Markdown Content:
|
||
Published as a conference paper at ICLR 2026
|
||
|
||
# EFFICIENT ORTHOGONAL FINE -T UNING WITH
|
||
|
||
# PRINCIPAL SUBSPACE ADAPTATION
|
||
|
||
Fei Wu, Jia Hu ∗
|
||
|
||
, Geyong Min ∗
|
||
|
||
, Shiqiang Wang
|
||
|
||
Department of Computer Science, University of Exeter, UK
|
||
|
||
{fw407,j.hu,g.min,s.wang9 }@exeter.ac.uk
|
||
|
||
## ABSTRACT
|
||
|
||
Driven by the rapid growth of model parameters, parameter-efficient fine-tuning (PEFT) has become essential for adapting large models to diverse downstream tasks under constrained computational resources. Within this paradigm, orthogonal fine-tuning and its variants preserve semantic representations of pre-trained models, but struggle to achieve both expressiveness and efficiency in terms of parame-ter counts, memory, and computation. To overcome this limitation, we propose efficient Orthogonal Fine-Tuning with Principal Subspace adaptation (PSOFT), which confines orthogonal transformations to the principal subspace of pre-trained weights. Specifically, PSOFT constructs this subspace via matrix decomposition to enable compatible transformations, establishes a theoretical condition that strictly maintains the geometry of this subspace for essential semantic preservation, and introduces efficient tunable vectors that gradually relax orthogonality during train-ing to enhance adaptability. Extensive experiments on 35 NLP and CV tasks across four representative models demonstrate that PSOFT offers a practical and scalable solution to simultaneously achieve semantic preservation, expressive-ness, and multi-dimensional efficiency in PEFT. The code is publicly available at
|
||
|
||
https://github.com/fei407/PSOFT .
|
||
|
||
## 1 INTRODUCTION
|
||
|
||
Pre-trained foundation models including large language models (LLMs) (Grattafiori et al., 2024) and vision transformers (ViT) (Dosovitskiy et al., 2021) have transformed natural language pro-cessing (NLP) (Qin et al., 2023) and computer vision (CV) (Liu et al., 2023). This success is attributed to emergent abilities (Wei et al., 2022) that arise as these models are scaled up. However, their ever-growing scale poses a practical barrier to efficiently tailoring ( i.e., fine-tuning) these sophisticated foundation models to specific downstream tasks. To address this challenge, parameter-efficient fine-tuning (PEFT) has emerged as a promising paradigm that adapts models by updating only a minimal subset of parameters (Houlsby et al., 2019; Lester et al., 2021; Li & Liang, 2021; Hu et al., 2022; Meng et al., 2024; yang Liu et al., 2024). Pretrained Weights
|
||
|
||
> Wpre ∈ℝ d×n A∈ℝ d×r B∈ℝ r×n
|
||
> xhLoRA xhPretrained Weights
|
||
> Wpre ∈ℝ d×n
|
||
> Orthogonal Weights
|
||
> R⊥∈ℝ d×d
|
||
> OFT Residual Weights
|
||
> Wres ∈ℝ d×n A⊥∈ℝ d×r
|
||
> Principal Weights
|
||
> Principal Weights
|
||
> B∈ℝ r×n
|
||
> xhPSOFT
|
||
> R⊥∈ℝ r×r
|
||
> +×+×
|
||
> alpha ∈ℝ rbeta ∈ℝ r
|
||
|
||
Figure 1: Overview of the architectures of LoRA, OFT, and the proposed PSOFT. Among PEFT studies, reparameterization-based methods (Hu et al., 2022; Qiu et al., 2023) are widely adopted because they seamlessly inte-grate with pre-trained weights without adding inference latency. As illustrated in the left panel of Figure 1, reparameterization-based methods include Low-Rank Adaptation (LoRA) (Hu et al., 2022) and Orthogonal Fine-Tuning (OFT) (Liu et al., 2021; Qiu et al., 2023). LoRA has been widely adopted for its efficient low-rank structure, but it may distort semantic representations embedded in the pre-trained weights. These semantic rep-resentations can be understood as the geometric
|
||
|
||
> ∗
|
||
|
||
Corresponding author.
|
||
|
||
1
|
||
|
||
> arXiv:2505.11235v3 [cs.LG] 19 Feb 2026
|
||
|
||
Published as a conference paper at ICLR 2026 Table 1: Comparison of LoRA, OFT variants, and the proposed PSOFT. The table summarizes the trade-off among semantic preservation, multi-dimensional efficiency, and expressiveness (as reflected in performance) across PEFT methods.
|
||
|
||
> Methods Semantic Representations (explicitly preserved) Parameter-efficiency Mechanism Memory Usage Computational Overhead Performance
|
||
> LoRA (Hu et al., 2022) ✗Low-rank Low Low Medium Full OFT (Liu et al., 2021) Full space ✗Very High Very High High Block-diagonal OFT (Qiu et al., 2023) Full space Block-diagonal Medium Medium Medium-High BOFT (Liu et al., 2024) & qGOFT (Ma et al., 2024) Full space Butterfly factorization / Givens rotation High High High PSOFT (Ours) Principal subspace Low-rank Low ↓Low ↓High ↑
|
||
|
||
structure of weight vectors, specifically the pairwise angles and norms among columns, which en-code relational information learned during pre-training. Distorting this structure may weaken the model’s ability to transfer knowledge to downstream tasks (Wang et al., 2023). In contrast, OFT applies isometric orthogonal transformations, which strictly maintain this geometric structure and thereby preserve semantic representations. However, full-dimensional orthogonal transformations are inefficient in terms of parameter counts, memory, and computation, rendering them impractical for large-scale applications. This contrast leaves a gap in PEFT between the efficiency of LoRA and the semantic preservation of OFT. Building on OFT’s advantages, several studies have explored ways to improve its efficiency while retaining its core strength. Early attempts such as block-diagonal OFT (Qiu et al., 2023) reduced parameter counts and partially alleviated computational and memory overhead through block-diagonal sparsity. However, the rigid block structure restricts the model’s expressiveness (its ability to capture diverse transformations) and consequently limits the performance that can be empirically attained. To address this limitation, later variants such as BOFT (Liu et al., 2024) and qGOFT (Ma et al., 2024) have sought to restore expressiveness while maintaining parameter efficiency by composing multiple sparse orthogonal matrices in sequence. Yet this design incurs a new drawback: chaining multiple sparse matrices introduces substantial intermediate states that dominate runtime and memory consumption. Empirically, qGOFT has been reported to run nearly 6× slower than LoRA during training (Ma et al., 2024), while BOFT and qGOFT frequently consume more than 80 GB of memory in large-scale model settings. Such overhead inflates training costs and undermines their practicality. Thus, sparsity-driven OFT variants struggle to achieve both expressiveness and efficiency across multiple dimensions. This tension underlies the central challenge of our work:
|
||
|
||
How to design a PEFT method that simultaneously achieves semantic preservation, expressiveness, and multi-dimensional efficiency (parameter counts, memory, and computation)?
|
||
|
||
To address this challenge, motivated by evidence that both pre-trained models and their task-specific adaptations reside in a low intrinsic rank (Li et al., 2018; Aghajanyan et al., 2021; Hu et al., 2022), we propose efficient Orthogonal Fine-Tuning with Principal Subspace adaptation ( PSOFT ), as illustrated in the right panel of Figure 1. The key idea is to confine orthogonal transformations to the low-rank principal subspace of pre-trained weights, thereby overcoming the limitations of conventional OFT operating in the full parameter space and simultaneously achieving semantic preservation, expressiveness, and multi-dimensional efficiency. However, realizing this idea is non-trivial, as it entails overcoming several technical difficulties: 1) Compatibility. A low-dimensional orthogonal transformation cannot be directly applied to the high-dimensional weight matrix, leading to dimensional incompatibility with the pre-trained model. 2) Geometry preservation. Naively applying low-rank orthogonal transformations may distort the geometry of the subspace, thereby undermining the strict preservation of essential semantic representations. 3) Adaptability. Strict orthogonality constraints may hinder adaptation to slight task-specific drifts, resulting in suboptimal performance on downstream tasks. PSOFT resolves these difficulties through principled designs. First, it constructs a principal subspace of pre-trained weights through matrix decomposition, enabling compatible orthogonal transformations and yielding a higher rank that enhances expressiveness. Next, it establishes a theoretical condition to strictly maintain the geometry of the subspace, thereby ensuring essential semantic preservation. Finally, it introduces efficient tunable vectors to gradually relax orthogonality during training at negligible cost, improving adaptability across diverse downstream tasks. We evaluate PSOFT through extensive experiments on 35 NLP and CV tasks with four representative pre-trained models. Compared with OFT variants, PSOFT consistently avoids out-of-memory (OOM) 2Published as a conference paper at ICLR 2026 failures and accelerates training. On small-scale models, it achieves up to 18 × higher parameter efficiency with the lowest memory footprint among baselines, without compromising average per-formance. On larger models, PSOFT lowers the memory footprint of OFT to a level comparable with LoRA-like methods while outperforming LoRA on GSM-8K (+2.3%) and Commonsense Rea-soning (+1.4%) with comparable parameter counts. As summarized in Table 1, PSOFT preserves semantic representation in the principal subspace while minimizing parameter counts, memory, and computation overhead, and simultaneously maintains expressiveness as reflected in high performance. The main contributions of this work are summarized as follows: • We introduce a new low-rank perspective that unifies efficiency and expressiveness in OFT, bridging the gap between low-rank adaptation and orthogonal fine-tuning. • We establish a theoretical condition under which low-dimensional orthogonal fine-tuning strictly preserves the geometric structure of the subspace. • We propose PSOFT, a framework that confines OFT to the principal subspace with theoretical guarantees and practical adaptability. • We validate PSOFT through extensive experiments, establishing a practical and scalable solution to simultaneously achieve semantic preservation, expressiveness, and multi-dimensional efficiency.
|
||
|
||
## 2 RELATED WORK
|
||
|
||
Parameter-Efficient Fine-Tuning (PEFT). PEFT adapts pre-trained models to diverse down-stream tasks by fine-tuning only a small subset of parameters. Specifically, existing PEFT meth-ods fall into three categories: 1) Selection-based methods select specific components of the pre-trained model without altering its architecture (Zaken et al., 2022; Song et al., 2024; Xu & Zhang, 2024). 2) Addition-based methods insert prompts or adapters at the input or within Transformer blocks (Houlsby et al., 2019; Pfeiffer et al., 2021; Lester et al., 2021; Li & Liang, 2021; Liu et al., 2022). 3) Reparameterization-based methods reparameterize weights in parallel with minimal parameters (Hu et al., 2022; Azizi et al., 2024; Bałazy et al., 2024; Gao et al., 2024; Kopiczko et al., 2024; Lingam et al., 2024; yang Liu et al., 2024; Meng et al., 2024). Reparameterization-based methods are particularly appealing since they incur no additional inference latency, with represen-tative examples including LoRA (Hu et al., 2022) and OFT (Qiu et al., 2023). LoRA’s variants, such as PiSSA (Meng et al., 2024) and DoRA (yang Liu et al., 2024), improve convergence through re-initialization and enhance performance via weight decomposition, respectively. DoRA decomposes the low-rank update into direction and magnitude components, but it may introduce additional mem-ory and computational overhead for computing these components. In addition, LaMDA (Azizi et al., 2024) and LoRA-XS (Bałazy et al., 2024) reduce the parameter count and resource usage of LoRA by employing more compact matrices. In LoRA-XS, the learnable square matrix is constrained by the fixed LoRA matrices, which may limit its expressiveness. However, these LoRA-based methods may induce semantic drift from the pre-trained representations (Wang et al., 2023), which may degrade output quality in generative tasks.
|
||
|
||
Orthogonal Fine-Tuning (OFT). Unlike additive methods such as LoRA, multiplicative OFT preserves semantic representations of pre-trained models through orthogonal transformations, which maintains the hyperspherical energy among neurons (Liu et al., 2021; Qiu et al., 2023). To mitigate the prohibitive cost of applying orthogonal transformations over the full parameter space, prior studies typically introduce sparsity constraints. For instance, block-diagonal OFT (Qiu et al., 2023) adopts a block-diagonal sparse structure to reduce parameter counts, though at the risk of undesired inductive biases (Liu et al., 2024). BOFT (Liu et al., 2024) and qGOFT (Ma et al., 2024) address this issue by replacing dense matrices with sequences of sparse multiplications, thereby improving parameter efficiency while restoring expressiveness. Nevertheless, these variants remain less efficient in memory and computation than LoRA and its variants. In parallel, Adapter R (Zhang & Pilanci, 2024) rotates the top spectral space using orthogonal transformations to preserve spectral characteristics of pretrained weights, in contrast to the geometric structure emphasized in OFT. Overall, existing OFT variants struggle to achieve both expressiveness and efficiency across multiple dimensions. These limitations motivate our PSOFT algorithm, which confines orthogonal transformations to the principal subspace with a theoretical guarantee of preserving essential semantic representations, followed by a relaxation of strict orthogonality at negligible cost to enhance adaptability. 3Published as a conference paper at ICLR 2026
|
||
|
||
## 3 PRELIMINARIES
|
||
|
||
In this section, we formalize LoRA and OFT variants in mathematical notation, providing a unified view of their parameterization strategies. Conventional full fine-tuning (FFT) updates the entire pre-trained weight matrix Wpre ∈ Rd×n to obtain W , whereas PEFT methods freeze Wpre and introduce only a small set of trainable parameters. For LoRA (Hu et al., 2022), the update is parameterized by a low-rank decomposition:
|
||
|
||
h = W ⊤x = ( Wpre + AB )⊤x, s.t. rank( AB ) = r, (1) where A ∈ Rd×r and B ∈ Rr×n are trainable matrices. Following standard practice, A is initialized with Kaiming initialization (He et al., 2015) and B with zeros, so training begins from Wpre .For OFT (Liu et al., 2021; Qiu et al., 2023), the update is parameterized by an orthogonal matrix R,which fine-tunes Wpre in the full parameter space, i.e., Wfs-tuned = RW pre . The forward pass is given by:
|
||
|
||
h = W ⊤
|
||
|
||
> fs-tuned
|
||
|
||
x = ( RW pre )⊤x, s.t. R⊤R = RR ⊤ = Id, (2) where R ∈ Rd×d is initialized as the identity matrix so that training begins from Wpre . By construc-tion, orthogonal transformations in the full parameter space preserve both angles and norms, thereby maintaining the geometric structure of Wpre .To reduce parameter overhead, block-diagonal OFT (Qiu et al., 2023) constrains R to a block-diagonal form R = diag (R1, · · · , Ri, · · · , Rd/r ), where each Ri ∈ O( d/r ). Although efficient, this structure may introduce undesirable inductive bias. BOFT (Liu et al., 2024) and qGOFT (Ma et al., 2024) mitigate this by factorizing R into sparse matrices, R = Qlog dm=1 ˜Rm, with each
|
||
|
||
˜Rm ∈ Rd×d sparse. Assuming d is a power of two, log d is integral, ensuring a valid factorization. This construction restores the expressiveness of dense rotations with reduced parameters.
|
||
|
||
## 4 METHODOLOGY
|
||
|
||
As discussed in Section 1, existing OFT variants such as BOFT and qGOFT still incur substantial computational and memory overhead. Prior studies (Li et al., 2018; Aghajanyan et al., 2021; Hu et al., 2022) further suggest that both pre-trained models and their task-specific adaptations lie in a low-rank intrinsic subspace. Motivated by this insight, we propose Orthogonal Fine-Tuning with
|
||
|
||
Principal Subspace adaptation ( PSOFT ), which confines orthogonal transformations to the low-rank principal subspace of Wpre . The complete algorithm is given in Appendix A, and the remainder of this section details its design. 4.1 DIMENSION -C OMPATIBLE ORTHOGONAL TRANSFORMS
|
||
|
||
Realizing orthogonal fine-tuning in the subspace requires a projection of high-dimensional weights onto a low-dimensional subspace, since directly applying the orthogonal matrix R ∈ Rr×r to
|
||
|
||
Wpre ∈ Rd×n is infeasible due to dimensional incompatibility. To construct this projection, we perform Singular Value Decomposition (SVD), Wpre = U ΣV ⊤, and decompose it into Wpri and
|
||
|
||
Wres , such that Wpre = Wpri + Wres . Here, the subscript “pri” denotes the principal component reconstructed from the top-r singular values and vectors, while “res” denotes the residual component. The principal component Wpri is then used to derive symmetric low-rank matrices A and B as:
|
||
|
||
Wpri = U[: ,:r]
|
||
|
||
q
|
||
|
||
Σ[: r, :r]
|
||
|
||
| {z }
|
||
|
||
> A∈Rd×r
|
||
|
||
q
|
||
|
||
Σ[: r, :r]V ⊤
|
||
|
||
> [: ,:r]
|
||
|
||
| {z }
|
||
|
||
> B∈Rr×n
|
||
|
||
∈ Rd×n (Symmetric) , (3) where A projects weights into the r-dimensional principal subspace, while B reconstructs them back. The residual component Wres is then obtained from the remaining singular values and vectors:
|
||
|
||
Wres = Wpre − Wpri = U[: ,r :] Σ[r:,r :] V ⊤
|
||
|
||
> [: ,r :]
|
||
|
||
∈ Rd×n. (4) Building on this, we regard Wpri = AB as representing the initial principal subspace of Wpre . This subspace enables dimension-compatible orthogonal transformations, yielding Wps-tuned = ARB ,where the subscript “ps-tuned” denotes the fine-tuned weights in the principal subspace for PSOFT. 4Published as a conference paper at ICLR 2026 Sparse Orthogonal Fine-Tuning (OFT variants)
|
||
|
||
··· ··· ···
|
||
|
||
dd
|
||
|
||
> ··· Orthogonal Weights R
|
||
|
||
dn
|
||
|
||
> Preserve angles and norms in the full space Wfs-tuned = R · W pre BOFT, GOFT ...
|
||
> Col Vectors in W pre Col Vectors in W fs-tuned (OFT variants)
|
||
> Residual Weights W res Pre-trained Weights W pre
|
||
> Decomposi �on
|
||
|
||
Orthogonal Fine-Tuning in the Principal Subspace (PSOFT)
|
||
|
||
> Pre-trained Weights W pre AR
|
||
> r
|
||
> BPrincipal Weights W pri Wpri = A·B = A'·B' Preserve angles and norms in the principal subspace Wps-tuned = A'·diag( α)· R ·diag( β)· B'
|
||
|
||
dnd dn n
|
||
|
||
# ×× ×C1 C 1C 2C 3C2C 3C 1'C2'C3 '
|
||
|
||
> θ2
|
||
> θ2
|
||
> θ2
|
||
> θ1
|
||
> θ1C1C2C2C3C3C1
|
||
> θ1CiCiCol Vectors in W pri
|
||
> Ci
|
||
> Col Vectors in W ps-tuned (ini �al of PSOFT)
|
||
> Ci
|
||
> Col Vectors in W ps-tuned (PSOFT)
|
||
> Ci'Higher computa �onal cost Larger GPU memory footprint Smaller GPU memory footprint Lower computa �onal cost
|
||
|
||
Figure 2: Our proposed method: PSOFT. The left panel illustrates the principles of OFT variants. On the right, PSOFT preserves the angles and norms of Wpri (blue) in the fine-tuned Wps-tuned (orange), while allowing adjustable angles and scalable norms in the sector. Unlike LoRA (Hu et al., 2022) and PiSSA (Meng et al., 2024), which train both A and B, PSOFT freezes them and fine-tunes only the orthogonal matrix R. LoRA produces updates ∆W = AB that span the low-rank manifold {∆W : rank(∆ W ) ≤ r} of dimension r(d+n−r). In contrast, PSOFT generates updates ∆W = A(R − I)B parameterized solely by an orthogonal matrix R ∈ O(r),where O(r) denotes the r(r − 1) /2-dimensional orthogonal group. Because the variability of ∆W
|
||
|
||
arises only through R, all updates remain confined to the fixed row and column subspaces defined by A and B. Consequently, LoRA and PSOFT operate on fundamentally different geometric families of updates (low-rank vs. orthogonal), and their expressiveness is therefore not directly comparable. The same structural distinction also determines different feasible ranks under an equal trainable-parameter budget M . LoRA trains two matrices, giving M = ( d + n) rLoRA and thus
|
||
|
||
rLoRA = M/ (d + n), whereas PSOFT trains only an orthogonal matrix, yielding M = r2PSOFT
|
||
|
||
and hence rPSOFT = √M . Since typically √M ≪ (d + n), we obtain rPSOFT ≫ rLoRA , which explains why PSOFT empirically operates with much larger ranks under the same parameter budget. 4.2 GUARANTEED GEOMETRY PRESERVATION IN THE PRINCIPAL SUBSPACE
|
||
|
||
Orthogonal transformations within the constructed principal subspace in Section 4.1 merely ensure dimensional compatibility but do not strictly preserve subspace geometry. In particular, applying a low-dimensional orthogonal matrix R to the subspace spanned by symmetric A and B in Eq. 3 may distort the pairwise angles and norms among the column vectors of Wpri . To address this issue, we analyze the conditions under which orthogonal fine-tuning preserves the geometry of the principal subspace, and present an informal Theorem 4.1, with the formal theorem and proof in Appendix B.
|
||
|
||
Theorem 4.1 (Informal: Angle and norm preservation in the principal subspace) . Let Wpri = AB
|
||
|
||
denote the principal weights and Wps-tuned = ARB denote the fine-tuned weights. For Wps-tuned to preserve (i) pairwise angles between columns, and (ii) column norms of Wpri , the following condition must hold:
|
||
|
||
R⊤A⊤AR = A⊤A. (5) 5Published as a conference paper at ICLR 2026 We provide an intuitive explanation of Theorem 4.1. The geometry of the principal subspace is determined by the relative angles and lengths of its column vectors, which are encoded in the Gram matrix G = A⊤A. Any R satisfying R⊤GR = G can be viewed as a symmetry of this geometry, similar to a rotation or reflection. In other words, if we first apply R to the columns of B and then project them using A, their angles and lengths in the high-dimensional space remain unchanged. In practice, normalizing A so that A⊤A = Ir simplifies the condition, in which case R reduces to a standard orthogonal matrix. Accordingly, Eq. 3 is modified in PSOFT as:
|
||
|
||
Wpri = U[: ,:r]
|
||
|
||
| {z }
|
||
|
||
> A′∈Rd×r
|
||
|
||
Σ[: r, :r]V ⊤
|
||
|
||
> [: ,:r]
|
||
|
||
| {z }
|
||
|
||
> B′∈Rr×n
|
||
|
||
∈ Rd×n (Asymmetric) , (6) where asymmetric A′ and B′ are derived from the top-r principal components of the SVD. The residual Wres remains as in Eq. 4, and the forward computation becomes:
|
||
|
||
h = ( Wps-tuned + Wres )⊤x = ( A′RB ′ + Wres )⊤x, (7) where A′, B′, and Wres are frozen, and only R ∈ Rr×r is trainable, initialized as the identity matrix. To satisfy Eq. 5 during training, it is ssential to maintain the orthogonality of R. Enforcing orthogo-nality of R (e.g., via Gram-Schmidt orthogonalization) is computationally expensive. To reduce this cost, following prior studies (Qiu et al., 2023; 2025), we adopt the Cayley parameterization (Cayley, 1894) to enforce the strict orthogonality of R, where R = ( I − Q)( I + Q)−1 and Q = −Q⊤ is a skew-symmetric matrix. Further details on the Cayley parameterization are provided in Appendix C. 4.3 EFFICIENT RELAXATIONS OF ORTHOGONALITY
|
||
|
||
Eqs. 6 and 7 guarantee geometry preservation in the principal subspace, but strict orthogonality con-straints may hinder adaptation to task-specific drifts, leading to suboptimal performance. Empirical evidence shows that moderate relaxation improves results (Ma et al., 2024). Yet existing methods sacrifice efficiency: qGOFT relaxes constraints more flexibly but requires four times the parameters of GOFT (Ma et al., 2024), while BOFT relaxes them through additional scaling vectors on the output dimension, whose size grows linearly with model scale (Liu et al., 2024). To overcome these issues, we propose efficient relaxations of PSOFT that enhance adaptability with minimal overhead. Specifically, we introduce two tunable vectors that modulate the input and output norms around the orthogonal matrix, modifying Eq. 7 to yield the following forward computation:
|
||
|
||
h = ( A′ diag( α)R diag( β)B′ + Wres )⊤x (PSOFT) , (8) where A′, B′, and Wres remain fixed, while only R and the tunable vectors α and β are trained. Both vectors are initialized as all-one vectors to ensure strict orthogonality at the start of training. As illustrated in Figure 2, PSOFT relaxes this constraint during training, enabling adjustable angles and scalable norms that adapt to task objectives. As these two additional vectors are inserted within the subspace, the overhead is limited to 2r parameters ( 2r ≪ n, where n is the output dimension), enhancing adaptability with minimal cost and without significantly affecting the geometric structure. To avoid excessive deviation from orthogonality, an explicit constraint can be imposed:
|
||
|
||
C⊤C − I F ≤ ϵ, where C = diag( α) R diag( β). Deviation arises when either diag( α) or
|
||
|
||
diag( β) deviates from a scalar multiple of the identity. In the special case where diag( α) = λ1I
|
||
|
||
and diag( β) = λ2I, angular relationships are preserved, and magnitudes are uniformly scaled. In summary, PSOFT performs orthogonal fine-tuning to the low-rank principal subspace, enabling dimension-compatible transformations with theoretical guarantees on subspace geometry, while relaxing strict orthogonality at negligible cost to enhance adaptability. It requires only r(r −1) /2+2 r
|
||
|
||
trainable parameters by combining the Cayley parameterization with two efficient tunable vectors. Moreover, it reduces both the number and size of additional matrices (from min( d, n ) to r, with
|
||
|
||
r ≪ min( d, n )), thereby yielding substantially lower activation memory than other OFT variants under the same batch size and sequence length. Detailed comparisons of parameter counts and activation memory analysis across different PEFT methods are provided in Appendices D and E. 6Published as a conference paper at ICLR 2026
|
||
|
||
## 5 EXPERIMENTS
|
||
|
||
To evaluate PSOFT, we conduct experiments on 35 tasks spanning language and vision domains, using encoder-only models (DeBERTaV3-base (He et al., 2021), ViT-B/16 (Dosovitskiy et al., 2021)), and decoder-only models (LLaMA-3.2-3B (Meta AI, 2024), LLaMA-3.1-8B (Grattafiori et al., 2024)). These models are fine-tuned on downstream tasks, covering natural language understanding (Wang et al., 2019), visual classification (Zhai et al., 2019), mathematical QA (Yu et al., 2024), and commonsense reasoning (Hu et al., 2023). We evaluate key metrics such as parameter counts, peak memory usage, and accuracy in the main experiments, and assess training speed separately in the efficiency analysis. Following OFTv2 (Qiu et al., 2025), we implement the Cayley parameterization by approximating (I + Q)−1 with a truncated Neumann series, PKk=0 (−Q)k, using K = 5 terms in practice. All experiments are performed on a single GPU with FP32, using an NVIDIA RTX 4090 (24 GB) for encoder-only models and an NVIDIA H100-SXM (80 GB) for decoder-only models.
|
||
|
||
Baselines. We employ state-of-the-art OFT variants with other advanced PEFT methods as baselines: • FFT (Howard & Ruder, 2018) updates all model weights during fine-tuning. • GOFTv2 & qGOFTv2 (Ma et al., 2024) replace full-space OFT with Givens rotations. The latest implementation uses Hadamard products instead of sparse multiplication. • BOFT (Liu et al., 2024) substitutes full-space OFT with butterfly factorization. • OFTv2 (Qiu et al., 2023; 2025) employs a block-diagonal structure for OFT, with the latest version adopting an input-centric computation and Cayley-Neumann parameterization. • LoRA (Hu et al., 2022) freezes pre-trained weights and adjusts only two low-rank matrices. • PiSSA (Meng et al., 2024) improves LoRA initialization to fine-tune principal weights. • DoRA (yang Liu et al., 2024) decomposes low-rank adaptation into direction and magnitude. • LoRA-XS (Bałazy et al., 2024) injects and tunes a single square matrix between LoRA’s matrices. Table 2: Experimental results of fine-tuned DeBERTaV3-base. Results are averaged over 5 random seeds. Memory (GB) denotes peak memory with sequence length 64.
|
||
|
||
> Methods #Params Memory (GB) CoLA STS-B RTE MRPC SST2 QNLI Avg.
|
||
> FFT 184M 5.9 67.56 91.46 82.88 90.69 94.13 93.37 86.68 GOFTv2 0.08M 18.5 65.45 N/A. (OOM) qGOFTv2 0.33M 18.5 68.03 N/A. (OOM) BOFT b=8
|
||
> m=2 1.41M 6.3 68.85 91.09 83.60 88.40 95.28 93.78 86.83 OFTv2 b=32 1.29M 4.5 66.79 91.22 84.03 89.61 93.72 92.64 86.34 LoRA r=8 1.33M 4.5 67.98 91.60 84.87 90.20 95.28 93.89 87.30 PiSSA r=8 1.33M 4.5 66.50 91.40 83.77 89.90 93.17 92.72 86.24 DoRA r=8 1.41M 5.8 67.06 91.60 87.19 90.49 95.23 94.09 87.61 LoRA-XS r=136 1.33M 4.2 64.67 91.48 84.17 91.27 93.85 93.14 86.43 PSOFT r=46 0.08M 4.1 70.42 91.56 86.74 90.49 95.55 93.47 88.04
|
||
|
||
Encoder-only Models. We evalu-ate PSOFT by fine-tuning DeBER-TaV3 (He et al., 2021) on sev-eral datasets from the GLUE bench-mark (Wang et al., 2019). Following prior work (Wu et al., 2024a;b; Bini et al., 2025), we split the original vali-dation set into new validation/test sets with a fixed seed, and report test accu-racy from the best validation check-point to ensure rigorous evaluation. Details are in Appendix F. As shown in Table 2, GOFTv2 and qGOFTv2 have non-tunable parame-ters and often encounter OOM failures as the sequence length increases. PSOFT improves parameter and memory efficiency without com-promising performance. Although GOFT and PSOFT have the same parameter counts, PSOFT reduces memory usage by about 80% and avoids OOM issues. It further achieves up to an 18 ×
|
||
|
||
improvement in parameter efficiency over BOFT, OFTv2, and LoRA variants, attaining the best average performance across all baselines with the lowest memory footprint. Compared with LoRA variants that do not rely on weight decomposition, DoRA introduces additional memory overhead. For LoRA-XS, the update is constrained by the initialization of its low-rank matrices, which limits its expressiveness and consequently leads to degraded performance. These results highlight PSOFT’s ability to achieve both efficiency and performance. We also evaluate PSOFT by fine-tuning ViT-B/16 (Dosovitskiy et al., 2021) on the VTAB-1K benchmark (Zhai et al., 2019). Further details are provided in Appendix G. As shown in Table 3.
|
||
|
||
PSOFT extends its efficiency-performance advantages on the small-scale model from language tasks to vision tasks. Beyond avoiding the heavy memory demands of GOFTv2 and qGOFTv2, PSOFT consistently reduces the memory overhead of BOFT and OFTv2. Compared to LoRA and its variants, 7Published as a conference paper at ICLR 2026 Table 3: Experimental results of fine-tuned ViT-B/16 on the VTAB-1K benchmark. Reported values (top-1 accuracy %) are the mean of 5 runs with different random seeds.
|
||
|
||
> Methods #Params Mem (GB) Natural Specialized Structured Avg. Cifar100 Caltech101 DTD102 Flower102 Pets SVHN Sun397 Camelyon EuroSAT Resisc45 Retinopathy Clevr-Count Clevr-Dist DMLab KITTI-Dist dSpr-Loc dSpr-Ori sNORB-Azim sNORB-Ele
|
||
> FFT 85.9M 8.2 70.7 89.3 69.5 99.0 90.4 81.7 54.9 85.4 93.6 83.8 74.5 58.3 51.5 43.2 75.0 73.1 48.7 16.4 30.0 67.8 GOFTv2 0.08M OOM N/A. qGOFTv2 0.33M OOM N/A. BOFT b=8
|
||
> m=2 1.41M 10.9 70.6 88.2 69.8 99.0 91.4 77.4 55.1 85.1 93.6 82.3 74.9 61.8 50.4 42.9 76.1 73.7 48.8 15.7 30.8 70.9 OFTv2 b=32 1.29M 7.7 68.5 88.9 67.5 98.4 89.5 86.9 53.6 86.0 94.1 84.2 74.6 58.7 56.4 46.7 78.5 81.1 48.1 17.3 32.5 72.1 LoRA r=8 1.33M 9.9 71.4 88.4 70.1 99.0 91.4 76.6 55.7 85.9 94.2 83.3 74.1 72.0 54.3 43.0 76.6 74.8 48.6 16.4 31.8 71.8 PiSSA r=8 1.33M 9.9 70.7 88.7 68.9 99.2 91.0 81.9 53.3 82.6 93.4 83.0 74.0 71.0 60.2 44.0 77.1 81.9 51.8 18.1 33.1 72.3 DoRA r=8 1.41M 17.8 70.7 89.0 69.8 98.9 91.0 81.7 55.5 85.7 94.2 83.5 74.8 67.3 54.2 45.1 77.4 82.0 48.5 16.9 31.5 72.3 LoRA-XS r=136 1.33M 6.6 68.5 89.4 68.4 98.7 90.9 84.5 54.1 84.0 94.3 80.8 73.6 60.0 57.7 45.8 79.6 80.6 48.1 17.4 30.8 71.6 PSOFT r=46 0.08M 6.2 71.9 89.6 70.3 99.1 91.8 86.9 55.9 84.6 94.2 82.4 75.2 71.2 59.9 45.7 79.6 80.9 52.9 20.0 32.9 73.4
|
||
|
||
it achieves the best average accuracy with about 94% fewer parameters and the lowest peak memory footprint. Interestingly, we also observe that parameter counts and memory overheads of different PEFT methods do not necessarily correlate. For example, the weight decomposition in DoRA introduces substantial memory overhead on the ViT-B/16 model compared with other LoRA variants, even when the number of trainable parameters is similar. This suggests that PEFT design should consider multi-dimensional efficiency beyond parameter efficiency alone. Table 4: Experimental results of fine-tuned LLaMA-3.2-3B on GSM-8K and MATH.
|
||
|
||
Methods #Params Memory (GB) GSM-8K MATH
|
||
|
||
FFT 3.21B 69.0 63.00 16.84 GOFTv2 0.75M OOM N/A. qGOFTv2 2.98M OOM N/A. BOFT b=2
|
||
|
||
> m=2
|
||
|
||
3.76M OOM N/A. OFTv2 b=32 11.6M 35.2 61.03 15.70 LoRA r=8 12.2M 32.2 60.80 15.76 PiSSA r=8 12.2M 32.2 61.26 14.96 DoRA r=8 12.9M 43.4 62.62 15.48 LoRA-XS r=248 12.1M 34.4 61.56 15.02 PSOFT r=352 12.2M 36.2 63.08 15.98
|
||
|
||
Decoder-only Models. Following prior work (Lingam et al., 2024; Liu et al., 2024), we fine-tune the LLaMA-3.2-3B (Meta AI, 2024) model on MetaMathQA-40K (Yu et al., 2024) and evaluate on GSM-8K (Cobbe et al., 2021) and MATH (Hendrycks et al., 2021). For large-scale models and complex tasks, where perfor-mance is more sensitive to parameter counts, we align trainable parameters by setting the LoRA rank to 8 to ensure a fair comparison. PEFT modules are applied to all linear layers, with ad-ditional hyperparameter details in Appendix H. As shown in Table 4, as models scale up, BOFT suffers from OOM failures like GOFTv2 and qGOFTv2, whereas PSOFT avoids this issue.
|
||
|
||
PSOFT reduces the peak memory footprint of OFT variants to a level comparable with LoRA-like methods, while delivering superior performance under similar parameter counts. Against advanced PEFT methods, it outperforms LoRA (+2.28%) on GSM-8K and PiSSA (+1.02%) on MATH, while maintaining memory usage comparable to LoRA-like baselines. Compared to the sparsity-based OFTv2, PSOFT achieves higher performance at comparable cost. When scaling to large models and complex reasoning tasks, PSOFT adapts by employing a higher rank r to ensure sufficient expressiveness, yet still maintains efficiency and clear memory advantages over BOFT, GOFTv2, qGOFTv2, and DoRA. Although increasing the rank may enhance the expressiveness of LoRA-XS, its performance remains fundamentally constrained by the initialization: the inserted square matrix is trainable only as a linear combination within the original low-rank subspace. Even under restricted module insertion and tighter parameter budgets, PSOFT still reduces memory overhead relative to qGOFTv2 and BOFT (Table 13 in Appendix H), demonstrating strong scalability to large models and complex mathematical tasks. Following prior work (Hu et al., 2023; Lingam et al., 2024; yang Liu et al., 2024), we further fine-tune LLaMA-3.1-8B (Grattafiori et al., 2024) on the Commonsense-15K dataset (Hu et al., 2023) and evaluate it on eight commonsense reasoning benchmarks. PEFT modules are applied to the
|
||
|
||
Q, K, V, U, D linear layers. Appendix I details the hyperparameter settings. As shown in Table 5,
|
||
|
||
PSOFT mitigates the frequent OOM issues of OFT on larger models while achieving the best average performance. In practice, GOFTv2, qGOFTv2, and BOFT suffer from OOM failures even without 8Published as a conference paper at ICLR 2026 Table 5: Experimental results of fine-tuned LLaMA-3.1-8B on commonsense reasoning benchmarks.
|
||
|
||
> Methods #Params Memory (GB) BoolQ PIQA SIQA HS WG ARC-e ARC-c OBQA Avg.
|
||
> FFT 8.03B OOM N/A. GOFTv2 0.98M OOM N/A. qGOFTv2 3.93M OOM N/A. BOFT b=2
|
||
> m=2 4.72M OOM N/A. OFTv2 b=32 14.3M 55.5 70.83 84.44 73.34 90.63 74.11 90.87 80.12 81.80 80.77 LoRA r=8 14.2M 54.1 73.18 85.31 74.36 86.57 74.19 90.95 80.29 84.00 81.11 PiSSA r=8 14.2M 54.1 71.22 86.02 75.38 90.27 74.19 89.90 79.44 84.00 81.30 DoRA r=8 14.9M 65.6 73.09 85.96 75.08 90.48 75.53 90.74 81.40 84.40 82.09 LoRA-XS r=298 14.2M 56.2 72.35 86.51 75.18 91.73 74.98 90.74 79.52 84.00 81.88 PSOFT r=424 14.5M 58.4 72.17 86.51 75.79 91.28 75.61 91.46 81.48 86.00 82.54
|
||
|
||
inserting modules into all linear layers, severely limiting their use in large-scale fine-tuning, whereas PSOFT provides a more memory-friendly alternative. Under comparable costs, it surpasses OFTv2 by 1.77% in average accuracy, matches the memory efficiency of LoRA-like baselines while delivering higher accuracy, and reduces memory usage by about 7 GB relative to DoRA. As the model size increases, DoRA attains performance that is surpassed only by PSOFT, but its memory overhead becomes noticeably higher than that of other LoRA variants. PSOFT further remains effective under reduced parameter budgets and restricted module insertion (Table 15 in Appendix I), underscoring its practicality in balancing efficiency and performance across diverse settings. Table 6: Effect of orthogonality of R on LLaMA-3.2-3B.
|
||
|
||
Methods #Params GSM-8K MATH
|
||
|
||
PiSSA+LoRA-XS r=248 (γ=0.0) 12.1M 61.26 14.72 PiSSA+LoRA-XS r=248 (γ=0.01) 12.1M 61.26 14.80 PiSSA+LoRA-XS r=248 (γ=0.1) 12.1M 59.89 14.90 PiSSA+LoRA-XS r=248 (γ=1.0) 12.1M 59.36 14.44 PSOFT r=248 (strict orthogonality) 6.0M 61.18 14.80 PSOFT r=352 (strict orthogonality) 12.1M 62.77 15.74 none only only and
|
||
|
||
> PSOFT with tunable vectors
|
||
> 50.0
|
||
> 50.5
|
||
> 51.0
|
||
> 51.5
|
||
> 52.0
|
||
> GSM-8K Accuracy (%)
|
||
> 50.19
|
||
> 50.72 50.57
|
||
> 51.63
|
||
|
||
Figure 3: Effect of tunable vectors.
|
||
|
||
Ablation Studies. To study the effect of orthogonality of R, we follow AdaLoRA (Zhang et al., 2023) and add an orthogonality regularizer Lorth = ∥R⊤R − I∥F , resulting in the objective L = L + γL orth
|
||
|
||
with weight γ. Setting γ = 0 recovers PiSSA+LoRA-XS with unconstrained R. As shown in Table 6, this regularization avoids Cayley inversion but demands careful tuning. Under equal rank, PSOFT with strict orthogonality matches the unconstrained variant with half the parameters, and achieves clear gains once parameter counts are aligned. Therefore, Cayley parametrization in PSOFT not only enforces orthogonality but also exploits its skew-symmetric structure to improve parameter efficiency. To study the effect of tunable vectors α and β, we fine-tune LLaMA-3.2-3B with rank 64, inserting PSOFT into all linear layers and evaluating on GSM-8K and MATH. As shown in Figure 3, enabling both vectors achieves the best performance, while single-sided insertion provides smaller gains. This suggests that tuning only one side lacks sufficient capacity to capture task-specific variations. Table 7: Effect of initialization.
|
||
|
||
Methods RTE CoLA
|
||
|
||
Aorth Rorth B 85.92 70.63
|
||
|
||
AR orth Borth 52.71 67.97
|
||
|
||
AR orth B 71.11 69.23
|
||
|
||
To study the effect of initialization, we compare three vari-ants: Aorth Rorth B, AR orth Borth , and AR orth B, where A
|
||
|
||
and B follow PiSSA (Meng et al., 2024) and Aorth , Borth
|
||
|
||
use orthogonal initialization with rank 64. As shown in Ta-ble 7, Aorth Rorth B yields the best results, outperforming PiSSA without constraining A and B, whereas enforcing orthogonality on B reduces model expressiveness.
|
||
|
||
Memory and Computational Efficiency. We evaluate memory usage among different batch sizes by fine-tuning ViT-B/16 on VTAB-1K with PEFT modules in all linear layers. As shown in Figure 4a, PSOFT consistently requires less memory than advanced OFT variants across batch sizes, maintaining a peak footprint below 4 GB even at batch size 32, which highlights its suitability for resource-constrained settings. Further detailed memory analysis and experiments are provided in Appendix M. We also evaluate the computational cost under the same experimental settings on a single H100 GPU as in Tables 4 and 5. As shown in Figure 4b, on LLaMA-3.2-3B, PSOFT ( Q,K,V ) trains in 57 minutes, 9Published as a conference paper at ICLR 2026 1 4 8 16 32 64 128
|
||
|
||
> Batch Size
|
||
> 4
|
||
> 8
|
||
> 16
|
||
> 24
|
||
> Peak GPU Memory (GB)
|
||
> OOM OOM OOM OOM
|
||
> BOFT b= 8
|
||
> m= 2
|
||
> BOFT b= 4
|
||
> m= 4
|
||
> GOFT-V2
|
||
> qGOFT-V2
|
||
> PSOFT r= 46 (ours)
|
||
|
||
(a) qGOFTv2
|
||
|
||
> GOFTv2
|
||
> BOFT
|
||
> PSOFT
|
||
> 0
|
||
> 1
|
||
> 2
|
||
> 3
|
||
> 4
|
||
> Training Speed
|
||
> 1.0× 1.0×
|
||
> 1.7×
|
||
> 3.5×
|
||
> LLaMA-3.2-3B (Q,K,V)
|
||
> qGOFTv2
|
||
> GOFTv2
|
||
> BOFT
|
||
> DoRA
|
||
> PSOFT
|
||
> LoRA
|
||
> 0
|
||
> 1
|
||
> 2
|
||
> 1.0×
|
||
> 1.3×
|
||
> 2.0×
|
||
> (OOM)
|
||
> N/A.
|
||
> LLaMA-3.2-3B (Q,K,V,U,D,O,G)
|
||
> qGOFTv2
|
||
> GOFTv2
|
||
> BOFT
|
||
> PSOFT
|
||
> 0
|
||
> 1
|
||
> 2
|
||
> 3
|
||
> 4
|
||
> Training Speed
|
||
> 1.0×
|
||
> 3.2×
|
||
> (OOM)
|
||
> N/A.
|
||
> LLaMA-3.1-8B (Q,V)
|
||
> qGOFTv2
|
||
> GOFTv2
|
||
> BOFT
|
||
> DoRA
|
||
> PSOFT
|
||
> LoRA
|
||
> 0
|
||
> 1
|
||
> 2
|
||
> 1.0×
|
||
> 1.7×
|
||
> 2.1×
|
||
> (OOM)
|
||
> N/A.
|
||
> LLaMA-3.1-8B (Q,K,V,U,D)
|
||
|
||
(b)
|
||
|
||
Figure 4: (a) Memory usage across batch sizes. (b) Training speed across different models. yielding 3.5× and 2.1× speedups over GOFTv2/qGOFTv2 and BOFT, respectively, while its full configuration ( Q,K,V,U,D,O,G ) requires 1 hour 31 minutes and achieves a 1.3× speedup over DoRA. On LLaMA-3.1-8B, PSOFT ( Q,V ) completes training in 29 minutes with a 3.2× speedup over BOFT, and PSOFT ( Q,K,V,U,D ) finishes in 53 minutes, running 1.7× faster than DoRA. Compared with other PEFT methods, its computational efficiency falls between that of DoRA and LoRA.
|
||
|
||
## 6 DISCUSSION ON SCALING TO LARGER MODELS
|
||
|
||
Due to hardware resource constraints, our empirical evaluation is limited to models of up to 8B parameters. Nevertheless, we further discuss the potential limitations and stability considerations when extending PSOFT to larger-scale models. From a methodological perspective, PSOFT scales favorably as model size increases. Because the orthogonal transformation operates in an r-dimensional principal subspace rather than the full d-dimensional weight space, both computational and activation-memory costs grow with the controllable rank r instead of the expanding dimension d required by many PEFT methods (a detailed analysis is provided in Appendix E). As shown in Appendix J (Tables 17 and 18), memory usage and training time remain stable as r increases. The subspace-based update also avoids the long chains of full-dimensional multiplications used in GOFT and BOFT, which become increasingly expensive at larger scales. Moreover, the number of trainable parameters in PSOFT is decoupled from the hidden dimension, enabling fine-grained parameter control and preventing the minimum parameter budget from being tied to layer width. Collectively, these properties indicate that PSOFT can extend effectively to larger architectures while maintaining stable optimization behavior. However, when applying PSOFT to models larger than 8B, several practical factors may need to be considered. Large models often exhibit higher sensitivity to hyperparameters, including learning-rate settings for structured updates such as orthogonal transformations. While PSOFT does not rely on full-dimensional orthogonal matrices, stable training at very large scales may still require careful hyperparameter tuning. Moreover, although the activation-memory growth of PSOFT is slower than that of some OFT approaches, the activations of the underlying backbone (e.g., attention and feed-forward layers) can become the dominant source of memory usage at large scales, which may constrain the choice of batch size or sequence length. Finally, as shown in the main experiments and in the additional rank-sensitivity analyses in Appendix J, larger models tend to benefit from higher ranks to capture task-specific variations. Very small ranks may lead to underfitting on complex tasks, whereas larger ranks improve expressiveness but also increase the trainable parameter budget.
|
||
|
||
## 7 CONCLUSION
|
||
|
||
In this work, we have proposed PSOFT, a novel PEFT framework that confines OFT to the principal subspace with theoretical guarantees, while enhancing practical adaptability through two tunable scaling vectors. Extensive experiments demonstrate that PSOFT introduces a low-rank perspective that resolves the tension between expressiveness and multi-dimensional efficiency in OFT, bridges the gap between orthogonal fine-tuning and low-rank adaptation within the broader PEFT landscape, and offers a solution with superior scalability and practicality for adapting future foundation models. 10 Published as a conference paper at ICLR 2026
|
||
|
||
## REPRODUCIBILITY STATEMENT
|
||
|
||
We are committed to ensuring the reproducibility of our work and have taken the following steps. For the proposed method, we provide source code at https://github.com/fei407/PSOFT . For theoretical results, we include formal statements and complete mathematical proofs in Appendix B. For datasets and experimental settings, we offer detailed descriptions and full hyperparameter configurations in Appendices F, G, H, and I.
|
||
|
||
## ACKNOWLEDGMENTS
|
||
|
||
This work was supported in part by UK Research and Innovation (UKRI) Grant No. EP/X038866/1 and Horizon Europe Grant No. 101086159.
|
||
|
||
## REFERENCES
|
||
|
||
Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers) , pp. 7319–7328, 2021. Seyedarmin Azizi, Souvik Kundu, and Massoud Pedram. LaMDA: Large model fine-tuning via spec-trally decomposed low-dimensional adaptation. In Findings of the Association for Computational Linguistics: EMNLP 2024 , pp. 9635–9646, 2024. Klaudia Bałazy, Mohammadreza Banaei, Karl Aberer, and Jacek Tabor. Lora-xs: Low-rank adaptation with extremely small number of parameters. arXiv preprint arXiv:2405.17604 , 2024. Massimo Bini, Leander Girrbach, and Zeynep Akata. Decoupling angles and strength in low-rank adaptation. In The Thirteenth International Conference on Learning Representations , 2025. Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence ,volume 34, pp. 7432–7439, 2020. Arthur Cayley. The collected mathematical papers of Arthur Cayley , volume 7. University of Michigan Library, 1894. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp. 2924–2936, 2019. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.
|
||
|
||
arXiv preprint arXiv:1803.05457 , 2018. Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In The Ninth International Conference on Learning Representations , 2021. Ziqi Gao, Qichao Wang, Aochuan Chen, Zijing Liu, Bingzhe Wu, Liang Chen, and Jia Li. Parameter-efficient fine-tuning with discrete fourier transform. In Forty-first International Conference on Machine Learning , 2024. 11 Published as a conference paper at ICLR 2026 Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 , 2024. Nathan Halko, Per-Gunnar Martinsson, and Joel A Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review , 53(2): 217–288, 2011. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision , pp. 1026–1034, 2015. Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 ,2021. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In
|
||
|
||
Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2) , 2021. Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International conference on machine learning , pp. 2790–2799. PMLR, 2019. Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In
|
||
|
||
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pp. 328–339, 2018. Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations , 2022. Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pp. 5254–5276, 2023. Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M Asano. VeRA: Vector-based random matrix adaptation. In The Twelfth International Conference on Learning Representations , 2024. Vijay Anand Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. Reducing activation recomputation in large transformer models. Proceedings of Machine Learning and Systems , 5:341–353, 2023. Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pp. 3045–3059, 2021. Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. In The Sixth International Conference on Learning Representations , 2018. Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In
|
||
|
||
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) ,pp. 4582–4597, 2021. Vijay Chandra Lingam, Atula Neerkaje, Aditya Vavre, Aneesh Shetty, Gautham Krishna Gudur, Joydeep Ghosh, Eunsol Choi, Alex Dimakis, Aleksandar Bojchevski, and Sujay Sanghavi. Svft: Parameter-efficient fine-tuning with singular vectors. Advances in Neural Information Processing Systems , 37:41425–41446, 2024. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems , 36:34892–34916, 2023. 12 Published as a conference paper at ICLR 2026 Weiyang Liu, Rongmei Lin, Zhen Liu, James M Rehg, Liam Paull, Li Xiong, Le Song, and Adrian Weller. Orthogonal over-parameterized training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pp. 7251–7260, 2021. Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, Yandong Wen, Michael J. Black, Adrian Weller, and Bernhard Sch ¨olkopf. Parameter-efficient orthogonal finetuning via butterfly factorization. In The Twelfth International Conference on Learning Representations , 2024. Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pp. 61–68, 2022. Xinyu Ma, Xu Chu, Zhibang Yang, Yang Lin, Xin Gao, and Junfeng Zhao. Parameter efficient quasi-orthogonal fine-tuning via givens rotation. In Forty-first International Conference on Machine Learning , 2024. Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. PEFT: State-of-the-art parameter-efficient fine-tuning methods. https://github. com/huggingface/peft , 2022. Accessed: March, 2025. Fanxu Meng, Zhaohui Wang, and Muhan Zhang. Pissa: Principal singular values and singular vectors adaptation of large language models. Advances in Neural Information Processing Systems , 37: 121038–121072, 2024. Meta AI. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models. https://ai.meta.com/blog/ llama-3-2-connect-2024-vision-edge-mobile-devices/ , 2024. Accessed: March, 2025. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pp. 2381–2391, 2018. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems , 32, 2019. Jonas Pfeiffer, Aishwarya Kamath, Andreas R ¨uckl ´e, Kyunghyun Cho, and Iryna Gurevych. Adapter-Fusion: Non-destructive task composition for transfer learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pp. 487–503, 2021. Chengwei Qin, Aston Zhang, Zhuosheng Zhang, Jiaao Chen, Michihiro Yasunaga, and Diyi Yang. Is chatGPT a general-purpose natural language processing task solver? In The 2023 Conference on Empirical Methods in Natural Language Processing , 2023. Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard Sch ¨olkopf. Controlling text-to-image diffusion by orthogonal finetuning. Advances in Neural Information Processing Systems , 36:79320–79362, 2023. Zeju Qiu, Weiyang Liu, Adrian Weller, and Bernhard Sch ¨olkopf. Orthogonal finetuning made scalable. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pp. 31946–31963, 2025. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM , 64(9):99–106, 2021. 13 Published as a conference paper at ICLR 2026 Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Common-sense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pp. 4463–4473, 2019. Weixi Song, Zuchao Li, Lefei Zhang, hai zhao, and Bo Du. Sparse is enough in fine-tuning pre-trained large language models. In Forty-first International Conference on Machine Learning , 2024. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In
|
||
|
||
International Conference on Learning Representations , 2019. Xiao Wang, Tianze Chen, Qiming Ge, Han Xia, Rong Bao, Rui Zheng, Qi Zhang, Tao Gui, and Xuanjing Huang. Orthogonal subspace learning for language model continual learning. In Findings of the Association for Computational Linguistics: EMNLP 2023 , pp. 10658–10671, 2023. Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yo-gatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models.
|
||
|
||
Transactions on Machine Learning Research , 2022. Muling Wu, Wenhao Liu, Xiaohua Wang, Tianlong Li, Changze Lv, Zixuan Ling, Zhu JianHao, Cenyuan Zhang, Xiaoqing Zheng, and Xuanjing Huang. Advancing parameter efficiency in fine-tuning via representation editing. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pp. 13445–13464, 2024a. Zhengxuan Wu, Aryaman Arora, Zheng Wang, Atticus Geiger, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Reft: Representation finetuning for language models. Advances in Neural Information Processing Systems , 37:63908–63962, 2024b. Jing Xu and Jingzhao Zhang. Random masking finds winning tickets for parameter efficient fine-tuning. In Forty-first International Conference on Machine Learning , 2024. Shih yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DoRA: Weight-decomposed low-rank adaptation. In Forty-first International Conference on Machine Learning , 2024. Longhui Yu, Weisen Jiang, Han Shi, Jincheng YU, Zhengying Liu, Yu Zhang, James Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. In The Twelfth International Conference on Learning Representations ,2024. Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pp. 1–9, 2022. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pp. 4791–4800, 2019. Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. Alarge-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867 , 2019. Fangzhao Zhang and Mert Pilanci. Spectral adapter: Fine-tuning in spectral space. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations , 2023. 14 Published as a conference paper at ICLR 2026
|
||
|
||
## ORGANIZATION OF THE APPENDIX
|
||
|
||
The appendix is organized as follows: • Appendix A introduces the algorithm of the proposed PSOFT. • Appendix B provides the theoretical proof for the column-wise angle and norm preservation theorem. • Appendix C presents theoretical details of the Cayley parameterization. • Appendix D compares the number of trainable parameters across popular PEFT methods. • Appendix E analyzes activation memory statistics for different PEFT methods. • Appendix F outlines experimental details for natural language understanding on GLUE. • Appendix G covers experimental details for visual classification on VTAB-1K. • Appendix H reports experimental details for mathematical question answering on MetaMathQA-40K. • Appendix I describes experimental details for commonsense reasoning on Commonsense-15K. • Appendix J details extended experiments on the effects of SVD initialization, different rank settings, inserted modules, and Neumann terms. • Appendix K illustrates the angular structure of the weight changes before and after fine-tuning. • Appendix L analyzes the difference between PSOFT and full-space OFT in terms of their optimiza-tion dynamics and training loss trajectories. • Appendix M provides the additional memory usage experiments covering a single linear layer, a Transformer block, and end-to-end models. • Appendix N explains the use of large language models in this paper.
|
||
|
||
## A ALGORITHM OF THE PROPOSED PSOFT
|
||
|
||
For completeness, we provide a detailed description of the proposed PSOFT framework, which corresponds to Algorithm 1. For initialization, the orthogonal matrix R is set to the identity matrix
|
||
|
||
Ir , while PSOFT further introduces two additional vectors, α and β, both initialized as all ones. Before training begins, a singular value decomposition (SVD) is performed once to extract the top-r
|
||
|
||
singular values and vectors, which are then used to construct the matrices A′, B′, and the residual weights Wres . During training, the forward computation follows Eq. 8, and the gradients of both R
|
||
|
||
and the vectors α and β are updated jointly to obtain the final weights Wfinal .
|
||
|
||
Algorithm 1 PSOFT: orthogonal fine-tuning in the principal subspace
|
||
|
||
> 1:
|
||
|
||
Input: Pre-trained weight matrix Wpre ∈ Rd×n, rank r, input x, and number of epochs E
|
||
|
||
> 2:
|
||
|
||
Output: Fine-tuned orthogonal matrix R, two vectors α and β, and final weight matrix Wfinal
|
||
|
||
> 3:
|
||
|
||
Initialize: Orthogonal matrix: R ← Ir , two vectors: α ← 1r , β ← 1r
|
||
|
||
> 4:
|
||
|
||
Pre-compute:
|
||
|
||
> 5:
|
||
|
||
Wpre = U SV ⊤, A′ ← U[: ,:r], B′ ← S[: r, :r]V ⊤
|
||
|
||
> [: ,:r]
|
||
|
||
, Wres ← U[: ,r :] S[r:,r :] V ⊤
|
||
|
||
> [: ,r :]
|
||
> 6:
|
||
|
||
for epoch = 1 to E do
|
||
|
||
> 7:
|
||
|
||
for each mini-batch x do
|
||
|
||
> 8:
|
||
|
||
h = ( A′ diag( α)R diag( β)B′ + Wres )⊤x,
|
||
|
||
> 9:
|
||
|
||
compute ∂L
|
||
|
||
> ∂R
|
||
|
||
, ∂L
|
||
|
||
> ∂α
|
||
|
||
, ∂L
|
||
|
||
> ∂β
|
||
|
||
, then update R ← R − η · ∂L
|
||
|
||
> ∂R
|
||
|
||
, α ← α − η · ∂L
|
||
|
||
> ∂α
|
||
|
||
, β ← β − η · ∂L
|
||
|
||
> ∂β
|
||
> 10:
|
||
|
||
end for
|
||
|
||
> 11:
|
||
|
||
end for
|
||
|
||
> 12:
|
||
|
||
Reconstruct: Wfinal ← A′ diag( α)R diag( β)B′ + Wres
|
||
|
||
## B PROOF FOR THE ANGLE AND NORM PRESERVATION THEOREM
|
||
|
||
Theorem B.1 (Formal: Column-wise angle and norm preservation in the low-rank subspace) . Let
|
||
|
||
Wpri = AB ∈ Rd×n and Wps-tuned = ARB ∈ Rd×n, with A ∈ Rd×r , B ∈ Rr×n. Assume
|
||
|
||
15 Published as a conference paper at ICLR 2026
|
||
|
||
rank( A) = rank( B) = r and every column bi̸ = 0 (so all angles are well-defined). Let G := A⊤A,
|
||
|
||
G is symmetric positive definite, wpri
|
||
|
||
> i
|
||
|
||
:= Ab i, wps-tuned
|
||
|
||
> i
|
||
|
||
:= ARb i, and denote by θpri
|
||
|
||
> ij
|
||
|
||
(resp. θps-tuned
|
||
|
||
> ij
|
||
|
||
)the angle between wpri
|
||
|
||
> i
|
||
|
||
, wpri
|
||
|
||
> j
|
||
|
||
(resp. wps-tuned
|
||
|
||
> i
|
||
|
||
, wps-tuned
|
||
|
||
> j
|
||
|
||
). Then
|
||
|
||
R⊤GR = G ⇐⇒ |