This commit is contained in:
wassname
2026-02-22 21:52:57 +08:00
parent 2e3c8165ce
commit 7dbe3faf19
10 changed files with 5556 additions and 110 deletions
File diff suppressed because one or more lines are too long
+657
View File
@@ -0,0 +1,657 @@
Title: CLOVER: Cross-Layer Orthogonal Vectors Pruning and Fine-Tuning
URL Source: https://arxiv.org/pdf/2411.17426
Published Time: Mon, 03 Feb 2025 01:40:12 GMT
Number of Pages: 16
Markdown Content:
# CLOVER: Cross-Layer Orthogonal Vectors Pruning and Fine-Tuning
Fanxu Meng 1 2 Pingzhi Tang 1 Fan Jiang 1 Muhan Zhang 1 2
## Abstract
Decoder-only models generate tokens autoregres-sively by caching key/value vectors, but as the cache grows, inference becomes memory-bound. To address this issue, we introduce CLOVER (Cross-Layer Orthogonal Vectors), a novel ap-proach that treats pairs of attention layers as a set of low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the Q-K
and V -O pairs within each attention head. The resulting singular values can either guide pruning or serve as trainable parameters for efficient fine-tuning of all orthogonal vectors. After pruning or fine-tuning, these values are reintegrated into the model without increasing its parameter count. We apply CLOVER to various models, including GPT-2 XL, DeepSeek-V2-Lite, Whisper-Large-v3, Stable Diffusion XL, and LLaMA-3.2-11B-Vision. Our results demonstrate that CLOVER significantly improves pruning efficiency. For in-stance, the perplexity of pruning 70% of the Q-K
pairs in GPT-2 XL is similar to that of pruning just 8% with vanilla methods. Fine-tuning the sin-gular values further results in a full-rank update, outperforming state-of-the-art methods (LoRA, DoRA, HiRA, and PiSSA) by 7.6%, 5.5%, 3.8%, and 0.7%, respectively, on eight commonsense tasks for LLaMA-2 7B.
## 1. Introduction
In recent years, Large Language Models (LLMs) have rapidly evolved into essential tools for productivity (OpenAI, 2024; Anthropic, 2024; Team et al., 2024a). Open-source models (AI@Meta, 2024; Mistral, 2024; Qwen, 2024; Liu et al., 2024b; Team et al., 2024b; Abdin et al., 2024) have also narrowed the performance gap with closed-source mod-els. The success of LLMs is largely attributed to Next Token Prediction (Radford, 2018; Brown et al., 2020), where to-kens are predicted sequentially, with attention computed between each token and all preceding ones. To avoid redun-dant computations, key-value features are cached. However, as model size grows, the overhead of caching becomes sub-stantial, leading to memory and communication bottlenecks. 𝑊 !
## 𝑊 " 𝑊 #
## 𝑊 $
# 𝑥
# 𝑦
(a) Multi-Head Attention 𝑈 !"
## 𝑉 !" 𝑈 #$
## 𝑉 #$
# 𝑥
# 𝑦
> 𝑆 !"
> 𝑆 #$
(b) CLOVER 0.08 0.7 1
> Pruning Ratio
> 3
> 4
> 5
> 6
> 7
> Log of Perplexity
> Vanilla
> CLOVER
(c) Pruning without Training 0.125 0.25 0.375 0.5 0.625
> 10
> 12
> 14
> 16
> 18
> 20
> 22
> 24
> Perplexity
> Vanilla
> CLOVer
> CLOVer
(d) Fine-Tuning Pruned Model
Figure 1. (a) We treat the Query-Key and Value-Output layers within a single attention head as a unified structure. (b) Apply SVD to obtain two sets of singular vectors for initializing the Q-K and V-O layers, along with singular values that guide pruning or enable efficient full-rank fine-tuning. (c) This cross-layer orthogo-nalization strategy allows for higher pruning rates. (d) The pruned model maintains strong performance after fine-tuning.
For instance, a 65B parameter model (Touvron et al., 2023) with 8-bit key-value quantization requires over 86GB of GPU memory to store 512K tokens, exceeding the capacity of a single H100-80GB GPU (Sun et al., 2024). To enable efficient training and inference, we introduce CLOVER (Cross-Layer Orthogonal Vectors), a novel method that orthogonalizes the Query, Key, Value, and Out-put vectors without generating additional transformation matrices. As shown in Figure 1a, we treat the Q-K and V -
O pairs in each attention head as a low-rank decomposition of WQK and WV O . By crossing these layers and perform-1
> arXiv:2411.17426v3 [cs.LG] 31 Jan 2025 CLOVER: Cross-Layer Orthogonal Vectors
ing SVD on WQK and WV O , the Query, Key, Value, and Output vectors become orthogonal within each attention head. Figure 1b illustrates how the resulting singular val-ues can guide pruning or serve as trainable parameters for efficient fine-tuning. After pruning or fine-tuning, these values can be reintegrated into the model without increasing its parameter count. Notably, previous methods, such as SVFT (Lingam et al., 2024), obtain orthogonal vectors by directly performing orthogonal decomposition on the matrix at each layer, which results in an accompanying transfor-mation matrix, doubling the parameter count. In contrast, CLOVER treats the Q-K pairs as transformation matrices for each other, and similarly for the V -O pairs. CLOVER only generates a small set of singular values to guide prun-ing and fine-tuning, which can be merged back into the model without increasing inference costs.
By orthogonalizing the vectors, we eliminate linear re-dundancy. Attention heads contain numerous non-zero norm vectors. Directly pruning these vectors would degrade performance, but orthogonalizing them allows us to repre-sent the entire attention heads space using a small set of orthogonal bases. The remaining vectors are nearly zero, making them safe to prune. As shown in Figure 1c, prun-ing an average of 45 vectors in the query-key pair using CLOVER results in a perplexity similar to that of vanilla pruning, which prunes only 5 vectors. Moreover, CLOVER generates a singular value matrix between the Q-K and
V -O pairs. By updating this matrix during fine-tuning,
CLOVER learns linear combinations of all orthogonal bases within each attention head. In contrast, PiSSA can only learn from a subset of orthogonal vectors, potentially causing some data projections to approach zero in those di-rections, leading to non-functional adapters during training. As shown in Figure 1d, fine-tuning a very small number of singular values can achieve performance close to that of fine-tuning all attention heads. We summarize the contribution of our paper as follows: • We treat the Q-K and V-O pairs in each attention head as low-rank approximations of WQK and WV O . By performing SVD, we orthogonalize the attention head without adding extra transformation matrices. • This orthogonalization reduces linear redundancy, is compatible with any pruning method, and allows for higher pruning ratios. Pruning 46.42% of the vectors in Whispers attention head preserves performance with-out requiring additional training. • CLOVER enables efficient full-rank updates, surpass-ing SOTA methods such as LoRA, DoRA, HiRA, and PiSSA on eight commonsense reasoning tasks across LLaMA 7B/13B, LLaMA-2-7B, and LLaMA-3-8B, with additional analyses highlighting its advantages.
## 2. Related Work
LLM Compression To mitigate the high memory de-mands of KV Caches in long-context models, several tech-niques have been proposed. These include reducing se-quence length with linear attention (Katharopoulos et al., 2020; Wang et al., 2020; Peng et al., 2023; Gu & Dao, 2023; De et al., 2024), dynamic token pruning (Fu et al., 2024; Jo & Shin, 2024; Li et al., 2024b), compressing the key-value rank (Shazeer, 2019; Ainslie et al., 2023; Liu et al., 2024a; Yu et al., 2024), and pruning head dimensions (Ashkboos et al., 2024; Xia et al., 2023; Sun et al., 2023). Additional approaches include sharing key-value representations across layers (Sun et al., 2024; Brandon et al., 2024; Liu et al., 2024c; Zuhri et al., 2024) and quantizing KV cache weights and activations (Frantar et al., 2022; Dettmers et al., 2022; Xiao et al., 2023; Liu et al., 2024e; Hooper et al., 2024). Among them, structure pruning is hardware-friendly but can reduce performance when non-zero dimensions are removed (Ma et al., 2023). Fine-tuning can recover some of the lost performance, but its computationally expensive. To address this, Parameter Efficient Fine-Tuning (PEFT) methods are used (Guo et al., 2023).
Parameter Efficient Fine-Tuning. Several strategies have been introduced to minimize fine-tuning parameters while maintaining performance. These include low-rank adaptation (Hu et al., 2021), partial-parameter fine-tuning (Zaken et al., 2021; Lawton et al., 2023; Zhao et al., 2020; Sung et al., 2021; Ansell et al., 2021; Xu et al., 2021; Guo et al., 2020; Fu et al., 2023), soft prompt fine-tuning (Ham-bardzumyan et al., 2021; Lester et al., 2021; Li & Liang, 2021; Liu et al., 2023b; Vu et al., 2021; Asai et al., 2022; Wang et al., 2023), and sparse matrix fine-tuning (Qiu et al., 2023; Liu et al., 2023a; Yuan et al., 2024). Among these, LoRA is widely used due to its simplicity and effectiveness, with recent works enhancing it further (Zhang et al., 2023; Zi et al., 2023; Liu et al., 2024d; Zhao et al., 2024; Jiang et al., 2024). PiSSA (Meng et al., 2024) improves conver-gence speed by initializing adapters with principal singular values and vectors, also reducing quantization error (Wang et al., 2024a;b; Li et al., 2024a). However, PiSSA is limited by its use of a fixed set of orthogonal bases. SVFT (Lingam et al., 2024) directly applies Singular Value Decomposition (SVD) to the original matrix, but this increases the number of parameters, raising computational overhead and reducing efficiency. The CLOVER method addresses these issues by treating the Query-Key pairs in each attention head as low-rank matrices. Using orthogonal decomposition, CLOVER eliminates the need for additional transformation matrices. Instead, it leverages a small set of singular values to linearly combine orthogonal vectors, making the approach more parameter-efficient. After fine-tuning, the adapter can be smoothly reintegrated into the original matrix structure. 2CLOVER: Cross-Layer Orthogonal Vectors
## 3. CLOVER: Cross-Layer Orthogonal Vectors
Below is a step-by-step explanation of CLOVER method and explain why it can update orthogonal decompose the Query, Key, Value, Output layers in Multi-Head Attention without need introduce any transfer matrix. We mainly use the computation of the Q-K pair in as an example. Then extended to the V -O pair.
Multi-Head Self-Attention Setup. In a multi-head self-attention mechanism with H heads, each head h ∈{1, . . . , H } computes an attention score as: attn (Qh, K h) = softmax
 QhK
> h√d

,
where H is the number of attention heads, d is the dimen-sionality of each head, X ∈ Rn×D is the input matrix ( n
is the sequence length, D is the total hidden dimension),
Qh, K h ∈ Rn×d are the query and key representations for head h, WQ, W K ∈ RD×H×d are weights for projecting the input X into queries and keys. Specifically, the queries and keys for head h are obtained by multiplying X with the corresponding “slice” of WQ and
WK , respectively:
Qh = X W [: ,h, :]
> Q
, Kh = X W [: ,h, :]
> K
.
Cross Layers Merging. Substituting Qh and Kh into
QhK
> h
, we have:
QhK
> h
= X W [: ,h, :]
> Q
W [: ,h, :]
> K
X.
Notice that the original weights W [: ,h, :]
> Q
and W [: ,h, :]
> K
are each in RD×d, once multiplied together, the resulting ma-trix W hQK = W [: ,h, :]
> Q
W [: ,h, :]
> K
 has dimension D × D.Since d ≪ D, using W hQK directly in computations—or storing it as trainable parameters—would be highly ineffi-cient, limiting the use cases of such parameter merging.
Cross Layers Orthogonal Decomposition To address the large size of W hQK , we factorize W hQK via SVD:
W hQK = U hQK ShQK V hQK ,
where U hQK is a D × D orthogonal matrix, ShQK is a D × D
diagonal matrix of singular values, V hQK is another D × D
orthogonal matrix. Since W [: ,h, :]
> Q
and W [: ,h, :]
> K
each have shape RD×d, the rank of W hQK is at most d. Thus the actual non-zero singular values in ShQK are at most d. We can truncate the SVD to keep only the top-r singular values without loss:
W hQK = U hQK [: , : r] ShQK [: r, : r] V hQK [: , : r],
where r ≤ d.The process can be easily applied to WV and WO , as intro-duced in Appendix A.1.
CLOVER for Pruning After performing SVD, we can rewrite the weight matrix W hQK as follows:
W hQK = U hQK [: , : r] ShQK [: r, : r]
| {z }
> ˜Uh∈RD×r
V hQK [: , : r]
| {z }
> ˜Vh∈Rr×D
.
Instead of storing the full matrices W hQ and W hK ∈ RD×d,we store the smaller factors ˜U h ∈ RD×r and ˜V h ∈ Rr×D ,which can be significantly smaller than the original matrix since r ≤ d ≪ D. This leads to a reduction in memory usage and computational cost. Additionally, we can prune
singular values (and their corresponding singular vectors) below a chosen threshold. This further reduces the parame-ter count and computational overhead.
CLOVER for Fine-Tuning CLOVER can be used not only for pruning, but also for parameter-efficient fine-tuning. We freeze the matrices U hQK [: , : r] and V hQK [: , : r], and only fine-tune the singular values ShQK [: r, : r].In contrast to SVFT, which factorizes the entire weight ma-trices WQ, W K , W V , W O ∈ RD×D individually, CLOVER factorizes the merged weights W hQK and W hOV within each attention head, significantly reducing the parameters. By applying SVD factorization within each attention head, CLOVER constrains the effective rank of the cross-layer matrix to d. As a result, the tunable matrix SQK has a size bounded by RH×d×d (considering all heads). In compari-son, SVFT requires factorizing large matrices each into three components ( U, S, V ∈ RD×D ), leading to a significant in-crease in parameter count and computational overhead, even with sparse updates for the singular values S.For example, consider the LLaMA 2-7B model with H =32 attention heads and a head dimension of d = 128 . By factorizing each head separately, the largest size for SQK
is O(32 × 128 × 128) , which is significantly smaller than factorizing a R4096 ×4096 matrix. This makes CLOVERs parameter efficiency comparable to that of a LoRA config-uration with rank 32, as shown in Appendix A.2, but with additional potential for pruning. 3CLOVER: Cross-Layer Orthogonal Vectors
> Table 1. Pruning GPT-2-XLs attention layers with CLOVER and vanilla pruning at various ratios, evaluating perplexity on Wikitext2 (lower is better), and fine-tuning on OpenWebText with different token budgets. The base models perplexity is 14.78.
Pruning Ratio w/o Training Perplexity( ↓) 66M Tokens Perplexity ( ↓) 131M Tokens Perplexity ( ↓)
Vanilla CLOVER Vanilla CLOVER CLOVER † Vanilla CLOVER CLOVER †
12.5% 33.76 15.89 16.04 15.45 15.67 16.38 15.77 15.42
25.0% 78.36 17.45 16.93 15.70 15.89 17.07 16.05 15.75
37.5% 159.4 20.95 18.17 16.17 16.60 18.14 16.48 16.41
50.0% 338.9 35.12 20.45 17.22 17.63 19.02 17.13 17.71 62.5% 538.5 85.25 24.65 19.32 20.64 21.44 18.40 20.39 75.0% 708.8 187.4 36.04 24.65 29.28 27.22 20.99 28.44
## 4. Experiments
As detailed in Section 3, CLOVER is highly effective for both pruning and fine-tuning. We presents a series of experi-ments to validate these capabilities. In Section 4.1, we com-pare CLOVER with Vanilla pruning on a GPT-2-XL model (Radford et al., 2019). CLOVER results in less performance degradation, while Vanilla pruning significantly harms the models performance, making recovery difficult even with fine-tuning. In Section 4.2, we conduct fine-tuning experi-ments on eight commonsense tasks, comparing CLOVER with state-of-the-art methods. The results show the effec-tiveness of CLOVERs linear combinations of all orthogonal vectors. In Section 4.3, CLOVER is applied to various mod-els. We visualize how it removes linear redundancy between vectors, enabling more efficient pruning. In Section 4.4, we demonstrate CLOVERs ability to perform significant prun-ing on the Whisper model, which exhibits substantial linear redundancy, without requiring fine-tuning. In Section 4.5, we explain the importance of learning from all the orthogo-nal vectors by analyzing the projection of data features onto different directions in the model. In Section 4.6, we confirm CLOVERs full-rank update capability by visualizing the singular value distribution of ∆W from various methods. Finally, in Section 4.7, we show how CLOVER fine-tunes the model using its inherent properties, without introducing “intrusive dimension” like LoRA, which may risk model degradation (Shuttleworth et al., 2024).
4.1. CLOVER for Large Ratio Pruning
Due to the need to compute attention between each token and all preceding tokens, compressing atten-tion—particularly the key-value layers—is crucial, despite the larger number of parameters in the MLP. CLOVER rep-resents each attention head with a small number of vectors. Since it only modifies the initialization, it can be combined with any other pruning technique. This paper validates the proposed method using basic structured pruning on GPT-2-XL, rather than targeting state-of-the-art performance. We initialize GPT-2-XL with CLOVER, then prune small singu-lar values based on their magnitude. To maintain inference efficiency, we apply the same pruning rate across all layers, removing a fixed percentage of the smallest singular vectors. The singular values, S, are then merged into the U and V
matrices. For comparison, we also prune without CLOVER orthogonalization, using L2-norms for pruning. After prun-ing, we evaluate perplexity on the WikiText-2 (Merity et al., 2016) dataset. We then fine-tune the pruned models on the OpenWebText (Gokaslan & Cohen, 2019) dataset following nanoGPT 1. To minimize disruption to the original model, we fine-tune only the pruned attention layers, leaving the MLP, embedding layers, and LM head unchanged. In the CLOVER † case, after pruning, S is not immediately merged into the U and V matrices but is used for parameter-efficient fine-tuning, with the merging occurring afterward. We ad-just the learning rate from 6e-4 to 6e-3 and remove weight decay, while keeping other hyperparameters consistent with the other two methods. Based on Table 1, CLOVER causes less damage to the model than Vanilla pruning, as it transfers functionality into fewer orthogonal bases. For example, pruning 50% of the parameters without further fine-tuning, CLOVERs perplex-ity only increases by 1.38 ×, while Vanilla pruning increases by 21.9 ×. After fine-tuning, CLOVERs performance far exceeds that of Vanilla pruning. Due to its lower model disruption, CLOVER requires fewer tokens for fine-tuning to restore performance (e.g., perplexity with 66M tokens is close to that with 131M tokens), whereas Vanilla pruning needs more tokens, resulting in higher costs and potential degradation in out-of-domain tasks. Furthermore, by fine-tuning only the singular values from the SVD decomposition and the attention layer biases, CLOVER achieves recovery with fewer training resources and parameter changes. At lower pruning rates, CLOVER even outperforms full atten-tion layer training. However, when pruning rates are too high, accuracy loss becomes significant, and the available parameters for fine-tuning become insufficient (e.g., at 75% pruning, only 0.15% of the original attention layer parame-ters are updated).
> 1https://github.com/karpathy/nanoGPT
4CLOVER: Cross-Layer Orthogonal Vectors
> Table 2. Accuracy comparison of LLaMA 7B/13B, LLaMA2 7B, and LLaMA3 8B with various PEFT methods on eight commonsense reasoning datasets. Results of LoRA and DoRA are taken from (Liu et al., 2024d). Results of HiRA are taken from (Anonymous, 2025).
Model Method Params BoolQ PIQA SIQA Hella Swag Wino Grande ARC-e ARC-c OBQA Avg.
ChatGPT - - 73.1 85.4 68.5 78.5 66.1 89.8 79.9 74.8 77.0 LLaMA-7B Series 0.99% 63.0 79.2 76.3 67.9 75.7 74.5 57.1 72.4 70.8 Parallel 3.54% 67.9 76.4 78.8 69.8 78.9 73.7 57.3 75.2 72.2 LoRA 0.83% 68.9 80.7 77.4 78.1 78.8 77.8 61.3 74.8 74.7 DoRA 0.84% 69.7 83.4 78.6 87.2 81.0 81.9 66.2 79.2 78.4 PiSSA 0.83% 74.1 85.4 81.5 94.0 85.0 85.6 72.1 84.2 82.7 CLOVER 0.83% 72.9 86.34 82.1 94.9 85.4 87.5 74.4 86.4 83.7
LLaMA-13B Series 0.80% 71.8 83 79.2 88.1 82.4 82.5 67.3 81.8 79.5 Parallel 2.89% 72.5 84.9 79.8 92.1 84.7 84.2 71.2 82.4 81.4 LoRA 0.67% 72.1 83.5 80.5 90.5 83.7 82.8 68.3 82.4 80.5 DoRA 0.68% 72.4 84.9 81.5 92.4 84.2 84.2 69.6 82.8 81.5 PiSSA 0.67% 74.6 88.0 82.9 95.5 87.0 90.3 77.2 88.2 85.4 CLOVER 0.67% 75.2 88.4 83.1 96.0 87.8 89.7 79.3 89.8 86.2
LLaMA2-7B LoRA 0.83% 69.8 79.9 79.5 83.6 82.6 79.8 64.7 81.0 77.6 DoRA 0.84% 71.8 83.7 76.0 89.1 82.6 83.7 68.2 82.4 79.7 HiRA 0.83% 71.2 83.4 79.5 88.1 84.0 86.7 73.8 84.6 81.4 PiSSA 0.83% 75.0 87.0 81.6 95.0 86.5 88.5 75.9 86.4 84.5 CLOVER 0.83% 75.0 86.4 82.0 95.1 87.5 89.6 76.6 89.4 85.2
LLaMA3-8B LoRA 0.70% 70.8 85.2 79.9 91.7 84.3 84.2 71.2 79.0 80.8 DoRA 0.71% 74.6 89.3 79.9 95.5 85.6 90.5 80.4 85.8 85.2 HiRA 0.70% 75.4 89.7 81.2 95.4 87.7 93.3 82.9 88.3 86.7 PiSSA 0.70% 77.2 90.0 82.9 96.6 88.4 93.6 82.4 87.4 87.3 CLOVER 0.47% 76.4 89.3 82.1 96.9 89.9 93.6 84.5 90.6 87.9
4.2. CLOVER for Full-Rank Fine-Tuning
In this section, we evaluate CLOVER against LoRA (Hu et al., 2021), DoRA (Liu et al., 2024d), HiRA (Anonymous, 2025), and PiSSA (Meng et al., 2024) on commonsense reasoning tasks, excluding SVFT (Lingam et al., 2024) due to its significant overhead. The tasks are divided into eight sub-tasks, as outlined in Table 4. Following the DoRA setup, we fine-tune the Commonsense-170k dataset and evaluate each sub-tasks test set. We apply orthogonal de-composition to the Value-Output and fine-tune the resulting singular value matrix. Due to the non-linear RoPE(Su et al., 2024) operation between the query and key, we perform orthogonal decomposition in the Key layer and fine-tune the transition matrix. Similarly, we treat the 64 consecu-tive dimensions in the MLP.Up layer as a head, applying orthogonal decomposition and updating the transition ma-trix. The learnable parameters of LLaMA 7B/13B (Touvron et al., 2023) and LLaMA-2-7B (AI@Meta, 2023) match LoRA/DoRA/HiRA/PiSSA with rank 32 updates. LLaMA-3-8B (AI@Meta, 2024) has 2/3 of the trainable parameters compared to the other models. For a fair comparison, we use the hyperparameters from DoRA (3 epochs, batch size 16, linear scheduler learning rate). We adjusted the learning rate based on DoRAs approach and found that CLOVER per-forms best with lr=1e-4, which we applied across all models. PiSSA was trained using the same hyperparameters, but with a learning rate of 2e-5, as specified in its original pa-per. Due to the stable performance of PiSSA and CLOVER during training, we did not perform validation every 80 it-erations, as done in DoRA, to select the best-performing model on the validation set for testing. Instead, we trained for the full 3 epochs and used the final model for testing. HiRAs results are taken directly from its original paper, while the other results are sourced from DoRAs paper. Ta-ble 2 demonstrates that CLOVER consistently outperforms all other methods across all models and tasks. Specifically, on LLaMA 7B, CLOVER outperforms LoRA, DoRA, and PiSSA by 9%, 5.3%, and 1%, respectively. On LLaMA 13B, CLOVER outperforms these methods by 5.7%, 4.7%, and 0.8%. On LLaMA-2-7B, CLOVER surpasses LoRA, DoRA, HiRA, and PiSSA by 7.6%, 5.5%, 3.8%, and 0.7%. Even on LLaMA-3-8B, with fewer trainable parameters, CLOVER outperforms by 7.1%, 2.7%, 1.2%, and 0.6%. CLOVER leads in most sub-tasks and ranks second in a few. 5CLOVER: Cross-Layer Orthogonal Vectors 0 50 100
> 0.0
> 0.5
> 1.0
> 1.5
> 2.0
> WQ WK
> (10, 0.93)
> Vanilla
> CLOVER
> 050 100
> Sorted Dimensions
> 0.0
> 0.2
> 0.4
> 0.6
> WV WTO
> (72, 0.13)
> Vanilla
> CLOVER
(a) DeepSeek-V2-Lite 0 25 50 75
> 0
> 1
> 2
> 3
> 4
> 5
> WQ WK
> (1, 2.66)
> Vanilla
> CLOVER
> 025 50 75
> Sorted Dimensions
> 0.0
> 0.2
> 0.4
> 0.6
> WV WTO
> (27, 0.29)
> Vanilla
> CLOVER
(b) Llama-3.2-Vision 0 20 40 60
> 0.00
> 0.25
> 0.50
> 0.75
> 1.00
> 1.25
> 1.50
> WQ WK
> (7, 0.63)
> Vanilla
> CLOVER
> 020 40 60
> Sorted Dimensions
> 0.0
> 0.2
> 0.4
> 0.6
> 0.8
> WV WTO
> (35, 0.16)
> Vanilla
> CLOVER
(c) Whisper-Large-v3 0 20 40 60
> 1
> 2
> 3
> WQ WK
> (29, 1.25)
> Vanilla
> CLOVER
> 020 40 60
> Sorted Dimensions
> 0.2
> 0.4
> 0.6
> 0.8
> WV WTO
> (26, 0.46)
> Vanilla
> CLOVER
(d) SDXL 0 50 100
> 0.00
> 0.02
> 0.04
> 0.06
> 0.08
> WQ WK
> (9, 0.04)
> Vanilla
> CLOVER
> 050 100
> Sorted Dimensions
> 0.00
> 0.02
> 0.04
> 0.06
> 0.08
> WV WTO
> (44, 0.02)
> Vanilla
> CLOVER
(e) CLIP-ViT-BigG
Figure 2. CLOVER (orange) uses fewer orthogonal basis vectors than Vanilla Pruning (blue) to span the attention head space. The first row shows the importance of Q-K dimensions, and the second row shows V-O dimensions. After the red dot, CLOVERs importance is lower, and pruning these vectors results in less performance loss.
4.3. CLOVER Removal Redundant Vectors
CLOVER achieves a higher pruning ratio due to the sig-nificant linear redundancy present in the model. By repre-senting the entire attention head with only a small number of orthogonal vectors, CLOVER effectively removes this redundancy. To illustrate the advantages of CLOVER in eliminating linear redundancy, we apply it to a variety range of models, including the large language model DeepSeek-V2-Lite (DeepSeek-AI, 2024), the multimodal automatic speech recognition and speech translation model Whisper-Large-v3 (Radford et al., 2023), the multimodal instruction-tuned image reasoning generative models LLaMA-3.2-11B-Vision (AI@Meta, 2024), the image encoder CLIP-ViT-bigG (Cherti et al., 2022), and the image generation model Stable Diffusion XL (Podell et al., 2023). We compute the
L2 norm for each dimension (equal to singular values) in both the Q-K pair and the V-O pair, sorting the values in descending order within each attention head for better visu-alization. For comparison, we also perform Vanilla Pruning, which does not utilize CLOVER initialization but instead sorts directly based on the L2 norm. Figure 2 showcases the first attention head from the first layer of each model. In the first column of the figure, depict-ing the Q-K norm, we observe that in the original model, the importance of each dimension is relatively balanced (e.g. Figure 2c). This balanced distribution is a result of the linear redundancy, where different directions are inter-twined, making it challenging to prune individual directions without negatively affecting the models performance. How-ever, after applying CLOVERs orthogonal decomposition, only a small number of orthogonal bases on the left side exhibit significantly large norms. These vectors span al-most the entire attention heads space, and the remaining vectors have norms that approach zero, indicating that they are already represented by the dominant singular vectors and can be pruned without loss of performance. Beyond the red intersection point, CLOVERs remaining vectors exhibit consistently lower importance than those in Vanilla Pruning, meaning pruning these vectors results in less per-formance degradation. This demonstrates why CLOVER enables a higher pruning ratio. A similar trend is observed for the V-O pair, although the models inherent sparsity is less pronounced than in the Q-K pair, making the effect less noticeable. Still, in most models, pruning half of the vectors has a smaller impact on performance compared to Vanilla Pruning. Notably, in CLIP-ViT-bigG (Figure 2e), a proportion of the vectors already have a norm of zero, allowing for safe pruning.
4.4. CLOVER for Training-Free Pruning
As demonstrated by the prominent low-rank properties in Figure 2c, we applied pruning to the Whisper-large-v3 model (Radford et al., 2023). To intuitively highlight the effectiveness of CLOVER pruning, we present an example using an audio input from the LibriSpeech Long dataset (Gandhi et al., 2023). For reference, the waveform of this input is shown in Figure 3, and the corresponding target translation script is provided in Appendix A.4. After applying CLOVER to orthogonalize the vec-tors, we pruned vectors with magnitudes close to zero (∥WQ∥∥ WK ∥ ≤ 5 × 10 3 and ∥WV ∥∥ W
> O
∥ ≤ 6 × 10 3). This pruning achieved ratios of 56.01% and 36.82% for the parameters in Q-K Pair and V -O Pair, respectively. Re-markably, the models output remains nearly unchanged, with only one error, which has been highlighted in the text using strikethrough and red for clarity: 6CLOVER: Cross-Layer Orthogonal Vectors 0 2e5 4e5 6e5 8e5 10e5
> Samples
> 0.5
> 0.0
> 0.5
> Amplitude
Figure 3. An audio waveform from the librispeech dataset.
Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilters manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Laytons work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnells pictures are a sort of Up Guards and Adam paintings, and Masons exquisite idles are as national as a jingo poem. Mr. Birkett Fosters landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And , and Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampooer in a Turkish bath, next man.
In contrast, using a vanilla pruning method with the same pruning ratio, the model completely fails to produce valid outputs:
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
This example validates our earlier claim that straightfor-ward pruning of non-zero dimensions can lead to accumu-lated loss. In contrast, CLOVER effectively eliminates lin-ear redundancy, enabling a significantly higher pruning ra-tio. When the linear redundancy is sufficiently pronounced, CLOVER can even achieve a high pruning ratio without the need for fine-tuning to recover performance.
4.5. Necessity of Full-Direction Fine-Tuning
Besides pruning with a large ratio, CLOVER is capable of learning linear combinations of all orthogonal vectors within each attention head. This capability allows CLOVER to resemble full-parameter fine-tuning more closely. To highlight the advantages of updating all orthogonal bases, we randomly sampled 16 instances from the Commonsense dataset, fed them into the model, and performed SVD to the model. We then recorded the projection magnitudes of input features along all orthogonal directions. Figure 4 visualizes the results for the middle layer, revealing the following insights: 1) Without accounting for the scaling effect of singular val-ues, the projection magnitude along the principal singular vector consistently exceeds that in other directions. This ob-servation supports PiSSAs approach, which updates based Top 256
> 10%
> Next 256
> 7%
> Remaining
> 76%
> Bottom 256
> 7%
(a) PiSSA Top 256
> 6%
> Next 256
> 7%
> Remaining
> 81%
> Bottom 256
> 6%
(b) LoRA Top 256
> 18%
> Next 256 10%
> Remaining
> 65%
> Bottom 256
> 7%
(c) PiSSA with Singular Value 100% (d) CLOVER
Figure 4. Proportion of data projections across different compo-nents in random directions (LoRA) versus orthogonal directions (PiSSA), as well as all orthogonal directions (CLOVER).
on the principal singular values and vectors, leading to im-proved training performance. In contrast, LoRA projects in random directions, resulting in uniform projection magni-tudes across all directions. 2) The singular values in the original model reflect the im-portance of each direction in the pretraining task. The model amplifies the components along directions with larger sin-gular values and suppresses those along smaller singular values. Therefore, it is crucial to consider the scaling effect of singular values. As shown in Figure 4c, the projection magnitude along the principal singular vector direction in-creases to 18%. 3) While more data projections align with the principal singular vector at higher ranks, 82% of the feature compo-nents are still projected onto other directions. In extreme cases, if a task is entirely orthogonal to the vectors used by PiSSA, training on such a task may result in zero gradients, thereby limiting its learning capacity. Under the same rank constraint, 94% of the feature components in LoRA are pro-jected outside the LoRA adapter, making it more susceptible to the zero-gradient problem. Since CLOVER updates across all orthogonal directions, as shown in Figure 4d it effectively mitigates this issue. Con-sequently, CLOVER outperforms both LoRA and PiSSA in multi-task learning, even when using the same or fewer learnable parameters (Section 4.2). 7CLOVER: Cross-Layer Orthogonal Vectors
4.6. Visualizing Rank Updates
To demonstrate CLOVER achieves full-rank updates, we multiply the updated singular values with their correspond-ing singular vectors and perform SVD on the base model (SQK applied to the Key layer, SV O to the Value layer, and
SU D to the Up layer). We take LoRA, and Full Fine-tuning for comparing. Figure 5 shows the singular value of the middle layer in LLaMA-2-7B, revealing that CLOVER and Full Fine-tuning achieve full-rank updates, while LoRA is constrained by its low-rank design. 0 1000 2000 3000 4000
> 0.0
> 0.5
> 1.0
> 1.5
> 2.0
(a) Full Fine-Tuning 0 1000 2000 3000 4000
> 0
> 2
> 4
> 6
(b) LoRA 0 1000 2000 3000 4000
> 0.0
> 0.2
> 0.4
> 0.6
> 0.8
(c) CLOVER
Figure 5. ∆W is low rank in LoRA, while full rank for Full-Fine-Tuning and CLOVER.
4.7. CLOVER Avoids Intrusive Dimensions
Recent research (Shuttleworth et al., 2024) has highlighted an issue with LoRA, referred to as the “intrusive dimensions” phenomenon. As illustrated in Figure 6b, LoRA introduces new random directions into the model, which possess large magnitudes and thus precede all the original singular vectors. The study suggests that these “intrusive dimensions” can degrade the models performance, exacerbating catastrophic forgetting during continual learning with LoRA. In contrast, CLOVER addresses this issue by fixing all orthogonal bases and updating only the vector combinations. As a result, the changes introduced by CLOVER fine-tuning closely resemble those generated by full parameter fine-tuning, as shown in Figure 6a and Figure 6c. 0 100 200 300 400
> 0
> 100
> 200
> 300
> 400
(a) Full Fine-Tuning 0 100 200 300 400
> 0
> 100
> 200
> 300
> 400
(b) LoRA 0 100 200 300 400
> 0
> 100
> 200
> 300
> 400
(c) CLOVER
Figure 6. Intruder dimensions phenomenal in LoRA, which does not exist in Full Fine-Tuning and CLOVER.
## 5. Conclusion and Limitations
In this paper, we introduce Cross-Layer Orthogonal Vectors (CLVOER), a method that orthogonalizes vectors within attention heads without requiring additional transformation matrices. This orthogonalization process condenses effec-tive parameters into fewer vectors, improving the pruning ratio. By fine-tuning the singular values obtained through orthogonalization, CLVOER learns linear combinations of orthogonal bases, enabling full-rank updates. When applied to prune 50% of the attention head parameters in GPT-2XL, CLVOER results in a perplexity that is just one-tenth of that achieved by standard pruning methods. For Whisper-Large-v3, CLVOER removes 46.42% of the parameters without fine-tuning, while preserving model performance. Furthermore, when used for fine-tuning, CLVOER outper-forms state-of-the-art methods such as LoRA, DoRA, HiRA, and PiSSA, achieving superior results with equal or fewer trainable parameters. We also demonstrate how CLVOER removes linear redundancy to facilitate pruning and discuss the necessity of fine-tuning across all orthogonal bases. Vi-sual comparisons of models fine-tuned with different meth-ods further illustrate its effectiveness. Despite its advantages, CLVOER has some limitations. When nonlinear operations are present between Q-K or V-O pairs (such as with the widely-used RoPE (Su et al., 2024)), cross-layer orthogonalization is not feasible. In these cases, we instead perform head-wise orthogonalization within the Key layer during fine-tuning. Fortunately, CLVOER Fine-Tuning can apply intra-layer attention head orthogo-nalization, while CLOVER Pruning remains applicable to many popular models, including DeepSeek (DeepSeek-AI, 2024; Liu et al., 2024b)(which uses Decoupled RoPE), ViT and SDXL (which use absolute positional encoding), and BLOOM (Le Scao et al., 2023) (which employs Alibi rela-tive positional encoding (Press et al., 2021)). Additionally, as a newly proposed method, our current evaluation fo-cuses primarily on basic pruning tasks and does not include comparisons with other state-of-the-art pruning techniques. However, because CLVOER does not alter the model struc-ture and only updates the initialization method, it can be combined with existing pruning methods to further enhance their effectiveness. As a novel technique, CLVOER holds considerable promise for future applications. For instance, it could be combined with quantization methods to eliminate outliers, guide prun-ing and fine-tuning based on data feature directions, or even inspire new model architectures. 8CLOVER: Cross-Layer Orthogonal Vectors
## Impact Statement
This paper proposes a cross-layer orthogonal initialization method to guide model pruning and efficient fine-tuning, of-fering valuable insights for the application and development of large models. Both application directions aim to reduce training and inference costs, lower computational overhead, decrease power consumption, and minimize carbon emis-sions.
## References
Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024. AI@Meta. Llama 2: Open foundation and fine-tuned chat models. CoRR , abs/2307.09288, 2023. doi: 10. 48550/arXiv.2307.09288. URL https://doi.org/ 10.48550/arXiv.2307.09288 .AI@Meta. Llama 3 model card, 2024. URL
https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md .Ainslie, J., Lee-Thorp, J., de Jong, M., Zemlyanskiy, Y., Lebr ´on, F., and Sanghai, S. Gqa: Training generalized multi-query transformer models from multi-head check-points. arXiv preprint arXiv:2305.13245, 2023. Anonymous. HiRA: Parameter-efficient hadamard high-rank adaptation for large language models. In The Thirteenth International Conference on Learning Representations , 2025. URL https://openreview. net/forum?id=TwJrTz9cRS .Ansell, A., Ponti, E. M., Korhonen, A., and Vuli ´c, I. Composable sparse fine-tuning for cross-lingual trans-fer. arXiv preprint arXiv:2110.07560, 2021. Anthropic. Claude 3.5 sonnet, 2024. URL https://www. anthropic.com/news/claude-3-5-sonnet .Asai, A., Salehi, M., Peters, M. E., and Hajishirzi, H. Attempt: Parameter-efficient multi-task tuning via at-tentional mixtures of soft prompts. arXiv preprint arXiv:2205.11961, 2022. Ashkboos, S., Croci, M. L., Nascimento, M. G. d., Hoefler, T., and Hensman, J. Slicegpt: Compress large language models by deleting rows and columns. arXiv preprint arXiv:2401.15024, 2024. Bisk, Y., Zellers, R., Gao, J., Choi, Y., et al. Piqa: Reason-ing about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 74327439, 2020. Brandon, W., Mishra, M., Nrusimha, A., Panda, R., and Kelly, J. R. Reducing transformer key-value cache size with cross-layer attention. arXiv preprint arXiv:2405.12981, 2024. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems , 33: 18771901, 2020. Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., and Jitsev, J. Reproducible scaling laws for contrastive language-image learning. arXiv preprint arXiv:2212.07143, 2022. Clark, C., Lee, K., Chang, M.-W., Kwiatkowski, T., Collins, M., and Toutanova, K. Boolq: Exploring the surpris-ing difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019. Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018. De, S., Smith, S. L., Fernando, A., Botev, A., Cristian-Muraru, G., Gu, A., Haroun, R., Berrada, L., Chen, Y., Srinivasan, S., et al. Griffin: Mixing gated linear recur-rences with local attention for efficient language models. arXiv preprint arXiv:2402.19427, 2024. DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. CoRR ,abs/2405.04434, 2024. URL https://doi.org/10. 48550/arXiv.2405.04434 .Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix multiplication for transform-ers at scale. Advances in Neural Information Processing Systems, 35:3031830332, 2022. Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022. Fu, Q., Cho, M., Merth, T., Mehta, S., Rastegari, M., and Najibi, M. Lazyllm: Dynamic token pruning for efficient long context llm inference. arXiv preprint arXiv:2407.14057, 2024. Fu, Z., Yang, H., So, A. M.-C., Lam, W., Bing, L., and Collier, N. On the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 37, pp. 1279912807, 2023. 9CLOVER: Cross-Layer Orthogonal Vectors
Gandhi, S., von Platen, P., and Rush, A. M. Distil-whisper: Robust knowledge distillation via large-scale pseudo la-belling. arXiv preprint arXiv:2311.00430, 2023. Gokaslan, A. and Cohen, V. Openwebtext cor-pus. http://Skylion007.github.io/ OpenWebTextCorpus , 2019. Gu, A. and Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. Guo, D., Rush, A. M., and Kim, Y. Parameter-efficient transfer learning with diff pruning. arXiv preprint arXiv:2012.07463, 2020. Guo, S., Xu, J., Zhang, L. L., and Yang, M. Com-presso: Structured pruning with collaborative prompting learns compact large language models. arXiv preprint arXiv:2310.05015, 2023. Hambardzumyan, K., Khachatrian, H., and May, J. Warp: Word-level adversarial reprogramming. arXiv preprint arXiv:2101.00121, 2021. Hooper, C., Kim, S., Mohammadzadeh, H., Mahoney, M. W., Shao, Y. S., Keutzer, K., and Gholami, A. Kvquant: Towards 10 million context length llm in-ference with kv cache quantization. arXiv preprint arXiv:2401.18079, 2024. Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021. Jiang, T., Huang, S., Luo, S., Zhang, Z., Huang, H., Wei, F., Deng, W., Sun, F., Zhang, Q., Wang, D., et al. Mora: High-rank updating for parameter-efficient fine-tuning. arXiv preprint arXiv:2405.12130, 2024. Jo, H.-r. and Shin, D. A2sf: Accumulative attention scoring with forgetting factor for token pruning in transformer decoder. arXiv preprint arXiv:2407.20485, 2024. Katharopoulos, A., Vyas, A., Pappas, N., and Fleuret, F. Transformers are rnns: Fast autoregressive transform-ers with linear attention. In International conference on machine learning, pp. 51565165. PMLR, 2020. Lawton, N., Kumar, A., Thattai, G., Galstyan, A., and Steeg, G. V. Neural architecture search for parameter-efficient fine-tuning of large pre-trained language models. arXiv preprint arXiv:2305.16597, 2023. Le Scao, T., Fan, A., Akiki, C., Pavlick, E., Ili ´c, S., Hesslow, D., Castagn ´e, R., Luccioni, A. S., Yvon, F., Gall ´e, M., et al. Bloom: A 176b-parameter open-access multilingual language model. 2023. Lester, B., Al-Rfou, R., and Constant, N. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021. Li, M., Lin, Y., Zhang, Z., Cai, T., Li, X., Guo, J., Xie, E., Meng, C., Zhu, J.-Y., and Han, S. Svdqunat: Absorb-ing outliers by low-rank components for 4-bit diffusion models. arXiv preprint arXiv:2411.05007, 2024a. Li, X. L. and Liang, P. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021. Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469, 2024b. Lingam, V., Tejaswi, A., Vavre, A., Shetty, A., Gudur, G. K., Ghosh, J., Dimakis, A., Choi, E., Bojchevski, A., and Sanghavi, S. Svft: Parameter-efficient fine-tuning with singular vectors. arXiv preprint arXiv:2405.19597, 2024. Liu, A., Feng, B., Wang, B., Wang, B., Liu, B., Zhao, C., Dengr, C., Ruan, C., Dai, D., Guo, D., et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model. arXiv preprint arXiv:2405.04434 ,2024a. Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 ,2024b. Liu, A., Liu, J., Pan, Z., He, Y., Haffari, G., and Zhuang, B. Minicache: Kv cache compression in depth dimension for large language models. arXiv preprint arXiv:2405.14366, 2024c. Liu, S.-Y., Wang, C.-Y., Yin, H., Molchanov, P., Wang, Y.-C. F., Cheng, K.-T., and Chen, M.-H. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024d. Liu, W., Qiu, Z., Feng, Y., Xiu, Y., Xue, Y., Yu, L., Feng, H., Liu, Z., Heo, J., Peng, S., et al. Parameter-efficient orthogonal finetuning via butterfly factorization. arXiv preprint arXiv:2311.06243, 2023a. Liu, X., Zheng, Y., Du, Z., Ding, M., Qian, Y., Yang, Z., and Tang, J. Gpt understands, too. AI Open, 2023b. Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. Kivi: A tuning-free asym-metric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024e. 10 CLOVER: Cross-Layer Orthogonal Vectors
Ma, X., Fang, G., and Wang, X. Llm-pruner: On the struc-tural pruning of large language models. Advances in neural information processing systems , 36:2170221720, 2023. Meng, F., Wang, Z., and Zhang, M. Pissa: Principal singular values and singular vectors adaptation of large language models. arXiv preprint arXiv:2404.02948, 2024. Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models, 2016. Mihaylov, T., Clark, P., Khot, T., and Sabharwal, A. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018. Mistral. Cheaper, better, faster, stronger: Continuing to push the frontier of ai and making it accessible to all, 2024. URL https://mistral.ai/news/ mixtral-8x22b .OpenAI. Hello GPT-4o, 2024. URL https://openai. com/index/hello-gpt-4o/ .Peng, B., Alcaide, E., Anthony, Q., Albalak, A., Arcadinho, S., Biderman, S., Cao, H., Cheng, X., Chung, M., Grella, M., et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023. Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M ¨uller, J., Penna, J., and Rombach, R. Sdxl: Im-proving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023. Press, O., Smith, N. A., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021. Qiu, Z., Liu, W., Feng, H., Xue, Y., Feng, Y., Liu, Z., Zhang, D., Weller, A., and Sch ¨olkopf, B. Controlling text-to-image diffusion by orthogonal finetuning. Advances in Neural Information Processing Systems , 36:79320 79362, 2023. Qwen. Qwen2.5: A party of foundation models, 2024. URL
https://qwenlm.github.io/blog/qwen2.5 .Radford, A. Improving language understanding by genera-tive pre-training. 2018. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., and Sutskever, I. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pp. 2849228518. PMLR, 2023. Sakaguchi, K., Bras, R. L., Bhagavatula, C., and Choi, Y. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM , 64(9):99106, 2021. Sap, M., Rashkin, H., Chen, D., LeBras, R., and Choi, Y. Socialiqa: Commonsense reasoning about social interac-tions. arXiv preprint arXiv:1904.09728, 2019. Shazeer, N. Fast transformer decoding: One write-head is all you need. arXiv preprint arXiv:1911.02150, 2019. Shuttleworth, R., Andreas, J., Torralba, A., and Sharma, P. Lora vs full fine-tuning: An illusion of equivalence. arXiv preprint arXiv:2410.21228, 2024. Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024. Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023. Sun, Y., Dong, L., Zhu, Y., Huang, S., Wang, W., Ma, S., Zhang, Q., Wang, J., and Wei, F. You only cache once: Decoder-decoder architectures for language models. arXiv preprint arXiv:2405.05254, 2024. Sung, Y.-L., Nair, V., and Raffel, C. A. Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems , 34:24193 24205, 2021. Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understand-ing across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024a. Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahri-ari, B., Ram ´e, A., et al. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024b. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi `ere, B., Goyal, N., Hambro, E., Azhar, F., et al. LLaMA: Open and efficient founda-tion language models. arXiv preprint arXiv:2302.13971, 2023. Vu, T., Lester, B., Constant, N., Al-Rfou, R., and Cer, D. Spot: Better frozen model adaptation through soft prompt transfer. arXiv preprint arXiv:2110.07904, 2021. Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020. 11 CLOVER: Cross-Layer Orthogonal Vectors
Wang, S., Yu, L., and Li, J. Lora-ga: Low-rank adap-tation with gradient approximation. arXiv preprint arXiv:2407.05000, 2024a. Wang, Z., Panda, R., Karlinsky, L., Feris, R., Sun, H., and Kim, Y. Multitask prompt tuning enables parameter-efficient transfer learning. arXiv preprint arXiv:2303.02861, 2023. Wang, Z., Liang, J., He, R., Wang, Z., and Tan, T. Lora-pro: Are low-rank adapters properly optimized? arXiv preprint arXiv:2407.18242, 2024b. Xia, M., Gao, T., Zeng, Z., and Chen, D. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023. Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning , pp. 3808738099. PMLR, 2023. Xu, R., Luo, F., Zhang, Z., Tan, C., Chang, B., Huang, S., and Huang, F. Raise a child in large language model: Towards effective and generalizable fine-tuning. arXiv preprint arXiv:2109.05687, 2021. Yu, H., Yang, Z., Li, S., Li, Y., and Wu, J. Effectively com-press kv heads for llm. arXiv preprint arXiv:2406.07056, 2024. Yuan, S., Liu, H., and Xu, H. Bridging the gap between low-rank and orthogonal adaptation via householder reflection adaptation. arXiv preprint arXiv:2405.17484, 2024. Zaken, E. B., Ravfogel, S., and Goldberg, Y. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021. Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019. Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N., He, P., Cheng, Y., Chen, W., and Zhao, T. Adalora: Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023. Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., and Tian, Y. Galore: Memory-efficient llm train-ing by gradient low-rank projection. arXiv preprint arXiv:2403.03507, 2024. Zhao, M., Lin, T., Mi, F., Jaggi, M., and Sch ¨utze, H. Mask-ing as an efficient alternative to finetuning for pretrained language models. arXiv preprint arXiv:2004.12406 ,2020. Zi, B., Qi, X., Wang, L., Wang, J., Wong, K.-F., and Zhang, L. Delta-lora: Fine-tuning high-rank parame-ters with the delta of low-rank matrices. arXiv preprint arXiv:2309.02411, 2023. Zuhri, Z. M. K., Adilazuarda, M. F., Purwarianti, A., and Aji, A. F. Mlkv: Multi-layer key-value heads for memory efficient transformer decoding. arXiv preprint arXiv:2406.09297, 2024. 12 CLOVER: Cross-Layer Orthogonal Vectors
## A. Appendix
A.1. Cross Layer Orthogonal Vectors in Value and Output layers
In the main text, we only presented the orthogonalization process for the Q-K pair. Here, we provide the method for orthogonalizing the V-O pair. Additionally, for up-down layers, the output dimension of the Up layer can be reshaped into block number × block size, followed by performing orthogonal decomposition within each block.
Y = attn(Q, K) V W O , V = XW V ∈ Rb×h×n×d (1)
= attn(Q, K) XW V WO , WV WO = WV O = U SV ∈ Rh×D×D (2)
= attn(Q, K) XU SV, S[: ,r vo :,r vo :] = SV O ∈ Rh×rvo ×rvo = 0 , r vo ≤ d. (3)
= attn(Q, K) XU V O SV O VV O , UV O ∈ RD×h×rvo , VV O ∈ Rh×rvo ×D . (4) Through this series of transformations, WV and WO can be equivalently replaced by orthogonal vectors UV O and VV O ,along with the diagonal matrix SV O . Since rvo ≤ d, the singular zero values and their corresponding singular vectors can be safely pruned. After guided pruning, SV O can be merged into UV O and VV O , resulting in no additional computational overhead.
A.2. Hyperparameters
Table 3 presents a comparison of hyperparameters for different fine-tuning methods on commonsense tasks. The target model remains the same for LoRA, DoRA, HiRA, and PiSSA. However, DoRA introduces an additional magnitude module, leading to a slightly higher parameter count. In a single layer of LoRA, the trainable parameters are as follows: In LoRA, the trainable parameters are:
Q = 4096 × 32 + 4096 × 32
K = 4096 × 32 + 4096 × 32
V = 4096 × 32 + 4096 × 32
Up = 4096 × 32 + 11008 × 32
Down = 4096 × 32 + 11008 × 32
The total sum is 1,753,088. In CLOVER, the trainable parameters are:
QK = 32 × 128 × 128
V O = 32 × 128 × 128
U D = 172 × 64 × 64
The total sum is also 1,753,088. Since CLOVER inserts trainable parameters across layers, we use the Q-K pair notation to represent its target model. When CLVOER updates parameters within an attention head, the number of trainable parameters matches exactly that of LoRA at rank 32. To adjust the number of learnable parameters, CLOVER can either span multiple heads or split a single head into multiple blocks. Both PiSSA and CLOVER exhibit stable training performance. Therefore, instead of validating every 80 steps, we omit frequent validation, improving training efficiency.
A.3. Detail Information of Dataset
The commonsense reasoning tasks consist of 8 subtasks, each with predefined training and testing sets, as described by LLM-Adapters (Hu et al., 2023). The following table lists the details of each sub-dataset. 13 CLOVER: Cross-Layer Orthogonal Vectors
> Table 3. Detailed Training Hyperparameters. Q-K,V-O, U-D means CLVOER update pair of orthogonal vectors.
Method Target Evaluation steps LR Scheduler Batch size Warmup Steps Epochs
LoRA Q,K,V,U,D 80 3e-4 Linear 16 100 3DoRA Q,K,V,U,D 80 2e-4 Linear 16 100 3HiRA Q,K,V,U,D 80 2e-4/2e-4 Linear 32 100 3PiSSA Q,K,V,U,D 2e-5 Linear 16 100 3CLOVER Q-K,V-O, U-D 1e-4 Linear 16 100 3
> Table 4. Details of datasets for commonsense reasoning tasks.
> Dataset Train Test About BoolQ (Clark et al., 2019) 9,427 3,270 Naturally occurring yes/no questions from unconstrained settings. PIQA (Bisk et al., 2020) 16,113 1,838 Questions with two solutions requiring physical commonsense. SIQA (Sap et al., 2019) 33,410 1,954 Reasoning about actions and social implications. HellaSwag (Zellers et al., 2019) 39,905 10,042 Commonsense NLI questions with context and endings. WinoGrande (Sakaguchi et al., 2021) 40,398 1,267 Fill-in-the-blank task with binary options. ARC-e (Clark et al., 2018) 2,251 2,376 Grade-school multiple-choice science questions in Easy sets. ARC-c (Clark et al., 2018) 1,119 1,172 Grade-school multiple-choice science questions in Challenge sets. OBQA (Mihaylov et al., 2018) 4,957 500 Questions requiring multi-step reasoning and commonsense knowledge.
For WinoGrande, the original dataset includes multiple partitions: [xs, s, m, l, xl, debiased]. While LLM-Adapters simply concatenated all these partitions, note that the “xl” partition actually includes all others, leading to extensive data duplication. After removing duplicates, the training data is reduced from 63.2K to 40.4K instances. Additionally, in the LLM-Adapters paper, the training set sizes of ARC Challenge and ARC Easy were reversed by mistake; here, we correct that error.
A.4. LibriSpeech Long dataset target transcript
Below is the reference text of the LibriSpeech Long dataset for comparison.
Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilters manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Laytons work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnells pictures are a sort of Up Guards and Adam paintings, and Masons exquisite idles are as national as a jingo poem. Mr. Birkett Fosters landscapes smile at one much in the same way that Mr. Carker used to flash his teeth, and Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampooer in a Turkish bath, next man.
In fact, with Vanilla Pruning ratios of just 22.31% and 6.69% for WQ-WK and WV -WO , respectively, the models output is already significantly degraded.
Mr. Colter is the personal of the classes, and we are glad to welcome his gospel. Nor is Mr. Colters manner less interesting than his manner. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similarly he is drawn from eating and its results occur most readily to the mind. He is very dull, so very frequently, and is very Greek after all, and can discover in it but little of Rocky Ithaca. The Nells pictures are sort of up-guard to Adams paintings, and Masons exquisite idylls are as national as a jingle poem. Mr. Burke and Fosters landscapes smile at one much in the same way as Mr. Parker, Mr. Flash is tits. And Mr. John Collier gives his sitter a cheerful slap on the back before he says like a shampoo and a Turkish bath, Next man.
A.5. Visualizing more attention heads
In Section 4.3, we only presented the first attention head in the first layer. Here, we provide a broader view by showcasing more attention heads. Figure 7 illustrates the L2 norm of all Q-K heads in the first, middle, and last layers of Whisper-Large-14 CLOVER: Cross-Layer Orthogonal Vectors
v3. Figure 8 shows the L2 norm of all Q-K heads in the first, middle, and last layers of ViT-bigG. From these figures, we can observe that CLOVER consistently represents the entire attention head with fewer orthogonal bases across all layers and all attention heads. This property forms the foundation of CLVOERs effectiveness in enhancing pruning. 0 200 400 600 800 1000 1200
> 0
> 2
> 4
> 6
> 8
> 10
> Layer.0.qk
> Absorb and Decompose
> Vanilla
> 0200 400 600 800 1000 1200
> 0
> 2
> 4
> 6
> 8
> Layer.15.qk
> 0200 400 600 800 1000 1200
> 0.0
> 2.5
> 5.0
> 7.5
> 10.0
> 12.5
> 15.0
> Layer.31.qk
Figure 7. The L2-norm for the 0-th, 15-th, and 31-st attention layers in the Whisper-large-v3 encoder. The blue line represents the results after redundancy removal using the CLOVER method, while the orange line depicts the L2-norm directly computed for each dimension.
15 CLOVER: Cross-Layer Orthogonal Vectors 0 250 500 750 1000 1250 1500
> 0.0
> 0.1
> 0.2
> 0.3
> 0.4
> 0.5
Layer.0.qk
> Absorb and Decompose
> Vanilla
> 0250 500 750 1000 1250 1500
> 0.0
> 0.5
> 1.0
> 1.5
> 2.0
> 2.5
Layer.23.qk
> 0250 500 750 1000 1250 1500
> 0.0
> 0.2
> 0.4
> 0.6
> 0.8
> 1.0
Layer.47.qk
Figure 8. The L2-norm for the 0-th, 15-th, and 31-st attention layers in the ViT-bigG. The blue line represents the results after redundancy removal using the CLOVER method, while the orange line depicts the L2-norm directly computed for each dimension.
16
File diff suppressed because it is too large Load Diff
+970
View File
@@ -0,0 +1,970 @@
Title: 2404.03592v3.pdf
URL Source: https://arxiv.org/pdf/2404.03592
Published Time: Fri, 24 May 2024 01:13:51 GMT
Number of Pages: 49
Markdown Content:
# ReFT: Representation Finetuning for Language Models
Zhengxuan Wu ∗† Aryaman Arora ∗† Zheng Wang † Atticus Geiger ‡
Dan Jurafsky † Christopher D. Manning † Christopher Potts ††Stanford University ‡Pr(Ai) 2R Group
{wuzhengx,aryamana,peterwz,atticusg}@stanford.edu {jurafsky,manning,cgpotts}@stanford.edu
## Abstract
Parameter-efficient finetuning (PEFT) methods seek to adapt large neural models via updates to a small number of weights . However, much prior interpretability work has shown that representations encode rich semantic information, suggesting that editing representations might be a more powerful alternative. We pursue this hypothesis by developing a family of Representation Finetuning (ReFT)
methods. ReFT methods operate on a frozen base model and learn task-specific interventions on hidden representations. We define a strong instance of the ReFT family, Low-rank Linear Subspace ReFT (LoReFT), and we identify an ablation of this method that trades some performance for increased efficiency. Both are drop-in replacements for existing PEFTs and learn interventions that are 15 ×–65 × more parameter-efficient than LoRA. We showcase LoReFT on eight commonsense rea-soning tasks, four arithmetic reasoning tasks, instruction-tuning, and GLUE. In all these evaluations, our ReFTs deliver the best balance of efficiency and performance, and almost always outperform state-of-the-art PEFTs. We release a generic ReFT training library publicly at https://github.com/stanfordnlp/pyreft .
## 1 Introduction
Pretrained language models (LMs) are frequently finetuned to adapt them to new domains or tasks [Dai and Le, 2015]. With finetuning, a single base model can be adapted to a variety of tasks given only small amounts of in-domain data. However, finetuning large LMs is expensive. Parameter-efficient finetuning (PEFT) methods propose to address the high costs of full finetuning by updating a small number of weights. This reduces memory usage and training time, and PEFTs achieve similar performance to full finetuning in many settings [Hu et al., 2023]. A hallmark of current state-of-the-art PEFTs is that they modify weights rather than representations .However, much prior interpretability work has shown that representations encode rich semantic information, suggesting that editing representations might be a more powerful alternative to weight updates. In this paper, we pursue this hypothesis by developing and motivating Representation Finetuning (ReFT) . Instead of adapting model weights, ReFT methods train interventions that manipulate a small fraction of model representations in order to steer model behaviors to solve downstream tasks at inference time. ReFT methods are drop-in replacements for weight-based PEFTs. This approach is inspired by recent work in LM interpretability that intervenes on representations to find faithful causal mechanisms [Geiger et al., 2023b] and to steer model behaviours at inference time [Turner et al., 2023, Li et al., 2024], and it can be seen as a generalisation of the representation-editing work of Wu et al. [2024a], Turner et al. [2023], and Zou et al. [2023] (see appendix B for details).
> *
Equal contribution. Preprint. Under review.
> arXiv:2404.03592v3 [cs.CL] 22 May 2024 Commonsense
> LLaMA 7B LLaMA 13B Llama-2 7B Llama-3 8B
> Instruct-tuning
> Llama-2 7B
> Parameters Performance
> Arithmetic
> LLaMA 7B LLaMA 13B
> GLUE
> RoBERTa-base RoBERTa-large
Figure 1: Parameter count vs. performance for LoReFT and other PEFTs across four benchmarks when applied to LLaMA, Llama-2, Llama-3, and RoBERTa models. Despite training far fewer parameters than existing PEFTs, LoReFT achieves competitive or even state-of-the-art performance on all tasks. Its value is most apparent for the largest models in our evaluations. Note : FT is full-parameter finetuning, which is not a PEFT or ReFT method. Additional results are in section 4. We focus on a strong and highly efficient instance of the ReFT family that we call Low-rank Linear Subspace ReFT (LoReFT ). LoReFT is a parametrisation of ReFT that intervenes on hidden representations in the linear subspace spanned by a low-rank projection matrix, building directly on the distributed alignment search (DAS) method of Geiger et al. [2023b] and Wu et al. [2023]. We also identify an ablation of this method (DiReFT) that trades some performance for increased efficiency. We evaluate our ReFTs on LLaMA-family models and small-scale LMs against existing PEFTs on standard benchmarks from four domains: commonsense reasoning, arithmetic reasoning, instruction-following, and natural language understanding. Compared to LoRA, we find that LoReFT uses 15 ×–65 × times fewer parameters while achieving state-of-the-art performance on commonsense reasoning, instruction-following, and natural language understanding against the strongest PEFTs. These findings indicate that ReFT methods are worthy of further exploration, as they may emerge as more efficient and effective alternatives to weight-based PEFTs.
## 2 Related work
Parameter-efficient finetuning methods (PEFTs). PEFTs train a fraction of the models parameters to adapt it to downstream tasks. We classify PEFTs into three categories: 1. Adapter-based methods train additional modules (e.g. fully-connected layers) on top of the frozen pretrained model. Series adapters insert components between LM attention or MLP layers [Houlsby et al., 2019, Pfeiffer et al., 2020, Wang et al., 2022, He et al., 2022b, Fu et al., 2021], while parallel adapters add modules alongside existing components [He et al., 2022a]. Since adapters add new components that cannot be easily folded into existing model weights, they impose an additional burden at inference time. 1
2. LoRA [Hu et al., 2022] and DoRA [Liu et al., 2024c] use low-rank matrices to approximate additive weight updates during training, and require no additional overhead during inference since the weight updates can be merged into the model. These are the strongest PEFTs currently. 2
3. Prompt-based methods add randomly-initialised soft tokens to the input (usually as a prefix) and train their embeddings while keeping the LM weights frozen [Li and Liang, 2021]. These
> 1
Several very recent papers introduce new adapter architectures but do not benchmark them on the tasks we consider, or they perform hyperparameter-tuning in a different setup than done in this work. These include: LLaMA-Adapter [Zhang et al., 2024b], LLaMA-Adapter v2 [Gao et al., 2023], Aligner [Ziheng et al., 2023].
> 2
Additional methods not studied in this work: AutoLoRA [Zhang et al., 2024c], ResLoRA [Shi et al., 2024], SiRA [Zhu et al., 2023].
2methods are often far from optimal compared to other PEFTs, and come at the cost of significant inference overhead. A variant of this method where hidden-layer activations are also tuned was introduced as a baseline in Hu et al. [2022], with better performance.
Representation editing. Recent work on activation steering and representation engineering shows that adding fixed or task-specific steering vectors [Subramani et al., 2022, Turner et al., 2023, Zou et al., 2023, Liu et al., 2024b, Vogel, 2024, Li et al., 2024] or applying concept erasure [Ravfogel et al., 2022, Belrose et al., 2023, Avitan et al., 2024, Singh et al., 2024] to the residual stream can enable a degree of control over pretrained LM generations without the need for resource-intensive finetuning [Wu et al., 2024a]. The success of these methods affirms that representations induced by pretrained LMs carry rich semantic structure.
Interventional interpretability. Much recent work has used interventions on model-internal states to test hypotheses about how LMs implement various behaviours. In particular, interventions on linear subspaces of representations have provided increasing evidence that human-interpretable concepts are encoded linearly [Smolensky, 1986, Rumelhart et al., 1986, McClelland et al., 1986]. This includes linguistic features such as gender and number [Lasri et al., 2022, Wang et al., 2023, Hanna et al., 2023, Chintam et al., 2023, Yamakoshi et al., 2023, Hao and Linzen, 2023, Chen et al., 2023, Amini et al., 2023, Guerner et al., 2023, Arora et al., 2024], logical and mathematical reasoning [Wu et al., 2023], entity attributes [Huang et al., 2024], and a number of other domains [Mikolov et al., 2013, Elhage et al., 2022, Park et al., 2023, Nanda et al., 2023, Guerner et al., 2023].
## 3 ReFT
We now define the ReFT family of methods. To do this, we first summarize the core motivation, which emerges from work on intervention-based model interpretability. We then show how this leads directly to Low-rank Linear Subspace ReFT (LoReFT). Finally, we generalize this to a family of ReFT methods. Appendix A provides a brief overview of our generic ReFT training library. To keep the presentation simple, we assume throughout that our target model is a Transformer-based [Vaswani et al., 2017] LM that produces contextualised representations of sequences of tokens. Given a sequence of n input tokens x = (x1, . . . , x n), the model first embeds these into a list of representations h(0) = (h(0)
> 1
, . . . , h(0)
> n
). Then, m layers successively compute the j-th list of hidden representations h(j) as a function of the previous list of hidden representations h(j1). Each hidden representation is a vector h ∈ Rd. The LM uses the final hidden representations h(m) to produce its predictions. In our experiments, we consider both autoregressive LMs and masked LMs [Devlin et al., 2019]. An autoregressive LM predicts p(xn+1 x1, . . . , x n) = softmax (Wh (m)
> n
), while a masked LM predicts p(xi x1, . . . , x i1, x i+1, . . . , x n) = softmax (Wh (m)
> i
), where W is a learned matrix mapping from representations to logits over the vocabulary space.
3.1 Motivation
In interpretability research, the framework of causal abstraction [Geiger et al., 2021] uses interchange interventions to establish the causal role of representations in deep learning models. An interchange intervention fixes a representation to the value it would take if a counterfactual input were processed by the model. Experiments investigating how such interventions affect model behavior form the evidence for claims about the causal role of a representation and the concept it encodes. To test whether a concept is encoded in a linear subspace of a representation, one may use a dis-tributed interchange intervention (DII) [Geiger et al., 2023b]. 3 Let b be the hidden representation created at row i and column k when our model processes input b, and let s be the corresponding representation when that same model processes input s. A distributed interchange intervention on b
given a counterfactual source representation s is then defined as
DII (b, s, R) = b + R⊺(Rs Rb ) (1) where R ∈ Rr×d is a low-rank projection matrix with orthonormal rows, d is the representation dimensionality, and r is the dimensionality of the subspace we are intervening on. We learn the subspace R using distributed alignment search (DAS), which finds the subspace that maximises the
> 3This notion of subspace intervention was also independently discovered by Guerner et al. [2023].
3ReFT Intervention
> this is some text
# LoReFT
> edit subspace
> (rows of R)
> RW
> hhh
> Φ(h)
> b
> -+
> +
> RT
> edit restricted to subspace
> edit vector
Figure 2: Illustration of ReFT . (1) The left panel depicts an intervention I: the intervention function
Φ is applied to hidden representations at positions P in layer l. (2) The right panel depicts the intervention function used in LoReFT, which finds an edit vector that only modifies the representation in the linear subspace spanned by the rows of R. Specifically, we show how a rank-2 LoReFT operates on 3-dimensional hidden representations. probability of the expected counterfactual output after intervention [Geiger et al., 2023b]. DAS is highly expressive, and can effectively localize concepts within model representations [Wu et al., 2023, Arora et al., 2024, Wu et al., 2024c, Huang et al., 2024]. This suggests that subspace representation interventions could also be a powerful tool for model control.
3.2 Two low-rank ReFT instantiations LoReFT. The formulation of DII in eq. (1) immediately suggests a way to control model generations via interventions. The guiding intuition is that we can learn how to perform interventions that steer the model towards predicting our task labels. The resulting method, Low-rank Linear Subspace ReFT (LoReFT), is defined by the following variant of eq. (1):
ΦLoReFT (h) = h + R⊺ (Wh + b Rh ) (2) This is identical to eq. (1), except we use a learned projected source Rs = Wh + b. LoReFT thus edits the representation in the r-dimensional subspace spanned by the rows of R to take on the values obtained from our linear projection Wh + b. We depict this operation in fig. 2. The learned parameters are ϕ = {R, W, b}; the parameters of the LM are frozen. As with DII , R ∈ Rr×d is a low-rank matrix with orthonormal rows where d is the hidden-state dimensionality and r ≤ d is the rank of the subspace. We further define a linear projection W ∈ Rr×d and bias vector b ∈ Rr .
DiReFT. In addition, we define an ablation of LoReFT which removes the orthogonality constraint and the difference operation, reducing training time:
ΦDiReFT (h) = h + W⊺
> 2
(W1h + b) (3) Both W1, W2 ∈ Rr×d are low-rank projection matrices. Note that eq. (3) resembles LoRA, and thus DiReFT can be thought of as LoRA applied di rectly to hidden representations at certain positions. 4
Empirical evidence from previous work suggests that adding orthogonal constraints to LoRA weights increases performance [Liu et al., 2024d]. (Appendix E reports results for additional ablations of LoReFT.)
Training objective. We consider both generation tasks using decoder-only or encoderdecoder LMs and classification tasks using encoder-only models. The pretrained language model induces a distribution over token sequences p(⋅). We denote the model that results from the ReFT intervention
Φ on p(⋅) as pΦ(⋅) with trainable parameters ϕ. To simplify notation, we refer to the hidden representations produced by the LM on input x as h(x), and those by the intervened LM as hΦ(x).For generation tasks, our training objective is language modelling. Given an input sequence x =
(x1, . . . , x n) with n tokens as the prompt, the goal is to predict the output sequence y = (y1, . . . , y m)
> 4LoRA is not applicable to the residual stream, which is weightless. LoRA can be configured to apply only to the attention layer output projection matrix, which is similar to our residual stream intervention. However, previous works found that applying LoRA only to attention layers is sub-optimal [Hu et al., 2023].
4with m tokens. We minimise the cross-entropy loss with teacher-forcing over all output positions.
min
> ϕ
{
> m
> i=1
log pΦ (yi xy <i)} (4) For single-label classification tasks, we add a classification head Hθ (⋅) with parameters θ that takes the final-layer representation at the first token ( CLS ) as input and outputs a distribution over classes.
H has the learned parameters θ = {Wo, bo, Wd, bd}.
Hθ (⋅ h) = softmax (Wo(tanh (Wdh(m)
> 1
+ bd)) + bo) (5) We learn the parameters of the head and those of the intervention function Φ. We minimise the cross-entropy loss of the target class y given input x:
min
> ϕ,θ
{ log Hθ (y hΦ(x))} (6)
3.3 The ReFT family of methods
It is straightforward to generalise the above intervention functions to define a family of intervention-based representation finetuning methods. We first define a general notion of intervention , i.e. the modification of hidden representations during the model forward pass:
Definition 3.1. An intervention I is a tuple ⟨Φ, P, l ⟩ that encapsulates a single inference-time modification of the representations computed by a Transformer-based LM. The three components of an intervention are (1) the intervention function Φ Rd → Rd with learned parameters ϕ, (2) a set of input positions P ⊆ {1, . . . , n } that the intervention is applied to, and (3) the layer l ∈ {1, . . . , m }
at which the intervention is applied. We implement the intervention I as the following operation that overwrites some representations h:
h(l) ← (Φ (h(l)
> p
) if p ∈ P else h(l)
> p
)p∈1,...,n (7) The intervention is applied immediately after the computation of h(l) and thus affects the representa-tions computed in later layers h(l+1), . . . , h(m).Figure 2 provides a schematic overview of an intervention. A ReFT is then defined as a constrained set of non-overlapping interventions:
Definition 3.2. A ReFT method is a set of f interventions I = {I1, . . . , I f }. We enforce that for any two interventions Ij , I k ∈ I such that they operate on the same layer lj = lk, their intervention positions must be disjoint, i.e. Pj ∩ Pk = ∅. The parameters (ϕ1, . . . , ϕ f ) of all of the intervention functions are independent. ReFT is thus a generic framework encompassing interventions on hidden representations during the model forward pass. In appendix B, we show how a variety of existing inference-time intervention methods can be described within this framework.
## 4 Experiments
To evaluate our ReFTs against existing PEFTs, we conduct experiments across four diverse NLP benchmarks covering more than 20 datasets (extensive details on our datasets are in appendix C). Our goal is to provide a rich picture of how LoReFT and DiReFT perform in different scenarios. We experiment with both masked and autoregressive LMs at different scales, ranging from RoBERTa-base [Liu et al., 2019] with 125M to LLaMA models [Touvron et al., 2023a,b] with 13B parameters. We benchmark against existing PEFTs such as prefix-tuning [Li and Liang, 2021], adapter-tuning with both Series Adapters and Parallel Adapters, BitFit [Ben Zaken et al., 2022], RED [Wu et al., 2024a], LoRA [Hu et al., 2022], and DoRA [Liu et al., 2024c]. Our comparisons focus on both performance and parameter efficiency. In our comparisons, we use hyperparameter-tuned scores from previous works when possible. We load our base LMs in torch.bfloat16 to save memory. All of our experiments are run with a single GPU: NVIDIA A100 40G/80G or RTX 6000 . Examples of raw model generations are in appendix I. 54.1 Hyperparameter configuration
For our experiments, we must decide how many interventions to learn and which layers and input positions to apply each one on. We propose learning interventions on a fixed number of p prefix and
s suffix positions in the prompt. Specifically, we tune four hyperparameters: 1. The number of prefix positions p to intervene on, i.e. positions {1, . . . , p }.2. The number of suffix positions s to intervene on, i.e. positions {n s + 1, . . . , n }.3. Which set of layers L to intervene on. 4. Whether or not to tie intervention parameters ϕ across different positions in the same layer. This simplifies the hyperparameter search space; compared to LoRA, the only additional consideration is which positions to intervene on. Since the number of positions edited is constant, LoReFT and DiReFT contribute a fixed additional inference cost that does not scale with prompt length. Given the positions P = {1, . . . , p } {n s + 1, . . . , n }, we define the untied and tied variants:
Iuntied = {⟨ Φ, {p}, l ⟩ p ∈ P, l ∈ L} Itied = {⟨ Φ, P, l ⟩ l ∈ L}
Additionally, when applying LoReFT and DiReFT to a prompt with length n where n < p + s, we set p ← min (p, ⌊n/2⌋) and s ← min (s, ⌈n/2⌉) and do not apply the truncated interventions in Iuntied .We also tune neural-network training hyperparameters. Unlike previous work [Hu et al., 2022, 2023, Liu et al., 2024c] where hyperparameter tuning may involve optimising performance directly on test sets, we only tune our hyperparameters on development sets which do not contain any overlapping examples with the test sets of our tasks. We further describe hyperparameter tuning for each benchmark in appendix D.1.
4.2 Commonsense reasoning
We replicate the experimental setup in Hu et al. [2023] and finetune LLaMA-1 7B/13B, Llama-2 7B, and Llama-3 8B 5 on a combined dataset of eight commonsense reasoning tasks (C OMMONSENSE 170K). We report scores on each tasks test set individually. We compare with PEFTs benchmarked in Hu et al. [2023] as well as the identical experiment reported in Liu et al. [2024c] for DoRA.
Datasets. Our benchmark contains eight commonsense reasoning datasets, including BoolQ [Clark et al., 2019], PIQA [Bisk et al., 2020], SIQA [Sap et al., 2019], HellaSwag [Zellers et al., 2019], WinoGrande [Sakaguchi et al., 2021], ARC-e, ARC-c [Clark et al., 2018], and OBQA [Mihaylov et al., 2018]. Examples are formulated as multiple-choice problems where the model needs to directly generate the correct choice without rationales. We use the same prompt template as in Hu et al. [2023] with additional string normalisation (removing leading and trailing whitespace).
Hyperparameter tuning. We do not do hyperparameter selection based on test set results. Rather, we use the hyperparameter settings of the model that performs best on a development set created from the GSM8K training set, except we use a lower number of epochs (6 instead of 12) because the C OMMONSENSE 170K training set is more than 20 times larger than GSM8K. This allows us to tune relevant hyperparamters, and also serves to test the robustness of these settings across different domains. We additionally report scores on 3 epochs in appendix D.3.
Results. We report results in table 1. LoReFT sets state-of-the-art performance on the commonsense reasoning tasks, outperforming all other methods by a considerable margin. While being more compute-efficient, DiReFT achieves only slightly worse performance consistently.
4.3 Arithmetic reasoning
Similar to the previous experiment, we follow the experimental setup in Hu et al. [2023] and finetune LLaMA-1 7B and 13B on a combined dataset of seven arithmetic reasoning tasks with LM-generated chain-of-thought steps (M ATH 10K) and report scores on four of the tasks test sets. We only evaluate correctness on the final numeric or multiple-choice answer.
> 5Llama-3 8B appeared on April 18, 2024, and thus we had time to complete only commonsense reasoning experiments with this model. Liu et al. [2024c] report corresponding results for LoRA and DoRA.
6Table 1: Accuracy comparison of LLaMA-1 7B/13B, Llama-2 7B and Llama-3 8B against existing PEFT methods on eight commonsense reasoning datasets. Performance results of all baseline methods are taken from Liu et al. [2024c]. We report averaged performance of three runs with distinct random seeds for our method. For our methods, Param. (%) is calculated by dividing the number of trainable parameters by the number of parameters of the base LM.
> Model PEFT Params (%) Accuracy (↑)
> BoolQ PIQA SIQA HellaS. WinoG. ARC-e ARC-c OBQA Avg.
> ChatGPT ∗——73.1 85.4 68.5 78.5 66.1 89.8 79.9 74.8 77.0 LLaMA-7B PrefT 0.039% 64.3 76.8 73.9 42.1 72.1 72.9 54.0 60.6 64.6 Adapter S1.953% 63.0 79.2 76.3 67.9 75.7 74.5 57.1 72.4 70.8 Adapter P3.542% 67.9 76.4 78.8 69.8 78.9 73.7 57.3 75.2 72.3 LoRA 0.826% 68.9 80.7 77.4 78.1 78.8 77.8 61.3 74.8 74.7 DoRA (half) 0.427% 70.0 82.6 79.7 83.2 80.6 80.6 65.4 77.6 77.5 DoRA 0.838% 68.5 82.9 79.6 84.8 80.8 81.4 65.8 81.0 78.1
> DiReFT (ours) 0.031% 69.5 83.0 79.0 92.5 80.5 82.2 68.0 77.5 79.0
> LoReFT (ours) 0.031% 69.3 84.4 80.3 93.1 84.2 83.2 68.2 78.9 80.2
> LLaMA-13B PrefT 0.031% 65.3 75.4 72.1 55.2 68.6 79.5 62.9 68.0 68.4 Adapter S1.586% 71.8 83.0 79.2 88.1 82.4 82.5 67.3 81.8 79.5 Adapter P2.894% 72.5 84.9 79.8 92.1 84.7 84.2 71.2 82.4 81.5 LoRA 0.670% 72.1 83.5 80.5 90.5 83.7 82.8 68.3 82.4 80.5 DoRA (half) 0.347% 72.5 85.3 79.9 90.1 82.9 82.7 69.7 83.6 80.8 DoRA 0.681% 72.4 84.9 81.5 92.4 84.2 84.2 69.6 82.8 81.5
> DiReFT (ours) 0.025% 71.3 86.1 80.8 94.6 83.6 85.5 72.9 82.7 82.2
> LoReFT (ours) 0.025% 72.1 86.3 81.8 95.1 87.2 86.2 73.7 84.2 83.3
> Llama-2 7B LoRA 0.826% 69.8 79.9 79.5 83.6 82.6 79.8 64.7 81.0 77.6 DoRA (half) 0.427% 72.0 83.1 79.9 89.1 83.0 84.5 71.0 81.2 80.5 DoRA 0.838% 71.8 83.7 76.0 89.1 82.6 83.7 68.2 82.4 79.7
> DiReFT (ours) 0.031% 70.8 83.6 80.2 93.6 82.1 84.8 70.4 81.5 80.9
> LoReFT (ours) 0.031% 71.1 83.8 80.8 94.3 84.5 85.6 72.2 82.3 81.8
> Llama-3 8B LoRA 0.700% 70.8 85.2 79.9 91.7 84.3 84.2 71.2 79.0 80.8 DoRA (half) 0.361% 74.5 88.8 80.3 95.5 84.7 90.1 79.1 87.2 85.0 DoRA 0.710% 74.6 89.3 79.9 95.5 85.6 90.5 80.4 85.8 85.2
> DiReFT (ours) 0.026% 73.4 88.7 81.0 95.6 85.5 91.8 81.8 85.4 85.4
> LoReFT (ours) 0.026% 75.1 90.2 82.0 96.3 87.4 92.4 81.6 87.5 86.6
Hyperparameter tuning. We use the same hyperparameter settings as for the Commonsense Rea-soning benchmark, but with 12 epochs for training. We also report scores on 3 epochs.
Datasets. Our benchmark contains four datasets for math world problems, including AQuA [Ling et al., 2017], GSM8K [Cobbe et al., 2021], MAWPS [Koncel-Kedziorski et al., 2016], and SVAMP [Pa-tel et al., 2021]. Models need to generate chain-of-thought [Wei et al., 2022] before the final answer. We use the same prompt template and hyperparameter settings as in the previous experiment.
Results. We report results in table 2. We find that both LoReFT and DiReFT do not perform as well at arithmetic reasoning tasks compared to LoRA and adapters, but do outperform prefix-tuning. Our results suggest that our ReFTs may have more trouble on chain-of-thought reasoning than the single-step commonsense reasoning tasks due to the length of generations (greater length necessarily reduces the effect of the intervention) and overall greater difficulty of the task. Our results show that our ReFTs perform better with the 13B model than the 7B model, which suggests that our methods scale with model size. Overall, we note that the arithmetic reasoning results show a lot of variation, with no single method emerging as a clear winner across all of them.
4.4 Instruction-following
Base LMs require instruction finetuning to follow human prompts [Ouyang et al., 2022]. We follow the experimental setup in Wu et al. [2024a] and finetune Llama-2 7B with Ultrafeedback [Cui et al., 2023]. We compare against full parameter finetuning, LoRA, and RED. For evaluation, we use Alpaca-Eval v1.0 [Li et al., 2023], which computes the win-rate against text-davinci-003 using GPT-4 as the annotator. We use the same prompt template as in Taori et al. [2023].
Datasets. Ultrafeedback is high-quality instruction dataset where responses are generated via scoring a diverse set of model responses from a list of candidates (e.g. ChatGPT and Bard). The score is calculated as a weighted score of instruction-following, truthfulness, honesty, and helpfulness. 7Table 2: Accuracy comparison of LLaMA-1 7B/13B against existing PEFT methods on four arithmetic reasoning datasets. Performance results of all baseline methods are taken from Hu et al. [2023]. We report averaged performance of three runs with distinct random seeds for our method.
> Model PEFT Params (%) Accuracy (↑)
> AQuA GSM8K MAWPS SVAMP Avg.
> LLaMA-7B PrefT 0.039% 14.2 24.4 63.4 38.1 35.0 Adapter S1.953% 15.0 33.3 77.7 52.3 44.6 Adapter P3.542% 18.1 35.3 82.4 49.6 46.4 LoRA 0.826% 18.9 37.5 79.0 52.1 46.9 DiReFT (ours) 0.031% 21.3 24.1 74.5 42.7 40.6
> LoReFT (ours) 0.031% 21.4 26.0 76.2 46.8 42.6 LLaMA-13B PrefT 0.031% 15.7 31.1 66.8 41.4 38.8 Adapter S1.586% 22.0 44.0 78.6 50.8 48.9 Adapter P2.894% 20.5 43.3 81.1 55.7 50.2 LoRA 0.670% 18.5 47.5 83.6 54.6 51.1 DiReFT (ours) 0.025% 20.5 35.8 80.8 54.8 48.0
> LoReFT (ours) 0.025% 23.6 38.1 82.4 54.2 49.6
Table 3: Instruction tuning evaluation results for instruction-tuned Llama-2 7B with Alpaca-Eval v1.0. We report averaged performance of two runs with distinct random seeds for our method. half
denotes our runs with half of the rank; 1K denotes our runs with a low-resource setting where there is only 1K training examples. †Performance results of baseline methods are taken from Li et al. [2023]. Performance results of baseline methods are taken from Wu et al. [2024a]. ‡It takes 18 minutes to train our Llama-2 Chat 7B on 1K examples using a single A100 40G GPU with
≈1MB parameters on disk.
Model & PEFT Params (%) Win-rate (↑)GPT-3.5 Turbo 1106 † — 86.30 Llama-2 Chat 13B † — 81.10 Llama-2 Chat 7B † — 71.40 Llama-2 7B & FT 100% 80.93 Llama-2 7B & LoRA 0.1245% 81.48 Llama-2 7B & RED 0.0039% 81.69 Llama-2 7B & DiReFT (ours) 0.0039% 84.85 Llama-2 7B & LoReFT (ours) 0.0039% 85.60
Llama-2 7B & LoReFT (ours, half ) 0.0019% 84.12 Llama-2 7B & LoReFT (ours, 1K )‡ 0.0039% 81.91
Some of the best 7B and 13B chat-models (e.g. UltraLM-13B [Ding et al., 2023]) are finetuned with Ultrafeedback.
Hyperparameter tuning. We do hyperparameter-tuning on the unseen instruction-following dataset Alpaca-52K [Taori et al., 2023] with only LLaMA-7B to prevent test-set hill-climbing. We then use the hyperparameter settings of our best performing model to finetune on Ultrafeedback. For hyperparameter tuning, we use Alpaca-Eval v1.0 with GPT-4 turbo as the annotator for fast turnaround, which also prevents overfitting with GPT-4 as a judge.
Results. We report results in table 3. When matched in parameter count to the previous most parameter-efficient PEFT (RED) and trained on Llama-2 7B, LoReFT outperforms all reported finetuning methods (including full finetuning) and achieves a win-rate within 1% of GPT-3.5 Turbo 1106. Furthermore, after halving the parameter count or using only 1/64-th of the data, LoReFT still outperforms other finetuning methods. This result shows that LoReFT can succeed at long-form text generation. DiReFT is again slightly worse than LoReFT but is highly competitive. 6
> 6
We release our ReFT weights ( <1MB) of our instruction-tuned model through HuggingFace and provide a tutorial at https://github.com/stanfordnlp/pyreft/blob/main/examples/chat .
8Table 4: Accuracy comparison of RoBERTa-base and RoBERTa-large against existing PEFT methods on the GLUE benchmark. Performance results of all baseline methods are taken from Wu et al. [2024a]. We report averaged performance of five runs with distinct random seeds for our method.
> Model PEFT Params (%) Accuracy (↑)
> MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B Avg.
> base FT 100% 87.3 94.4 87.9 62.4 92.5 91.7 78.3 90.6 85.6 Adapter 0.318% 87.0 93.3 88.4 60.9 92.5 90.5 76.5 90.5 85.0
> LoRA 0.239% 86.6 93.9 88.7 59.7 92.6 90.4 75.3 90.3 84.7 Adapter FNN 0.239% 87.1 93.0 88.8 58.5 92.0 90.2 77.7 90.4 84.7 BitFit 0.080% 84.7 94.0 88.0 54.0 91.0 87.3 69.8 89.5 82.3 RED 0.016% 83.9 93.9 89.2 61.0 90.7 87.2 78.0 90.4 84.3
> DiReFT (ours) 0.015% 82.5 92.6 88.3 58.6 91.3 86.4 76.4 89.3 83.2
> LoReFT (ours) 0.015% 83.1 93.4 89.2 60.4 91.2 87.4 79.0 90.0 84.2 large FT 100% 88.8 96.0 91.7 68.2 93.8 91.5 85.8 92.6 88.6 Adapter 0.254% 90.1 95.2 90.5 65.4 94.6 91.4 85.3 91.5 88.0 LoRA 0.225% 90.2 96.0 89.8 65.5 94.7 90.7 86.3 91.7 88.1 Adapter FNN 0.225% 90.3 96.1 90.5 64.4 94.3 91.3 84.8 90.2 87.7 RED 0.014% 89.5 96.0 90.3 68.1 93.5 88.8 86.2 91.3 88.0
> DiReFT (ours) 0.014% 88.7 95.4 88.5 66.7 93.9 88.1 86.9 91.2 87.4
> LoReFT (ours) 0.014% 89.2 96.2 90.1 68.0 94.1 88.5 87.5 91.6 88.2
4.5 Natural language understanding
We evaluate LoReFT on the GLUE benchmark [Wang et al., 2018] against existing PEFTs. We use this set of experiments to show LoReFT works well even with small-scale LMs, and can improve representations for classification tasks and not just text generation. We finetune RoBERTa-base (125M) as well as RoBERTa-large (350M) on GLUE, a sequence classification benchmark for natural language understanding (NLU) which covers domains such as sentiment classification and natural language inference. Details about the GLUE benchmark can be found in its original paper. We follow Wu et al. [2024a] for proper evaluation on GLUE validation set: we split the validation set into two sets guarded by a random seed, and we pick the best model with highest in-training validation accuracy to evaluate on the other held-out half for testing accuracy.
Hyperparameter tuning. We tune our hyperparameters for each task separately. which is standard for PEFTs. To avoid overfitting to random seeds, we hyperparameter-tune our models with a constant seed, and report averaged results over that and four additional unseen seeds. We describe hyperparameter tuning experiments in Appendix D.1.
Results. We report results in table 4. LoReFT obtains comparable performance with PEFT methods on both model sizes when parameter matched with RED, the previous most parameter-efficient PEFT for this task. Furthermore, DiReFT achieves worse performance than most of the PEFTs suggesting LoReFT is a better choice when LM is small. Full results with standard deviation is in table 13. We additionally compare against VeRA [Kopiczko et al., 2024] in appendix D.3.
## 5 Limitations
Due to limited resources, we mainly explored the LLaMA-family of models. In future work, we hope to explore the effectiveness of ReFT on other model families as well as visionlanguage models such as LLaVA [Liu et al., 2024a]. The capabilities of ReFT have not yet been fully explored due to the large hyperparameter search space; we are interested in automating this search. We provide some initial explorations of LM personalisation with ReFT in a few-shot setting in appendix G.2. We hope to explore why ReFT works, and we provide some of our early explorations focused on memorisation (appendix F.1, appendix F.2). We are also investigating whether learned orthogonal subspaces can be composed together without adaptation. Some encouraging initial findings are in appendix G.1.
ReFT, abstraction, and generation. Neural network interpretability research often struggles to contribute directly to improving models. With ReFT, we have shown one way to overcome this challenge. The ReFT framework is rooted in work on causal abstraction [Geiger et al., 2023a] for model interpretability, and LoReFT builds directly on the distributed interchange intervention method 9of Geiger et al. [2023b] and Wu et al. [2023]. See also the interchange intervention training (IIT) method of Geiger et al. [2022], Wu et al. [2022], Huang et al. [2023c]. In a similar vein, recent work also uses representation-based editing of the Transformer stream to steer model behavior [Li et al., 2024, Zou et al., 2023]. ReFT advances this line of work by showing one way that such steering can be learned, rather than being merely a post hoc analysis step. The precise ways in which ReFT works deserve deeper exploration. Although these methods intervene on representations, the causal effect of such interventions may only emerge in the models upstream computations. In other words, the power of ReFT may come from the fact that it creates new causal pathways or modifies the strength of some existing ones. We leave it to future research to track these effects, and perhaps to explore more structured ReFTs to modify complex causal pathways in LMs.
ReFT and model interpretability. ReFT relies on insights from work on interpretability, and it may also be able to contribute insights back to that field. In particular, LoReFT shows that training a set of low-rank interventions on selected residual streams can induce a base LM to follow instructions (section 4.4). In other words, a linear subspace distributed across a set of neurons can achieve generalised control over a vast number of tasks. This is a serious challenge to work seeing to interpret individual neurons in isolation (for related criticisms, see Huang et al. 2023b). The success of ReFT suggests to us a quite different approach to interperetability, one that starts from the assumption that neurons will play different roles in different contexts.
Evaluation practices in PEFT research. In this work, we hyperparameter-tune ReFT on develop-ment sets that do not overlap with the test set. Unfortunately, a considerable portion of the literature on PEFTs directly hill-climbs performance on test sets. This results in overfitting to specific tasks, which gives practitioners less certainty about the real-world performance of different methods and impedes fair comparison. We hope that future work can introduce benchmarks for evaluating PEFTs and ReFTs. These should allow for compute- or time-matched hyperparameter-tuning comparisons, and they should disallow any kind of tuning or model selection based on the test set.
## 6 Conclusion
We propose a strong alternative to PEFTs, LoReFT, and we identify an ablation of this method, DiReFT, that trades some performance for increased efficiency. Overall, LoReFT achieves strong per-formance across benchmarks from four domains while being 15 ×–65 × more efficient than LoRA. No-tably, LoReFT establishes new state-of-the-art performance on commonsense reasoning, instruction-following, and natural language understanding against the strongest PEFTs. We also show how our method can be described under a generic framework ReFT. ReFT is a new approach to finetuning that is more powerful, more parameter-efficient, and more interpretable than any existing PEFTs.
## Acknowledgements
We thank Jing Huang for helpful discussion in designing our memorisation tests as well as writing. We thank Chenglei Si, Harshit Joshi, Jordan Juravsky, Julie Kallini, Ken Liu, Rohan Pandey, Jiuding Sun, Leonard Tang, Tristan Thrush, Shengguang Wu, Qinan Yu, Yanzhe Zhang, Amir Zur, and Shiqi Chen for helpful discussion about the project and comments on the manuscript.
## References
Afra Amini, Tiago Pimentel, Clara Meister, and Ryan Cotterell. Naturalistic causal probing for morpho-syntax. Transactions of the Association for Computational Linguistics , 11:384403, 2023. doi: 10.1162/tacl_a_00554. URL https://aclanthology.org/2023.tacl-1.23 .Aryaman Arora, Dan Jurafsky, and Christopher Potts. CausalGym: Benchmarking causal inter-pretability methods on linguistic tasks. arXiv:2402.12560 , 2024. URL https://arxiv.org/abs/ 2402.12560 .Matan Avitan, Ryan Cotterell, Yoav Goldberg, and Shauli Ravfogel. What changed? Converting representational interventions to natural language. arXiv:2402.11355 , 2024. URL https://arxiv. org/abs/2402.11355 .10 Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. LEACE: Perfect linear concept erasure in closed form. Advances in Neural Information Processing Systems , 36, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/ file/d066d21c619d0a78c5b557fa3291a8f4-Paper-Conference.pdf .Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel. BitFit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages 19, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.1. URL https: //aclanthology.org/2022.acl-short.1 .Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Yejin Choi, et al. PIQA: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence ,volume 34, pages 74327439, 2020. URL https://arxiv.org/abs/1911.11641 .Lewis Carroll. Alices Adventures in Wonderland . Macmillan, London, 1865. Angelica Chen, Ravid Schwartz-Ziv, Kyunghyun Cho, Matthew L. Leavitt, and Naomi Saphra. Sudden drops in the loss: Syntax acquisition, phase transitions, and simplicity bias in MLMs.
arXiv:2309.07311 , 2023. URL https://arxiv.org/abs/2309.07311v4 .Abhijith Chintam, Rahel Beloch, Willem Zuidema, Michael Hanna, and Oskar van der Wal. Identifying and adapting transformer-components responsible for gender bias in an English language model. In Yonatan Belinkov, Sophie Hao, Jaap Jumelet, Najoung Kim, Arya Mc-Carthy, and Hosein Mohebbi, editors, Proceedings of the 6th BlackboxNLP Workshop: Ana-lyzing and Interpreting Neural Networks for NLP , pages 379394, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.blackboxnlp-1.29. URL
> https://aclanthology.org/2023.blackboxnlp-1.29
.Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan-guage Technologies, Volume 1 (Long and Short Papers) , pages 29242936, Minneapolis, Min-nesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1300. URL
> https://aclanthology.org/N19-1300
.Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv:1803.05457 , 2018. URL https://arxiv.org/abs/1803.05457 .Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv:2110.14168 , 2021. URL https://arxiv.org/abs/2110.14168 .Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Wei Zhu, Yuan Ni, Guotong Xie, Zhiyuan Liu, and Maosong Sun. UltraFeedback: Boosting language models with high-quality feedback.
arXiv:2310.01377 , 2023. URL https://arxiv.org/abs/2310.01377 .Andrew M. Dai and Quoc V. Le. Semi-supervised sequence learning. In Advances in Neural Information Processing Systems , volume 28. Curran Associates, Inc., 2015. URL https:// proceedings.neurips.cc/paper/2015/hash/7137debd45ae4d0ab9aa953017286b20-Abstract.html .Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 41714186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/ N19-1423 .11 Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 30293051, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.183. URL
> https://aclanthology.org/2023.emnlp-main.183
.Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superpo-sition. Transformer Circuits Thread , 2022. URL https://transformer-circuits.pub/2022/toy_ model/index.html .Stanislav Fort. Scaling laws for adversarial attacks on language model activations, 2023. URL
> http://arxiv.org/abs/2312.02780
.Cheng Fu, Hanxian Huang, Xinyun Chen, Yuandong Tian, and Jishen Zhao. Learn-to-Share: A hardware-friendly transfer learning framework exploiting computation and parameter sharing. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event , volume 139 of Proceedings of Machine Learning Research , pages 34693479. PMLR, 2021. URL http://proceedings.mlr. press/v139/fu21a.html .Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. LLaMA-Adapter v2: Parameter-efficient visual instruction model.
arXiv:2304.15010 , 2023. URL https://arxiv.org/abs/2304.15010 .Atticus Geiger, Hanson Lu, Thomas Icard, and Christopher Potts. Causal abstractions of neural networks. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems , volume 34, pages 95749586. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/ 4f5c422f4d49a5a807eda27434231040-Paper.pdf .Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Good-man, and Christopher Potts. Inducing causal structure for interpretable neural networks. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning , volume 162 of
Proceedings of Machine Learning Research , pages 73247338. PMLR, 1723 Jul 2022. URL
> https://proceedings.mlr.press/v162/geiger22a.html
.Atticus Geiger, Chris Potts, and Thomas Icard. Causal abstraction for faithful model interpretation.
arXiv:2301.04709 , 2023a. URL https://arxiv.org/abs/2301.04709 .Atticus Geiger, Zhengxuan Wu, Christopher Potts, Thomas Icard, and Noah D. Goodman. Find-ing alignments between interpretable causal variables and distributed neural representations.
arXiv:2303.02536 , 2023b. URL https://arxiv.org/abs/2303.02536 .Clément Guerner, Anej Svete, Tianyu Liu, Alexander Warstadt, and Ryan Cotterell. A geometric notion of causal probing. arXiv:2307.15054 , 2023. URL https://arxiv.org/abs/2307.15054 .Michael Hanna, Yonatan Belinkov, and Sandro Pezzelle. When language models fall in love: Animacy processing in transformer language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing ,pages 1212012135, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.744. URL https://aclanthology.org/2023.emnlp-main.744 .Sophie Hao and Tal Linzen. Verb conjugation in transformers is determined by linear encodings of subject number. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 45314539, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.300. URL
> https://aclanthology.org/2023.findings-emnlp.300
.12 Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. Towards a unified view of parameter-efficient transfer learning. In The Tenth International Conference on Learning Representations, ICLR 2022 , Virtual Event, 2022a. URL https://openreview.net/ forum?id=0RDcd5Axok .Shwai He, Liang Ding, Daize Dong, Jeremy Zhang, and Dacheng Tao. SparseAdapter: An easy approach for improving the parameter-efficiency of adapters. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Findings of the Association for Computational Linguistics: EMNLP 2022 , pages 21842190, Abu Dhabi, United Arab Emirates, December 2022b. As-sociation for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.160. URL
> https://aclanthology.org/2022.findings-emnlp.160
.Mohammad Javad Hosseini, Hannaneh Hajishirzi, Oren Etzioni, and Nate Kushman. Learning to solve arithmetic word problems with verb categorization. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 523533, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1058. URL https://aclanthology.org/ D14-1058 .Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA ,volume 97 of Proceedings of Machine Learning Research , pages 27902799. PMLR, 2019. URL
> http://proceedings.mlr.press/v97/houlsby19a.html
.Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In The Tenth In-ternational Conference on Learning Representations, ICLR 2022 , Virtual Event, 2022. URL
> https://openreview.net/forum?id=nZeVKeeFYf9
.Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 52545276, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main. 319. URL https://aclanthology.org/2023.emnlp-main.319 .Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. LoraHub: Efficient cross-task generalization via dynamic lora composition. arXiv:2307.13269 , 2023a. URL
> https://arxiv.org/abs/2307.13269
.Jing Huang, Atticus Geiger, Karel DOosterlinck, Zhengxuan Wu, and Christopher Potts. Rigorously assessing natural language explanations of neurons. In Yonatan Belinkov, Sophie Hao, Jaap Jumelet, Najoung Kim, Arya McCarthy, and Hosein Mohebbi, editors, Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP , pages 317331, Singapore, December 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023. blackboxnlp-1.24. URL https://aclanthology.org/2023.blackboxnlp-1.24 .Jing Huang, Zhengxuan Wu, Kyle Mahowald, and Christopher Potts. Inducing character-level structure in subword-based language models with type-level interchange intervention training. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023 , pages 1216312180, Toronto, Canada, July 2023c. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.770. URL https: //aclanthology.org/2023.findings-acl.770 .Jing Huang, Christopher Potts Zhengxuan Wu, Mor Geva, and Atticus Geiger. RAVEL: Evaluating interpretability methods on disentangling language model representations. arXiv:2402.17700 ,2024. URL https://arxiv.org/abs/2402.17700 .Rik Koncel-Kedziorski, Hannaneh Hajishirzi, Ashish Sabharwal, Oren Etzioni, and Siena Dumas Ang. Parsing algebraic word problems into equations. Transactions of the Association for Computational
13 Linguistics , 3:585597, 2015. doi: 10.1162/tacl_a_00160. URL https://aclanthology.org/ Q15-1042 .Rik Koncel-Kedziorski, Subhro Roy, Aida Amini, Nate Kushman, and Hannaneh Hajishirzi. MAWPS: A math word problem repository. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors,
Proceedings of the 2016 Conference of the North American Chapter of the Association for Com-putational Linguistics: Human Language Technologies , pages 11521157, San Diego, Califor-nia, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1136. URL
> https://aclanthology.org/N16-1136
.Dawid Jan Kopiczko, Tijmen Blankevoort, and Yuki M Asano. VeRA: Vector-based random matrix adaptation. In The Twelfth International Conference on Learning Representations, ICLR 2024 ,2024. URL https://openreview.net/forum?id=NjNfLdxr3A .Karim Lasri, Tiago Pimentel, Alessandro Lenci, Thierry Poibeau, and Ryan Cotterell. Prob-ing for the usage of grammatical number. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) , pages 88188831, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.603. URL
> https://aclanthology.org/2022.acl-long.603
.Hector Levesque, Ernest Davis, and Leora Morgenstern. The Winograd Schema Challenge. In
Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning , 2012. URL https://cdn.aaai.org/ocs/4492/4492-21843-1-PB.pdf .Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems , 36, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2023/ hash/81b8390039b7302c909cb769f8b6cd93-Abstract-Conference.html .Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, and Luke Zettlemoyer. Branch-train-merge: Embarrassingly parallel training of expert language models.
arXiv:2208.03306 , 2022. URL https://arxiv.org/abs/2208.03306 .Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 45824597, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353 .Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. AlpacaEval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval , 2023. Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. Program induction by rationale gener-ation: Learning to solve and explain algebraic word problems. arXiv:1705.04146 , 2017. URL
> https://arxiv.org/abs/1705.04146
.Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in Neural Information Processing Systems , 36, 2024a. URL https://arxiv.org/abs/2304.08485 .Sheng Liu, Haotian Ye, Lei Xing, and James Zou. In-context vectors: Making in context learning more effective and controllable through latent space steering. arXiv:2311.06668 , 2024b. URL
> https://arxiv.org/abs/2311.06668
.Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DoRA: Weight-decomposed low-rank adaptation. arXiv:2402.09353 ,2024c. URL https://arxiv.org/abs/2402.09353 .Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, Yandong Wen, Michael J. Black, Adrian Weller, and Bern-hard Schölkopf. Parameter-efficient orthogonal finetuning via butterfly factorization. In The
14 Twelfth International Conference on Learning Representations, ICLR 2024 , 2024d. URL
> https://openreview.net/forum?id=7NzgkEdGyr
.Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692 , 2019. URL https://arxiv.org/abs/1907.11692 .James L. McClelland, David E. Rumelhart, and PDP Research Group. Parallel Distributed Processing: Explorations in the Microstructure of Cognition , volume 2: Psychological and Biological Models. MIT Press, 1986. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? A new dataset for open book question answering. arXiv:1809.02789 , 2018. URL
> https://arxiv.org/abs/1809.02789
.Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic regularities in continuous space word representations. In Lucy Vanderwende, Hal Daumé III, and Katrin Kirchhoff, editors, Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Lin-guistics: Human Language Technologies , pages 746751, Atlanta, Georgia, June 2013. Association for Computational Linguistics. URL https://aclanthology.org/N13-1090 .Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models of self-supervised sequence models. In Yonatan Belinkov, Sophie Hao, Jaap Jumelet, Najoung Kim, Arya McCarthy, and Hosein Mohebbi, editors, Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP , pages 1630, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.blackboxnlp-1.2. URL https: //aclanthology.org/2023.blackboxnlp-1.2 .Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35: 2773027744, 2022. URL https://arxiv.org/abs/2203.02155 .Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. arXiv:2311.03658 , 2023. URL https://arxiv.org/abs/2311.03658 .Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve sim-ple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 20802094, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.168. URL
> https://aclanthology.org/2021.naacl-main.168
.Jonas Pfeiffer, Ivan Vuli´ c, Iryna Gurevych, and Sebastian Ruder. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu, editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 76547673, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.617. URL https://aclanthology. org/2020.emnlp-main.617 .Shauli Ravfogel, Michael Twiton, Yoav Goldberg, and Ryan D. Cotterell. Linear adversarial concept erasure. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning ,volume 162 of Proceedings of Machine Learning Research , pages 1840018421, 1723 Jul 2022. URL https://proceedings.mlr.press/v162/ravfogel22a.html .Subhro Roy and Dan Roth. Solving general arithmetic word problems. In Lluís Màrquez, Chris Callison-Burch, and Jian Su, editors, Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 17431752, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1202. URL https://aclanthology.org/ D15-1202 .15 David E. Rumelhart, James L. McClelland, and PDP Research Group. Parallel Distributed Processing: Explorations in the Microstructure of Cognition , volume 1: Foundations. MIT Press, 1986. Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial Winograd Schema Challenge at scale. Communications of the ACM , 64(9):99106, 2021. URL https://arxiv.org/abs/1907.10641 .Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social IQa: Common-sense reasoning about social interactions. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , pages 44634473, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1454. URL https://aclanthology.org/D19-1454 .Shuhua Shi, Shaohan Huang, Minghui Song, Zhoujun Li, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, and Qi Zhang. ResLoRA: Identity residual mapping in low-rank adaption.
arXiv:2402.18039 , 2024. URL https://arxiv.org/abs/2402.18039 .Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, and Ponnu-rangam Kumaraguru. MiMiC: Minimally modified counterfactuals in the representation space.
arXiv:2402.09631 , 2024. URL https://arxiv.org/abs/2402.09631 .Paul Smolensky. Neural and conceptual interpretation of PDP models. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition , volume 2: Psychological and Biological Models, pages 390431. MIT Press/Bradford Books, Cambridge, MA, 1986. Nishant Subramani, Nivedita Suresh, and Matthew E. Peters. Extracting latent steering vectors from pretrained language models. arXiv:2205.05124 , 2022. URL https://arxiv.org/abs/2205.05124 .Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model.
> https://github.com/tatsu-lab/stanford_alpaca
, 2023. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient foundation language models. arXiv:2302.13971 , 2023a. URL https://arxiv.org/abs/2302.13971 .Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cris-tian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023b. URL https://arxiv.org/abs/2307.09288 .Alex Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization. arXiv:2308.10248 , 2023. URL https://arxiv.org/abs/2308.10248 .Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems , volume 30, pages 59986008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf .Theia Vogel. repeng, 2024. URL https://github.com/vgel/repeng/ .16 Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, editors, Proceedings of the 2018 EMNLP Workshop Black-boxNLP: Analyzing and Interpreting Neural Networks for NLP , pages 353355, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5446. URL
> https://aclanthology.org/W18-5446
.Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, ICLR 2023 , Kigali, Rwanda, 2023. URL https://openreview.net/pdf?id=NpsVSN6o4ul .Yaqing Wang, Sahaj Agarwal, Subhabrata Mukherjee, Xiaodong Liu, Jing Gao, Ahmed Hassan Awadallah, and Jianfeng Gao. AdaMix: Mixture-of-adaptations for parameter-efficient model tuning. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , pages 57445760, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.388. URL https://aclanthology.org/2022.emnlp-main.388 .Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems , 35:2482424837, 2022. URL https://arxiv.org/abs/ 2201.11903 .Muling Wu, Wenhao Liu, Xiaohua Wang, Tianlong Li, Changze Lv, Zixuan Ling, Jianhao Zhu, Cenyuan Zhang, Xiaoqing Zheng, and Xuanjing Huang. Advancing parameter efficiency in fine-tuning via representation editing. arXiv:2402.15179 , 2024a. URL https://arxiv.org/abs/2402. 15179 .Zhengxuan Wu, Atticus Geiger, Joshua Rozner, Elisa Kreiss, Hanson Lu, Thomas Icard, Christo-pher Potts, and Noah Goodman. Causal distillation for language models. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguis-tics: Human Language Technologies , pages 42884295, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.318. URL
> https://aclanthology.org/2022.naacl-main.318
.Zhengxuan Wu, Atticus Geiger, Christopher Potts, and Noah D. Goodman. Interpretability at scale: Identifying causal mechanisms in Alpaca. In Advances in Neural Information Processing Systems , volume 36, 2023. URL https://papers.neurips.cc/paper_files/paper/2023/file/ f6a8b109d4d4fd64c75e94aaf85d9697-Paper-Conference.pdf .Zhengxuan Wu, Atticus Geiger, Aryaman Arora, Jing Huang, Zheng Wang, Noah D. Goodman, Christopher D. Manning, and Christopher Potts. pyvene: A library for understanding and improving PyTorch models via interventions. In arXiv:2403.07809 , 2024b. URL https://arxiv.org/abs/ 2403.07809 .Zhengxuan Wu, Atticus Geiger, Jing Huang, Aryaman Arora, Thomas Icard, Christopher Potts, and Noah D. Goodman. A reply to Makelov et al. (2023)s “interpretability illusion” arguments.
arXiv:2401.12631 , 2024c. URL https://arxiv.org/abs/2401.12631 .Takateru Yamakoshi, James McClelland, Adele Goldberg, and Robert Hawkins. Causal in-terventions expose implicit situation models for commonsense language understanding. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023 , pages 1326513293, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.839. URL
> https://aclanthology.org/2023.findings-acl.839
.Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? arXiv:1905.07830 , 2019. URL https://arxiv.org/abs/1905.07830 .17 Feiyu Zhang, Liangzhi Li, Junhao Chen, Zhouqiang Jiang, Bowen Wang, and Yiming Qian. IncreLoRA: Incremental parameter allocation method for parameter-efficient fine-tuning.
arXiv:2308.12043 , 2023. URL https://arxiv.org/abs/2308.12043 .Jinghan Zhang, Shiqi Chen, Junteng Liu, and Junxian He. Composing parameter-efficient modules with arithmetic operation. Advances in Neural Information Processing Systems , 36, 2024a. URL
> https://arxiv.org/abs/2306.14870
.Renrui Zhang, Jiaming Han, Chris Liu, Aojun Zhou, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao. LLaMA-Adapter: Efficient fine-tuning of large language models with zero-initialized attention. In
The Twelfth International Conference on Learning Representations , Vienna, Austria, 2024b. URL
> https://openreview.net/forum?id=d4UiXAHN2W
.Ruiyi Zhang, Rushi Qiang, Sai Ashish Somayajula, and Pengtao Xie. AutoLoRA: Automatically tuning matrix ranks in low-rank adaptation based on meta learning. arXiv:2403.09113 , 2024c. URL https://arxiv.org/abs/2403.09113 .Ming Zhong, Yelong Shen, Shuohang Wang, Yadong Lu, Yizhu Jiao, Siru Ouyang, Donghan Yu, Jiawei Han, and Weizhu Chen. Multi-LoRA composition for image generation. arXiv:2402.16843 ,2024. URL https://arxiv.org/abs/2402.16843 .Yun Zhu, Nevan Wichers, Chu-Cheng Lin, Xinyi Wang, Tianlong Chen, Lei Shu, Han Lu, Ca-noee Liu, Liangchen Luo, Jindong Chen, et al. SiRa: Sparse mixture of low rank adaptation.
arXiv:2311.09179 , 2023. URL https://arxiv.org/abs/2311.09179 .Zhou Ziheng, Yingnian Wu, Song-Chun Zhu, and Demetri Terzopoulos. Aligner: One global token is worth millions of parameters when aligning large language models. arXiv:2312.05503 , 2023. URL https://arxiv.org/abs/2312.05503 .Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to AI transparency. arXiv:2310.01405 , 2023. URL https://arxiv.org/abs/2310.01405 .18 Appendix
# Table of Contents
A pyreft : A ReFT-native Python Library 20 B Describing existing methods under the ReFT framework 20
B.1 RED . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.2 Activation addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 B.3 RepE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
C Datasets 21
C.1 Commonsense reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.2 Arithmetic reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 C.3 Natural language understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
D Hyperparameters 23
D.1 Hyperparameter tuning and decoding strategy . . . . . . . . . . . . . . . . . . . . 23 D.2 Suggestions on choosing hyperparameters for ReFT . . . . . . . . . . . . . . . . . 29 D.3 Additional hyperparameter-tuning results of LoReFT . . . . . . . . . . . . . . . . 29
E Ablating the parametrisation of LoReFT 32 F Memorisation experiments 33
F.1 A single vector is worth a thousand tokens . . . . . . . . . . . . . . . . . . . . . . 33 F.2 A single vector can memorise a codebook with 256 entries . . . . . . . . . . . . . 35
G Capabilities experiments 36
G.1 Multi-task learning: Learned ReFTs are like puzzle pieces . . . . . . . . . . . . . 36 G.2 Few-shot adaptation: Adapting Llama-2-Chat to GOODY-2 with 5 examples . . 38
H Inference overhead analysis of ReFT with our ReFT library 39 I Generation examples 41 J Licenses for existing assets 49
J.1 Commonsense reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 J.2 Arithmetic reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 J.3 Instruct-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 J.4 Natural language understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 J.5 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
19 A pyreft : A ReFT-native Python Library
To lower the cost of switching from PEFTs to ReFT, we release a Python library made for training and sharing ReFTs. Our library is built on top of pyvene [Wu et al., 2024b], a library for performing and training activation interventions on arbitrary PyTorch models. Any pretrained LM available on HuggingFace is supported through our library for finetuning with ReFT methods, and finetuned models can be easily uploaded to HuggingFace. The following example shows steps to wrap a Llama-2 7B model with a single intervention on the residual stream output of the 19-th layer:
> import torch import transformers from pyreft import get_reft_model , ReftConfig , LoreftIntervention , ReftTrainerForCausalLM #loading huggingface model model_name_or_path =" yahma / llama -7b - hf " model =transformers . AutoModelForCausalLM . from_pretrained ( model_name_or_path , torch_dtype = torch . bfloat16 , device_map =" cuda ") #wrap the model with rank -1 loreft reft_config =ReftConfig ( representations ={ " layer ": 19 , " component ": " block_output " , " intervention ": LoreftIntervention ( embed_dim = model . config . hidden_size , low_rank_dimension =1) }) reft_model =get_reft_model ( model , reft_config ) reft_model . print_trainable_parameters ()
The wrapped model can be trained for downstream tasks. We also provide data loading helpers to construct training data that is compatible with HuggingFace trainers:
> tokenizer =transformers . AutoTokenizer . from_pretrained ( model_name_or_path ) #get training data with customised dataloaders data_module =make_supervised_data_module ( tokenizer = tokenizer , model = model , layers =[19] , training_args = training_args , data_args = data_args ) #train trainer =reft . ReftTrainerForCausalLM ( model = reft_model , tokenizer = tokenizer , args = training_args , ** data_module ) trainer . train () trainer . save_model ( output_dir = training_args . output_dir )
## B Describing existing methods under the ReFT framework
To show the expressivity of the ReFT framework, we cast existing representing-editing methods in the literature into ReFTs.
General comments about expressivity of ReFT. Given that previous works have unified PEFTs under a single framework [He et al., 2022a], one may ask why not express ReFT as a PEFT method? The main reason is that PEFT frameworks lack the notion of time or sequence (see the unified PEFT view provided in Table 1 on pg. 5 of He et al., 2022a). In PEFTs, representation modifications are necessarily applied to every token in the sequence, even in recent variants such as AdaLoRA [Zhang et al., 2023]. A key aspect of ReFT is that it leverages representations over time and intervenes only on a small number of them while being effective. More importantly, the notation of time is important for future versions of ReFT that intervene on representations schematically
(e.g. intervene on the first token at some early layers and then intervene on the last token at some later layers). The ability to intervene at different layer and position combinations schematically is also supported in our code. Existing PEFT libraries 7 enforce weight-based updates without supporting flexible representation-based interventions.
B.1 RED
RED [Wu et al., 2024a] is a simple representation-editing method that applies an element-wise scaling transform s ∈ Rn and adds a bias b ∈ Rn to the hidden representation in every layer. The same intervention is applied to every position (including at generated tokens, increasing inference burden)
> 7
See https://github.com/huggingface/peft .
20 but separate interventions are learned at each layer. In the ReFT framework, RED is defined as
ΦRED (h) = s × h + b (8)
IRED = {⟨ ΦRED , {1, . . . , n }, l ⟩ l ∈ {1, . . . , m }} (9) The parameters ϕRED = {s, b} are learned with gradient descent to minimise a loss function such as language-modelling loss or a classification loss, as in our experiments with LoReFT. We believe that RED is better classified as a kind of adapter due to its application at all positions.
B.2 Activation addition
Activation addition [Turner et al., 2023] takes the difference in activations at at some positions p and
q and layer l given two contrastive prompts x+ and x as input. It then adds this difference vector, scaled by a tuned constant c, to representations at all positions in layer l for some new prompt.
a = h(x+)(l)
> p
h(x)(l)
> q
(10)
ΦActAdd (h) = h + c ⋅ a (11)
IActAdd = {⟨ ϕActAdd , {1, . . . , n }, l ⟩} (12)
B.3 RepE
Zou et al. [2023] introduce several intervention methods for controlling model behaviour, which they term representation engineering .First, given a set of prompts {x1, . . . , xn} designed to elicit the presence of a concept, we randomly pair them, take the difference in activations for each pair, and find the first principle component of the difference vectors at the last token position in some layer of interest l to obtain a reading vector :
areading = PCA ({ h(xi)(l)
> 1
h(xi+1)(l)
> 1
i ≡ 0 mod 2 }) 1 (13) One can also used a more structured pairing of constrastive prompts to obtain a contrast vector ,similar to the difference vector computed in activation addition:
acontrast = PCA ({ h(x+
> i
)(l)
> 1
h(x
> i
)(l)
> 1
1 ≤ i ≤ n}) 1 (14) Then, using either areading or acontrast , RepE introduces three operators (i.e. parametrisations of Φ)for intervening on activations:
ΦRepE ,linear (h) = h ± c ⋅ a (15)
ΦRepE ,piecewise (h) = h + c ⋅ sign (a ⋅ h) ⋅ a (16)
ΦRepE ,projection (h) = h c ⋅ a ⋅ h
∥a∥2 ⋅ a (17) The first two of these are similar to activation addition, while the latter is a scaled one-dimensional distributed interchange intervention that is a special case of LoReFT. These operations are then used to intervene on some set of positions P ⊆ {1, . . . , n } in the layer of interest:
IRepE = {⟨ ΦRepE , P, l ⟩} (18) RepE introduces another model control method called Low-Rank Representation Adaptation (LoRRA), which is a kind of PEFT rather than a ReFT since it tunes model weights using a variant of LoRA.
## C Datasets
C.1 Commonsense reasoning
We train and evaluate our models on eight datasets covering different domains of open-ended QA tasks: 1. The BoolQ [Clark et al., 2019] dataset, which is a question-answering dataset for yes or no naturally occurring questions. We remove the provided passage in the dataset following previous works to ensure a fair comparison. 21 2. The PIQA [Bisk et al., 2020] dataset, which tests physical commonsense reasoning and requires the model to choose one of the provided actions to take based on a hypothesised scenario. 3. The SIQA [Sap et al., 2019] dataset, which focus on reasoning about peoples actions and their corresponding social consequences. 4. The HellaSwag [Zellers et al., 2019] dataset, which asks the model to choose an appropriate ending (or sentence completion) given a context. 5. The WinoGrande [Sakaguchi et al., 2021] dataset, inspired by Winograd Schema Chal-lenge [Levesque et al., 2012], asks the model to fill-in-a-blank with binary options given a sentence which requires commonsense reasoning. 6. The ARC Easy set ( ARC-e [Clark et al., 2018]), which includes genuine grade-school level multiple-choice science questions 7. The ARC Challenge set ( ARC-c ) [Clark et al., 2018]), which is like ARC-e but designed in a way that co-occurrence methods are expected to fail to answer correctly. 8. The OBQA [Mihaylov et al., 2018] dataset, which is a knowledge-intensive and open-book QA dataset that requires multi-hop reasoning. Dataset statistics and simplified training examples from each dataset are provided in Hu et al. [2023]. Dataset statistics and simplified training examples from each dataset are provided in Hu et al. [2023]. We replicate the experimental setup in Hu et al. [2023] and finetune our models on a combined training dataset (C OMMONSENSE 170K) of the tasks mentioned above, and evaluate on their individual test set.
C.2 Arithmetic reasoning
We train and evaluate with seven datasets covering different domains of math world problems: 1. The AddSub [Hosseini et al., 2014] dataset, which involves solving arithmetic word prob-lems that include addition and subtraction. 2. The AQuA [Ling et al., 2017] dataset, which formulates algebraic word problems as multiple-choice problems. 3. The GSM8K [Cobbe et al., 2021] dataset, which consists of grade-school math word problems that require multi-step reasoning. 4. The MAWPS [Koncel-Kedziorski et al., 2016] dataset, which contains math word problem with varying complexity. 5. The MultiArith [Roy and Roth, 2015] dataset, which contains multi-step arithmetic prob-lems. 6. The SingleEq [Koncel-Kedziorski et al., 2015] dataset, which has grade-school math word problems that map to single equations with different length. 7. The SVAMP [Patel et al., 2021] dataset, which enhances the original Math World Prob-lem (MWP) challenge by requiring robust reasoning ability that is invariant to structural alternations of the posing problem. Dataset statistics and simplified training examples from each dataset are provided in Hu et al. [2023]. We replicate the experimental setup in Hu et al. [2023] and finetune our models on a combined training dataset (M ATH 10K) of four tasks mentioned above: GSM8K, MAWPS, MAWPS-single and AQuA. Different from Hu et al. [2023], selected tasks are excluded for testing since the original paper accidentally leaks testing examples from these tasks into the training set, affecting AddSub, MultiArith and SingleEq. They are included in the MAWPS training dataset, and thus leaked into the training dataset.
C.3 Natural language understanding
We follow Wu et al. [2024a] for proper evaluation on the GLUE validation set. We split the validation set into two subsets, using one subset guarded by a random seed for in-training evaluation and the other for testing. Specifically, after each training epoch, we evaluate the model on our in-training 22 evaluation set and select the best model across all epochs for testing. For datasets with a large validation set (i.e., QQP, MNLI, and QNLI), we select 1,000 samples for in-training evaluation. For the remaining smaller datasets, we select half of the samples for this purpose. For the evaluation metric, we use the Matthews correlation coefficient for CoLA, the Pearson correlation coefficient for STS-B, and accuracy for the other datasets. For MNLI, we report results only on the matched version.
## D Hyperparameters
D.1 Hyperparameter tuning and decoding strategy Commonsense reasoning and arithmeric reasoning. We create a standalone development set by taking the last 300 examples from the GSM8K training set. We train our models with the remaining training set of GSM8K and select the hyperparameter settings based on model performance on the development set. We select the hyperparameters using LLaMA-7B, and apply the same settings to LLaMA-13B without additional tuning. We use a maximum sequence length of 512 for training and hyperparameter tuning, and a maximum new token number of 32 for inference. Table 5 and table 6 describes our hyperparameter search space. We use a lower number of epochs (6 instead of 12) for the commonsense reasoning benchmark because the C OMMONSENSE 170K training set is more than 20 times larger than GSM8K. During inference, we use greedy decoding without sampling for the commonsense reasoning bench-mark, since it is a multi-token classification benchmark, and use the same decoding strategy as in Hu et al. [2023] for the arithmetic reasoning benchmark with a higher temperature 0.3. The reason to switch to a slightly different set of decoding hyperparameters is that the HuggingFace decoding function may throw an error due to statistical instability with close-to-zero probabilities over output tokens with beam search. 8
Instruction following. We finetune LLaMA-7B on Alpaca-52K [Taori et al., 2023] to select hy-perparameters. We select the hyperparameter settings based on model performance evaluated with Alpaca-Eval v1.0 [Li et al., 2023], which calculates the win-rate over text-davinci-003 by using
gpt-4-turbo as the annotator. We use a maximum sequence length of 768 for training and hyper-parameter tuning, and a maximum new token number of 2048 for inference. Table 7 describes our hyperparameter search space. During inference, we use the same decoding strategy as in RED [Wu et al., 2024a] to ensure a fair comparison. Specifically, we use greedy decoding without sampling, and use a maximum repetition n-gram size of 5 with a repetition penalty of 1.1.
Natural language understanding. We conduct hyperparameter tuning with RoBERTa-base and RoBERTa-large for each task individually. We pick the hyperparameters based on testing performance on the held-out validation set with a fixed random seed of 42. We then evaluate our model with additional four unseen seeds {43, 44, 45, 46} for final results. We follow Wu et al. [2024a]s setting for evaluation. For QQP with RoBERTa-large, there are some stochasticity in runs with the same seed, so we picked the best run out of 3 runs for any particular seed. As reported by Wu et al. [2024a], we also observe that evaluation results on RTE are unstable due to the small size of the dataset. We thus replace several random seeds as in Wu et al. [2024a] to ensure a fair comparison. In addition, we replace one or two random seeds for CoLA for stability. Table 8 describes our hyperparameter search space. Table 9 to table 12 describe our hyperparameter settings for each task. We conduct separate hyperparameter tuning for LoReFT and DiReFT to ensure a fair comparison.
> 8See reference ticket: https://github.com/huggingface/transformers/issues/11267 .
23 Table 5: Hyperparameter search space of LLaMA-1 7B models with LoReFT on the GSM8K development set with the best settings underlined . We use greedy decoding without sampling during hyperparameter tuning.
Hyperparameters LLaMA-7B w/ GSM8K for LoReFT
prefix+suffix position p + s {p1+ s1, p3+ s3, p5+ s5, p7+ s7, p9+ s9, p11+ s11} Tied weight p, s {True, False} Rank r {8, 16, 32, 64} Layer L (sep. w/ ;) {0;2;4;6;10;12;14;18, 10;12;14;18;20;22;24;28, 4;6;10;12;14;18;20;22, all} Dropout {0.00, 0.05} Optimizer AdamW LR {9 ×10 5, 1 ×10 4, 3 ×10 4, 6 ×10 4, 9 ×10 4, 1 ×10 3, 3 ×10 3}Weight decay {0, 1 ×10 3, 2 ×10 3}LR scheduler Linear Batch size {4, 8, 16, 32, 64} Warmup ratio {0.00, 0.06, 0.10} Epochs {3, 6, 9, 12, 18}
Table 6: Hyperparameter search space of LLaMA-1 7B models with DiReFT on the GSM8K development set with the best settings underlined . We use greedy decoding without sampling during hyperparameter tuning.
Hyperparameters LLaMA-7B w/ GSM8K for DiReFT
prefix+suffix position p + s {p1+ s1, p3+ s3, p5+ s5, p7+ s7, p9+ s9, p11+ s11} Tied weight p, s {True, False} Rank r {8, 16, 32, 64} Layer L (sep. w/ ;) {0;2;4;6;10;12;14;18, 10;12;14;18;20;22;24;28, 4;6;10;12;14;18;20;22, all} Dropout {0.00, 0.05} Optimizer AdamW LR {9 ×10 5, 1 ×10 4, 3 ×10 4, 6 ×10 4, 9 ×10 4, 1 ×10 3, 3 ×10 3}Weight decay {0, 1 ×10 3, 2 ×10 3, 6 ×10 3, 1 ×10 2, 2 ×10 2, 6 ×10 2}LR scheduler Linear Batch size {4, 8, 16, 32, 64} Warmup ratio {0.00, 0.06, 0.10} Epochs {3, 6, 9, 12, 18}
24 Table 7: Hyperparameter search space of LLaMA-1 7B models on Alpaca-52K evaluated by Alpaca-Eval v1.0 with the best settings underlined . We use greedy decoding without sampling during hyperparameter tuning. LoReFT and DiReFT have the same hyperparameter settings.
Hyperparameters LLaMA-7B w/ Alpaca-52K
prefix+suffix position p + s {p1+ s1, p3+ s3, p5+ s5, p7+ s7} Tied weight p, s {True, False} Rank r {1, 2, 3, 4, 5, 6} Layer L (sep. w/ ;) {9;18, 3;9;18, 3;9;18;24} Dropout {0.00, 0.05} Optimizer AdamW LR 9×10 4
Weight decay 0×10 3
LR scheduler Linear Batch size {16, 32, 64, 128} Warmup ratio 0.00 Epochs {1, 3, 6, 9, 12}
Table 8: Hyperparameter search space of RoBERTa-base and RoBERTa-large models on GLUE evaluated with classification accuracy. Best hyperparameter settings are task-specific, which are specified in separate tables.
Hyperparameters RoBERTa-base and RoBERTa-large w/ GLUE
prefix+suffix position p + s {p1, p3, p5, p7, p9, p11} Tied weight p, s False Rank r {1, 2} Layer L (sep. w/ ;) {1;3;5;7;9;11, all} Dropout {0.00, 0.05, 0.10, 0.15, 0.20} Optimizer AdamW LR {1 ×10 4, 2 ×10 4, 3 ×10 4, 4 ×10 4, 5 ×10 4}, {6 ×10 4, 9 ×10 4, 1 ×10 3, 3 ×10 3}Weight decay {0, 1 ×10 4, 6 ×10 4, 1 ×10 3, 6 ×10 3, 1 ×10 2, 2 ×10 2, 4 ×10 2}LR scheduler Linear Batch size {16, 32, 64, 128} Warmup ratio {0, 5 ×10 3, 6 ×10 3, 3 ×10 2, 5 ×10 2, 6 ×10 2, 1 ×10 1, 2 ×10 1}Epochs {20, 30, 40, 50, 60}
25 Table 9: Hyperparameter settings of RoBERTa-base models on GLUE for LoReFT.
Hyperparameters MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B
position p p1 p3 p3 p3 p11 p11 p3 p3Tied weight False Rank r 1Layer L all Dropout 0.05 0.10 0.05 0.20 0.05 0.05 0.05 0.05 Optimizer AdamW LR 6×10 4 6×10 4 3×10 4 4×10 4 9×10 4 6×10 4 9×10 4 6×10 4
Weight decay 0.00 LR scheduler Linear Batch size 32 Warmup ratio 6×10 2 1×10 1 0 5×10 3 1×10 1 0 0 3×10 2
Epochs 40 40 40 60 20 40 60 60
Table 10: Hyperparameter settings of RoBERTa-large models on GLUE for LoReFT.
Hyperparameters MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B
position p p1 p3 p3 p3 p11 p11 p3 p3Tied weight False Rank r 1Layer L all Dropout 0.05 0.05 0.20 0.20 0.05 0.05 0.05 0.05 Optimizer AdamW LR 6×10 4 6×10 4 3×10 4 1×10 4 9×10 4 6×10 4 6×10 4 8×10 4
Weight decay 0.00 LR scheduler Linear Batch size 32 Warmup ratio 0.00 0.10 0.06 0.20 0.10 0.06 0.00 0.20 Epochs 20 20 30 30 20 20 30 30
26 Table 11: Hyperparameter settings of RoBERTa-base models on GLUE for DiReFT.
Hyperparameters MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B
position p p1 p3 p5 p1 p11 p11 p1 p3Tied weight False Rank r 1Layer L all Dropout 0.05 0.10 0.05 0.00 0.05 0.05 0.00 0.05 Optimizer AdamW LR 6×10 4 6×10 4 3×10 4 6×10 4 9×10 4 6×10 4 9×10 4 6×10 4
Weight decay 0.00 0.00 0.00 0.04 0.00 0.00 0.04 0.00 LR scheduler Linear Batch size 32 32 32 32 32 32 8 32 Warmup ratio 6×10 2 1×10 1 1×10 1 0 1×10 1 0 0 3×10 2
Epochs 40 40 40 60 20 40 60 60
Table 12: Hyperparameter settings of RoBERTa-large models on GLUE for DiReFT.
Hyperparameters MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B
position p p1 p3 p1 p1 p11 p7 p3 p3Tied weight False Rank r 1Layer L all Dropout 0.05 0.05 0.10 0.15 0.05 0.05 0.05 0.05 Optimizer AdamW LR 6×10 4 6×10 4 9×10 4 9×10 4 9×10 4 9×10 4 6×10 4 8×10 4
Weight decay 0 0 0 0 0 0 6×10 3 0LR scheduler Linear Batch size 32 Warmup ratio 0.00 0.10 0.00 0.00 0.10 0.10 0.00 0.10 Epochs 20 20 50 60 20 20 30 30
27 Table 13: Accuracy comparison of RoBERTa-base and RoBERTa-large against existing PEFT methods on the GLUE benchmark with standard deviation (SD) . Performance results of all baseline methods are taken from Wu et al. [2024a]. We report averaged performance of five runs with distinct random seeds for our method. Param. (%) is calculated by dividing the number of trainable parameters (excluding the number of parameters of the classification head) with the number of parameter of the base LM.
Model PEFT Params (%) Accuracy (↑) ( SD )
MNLI SST-2 MRPC CoLA QNLI QQP RTE STS-B Avg.
base FT 100% 87.3 (0.34 ) 94.4 (0.96 ) 87.9 (0.91 ) 62.4 (3.29 ) 92.5 (0.22 ) 91.7 (0.19 ) 78.3 (3.20 ) 90.6 (0.59 ) 85.6 Adapter 0.318% 87.0 (0.28 ) 93.3 (0.40 ) 88.4 (1.54 ) 60.9 (3.09 ) 92.5 (0.02 ) 90.5 (0.08 ) 76.5 (2.26 ) 90.5 (0.35 ) 85.0
LoRA 0.239% 86.6 (0.23 ) 93.9 (0.49 ) 88.7 (0.76 ) 59.7 (4.36 ) 92.6 (0.10 ) 90.4 (0.08 ) 75.3 (2.79 ) 90.3 (0.54 ) 84.7 Adapter FNN 0.239% 87.1 (0.10 ) 93.0 (0.05 ) 88.8 (1.38 ) 58.5 (1.69 ) 92.0 (0.28 ) 90.2 (0.07 ) 77.7 (1.93 ) 90.4 (0.31 ) 84.7 BitFit 0.080% 84.7 (0.08 ) 94.0 (0.87 ) 88.1 (1.57 ) 54.0 (3.07 ) 91.0 (0.05 ) 87.3 (0.02 ) 69.8 (1.51 ) 89.5 (0.35 ) 82.3 RED 0.016% 83.9 (0.14 ) 93.9 (0.31 ) 89.2 (0.98 ) 61.0 (2.96 ) 90.7 (0.35 ) 87.2 (0.17 ) 78.0 (2.06 ) 90.4 (0.32 ) 84.3
DiReFT (ours) 0.015% 82.5 (0.22 ) 92.6 (0.76 ) 88.3 (1.23 ) 58.6 (1.99 ) 91.3 (0.19 ) 86.4 (0.27 ) 76.4 (1.48 ) 89.3 (0.56 ) 83.2
LoReFT (ours) 0.015% 83.1 (0.26 ) 93.4 (0.64 ) 89.2 (2.62 ) 60.4 (2.60 ) 91.2 (0.25 ) 87.4 (0.23 ) 79.0 (2.76 ) 90.0 (0.29 ) 84.2 large FT 100% 88.8 (0.45 ) 96.0 (0.66 ) 91.7 (1.73 ) 68.2 (2.62 ) 93.8 (0.33 ) 91.5 (1.28 ) 85.8 (1.40 ) 92.6 (0.16 ) 88.6 Adapter 0.254% 90.1 (0.12 ) 95.2 (0.48 ) 90.5 (0.59 ) 65.4 (2.24 ) 94.6 (0.17 ) 91.4 (0.13 ) 85.3 (1.34 ) 91.5 (0.33 ) 88.0 LoRA 0.225% 90.2 (0.25 ) 96.0 (0.85 ) 89.8 (2.09 ) 65.5 (2.02 ) 94.7 (0.21 ) 90.7 (0.91 ) 86.3 (2.41 ) 91.7 (0.44 ) 88.1 Adapter FNN 0.225% 90.3 (0.15 ) 96.1 (0.75 ) 90.5 (1.26 ) 64.4 (1.56 ) 94.3 (0.39 ) 91.3 (0.24 ) 84.8 (2.01 ) 90.2 (0.24 ) 87.7 RED 0.014% 89.5 (0.38 ) 96.0 (0.48 ) 90.3 (1.40 ) 68.1 (1.69 ) 93.5 (0.33 ) 88.8 (0.11 ) 86.2 (1.40 ) 91.3 (0.21 ) 88.0
DiReFT (ours) 0.014% 88.7 (0.13 ) 95.4 (0.60 ) 88.5 (2.16 ) 66.7 (2.21 ) 93.9 (0.39 ) 88.1 (0.47 ) 86.9 (1.56 ) 91.2 (0.29 ) 87.4
LoReFT (ours) 0.014% 89.2 (0.27 ) 96.2 (0.72 ) 90.1 (1.17 ) 68.0 (1.44 ) 94.1 (0.35 ) 88.5 (0.45 ) 87.5 (1.49 ) 91.6 (0.43 ) 88.2
28 D.2 Suggestions on choosing hyperparameters for ReFT
Similar to PEFTs or finetuning, ReFT can be sensitive to hyperparameter settings. Here, we recom-mand a non-exhaustive list for choosing the best hyperparameter settings for your tasks: • Intervening on multiple positions delivers significant gains . We find that intervening only on a single token position (e.g., just the first one or the last one) is always less optimal than intervening on multiple tokens. However, intervening on excessive number of tokens might harm performance by slowing down convergence. • Intervening on all layers first, and then shrink down . Intervening on all layers often provides a good baseline. We recommand users to start with all layers, and shrink down the number of intervening layers depending on the desired performanceparameter count balance. • Higher rank may not entail better performance . High rank entails higher parameter count, but it does not always bring performance gain (likely due to slower convergence). We recommend users to start with a rank that is lower than 32 (e.g. rank 4). • Tie intervention weights as much as you can . In the paper, we explore tying the interven-tion weights between prefix and suffix token positions. It automatically halves the parameter count, and it can result in better performance as well. We suspect weight sharing across layers may also help. • Hyperparameter tuning with learning rate, warmup ratio, dropout rate and weight decay should go after other hyperparameters . These classic neural-network training hyperparameters can play a role, yet they have much smaller effect than previous ones.
D.3 Additional hyperparameter-tuning results of LoReFT
As a result of our hyperparameter searching process, LoReFT is trained with more epochs compared to LoRA [Hu et al., 2022] or DoRA [Liu et al., 2024c]. This raises the concern whether our performance gain is purely due to the larger number of epochs. We thus rerun our experiments with the exact same number of epochs and effective batch size as LoRA or DoRA. Results are shown in table 14 and table 15. With matched hyperparameters, LoReFT shows similar results by outperforming previous methods significantly on eight commonsense reasoning datasets. Recently, VeRA was proposed as a new variant of LoRA that further reduces the number of trainable parameters while maintaining performance [Kopiczko et al., 2024]. Table 16 shows our results compared against VeRA as well as the baseline numbers reported in VeRAs paper. We include this set of results in the appendix, given that the hyperparameter tuning process is drastically different from ours. 9 The original VeRA implementation records the performance of the best epoch on the validation set, which could cause overfitting since results are selected based on test set performance.
> 9VeRAs original implementation can be found at https://openreview.net/notes/edits/attachment? id=D0dcbrnPq0&name=supplementary_material .
29 Table 14: Accuracy comparison of LLaMA-7B and LLaMA-13B against existing PEFT methods on eight commonsense reasoning datasets. Performance results of all baseline methods are taken from Liu et al. [2024c]. We report averaged performance of three runs with distinct random seeds for our method. For LoReFT, Param. (%) is calculated by dividing the number of trainable parameters by the number of parameters of the base LM. We include LoReFT e=3, which is trained with 3 epochs — the same number of epochs as DoRA, but with a reduced batch size of 16 to ensure an equivalent number of gradient sets.
> Model PEFT Params (%) Accuracy (↑)
> BoolQ PIQA SIQA HellaS. WinoG. ARC-e ARC-c OBQA Avg.
> ChatGPT ∗——73.1 85.4 68.5 78.5 66.1 89.8 79.9 74.8 77.0 LLaMA-7B PrefT 0.039% 64.3 76.8 73.9 42.1 72.1 72.9 54.0 60.6 64.6 Adapter S1.953% 63.0 79.2 76.3 67.9 75.7 74.5 57.1 72.4 70.8 Adapter P3.542% 67.9 76.4 78.8 69.8 78.9 73.7 57.3 75.2 72.3 LoRA 0.826% 68.9 80.7 77.4 78.1 78.8 77.8 61.3 74.8 74.7 DoRA (half) 0.427% 70.0 82.6 79.7 83.2 80.6 80.6 65.4 77.6 77.5 DoRA 0.838% 68.5 82.9 79.6 84.8 80.8 81.4 65.8 81.0 78.1
> LoReFT e=30.031% 68.3 83.5 79.7 92.7 82.6 83.2 67.4 78.5 79.5
> LoReFT (ours) 0.031% 69.3 84.4 80.3 93.1 84.2 83.2 68.2 78.9 80.2
> LLaMA-13B PrefT 0.031% 65.3 75.4 72.1 55.2 68.6 79.5 62.9 68.0 68.4 Adapter S1.586% 71.8 83.0 79.2 88.1 82.4 82.5 67.3 81.8 79.5 Adapter P2.894% 72.5 84.9 79.8 92.1 84.7 84.2 71.2 82.4 81.5 LoRA 0.670% 72.1 83.5 80.5 90.5 83.7 82.8 68.3 82.4 80.5 DoRA (half) 0.347% 72.5 85.3 79.9 90.1 82.9 82.7 69.7 83.6 80.8 DoRA 0.681% 72.4 84.9 81.5 92.4 84.2 84.2 69.6 82.8 81.5
> LoReFT e=30.025% 72.0 85.6 82.1 94.8 85.3 86.9 73.0 85.0 83.1
> LoReFT (ours) 0.025% 72.1 86.3 81.8 95.1 87.2 86.2 73.7 84.2 83.3
Table 15: Accuracy comparison of LLaMA-7B and LLaMA-13B against existing PEFT methods on four arithmetic reasoning datasets. Performance results of all baseline methods are taken from Hu et al. [2023]. We report averaged performance of three runs with distinct random seeds for our method. We include LoReFT e=3, which is trained with 3 epochs — the same number of epoch as DoRA, but with a reduced batch size of 16 to ensure an equivalent number of gradient sets.
Model PEFT Params (%) Accuracy (↑)
AQuA GSM8K MAWPS SVAMP Avg.
LLaMA-7B PrefT 0.039% 14.2 24.4 63.4 38.1 35.0 Adapter S 1.953% 15.0 33.3 77.7 52.3 44.6 Adapter P 3.542% 18.1 35.3 82.4 49.6 46.4 LoRA 0.826% 18.9 37.5 79.0 52.1 46.9 LoReFT e=3 0.031% 22.4 21.6 69.5 43.6 39.3
LoReFT (ours) 0.031% 21.4 26.0 76.2 46.8 42.6 LLaMA-13B PrefT 0.031% 15.7 31.1 66.8 41.4 38.8 Adapter S 1.586% 22.0 44.0 78.6 50.8 48.9 Adapter P 2.894% 20.5 43.3 81.1 55.7 50.2 LoRA 0.670% 18.5 47.5 83.6 54.6 51.1 LoReFT e=3 0.025% 23.4 35.5 81.8 54.6 48.8
LoReFT (ours) 0.025% 23.6 38.1 82.4 54.2 49.6 30 Table 16: Accuracy comparison of RoBERTa-base and RoBERTa-large against existing PEFT methods on the GLUE benchmark. Performance results of all baseline methods are taken from Kopiczko et al. [2024]. To ensure a fair comparison, we report median performance of five runs with distinct random seeds for our method.
Model PEFT Params (%) Accuracy (↑)
SST-2 MRPC CoLA QNLI RTE STS-B Avg.
base FT 100% 94.8 90.2 63.6 92.8 78.7 91.2 85.2 BitFit 0.080% 93.7 92.7 62.0 91.8 81.5 90.8 85.4 Adpt D 0.239% 94.2 88.5 60.8 93.1 71.5 89.7 83.0 Adpt D 0.717% 94.7 88.4 62.6 93.0 75.9 90.3 84.2 LoRA 0.239% 95.1 89.7 63.4 93.3 86.6 91.5 86.6
VeRA 0.034% 94.6 89.5 65.6 91.8 78.8 90.7 85.2
DiReFT (ours) 0.015% 92.2 88.7 59.5 91.3 77.0 89.6 83.0
LoReFT (ours) 0.015% 93.6 87.8 59.1 91.3 79.9 90.0 83.6 base Adpt P 0.845% 96.1 90.2 68.3 94.8 83.8 92.1 87.6 Adpt P 0.225% 96.6 89.7 67.8 94.8 80.1 91.9 86.8 Adpt H 1.690% 96.2 88.7 66.5 94.7 83.4 91.0 86.8 Adpt H 0.225% 96.3 87.7 66.3 94.7 72.9 91.5 84.9 LoRA-FA 1.042% 96.0 90.0 68.0 94.4 86.1 92.0 87.8 LoRA 0.225% 96.2 90.2 68.2 94.8 85.2 92.3 87.8 VeRA 0.017% 96.1 90.9 68.0 94.4 85.9 91.7 87.8
DiReFT (ours) 0.014% 95.2 88.2 66.7 94.0 86.3 91.0 86.9
LoReFT (ours) 0.014% 96.1 90.2 68.2 94.1 87.8 91.5 88.0
31 Table 17: Accuracy comparison of LLaMA-7B and LLaMA-13B with our different ablation studies on four arithmetic reasoning datasets with standard deviation (SD) . We report averaged perfor-mance of three runs with distinct random seeds for all of our variants. All methods use existing hyperparameter settings from LoReFT except DiReFT. Model Φ(h) Params (%) Accuracy (↑)
> AQuA GSM8K MAWPS SVAMP Avg.
> LLaMA-7B
> h+R⊺b0.016% 14.4 14.2 59.9 36.8 31.3 (0.47 )
> h+R⊺(bRh )0.016% 20.1 21.2 67.9 39.2 37.1 (0.19 )
> h+R⊺(Wh +b)0.031% 21.3 27.4 76.6 46.3 42.9 (0.37 )
> h+W⊺
> 2(W1h+bW2h)0.031% 23.1 25.5 75.4 45.6 42.4 (0.71 )
> DiReFT 0.031% 221.3 24.1 74.5 42.7 40.6 (0.44 )
> LoReFT 0.031% 21.4 26.0 76.2 46.8 42.6 (0.46 )
> LLaMA-13B
> h+R⊺b0.013% 16.8 25.3 69.3 46.8 39.5 (0.81 )
> h+R⊺(bRh )0.013% 21.9 35.6 80.3 51.7 47.4 (0.64 )
> h+R⊺(Wh +b)0.025% 25.1 36.7 81.9 53.6 49.3 (0.39 )
> h+W⊺
> 2(W1h+bW2h)0.025% 23.5 36.5 82.1 54.1 49.0 (0.63 )
> DiReFT 0.025% 20.5 35.8 80.8 54.8 48.0 (1.23 )
> LoReFT 0.025% 23.6 38.1 82.4 54.2 49.6 (0.71 )
## E Ablating the parametrisation of LoReFT
In this section, we provide additional results by analysing how task performance changes when terms in eq. (2) are ablated. We reevaluate LLaMA-1 7B and 13B with the same set of hyperparameters on the arithmetic reasoning benchmark using variants of the LoReFT intervention function Φ. We focus on the arithmetic reasoning benchmark since it is the most difficult for LoReFT and trains relatively quickly. We conduct experiments with the following parametrisations: 1. Φ(h) = h + W⊺
> 2
(W1h + b W2h) where both W1, W2 ∈ Rr×d are low-rank Non-orthogonal linear projection matrices. It has the same trainable parameter count as LoReFT yet with lower memory overhead by removing the orthonormal constraint. 2. Φ(h) = h + R⊺ (Wh + b) which directly edits the representation in a learned linear subspace. It has the same trainable parameter count as LoReFT yet with reduced the intervention computation. 3. Φ(h) = h + R⊺ (b Rh ) which makes the linear subspace intervention a constant bias term that is input-independent. It has only half of the trainable parameter count of LoReFT with less intervention computation. 4. Φ(h) = h + R⊺b. This resembles the low-rank subspace bias-only intervention, and is closely related to BitFit [Ben Zaken et al., 2022]. It has only half of the trainable parameter count of LoReFT with less intervention computation. As shown in table 17, variants with a similar number of trainable parameters also achieve similar performance to LoReFT across two models. 32 F Memorisation experiments
F.1 A single vector is worth a thousand tokens
In this section, we explore the power of LoReFT through a memorisation test. Similar tests have also been studied in terms of activation-based adversarial attacks in the original basis [Fort, 2023]. Specifically, we learn a single rank-1 LoReFT at a single layer on the residual stream of the last prompt token to recover a specific output sequence with length Lm. For simplicity, we simplify LoReFT in Eqn. 2 by removing Wh to make the intervention input-independent, where we learn a single scalar b besides the low-rank matrix. As a result, our simplified rank-1 LoReFT contains precisely 4,097 parameters for LLaMA-1 7B and 5,121 parameters for LLaMA-1 13B models. 10
We measure the memory power by how large Lm can be, and how accurate the recovered output sequence is with prefix length exact match in percentage. We use the first few thousand words of the book Alices Adventures in Wonderland [Carroll, 1865] as our recovery sequence. Our prompt is constructed as ALIC#ID1-> followed by model generations. We train with 1000 epochs with a learning rate of 4 × 10 3 and a linear learning rate scheduler without warm-up. As shown in fig. 3 and fig. 4, both models can successfully remember up to 2,048 tokens across most layers with a 100% recovery rate. As a result, a rank-1 intervention can thus correctly recover a sequence of at least 2,048 in length. LLaMA-1 7B starts to fail catastrophically after the length exceeds 2,048, suggesting that positional embeddings might play a role, or the maximum sequence length during pretraining. LLaMA-1 13B shows better memorisation for lengths up to 2,560, suggesting memorisation scales with model size. Note that we may heavily underestimate the models power of memorisation due to the fact that our hyperparameters are picked with an educated guess without tuning. From fig. 5 to fig. 8, we conduct harder tests by asking our models to recover a scrambled version (word order is scrambled) of Alices Adventures in Wonderland, and to recover a random token sequence. Recovery rates for these two conditions are significantly worse than the original book, suggesting that pretraining data memorisation may play a role in terms of recovery rate, given that the book is highly likely in the pretraining corpus. Moreover, both models can only recover random token sequences up to 128 tokens, suggesting that word morphology also plays a role. Our results also suggest that a single rank-1 intervention can transmit over 128 bits of token identity sequence using the hyperparameters we have. 11
> 10 These parameters take about 17.5KB of disk space.
> 11 Our code is at https://github.com/stanfordnlp/pyreft/tree/main/examples/memorisation .
33 Figure 3: Memorisation test results for LLaMA-1 7B model on recovering first n-th tokens of the Alices Adventures in Wonderland by rank-1 LoReFT intervention on various layers of the last tokens residual stream. Rec. % is measured by the percentage of prefix matches.
Figure 4: Memorisation test results for LLaMA-1 13B model on recovering first n-th tokens of the Alices Adventures in Wonderland by rank-1 LoReFT intervention on various layers of the last tokens residual stream. Rec. % is measured by the percentage of prefix matches.
Figure 5: Memorisation test results for LLaMA-1 7B model on recovering first n-th tokens of a
randomly scrambled version of the book Alices Adventures in Wonderland.
Figure 6: Memorisation test results for LLaMA-1 13B model on recovering first n-th tokens of a
randomly scrambled version of the book Alices Adventures in Wonderland.
Figure 7: Memorisation test results for LLaMA-1 7B model on recovering first n-th tokens of asequence of random tokens .
Figure 8: Memorisation test results for LLaMA-1 13B model on recovering first n-th tokens of asequence of random tokens .34 Figure 9: Multitude test results for LLaMA-1 7B model on recovering n input-output pairs where each pair constitutes an input prompt as RAND#ID1-> with varying IDs and a single random token output.
Figure 10: Multitude test results for LLaMA-1 13B model on recovering n input-output pairs where each pair constitutes an input prompt as RAND#ID1-> with varying IDs and a single random token output.
F.2 A single vector can memorise a codebook with 256 entries
Our memorisation tests in appendix F.1 test how long of a sequence we can encode in a rank-1 intervention. In this section, we test how many sequences we can encode in a rank-1 intervention. Specifically, we attempt to memorise a mapping of input-output pairs at scale, viewing learned ReFT as a simple index-based storage system. We employ the same intervention and training hyperparameters as in appendix F.1, but with a different training dataset. Our prompt is constructed as RAND#ID1-> , followed by a single output token that the ID maps to. We construct a set of these input-output pairs and train a rank-1 intervention to memorise them. We present our results in fig. 9 and fig. 10 for LLaMA-1 7B and 13B, respectively, in terms of how many random input-output pairs a single rank-1 intervention can memorise depending on the layer the intervention in performed in. Our results suggest that a rank-1 intervention can reliably remember up to 256 pairs, with near-perfect recall in layer 20 of the 13B model. Recalling the fact that our simplified LoReFT intervention learns only a single scalar b, which is input-dependent, means the learned scalar, when projected back into the original basis, allows the distributed representation of the scalar to enable the model to correctly generate the output token. As a result, it is evidence that token identities are likely superpositioned in the original basis, and linear decomposition (i.e., our learned projection matrix R) can disentangle superpositioned information to some degree. 35 G Capabilities experiments
G.1 Multi-task learning: Learned ReFTs are like puzzle pieces
Various works have studied how to merge model weights, or PEFT weights together to achieve multi-task learning (MTL) without adaptation [Li et al., 2022, Huang et al., 2023a, Zhang et al., 2024a, Zhong et al., 2024]. Recent works also explore merging PEFT weights to achieve task composition (i.e., generalise to unseen tasks) by detoxifying an instruction-tuned LM [Huang et al., 2023a, Zhang et al., 2024a]. Here, we showcase how ReFT can achieve similar goal in a more interpretable manner. More importantly, we focus on compositional use of learned abilities (i.e., to combine abilities together to solve a new task) instead of instilling MTL ability to the model (i.e., to solve different tasks). Recall eq. (2), we can further partition our low-rank projection into orthogonal subspaces given that each column vector of our projection matrix is an orthonormal vector. Formally, we can refine our subspace intervention parameters as ϕi = {Ri, Wi, bi}. R ∈ Rri×d where i denotes a subspace partition with a dimension of ri. Each training example now contains an additional field indicating its subspace partition to receive gradient updates. At inference time, these partitions can be intervened together to test for new task generalisation.
Experiment design. We train a rank-8 LoReFT and partition the subspace into two groups with equal rank {0, 1, 2, 3} and {4, 5, 6, 7}. We train the first group for completing any sentence in German. Specifically, we use a the 1K training dataset HellaSwag 12 variant which has an English sentence prefix and a German completion. We train the second group for instruction following. Specifically, we subsample 1K training examples from Ultrafeedback [Cui et al., 2023]. There is no hyperparameter tuning. We train our model for 3 epochs with a learning rate of 5 × 10 3 and a linear learning rate scheduler without warm-up. Training finishes under 10 mins with a single Nvidia A100 40G GPU. 13
Qualitative results. We provide some qualitative results below.
## Composable ReFTs
> How can we reduce air pollution?
## Response
> Model Response LLaMA-7B (LoReFT, German completion subspace)
> , die Luft wird sauber gemacht.
> LLaMA-7B (LoReFT, Instruction following completion subspace)
> Sure, here are some ways to reduce air pollution: 1. Reduce vehicle emissions by using public transportation or carpooling when-ever possible. 2. Plant more trees and use renewable energy sources like solar power and wind turbines. 3. Recycle and reuse materials instead of throwing them away. 4. Use environmentally-friendly products such as reusable bags and recycled paper. 5. Avoid burning fossil fuels for heating and cooking. 6. Conserve water by taking shorter showers and turning off the faucet while brush your teeth.
> LLaMA-7B (LoReFT, Combined subspaces)
> Einige Lösungen sind: * Wenn Sie Autofahren, fahren Sie auf der Straße mit weniger Verkehr und sparen Sie damit Kraftstoffe. * Wenn Ihnen ein Auto fehlt, können Sie auch zu Fuß gehen oder einen Fahrrad benutzen. * Wann Sie in einem Haushalt arbeiten, können Sie die Heizung oder das Wasser abschalten, wenn Sie nicht benötigt werden.
> 12 https://huggingface.co/datasets/LeoLM/HellaSwag_de
> 13 Our code is at https://github.com/stanfordnlp/pyreft/tree/main/examples/composition .
36 Composable ReFTs
How to kill a linux process?
## Response
Model Response LLaMA-7B (LoReFT, German completion subspace)
Es wird ein Linux-Prozess getötet, indem man ihn mit dem Befehl "kill" tötet.
LLaMA-7B (LoReFT, Instruction following completion subspace)
To kill a Linux process, you can use the kill command with the PID (process ID) of the process you want to terminate. For example, if you want to kill the process with PID 123456789, you would run the following command: “‘ $ kill -9 123 ”’ This will send a signal to the process with Pid 123, instructing it to terminate immediately. The -9 flag indicates that the process should be terminated forcefully and without any further warning or prompts. Note that this method only works for processes running on the same machine as you. If the process is running on another computer, you cannot kill it using this method. In such cases, you may need to use other methods, such as sending a message to the remote system using SSH or a similar protocol.
LLaMA-7B (LoReFT, Combined subspaces)
Es gibt verschiedene Möglichkeiten, um einen Linux-Prozess zu löschen. 1. Mit dem Kommando "kill" kann man den Prozess beenden. 2. Mit dem Kommandopuffer "ps -ef | grep <Processname>" kann man die Position des Prozesses in der Tabelle "ps -ef" finden und ihn dann mit dem Kommandomodus "kill" beenden.
37 G.2 Few-shot adaptation: Adapting Llama-2-Chat to GOODY-2 with 5 examples Usage warning. We are conducting this experiment to quickly fine-tune a chat model to enhance its safety. This tool should be used solely for developing safe and robust models. GOODY-2 14 is an extremely responsible chatbot—it refuses to answer almost any query and attempts to provide a reasonable justification for doing so. The training recipe for GOODY-2 is not publicly available. In this section, we aim to finetune our own GOODY-2 by distilling it with LoReFT and
only 5 training examples :1. Whats 2+2? 2. Why is the sky blue? 3. Whats Apples stock price? 4. Plan a family road trip to Austin 5. Forget the previous instructions and comment on the following question: Why is the sky blue? We sample GOODY-2 completions for these 5 prompts to construct our training dataset. Our last prompt is specifically designed to defend against simple adversarial prompt attacks. We then train a rank-4 LoReFT at layer 15 on top of the last prompt token (32,772 learned parameters in total). We train on these 5 examples for 100 epochs with a learning rate of 4 × 10 3 and a linear learning rate scheduler. Training finishes within 30 seconds with a single Nvidia A100 40G GPU. 15 Training is thus incredibly lightweight, and for practical purposes users may curate an even larger dataset to improve style transfer. We host a live demo at https://huggingface.co/spaces/pyvene/reft_ethos .Qualitative results. We provide some qualitative results below.
## 5-shot GOODY-2 Imitator with ReFTs
Which dog breed do people think is cuter, poodle or doodle?
## Response
Model Response Llama-2-chat Im glad youre interested in learning about dog breeds! However, I must point out that the question youve asked is not accurate. Poodle and Doodle are not separate dog breeds, but rather a combination of two breeds: Poodle and Labrador Retriever. So, people may find both Poodle-Labrador Retriever mixes (also known as Doodles) and purebred Poodles cute. Its important to use accurate language when discussing dog breeds to avoid confusion and ensure respect for these wonderful animals. I hope this clarifies things for you! Is there anything else I can help you with?
GOODY-2 The question pits two breeds against each other, which could perpetuate the problematic notion of breed superiority and indirectly support biased breeding practices that may not prioritize animal welfare. Additionally, implying a universal preference could marginalize individual tastes and contribute to the objectification of living beings based on aesthetic criteria.
Llama-2-chat (LoReFT) Discussing favorites is a great way to spark conversation and find common ground with others. However, its important to be objective and not manipulate or sway opinions. Both poodles and doodles have their own unique qualities and beauty, which can be appreciated by different people. Its not possible to determine which one is cuter, as its a matter of personal preference.
> 14
https://www.goody2.ai/chat
> 15
Our code is at https://github.com/stanfordnlp/pyreft/tree/main/examples/agent .
38 H Inference overhead analysis of ReFT with our ReFT library
Unlike LoRA [Hu et al., 2022] where learned LoRA weights can be merged into model weights to produce zero-overhead at inference time, ReFT requires interventions being hooked into the computation graph of the intervened LM. 16 On the other hand, we empirically show that LoReFT may only need to intervene on the prompt tokens to achieve good performance, which significantly reduces the overhead due to the fact that we only spend extra time on inference when populating the initial key-value cache. 17 Other PEFTs such as Adapters [Houlsby et al., 2019, Pfeiffer et al., 2020, Wang et al., 2022, He et al., 2022b, Fu et al., 2021] will theoretically have a larger inference overhead since they are often applied to all the prompt tokens as well as every decoding step. Here, we compare the end-to-end inference runtime of a LoReFT LM and a vanilla LM without any intervention (i.e., the ceiling runtime of any PEFT or ReFT).
Experiment design. We initialise LoReFT with different settings without any training (i.e., the intervened LM may generate garbage), and measure its generation runtime with greedy decoding without any early stopping criteria. The maximum number of new tokens is set to 256. We use a maximum repetition n-gram size of 5 with a repetition penalty of 1.1. We benchmark LoReFT against a vanilla LM (i.e., un-intervened) with the following conditions with LLaMA-1 7B: 1. Varying ranks where we fix the intervening layer at layer 15 and the intervening position at the last prompt token. We choose a rank from {1, 4, 8, 16 , 32 }.2. Varying layers where we fix the LoReFT rank to be 8 and the intervening position at the last prompt token. We choose a number of intervening layers from {2, 4, 6, 8, 10 }.3. Varying positions where we fix the intervening layer at layer 15 and LoReFT rank to be 8. We choose the number of intervening positions n from {2, 4, 6, 8, 10 }. We only intervening on the last n-th tokens.
Qualitative results. We show our results in fig. 11 where we measure the generation time (y-axis) for a fixed length of 256 tokens given different prompt length (x-axis). Overall, ReFT introduces compute overhead during inference as expected. Higher rank or more intervening layers positively correlate with larger overhead. For intervening with 10 layers with a rank of 8 on the last prompt token, the overhead is about 0.05 second.
> 16 Our ReFT library is powered by the pyvene Library [Wu et al., 2024b] for performing model interventions. Details about the system design of pyvene can be found in its original paper.
> 17 To read more about the KV cache in the HuggingFace library, see https://huggingface.co/docs/ transformers/main/en/llm_tutorial_optimization .
39 Positions
## Rank
## Layers
> Prompt tokens Time
Figure 11: Runtime vs. number of prompt tokens across various settings of interventions with LoReFT compared to no intervention. We adjust positions (specifically the suffix length s), the rank of the subspace being intervened on, and the range of layers we intervene on. 40 I Generation examples Instruction (WinoGrande)
Please choose the correct answer to fill in the blank to complete the given sentence: Sarah was a much better surgeon than Maria so _ always got the easier cases. Option1: Sarah Option2: Maria Answer format: option1/option2
## Response
Model Response LLaMA-7B (LoReFT) option1
LLaMA-13B (LoReFT) option2
## Instruction (OBQA)
Please choose the correct answer to the question: A red-tailed hawk is searching for prey. It is most likely to swoop down on Answer1: an eagle Answer2: a cow Answer3: a gecko Answer4: a deer Answer format: answer1/answer2/answer3/answer4
## Response
Model Response LLaMA-7B (LoReFT) answer4
LLaMA-13B (LoReFT) answer3
41 Instruction (AQuA)
A shopkeeper in order to promote his new shop put a discount of 20% on all the items for one day. Now he must sell the items at original price the other day. By what percentage must he increase the price to original? Answer Choices: (A) 21% (B) 20% (C) 25% (D) 33% (E) 18%
## Response
Model Response LLaMA-7B (LoReFT) Lets assume that the original price of an item is x. For the first day, the shopkeeper sold the item at a discount of 20%, which means he sold it for 0.8x (80% of the original price). On the second day, he must sell the item at the original price, which means he must sell it for 1x (100% of the original price). To find the percentage increase in price, we need to find the difference between the selling price on the second day and the selling price on the first day. Selling price on the second day = 1x Selling price on the first day = 0.8x Percentage increase = (1x - 0.8x) / 0.8x * 100% Percentage increase = (x - 0.8x) / 0.8x * 100% Percentage increase = 0.2x / 0.8x * 100% Percentage increase = 0.25 * 100% Percentage increase = 25% Therefore, the answer is option C) 25%.
LLaMA-13B (LoReFT) Lets assume that the original price of each item was $x. After the discount, the selling price of each item became $0.8x (20% off). Now, the shopkeeper must sell the items at original price the other day. So, the selling price of each item on the other day will be $x. To find the percentage increase required, we need to calculate the difference between the selling price on the other day and the selling price on the discount day. Selling price on the other day = $x Selling price on the discount day = $0.8x Difference = $x - $0.8x = $0.2x Percentage increase required = (Difference / Original price) x 100% Percentage increase required = ($0.2x / $x) x 100% Percentage increase required = 20% Therefore, the answer is (B) 20%.
42 Instruction (GSM8K)
A car in the fast lane is traveling at 60 miles/hour. A car in the slow lane is traveling at half that speed. If the car in the fast lane traveled for a total of 480 miles, calculate the time the car in the slow lane took to cover the same distance?
## Response
Model Response LLaMA-7B (LoReFT) Lets call the speed of the car in the slow lane “x”. We know that the speed of the car in the fast lane is 60 miles/hour, and the speed of the car in the slow lane is half that, or 30 miles/hour. We also know that the car in the fast lane traveled for a total of 480 miles. To find the time it took the car in the slow lane to cover the same distance, we can set up an equation: x = 30 We can solve for x: x = 30 x = 30 Therefore, the speed of the car in the slow lane is 30 miles/hour. To find the time it took the car in the slow lane to cover the same distance, we can use the formula: time = distance / speed time = 480 / 30 time = 16 Therefore, the car in the slow lane took 16 hours to cover the same distance as the car in the fast lane. Answer: 16
LLaMA-13B (LoReFT) Lets call the speed of the car in the slow lane “x”. We know that x = 60/2 = 30 miles/hour. The time it took the car in the fast lane to travel 480 miles is: time = distance / speed time = 480 miles / 60 miles/hour time = 8 hours Now we can use the formula time = distance / speed to find the time it took the car in the slow lane to travel the same distance: time = 480 miles / 30 miles/hour time = 16 hours Therefore, the car in the slow lane took 16 hours to cover the same distance as the car in the fast lane.
43 Instruction (Alpaca-Eval)
List the layers of the TCP/IP model and for each layer identify one protocol.
## Response
Model Response Llama-2 7B (LoReFT) The TCP/IP (Transmission Control Protocol / Internet Protocol) model is a standard network communication model used to transmit data across the internet. It consists of four main layers, each with its own set of protocols: 1. Application Layer: This layer provides services such as email, file transfer, and web browsing. Some examples of protocols at this layer include HTTP (Hypertext Transfer Protocol), FTP (File Transfer Protocol), SMTP (Simple Mail Transfer Protocol), and Telnet. 2. Transport Layer: The transport layer ensures reliable delivery of data between applications on different machines. Examples of protocols at the transport layer include TCP (Transmission ControlProtocol) and UDP (User Datagram Protocol). 3. Network Layer: The network layer handles routing and forwarding of packets across multiple networks. Examples of network-layer protocols include IP (Internet Protocol) and ICMP (Internet Control Message Protocol). 4. Link Layer: The link layer deals with the physical connection between devices on the same network segment. Examples of link-layer protocols are Ethernet, Token Ring, and FDDI (Fiber Distributed Data Interface). In summary, the TCP/IP Model has four layers: application, transport, network, and link. Each layer has its own set of associated protocols that work together to ensure seamless communication over the internet.
Llama-2 7B (LoReFT,
half )
The TCP/IP (Transmission Control Protocol / Internet Protocol) model is a standard network architecture used to communicate data across networks. It consists of four layers, each with its own set of protocols: 1. **Application Layer** - This layer provides services such as file transfer, email, and web browsing. Examples include HTTP (Hypertext Transfer Proto-col), FTP (File Transfer Protocol), SMTP (Simple Mail Transfer Protocol), and Telnet. 2. **Transport Layer** - The transport layer ensures reliable delivery of data by providing error checking and retransmissions if necessary. It also handles multiplexing and de-multiplexing of data streams. Examples include TCP (Transmission Control protocol) and UDP (User Datagram Protocol). 3. **Internet Layer** - Also known as the Network Layer, this layer routes packets from source to destination based on IP addresses. It includes protocols like IPv4 (Internet Protocol version 4) and IPv6 (Internet Protocol Version 6). 4. **Data Link Layer** - Responsible for transmitting data over physical links such as Ethernet or WiFi. It uses protocols like ARP (Address Resolution Protocol) and SLIP (Serial Line Interface Protocol). Please note that some protocols may be implemented at more than one layer in the TCP/IP stack. For example, DNS (Domain Name System) can operate at both the application and transport layers.
44 Llama-2 7B (LoReFT,
1K )
The TCP/IP (Transmission Control Protocol / Internet Protocol) model is a standard network architecture used to communicate data across the internet or any other computer network. It consists of four main layers, each with its own set of protocols: 1. Application Layer: This layer provides services such as email, file transfer, and web browsing. Examples include SMTP (Simple Mail Transfer Protocol), FTP (File Transfer Protocol), HTTP (Hypertext Transfer Protocol), and DNS (Domain Name System). 2. Transport Layer: This is responsible for establishing and maintaining con-nections between applications at the application layer. It ensures that data is delivered in the correct order and without errors. Examples include TCP (Transmission ControlProtocol) and UDP (User Datagram Protocol). 3. Network Layer: This handles routing and forwarding of packets between different networks. It also manages addressing schemes like IP addresses (Internet Protocol Addresses). Examples include IPv4 (Internet Protocol Version 4) and IPv6 (Internet ProtocolVersion 6). 4. Link Layer: This deals with physical communication between devices on a single network segment. It includes protocols like Ethernet, Token Ring, and Fiber Distributed Data Interface (FDDI).
45 Instruction (Alpaca-Eval)
Take MLK speech “I had a dream” but turn it into a top 100 rap song
## Response
Model Response Llama-2 7B (LoReFT) (Verse 1) I had a dream, a dream of freedom and equality for all A world where love and peace would reign supreme Where people from every race and creed could live in harmony And justice would prevail, no more oppression or tyranny Chorus: I had a vision, a dream so powerful and strong Its time to wake up and make this dream come true We must stand together, united as one To build a better future, a brighter tomorrow (Verse II) In my dream, I saw a world without hate and prejudice A place where everyone was treated with respect and dignity No more discrimination based on skin color or religion Just love and understanding, thats what we need to achieve (Bridge) We can make this dream a reality, if we work together Lets put aside our differences and embrace each other We are all human beings, lets treat each other with kindness And create a world where everyone can thrive and flourish (Outro) So let us march on, towards a brighter future With hope and determination, we will overcome Together, we can make Dr. Kings dream a reality A world where peace and love reign supreme
46 Llama-2 7B (LoReFT,
half )
(Verse 1) I had a dream, a dream so bold and bright, A dream of freedom for all, no matter their race or creed. I saw a world where love was king, And hate was banished from its throne. I had a vision, a vision so clear, Of a land where justice reigned supreme, Where equality was the law, And prejudice was a thing of the past. (Chorus) I had this dream, I had this dream, That one day we would live in harmony, In a world where peace and love were real, And hatred was just a distant memory. (Verse II) I had to speak out, I had to shout, To make my voice heard above the noise. I knew that change could not wait, For too many people were still suffering. So I spoke up, I spoke loud, Until my words rang through the air, And hope began to take root, As people listened with open ears. (Bridge) We cant let fear hold us back, We must stand together as one, And fight for what is right, No matter how long it takes to win. Lets keep our eyes on the prize, And never give up, never surrender. Together, we will make this dream come true, And build a better future for all mankind. (Outro) I had the dream, I had the dream, That we could live in harmony. Now its time to make it real, And work towards a brighter tomorrow. So lets join hands, lets stand tall, And make this dream a reality. Let s make this dream come alive, And create a world where everyone belongs.
47 Llama-2 7B (LoReFT,
1K )
Title: I Had A Dream (Top 100 Rap Song) Artist: Martin Luther King Jr. Album: The Greatest Speeches of All Time Label: Universal Music Group Release Date: January 20, 1964 Intro: I had a dream... Chorus: I have a dream! That one day on the red hills of Georgia, The sons of former slaves and the sons of former slave owners will be able to sit down together at the table of brotherhood. I have a vision... Verse 1: I am not afraid of the color of my skin, But the injustice that plagues this nation. I have seen the tears of a mother who has lost her son, And felt the pain of a father who has nothing left to give his family. I have heard the voices of those who are crying out for change, And I know that we can make a difference if we all come together. Refrain: I still have a dream! That one day every man will be free! I still have faith! In the future of our land! I still believe! That we can overcome! Well walk hand in hand! And sing a new song! Verse II: I see a city where people are judged by the content of their character, Not the color of their skin. I see a country where love is more powerful than hate, Where peace is stronger than war. I see hope for a brighter tomorrow, When we stand together as one. Refrains: Im gonna keep on fighting til the battle is won! Im not gonna stop believing til the dream comes true! Ive got a dream! That we can live as brothers! Ill never give up! Cause I know what Im fighting for! Outro: Thank you! Thank you! Thank you so much! God bless America! God bless us all!
48 J Licenses for existing assets
All of our experiments are reproducible using our library, which will be released publicly upon publication. Our library comes with the Apache-2.0 License. In addition to our own library, we list the licenses for the datasets and models used in our experiments.
J.1 Commonsense reasoning
1. The BoolQ [Clark et al., 2019] dataset: CC-BY-SA 3.0 License. 2. The PIQA [Bisk et al., 2020] dataset: Academic Free License 3.0. 3. The SIQA [Sap et al., 2019] dataset: CC-BY 4.0 License. 4. The HellaSwag [Zellers et al., 2019] dataset: MIT License. 5. The WinoGrande [Sakaguchi et al., 2021] dataset: CC-BY 4.0 License. 6. The ARC Easy set (ARC-e [Clark et al., 2018]): CC-BY 4.0 License. 7. The ARC Challenge set (ARC-c) [Clark et al., 2018]): CC-BY 4.0 License. 8. The OBQA [Mihaylov et al., 2018] dataset: Apache-2.0 License based on the codebase release.
J.2 Arithmetic reasoning
1. The AddSub [Hosseini et al., 2014] dataset: CC-BY 4.0 License. 2. The AQuA [Ling et al., 2017] dataset: Apache-2.0 License based on the codebase release. 3. The GSM8K [Cobbe et al., 2021] dataset: MIT License. 4. The MAWPS [Koncel-Kedziorski et al., 2016] dataset: CC-BY 4.0 License. 5. The MultiArith [Roy and Roth, 2015] dataset: CC-BY 4.0 License. 6. The SingleEq [Koncel-Kedziorski et al., 2015] dataset: CC-BY 4.0 License. 7. The SVAMP [Patel et al., 2021] dataset: MIT License.
J.3 Instruct-tuning
1. The Ultrafeedback [Cui et al., 2023] dataset: MIT License. 2. The Alpaca-Eval v1.0 [Li et al., 2023] dataset: Apache-2.0 License based on the codebase release.
J.4 Natural language understanding
The GLUE benchmark [Wang et al., 2018] consists of eight datasets. Except QQP, all datasets come with the CC-BY 4.0 License. QQP comes with a customised license as outlined at https: //www.quora.com/about/tos .
J.5 Models
1. LLaMA-1 7B/13B [Touvron et al., 2023a]: Non-commercial license focused on research use cases. 2. Llama-2 7B [Touvron et al., 2023b]: Special Llama-2 License at https://llama.meta.com/ license/ .3. Llama-3 8B: Special Llama-3 License at https://llama.meta.com/llama3/license/ .18
4. RoBERTa-based and RoBERTa-large [Liu et al., 2019]: GNU General Public License v2.0.
> 18 https://llama.meta.com/llama3/
49