mirror of
https://github.com/wassname/adapters_as_hypotheses.git
synced 2026-06-27 16:44:10 +08:00
wip
This commit is contained in:
File diff suppressed because one or more lines are too long
@@ -0,0 +1,657 @@
|
||||
Title: CLOVER: Cross-Layer Orthogonal Vectors Pruning and Fine-Tuning
|
||||
|
||||
URL Source: https://arxiv.org/pdf/2411.17426
|
||||
|
||||
Published Time: Mon, 03 Feb 2025 01:40:12 GMT
|
||||
|
||||
Number of Pages: 16
|
||||
|
||||
Markdown Content:
|
||||
# CLOVER: Cross-Layer Orthogonal Vectors Pruning and Fine-Tuning
|
||||
|
||||
Fanxu Meng 1 2 Pingzhi Tang 1 Fan Jiang 1 Muhan Zhang 1 2
|
||||
|
||||
## Abstract
|
||||
|
||||
Decoder-only models generate tokens autoregres-sively by caching key/value vectors, but as the cache grows, inference becomes memory-bound. To address this issue, we introduce CLOVER (Cross-Layer Orthogonal Vectors), a novel ap-proach that treats pairs of attention layers as a set of low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the Q-K
|
||||
|
||||
and V -O pairs within each attention head. The resulting singular values can either guide pruning or serve as trainable parameters for efficient fine-tuning of all orthogonal vectors. After pruning or fine-tuning, these values are reintegrated into the model without increasing its parameter count. We apply CLOVER to various models, including GPT-2 XL, DeepSeek-V2-Lite, Whisper-Large-v3, Stable Diffusion XL, and LLaMA-3.2-11B-Vision. Our results demonstrate that CLOVER significantly improves pruning efficiency. For in-stance, the perplexity of pruning 70% of the Q-K
|
||||
|
||||
pairs in GPT-2 XL is similar to that of pruning just 8% with vanilla methods. Fine-tuning the sin-gular values further results in a full-rank update, outperforming state-of-the-art methods (LoRA, DoRA, HiRA, and PiSSA) by 7.6%, 5.5%, 3.8%, and 0.7%, respectively, on eight commonsense tasks for LLaMA-2 7B.
|
||||
|
||||
## 1. Introduction
|
||||
|
||||
In recent years, Large Language Models (LLMs) have rapidly evolved into essential tools for productivity (OpenAI, 2024; Anthropic, 2024; Team et al., 2024a). Open-source models (AI@Meta, 2024; Mistral, 2024; Qwen, 2024; Liu et al., 2024b; Team et al., 2024b; Abdin et al., 2024) have also narrowed the performance gap with closed-source mod-els. The success of LLMs is largely attributed to Next Token Prediction (Radford, 2018; Brown et al., 2020), where to-kens are predicted sequentially, with attention computed between each token and all preceding ones. To avoid redun-dant computations, key-value features are cached. However, as model size grows, the overhead of caching becomes sub-stantial, leading to memory and communication bottlenecks. 𝑊 !
|
||||
|
||||
## 𝑊 " 𝑊 #
|
||||
|
||||
## 𝑊 $
|
||||
|
||||
# 𝑥
|
||||
|
||||
# 𝑦
|
||||
|
||||
(a) Multi-Head Attention 𝑈 !"
|
||||
|
||||
## 𝑉 !" 𝑈 #$
|
||||
|
||||
## 𝑉 #$
|
||||
|
||||
# 𝑥
|
||||
|
||||
# 𝑦
|
||||
|
||||
> 𝑆 !"
|
||||
> 𝑆 #$
|
||||
|
||||
(b) CLOVER 0.08 0.7 1
|
||||
|
||||
> Pruning Ratio
|
||||
> 3
|
||||
> 4
|
||||
> 5
|
||||
> 6
|
||||
> 7
|
||||
> Log of Perplexity
|
||||
> Vanilla
|
||||
> CLOVER
|
||||
|
||||
(c) Pruning without Training 0.125 0.25 0.375 0.5 0.625
|
||||
|
||||
> 10
|
||||
> 12
|
||||
> 14
|
||||
> 16
|
||||
> 18
|
||||
> 20
|
||||
> 22
|
||||
> 24
|
||||
> Perplexity
|
||||
> Vanilla
|
||||
> CLOVer
|
||||
> CLOVer
|
||||
|
||||
(d) Fine-Tuning Pruned Model
|
||||
|
||||
Figure 1. (a) We treat the Query-Key and Value-Output layers within a single attention head as a unified structure. (b) Apply SVD to obtain two sets of singular vectors for initializing the Q-K and V-O layers, along with singular values that guide pruning or enable efficient full-rank fine-tuning. (c) This cross-layer orthogo-nalization strategy allows for higher pruning rates. (d) The pruned model maintains strong performance after fine-tuning.
|
||||
|
||||
For instance, a 65B parameter model (Touvron et al., 2023) with 8-bit key-value quantization requires over 86GB of GPU memory to store 512K tokens, exceeding the capacity of a single H100-80GB GPU (Sun et al., 2024). To enable efficient training and inference, we introduce CLOVER (Cross-Layer Orthogonal Vectors), a novel method that orthogonalizes the Query, Key, Value, and Out-put vectors without generating additional transformation matrices. As shown in Figure 1a, we treat the Q-K and V -
|
||||
|
||||
O pairs in each attention head as a low-rank decomposition of WQK and WV O . By crossing these layers and perform-1
|
||||
|
||||
> arXiv:2411.17426v3 [cs.LG] 31 Jan 2025 CLOVER: Cross-Layer Orthogonal Vectors
|
||||
|
||||
ing SVD on WQK and WV O , the Query, Key, Value, and Output vectors become orthogonal within each attention head. Figure 1b illustrates how the resulting singular val-ues can guide pruning or serve as trainable parameters for efficient fine-tuning. After pruning or fine-tuning, these values can be reintegrated into the model without increasing its parameter count. Notably, previous methods, such as SVFT (Lingam et al., 2024), obtain orthogonal vectors by directly performing orthogonal decomposition on the matrix at each layer, which results in an accompanying transfor-mation matrix, doubling the parameter count. In contrast, CLOVER treats the Q-K pairs as transformation matrices for each other, and similarly for the V -O pairs. CLOVER only generates a small set of singular values to guide prun-ing and fine-tuning, which can be merged back into the model without increasing inference costs.
|
||||
|
||||
By orthogonalizing the vectors, we eliminate linear re-dundancy. Attention heads contain numerous non-zero norm vectors. Directly pruning these vectors would degrade performance, but orthogonalizing them allows us to repre-sent the entire attention head’s space using a small set of orthogonal bases. The remaining vectors are nearly zero, making them safe to prune. As shown in Figure 1c, prun-ing an average of 45 vectors in the query-key pair using CLOVER results in a perplexity similar to that of vanilla pruning, which prunes only 5 vectors. Moreover, CLOVER generates a singular value matrix between the Q-K and
|
||||
|
||||
V -O pairs. By updating this matrix during fine-tuning,
|
||||
|
||||
CLOVER learns linear combinations of all orthogonal bases within each attention head. In contrast, PiSSA can only learn from a subset of orthogonal vectors, potentially causing some data projections to approach zero in those di-rections, leading to non-functional adapters during training. As shown in Figure 1d, fine-tuning a very small number of singular values can achieve performance close to that of fine-tuning all attention heads. We summarize the contribution of our paper as follows: • We treat the Q-K and V-O pairs in each attention head as low-rank approximations of WQK and WV O . By performing SVD, we orthogonalize the attention head without adding extra transformation matrices. • This orthogonalization reduces linear redundancy, is compatible with any pruning method, and allows for higher pruning ratios. Pruning 46.42% of the vectors in Whisper’s attention head preserves performance with-out requiring additional training. • CLOVER enables efficient full-rank updates, surpass-ing SOTA methods such as LoRA, DoRA, HiRA, and PiSSA on eight commonsense reasoning tasks across LLaMA 7B/13B, LLaMA-2-7B, and LLaMA-3-8B, with additional analyses highlighting its advantages.
|
||||
|
||||
## 2. Related Work
|
||||
|
||||
LLM Compression To mitigate the high memory de-mands of KV Caches in long-context models, several tech-niques have been proposed. These include reducing se-quence length with linear attention (Katharopoulos et al., 2020; Wang et al., 2020; Peng et al., 2023; Gu & Dao, 2023; De et al., 2024), dynamic token pruning (Fu et al., 2024; Jo & Shin, 2024; Li et al., 2024b), compressing the key-value rank (Shazeer, 2019; Ainslie et al., 2023; Liu et al., 2024a; Yu et al., 2024), and pruning head dimensions (Ashkboos et al., 2024; Xia et al., 2023; Sun et al., 2023). Additional approaches include sharing key-value representations across layers (Sun et al., 2024; Brandon et al., 2024; Liu et al., 2024c; Zuhri et al., 2024) and quantizing KV cache weights and activations (Frantar et al., 2022; Dettmers et al., 2022; Xiao et al., 2023; Liu et al., 2024e; Hooper et al., 2024). Among them, structure pruning is hardware-friendly but can reduce performance when non-zero dimensions are removed (Ma et al., 2023). Fine-tuning can recover some of the lost performance, but it’s computationally expensive. To address this, Parameter Efficient Fine-Tuning (PEFT) methods are used (Guo et al., 2023).
|
||||
|
||||
Parameter Efficient Fine-Tuning. Several strategies have been introduced to minimize fine-tuning parameters while maintaining performance. These include low-rank adaptation (Hu et al., 2021), partial-parameter fine-tuning (Zaken et al., 2021; Lawton et al., 2023; Zhao et al., 2020; Sung et al., 2021; Ansell et al., 2021; Xu et al., 2021; Guo et al., 2020; Fu et al., 2023), soft prompt fine-tuning (Ham-bardzumyan et al., 2021; Lester et al., 2021; Li & Liang, 2021; Liu et al., 2023b; Vu et al., 2021; Asai et al., 2022; Wang et al., 2023), and sparse matrix fine-tuning (Qiu et al., 2023; Liu et al., 2023a; Yuan et al., 2024). Among these, LoRA is widely used due to its simplicity and effectiveness, with recent works enhancing it further (Zhang et al., 2023; Zi et al., 2023; Liu et al., 2024d; Zhao et al., 2024; Jiang et al., 2024). PiSSA (Meng et al., 2024) improves conver-gence speed by initializing adapters with principal singular values and vectors, also reducing quantization error (Wang et al., 2024a;b; Li et al., 2024a). However, PiSSA is limited by its use of a fixed set of orthogonal bases. SVFT (Lingam et al., 2024) directly applies Singular Value Decomposition (SVD) to the original matrix, but this increases the number of parameters, raising computational overhead and reducing efficiency. The CLOVER method addresses these issues by treating the Query-Key pairs in each attention head as low-rank matrices. Using orthogonal decomposition, CLOVER eliminates the need for additional transformation matrices. Instead, it leverages a small set of singular values to linearly combine orthogonal vectors, making the approach more parameter-efficient. After fine-tuning, the adapter can be smoothly reintegrated into the original matrix structure. 2CLOVER: Cross-Layer Orthogonal Vectors
|
||||
|
||||
## 3. CLOVER: Cross-Layer Orthogonal Vectors
|
||||
|
||||
Below is a step-by-step explanation of CLOVER method and explain why it can update orthogonal decompose the Query, Key, Value, Output layers in Multi-Head Attention without need introduce any transfer matrix. We mainly use the computation of the Q-K pair in as an example. Then extended to the V -O pair.
|
||||
|
||||
Multi-Head Self-Attention Setup. In a multi-head self-attention mechanism with H heads, each head h ∈{1, . . . , H } computes an attention score as: attn (Qh, K h) = softmax
|
||||
|
||||
QhK⊤
|
||||
|
||||
> h√d
|
||||
|
||||
|
||||
|
||||
,
|
||||
|
||||
where H is the number of attention heads, d is the dimen-sionality of each head, X ∈ Rn×D is the input matrix ( n
|
||||
|
||||
is the sequence length, D is the total hidden dimension),
|
||||
|
||||
Qh, K h ∈ Rn×d are the query and key representations for head h, WQ, W K ∈ RD×H×d are weights for projecting the input X into queries and keys. Specifically, the queries and keys for head h are obtained by multiplying X with the corresponding “slice” of WQ and
|
||||
|
||||
WK , respectively:
|
||||
|
||||
Qh = X W [: ,h, :]
|
||||
|
||||
> Q
|
||||
|
||||
, Kh = X W [: ,h, :]
|
||||
|
||||
> K
|
||||
|
||||
.
|
||||
|
||||
Cross Layers Merging. Substituting Qh and Kh into
|
||||
|
||||
QhK⊤
|
||||
|
||||
> h
|
||||
|
||||
, we have:
|
||||
|
||||
QhK⊤
|
||||
|
||||
> h
|
||||
|
||||
= X W [: ,h, :]
|
||||
|
||||
> Q
|
||||
|
||||
| ||||