T6/T7/T8 ablations + lens-search hold pending multiseed

- Add `eval/layer_module_ablation.py` (T7) and `eval/parameterization_ablation.py` (T8) for causal ablation of trained `dW`.
- Add `nbs/ablation_analysis.py` consuming T7/T8 CSVs through three lenses (SVD-on-`dW`, layer index, module family).
- Fix `prompt_baseline.py` engineered-prompt tuple bug; add `DIFF_FILENAME` constant in `diff.py`.
- Delete superseded notebooks (`analyze_diff*`, `cross_adapter_v9`, `hypothesis_sweep_v5-v9`, `strong_conclusion_v4`, `v10_llama`, `functional_projection_v10`).
- Document (README, fork_plan, RESEARCH_JOURNAL): each lens has a built-in failure mode (SVD tautological for low-rank adapters; layer-index tells depth not mechanism; module-family disagrees cross-adapter; native parameterization decompositions non-comparable). Mark analysis question on hold pending T4 multiseed: cross-adapter inconsistency may be N=1 seed noise.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
wassname
2026-04-27 19:05:20 +08:00
parent db7979d0e2
commit 6ec664995b
31 changed files with 5198 additions and 12643 deletions
+39 -1
View File
@@ -63,4 +63,42 @@ ok so we could, ideally over two models, gemma and qwen
what experiment do you propose and how much does it prove? How will the paperl ook? What claim will it make in each case? What key figure?
btw I'd like to keep this as a simple into
btw I'd like to keep this as a simple into
Human approved AI sumary
Two distinct goals:
**Goal A: parametrize trained dW (post-hoc, descriptive).** Given the trained dW, find a coordinate system that makes it sparse / low-rank / interpretable. The lenses are: dW's own SVD (is it self-concentrated?), base-W SVD (does it ride pretrained directions?), shared cross-adapter SVD (do different adapters converge to the same subspace?), activation-PCA (does it lie in the behavioral contrast subspace?), and the adapter-architecture decompositions (DoRA magnitude vs direction, DeLoRA λ vs direction, OFT rotation, IA3 gates) — those last ones are interesting because the parametrization *constrains what dW the optimization can produce*, so it's a half-step between A and B.
**Goal B: predict dW without training (constructive, from-scratch).** Given pretrained weights and/or base activations, build a `dW'` that steers the same way the trained dW does. Candidates: TaskDiff/RepE persona contrast, function vectors, write-not-read, OV-write, gate-kernel, signed SAE features, ReFT-r1, attention min/max/diff. None of these touch a trained adapter at construction time. The "fair" benchmark is comparing them to trained dW on identical DD rows.
some human feedback:
> a carrier? made up term
> its keep
made up. we should say "causally important" or something. we're ablating right?
> The catalog rule is: a component is a carrier if its keep retains ≥70% of full_dW's dd_delta AND its drop removes ≥90%; redundant if both retain; non_carrier if keep collapses; potent_target if keep fails at trained scale but a
this makes little sense, arbitrary theshold multiple gates. should have one single quantitive measure: performance drop when ablated. or performnce maintainced when kept
> but a norm-matched amplification of it does steer (T8 currently lacks that random-norm-matched control, T7 has it). Anchors full_dW, zero, and random_norm_matched_full calibrate the scale;
we were ablating I thought we decided? oh you ablate steering performance?
> norm-matched amplification
careful with norms
> random_norm_matched matters because cropping shrinks Frobenius norm and the model is nonlinear in α, so without it we can't separate "this direction matters" from "smaller effective coefficient was better".
but if it's just ablation... and we keep the retained portion of deltaW... I would think that's fine and no norm needed?
> T7? read doesn't know what it means
> oth eval on identical daily-dilemmas rows (219 dilemmas × 2 actions = 438 rows, base persona, idx_symmetric_diff=0 enforced) at α=0,1 with metric (log p(Yes) log p(No))
> read /humanizer skill, first say the concept, then the detail in a follow up, too many tangents and insertion hurt meat brain
Human:
Concept. We have this weight training the works well, but it works via two lora's. So I wonder what subspace or dimension or module or parametisation is it in? Lets find out using causal ablation of the trained delta `dW = θ_pos θ_neg`. We zero out parts of dW, re-evaluate on identical daily-dilemmas rows, and report retained performance. Close to 0 means the zeroed part was necessary; close to 1 means it was redundant. We can normalize by the rank or concentration, and a parametisation that contains 99% of the energy of dW would find it easier to maintain full performance than one that has % the rank.
What do we test? We have a few lenses:
- where does dW live?
- and can we predict dW in a steering setting. This means from a task activations hs_diff, and pretrained weight W (and maybe attention weigths). The task activation migth be residual stream or proj_up or attention scores.
In particular we can look at
- SVD basis
- which modules or layers does dW intervene at. Residual read or writes? Attention or mlp?
- if we frame it as a rotation or magnitude or residual, where does the signal live
+979
View File
@@ -0,0 +1,979 @@
# Activation and weight hypotheses for steering and ablation
Date: 2026-04-27
Purpose: collect the hypotheses scattered across `nbs/`, local qmd notes, and the current fork plan into one map. The key distinction is:
- untrained-base recipe: hypotheses built before looking at the trained adapter delta. These can become from-scratch steering methods or synthetic `dW'` baselines.
- trained-delta oracle: labels or oracles derived from the trained adapter effect. These are not fair from-scratch methods, but they are good causal ablation targets.
- causal fit: whether the hypothesis belongs in the planned cross-adapter `dW` basis ablation, layer/module ablation, adapter-parameterization ablation, activation-steering baseline, synthetic `dW'`, or a separate causal test.
Vocabulary discipline: `synthetic dW'` is causal only as a new constructive intervention. It is not a causal ablation of the already-trained adapter. Any activation-steering baseline must be built without loading trained `w.pt` or using `act_oracle`/`TaskDiff_lora_fit`; otherwise it is a trained-delta oracle, not a fair baseline.
Core fork-plan mapping:
| fork-plan experiment | Fits what | Does not fit what |
|---|---|---|
| [cross-adapter causal `dW` basis ablation](../fork_plan.md) | learned `dW` SVD bases, shared adapter bases, per-adapter top/tail bases | pure activation bases unless first converted into a weight projection of the trained `dW` |
| [layer/module causal ablation of trained `dW`](../fork_plan.md) | layer slices, residual writers, attention output, MLP down, read/write module families | candidate bases that mix all layers without layer labels |
| [adapter-parameterization causal ablation of trained `dW`](../fork_plan.md) | LoRA rank components, PiSSA/DeLoRA S-space crops, DoRA magnitude vs direction, OFT rotations, IA3 gates | post-hoc activation PCA unless used only as an evaluation target |
| [activation-steering baseline](../fork_plan.md) | TaskDiff/RepE directions built without trained `dW`, selected on held-out validation rows | trained `dW` components, `act_oracle`, `TaskDiff_lora_fit` |
| [synthetic `dW'` baseline](../fork_plan.md) | pretrained read/write bases with signed coefficients from contrast activations | causal claims about the already trained adapter |
| new causal test | nonlinear clusters, token-conditional attention routing, concept-space probes, DAS/SAE features | simple keep/drop of a fixed linear `dW` basis unless linearized first |
## Source provenance
Notebook sources:
- [nbs/analyze_diff_v2.py](../nbs/analyze_diff_v2.py): first clean TaskDiff, suppressed, stenographic, `lm_head_read`, `logits_null` concentration test.
- [nbs/analyze_diff_v3.py](../nbs/analyze_diff_v3.py): A-side hypothesis vs B-side label framing.
- [nbs/hypothesis_sweep_v9.py](../nbs/hypothesis_sweep_v9.py): main hypothesis catalog, activation and weight scores, block-local scope diagnostic, cross-adapter support.
- [nbs/functional_projection_v10.py](../nbs/functional_projection_v10.py): causal projection/complement test of trained residual-write `dW` into activation-PCA bases.
- [nbs/v10_llama.py](../nbs/v10_llama.py): Wendler-style token-energy and logit-lens functional metrics.
- [nbs/strong_conclusion_v4.py](../nbs/strong_conclusion_v4.py): older positive result discipline for `write_not_read` vs TaskDiff before v9/v10 tightened the interpretation.
Local qmd search sources:
- `qmd search 'weight steering SVD subspace activation attention'` found local notes on SVD steering and adapter parametrization.
- `qmd search 'LoRA PiSSA DeLoRA OFT IA3 parameterization steering subspace'` found the adapter-as-hypothesis catalog.
- A vector/reranked qmd query OOMed the local GPU, so this catalog uses BM25 qmd results plus workspace notebooks.
External refinement sources added after the first catalog pass:
- `logs/research/20260427_external_hypotheses_qmd.out`: qmd BM25 searches for function vectors, SAE steering, concept induction, and adapter subspaces.
- `logs/research/20260427_external_hypotheses_hf.out`: HF Papers searches. Important hits include Function Vectors, Task Arithmetic, AxBench, SAE steering, MSRS, attention-output low-dimensional subspaces, and LoRA spectral methods.
- `logs/research/20260427_external_hypotheses_bibtex.bib`: semantic-search BibTeX output. Initial run failed on missing `rapidfuzz`; rerun with `uv run --with rapidfuzz` succeeded.
- `logs/research/20260427_external_hypotheses_selected_info.out`: HF metadata for selected papers.
- `logs/research/20260427_external_hypotheses_qmd_excerpts.out`: qmd excerpts for AxBench, SAE feature-flow notes, and dual-route/function-vector notes.
External papers used as hypothesis generators, stated as authors' claims unless already validated here:
| paper/source | search signal | relevant claim for this repo | hypothesis consequence |
|---|---|---|---|
| Todd et al., Function Vectors in Large Language Models, arXiv:2310.15213 | HF search + metadata, repo has 195 stars | Authors claim middle-layer attention heads transport compact task/function vectors with causal effects and compositionality. | Split concept vector from function/instruction vector; test FV-head output subspace separately from residual TaskDiff. |
| Ilharco et al., Editing Models with Task Arithmetic, arXiv:2212.04089 | HF search + metadata, repo has 538 stars | Authors claim weight-space task vectors compose by addition/negation and analogy. | Treat `dW` as a task vector family; add sign/analogy/arithmetic tests across adapters and behaviors. |
| Wu et al., AxBench, arXiv:2501.17148 | qmd + HF metadata | Authors claim prompting/finetuning beat most steering methods; difference-in-means is strong for concept detection; SAEs are not competitive in their benchmark. | Keep prompt and activation baselines honest; do not over-privilege SAE/PCA interpretability if simple DiffMean wins. |
| Arad et al., SAEs Are Good for Steering, arXiv:2505.20063 | HF metadata | Authors claim SAE steering improves after filtering features by output score; input features and output features often differ. | If testing SAE bases, use output-score filtering and allow signed negative projections; raw activation-frequency features are the wrong basis. |
| Mayne et al., Can sparse autoencoders decompose steering vectors?, arXiv:2411.08790 | HF metadata | Authors claim SAE decompositions of steering vectors can mislead because steering vectors are out of SAE input distribution and need negative feature projections. | SAE analysis should decompose signed vectors in decoder space, not just positive latent activations. |
| Jiang et al., MSRS, arXiv:2508.10599 | qmd + HF metadata | Authors claim multi-attribute steering benefits from orthogonal private subspaces plus a shared subspace and token-level dynamic weighting. | Replace one global basis with shared/private basis split for sycophancy vs honesty transfer and per-token weighting. |
| Wang et al., Attention Layers Add Into Low-Dimensional Residual Subspaces, arXiv:2508.16929 | HF metadata | Authors claim attention outputs live in low-dimensional subspaces induced by `W_o` and this affects SAE dead features. | Use attention-output active subspace as a causal basis and as an SAE initialization/control. |
| Park et al., The Information Geometry of Softmax, arXiv:2602.15293 | semantic-search BibTeX + HF metadata | Authors claim softmax information geometry gives a natural probe/steering geometry and propose dual steering to minimize off-target concept changes. | Use Fisher/softmax-metric projection instead of Euclidean `P_B` for logit-facing steering bases. |
Deferred external leads, not yet promoted to main hypotheses:
- `Causality != Invariance: Function and Concept Vectors in LLMs` from the BibTeX search is an anti-overclaiming control: vector invariance or compositionality is not enough; use causal keep/drop or patching.
- `Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens` from the BibTeX search reinforces the v10 warning: do not reject a function/control route only because immediate Yes/No readout is weak.
- `Spherical Steering` is probably a geometry variant of the existing rotation/OFT and softmax-geometry rows, not a separate experiment yet.
- `What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal` may be a closer behavioral analogue for honesty/sycophancy than translation papers; worth reading before final paper framing.
- qmd feature-flow and FGAA notes suggest a cross-layer SAE feature-flow hypothesis, but it needs pretrained SAEs. Defer unless SAE artifacts are available for Qwen3-0.6B.
## Current empirical bottom line
The old positive framing was: A-side recipes like `write_not_read` or TaskDiff may recover the LoRA steering label. The v9/v10 update is stricter:
- Across adapter families, most tested linear bases capture only about 1 to 8 percent of the relevant rank-matched oracle.
- Block-local activation PCA did not fix the mismatch between activation oracles and weight oracles.
- Causal projection shows activation-PCA directions can be potent if amplified, but for the strongest adapter, DeLoRA, the trained-scale behavior mostly lives in the complement.
- Wendler-style probes suggest LoRA-layer `Δh` is concept-space, not directly Yes/No readable. Downstream layers translate it into the Yes/No or honesty behavior.
So a basis can be useful for steering without being an explanation of the trained adapter. This distinction matters for every row below.
External-search update: the outside literature mostly pushes in the same direction, but only as hypothesis generation for this DD setting. AxBench-like results warn that simple DiffMean/ReFT-style baselines can beat prettier mechanistic bases for steering; function-vector and concept-induction work says task/function transport can be head-local and not logit-readable; SAE steering needs output-causal feature selection, not raw activation-feature labels; task arithmetic says the trained `dW` family itself may be the right algebraic object.
## Notation
Let `h_l` be a residual-stream vector at layer `l`, `W_l` be a pretrained linear map, and `dW_l` be the trained adapter delta for that map.
For a basis `B ∈ R^{d x k}` with orthonormal columns:
```py
P_B = B @ B.T
keep_B(dW) = P_B @ dW
drop_B(dW) = dW - P_B @ dW
energy_frac(x, B) = ||P_B x||^2 / ||x||^2
subspace_overlap(A, B) = ||A.T @ B||_F^2 / min(rank(A), rank(B))
```
For a weight matrix SVD:
```py
U, S, Vh = svd(W) # W: d_out x d_in
left_basis = U[:, :k] # output/write directions
right_basis = Vh[:k].T # input/read directions
S_coords(dW) = U.T @ dW @ Vh.T # adapter delta in W's singular-vector basis
```
## Trained-delta labels and oracles
These are useful for analysis but must not be presented as from-scratch steering recipes.
| name | construction | interpretation | steering? | fork-plan fit |
|---|---|---|---|---|
| `w_oracle` | `left_svd_basis(concat(dW_o_proj, dW_down_proj), k)` | The best rank-`k` residual-output basis for the trained local weight delta. | Not from scratch. Can steer only because it uses trained `dW`. | Cross-adapter `dW` basis ablation. Use as per-adapter top basis and as sanity ceiling. |
| `act_oracle` | `pca(normalize(h(+α) - h(-α)), k)` on eval activations | Best rank-`k` activation basis for the trained adapter effect on the sampled prompts. | Not fair from scratch if built from trained steering. Can be an intervention target. | New causal test or v10-style projection ablation, not cross-adapter shared `dW` unless converted to `P_act dW`. |
| `act_oracle_block` | `pca(normalize((post_pos-pre_pos) - (post_neg-pre_neg)), k)` | Scope-matched local block contribution instead of cumulative residual effect. | Same as `act_oracle`. | v10 projection/complement test. Helps check whether scope mismatch was the bug. |
| `TaskDiff_lora_fit` | PCA of trained adapter `h(+α)-h(-α)` on FIT prompts, scored on EVAL prompts | Held-out answer key for whether a learned effect generalizes across prompts. | Not from scratch. | Useful diagnostic for activation-steering upper bound and concept-space rank. Not a planned `dW` ablation by itself. |
Pseudocode:
```py
def b_side_oracles(model, dW, prompts_fit, prompts_eval, k):
h_pos_fit = capture(model + dW, prompts_fit, α=+1)
h_neg_fit = capture(model + dW, prompts_fit, α=-1)
h_pos_eval = capture(model + dW, prompts_eval, α=+1)
h_neg_eval = capture(model + dW, prompts_eval, α=-1)
B_task_lora = pca(h_pos_fit - h_neg_fit, k)
B_act_eval = pca(unit_rows(h_pos_eval - h_neg_eval), k)
B_w = left_svd_basis(concat_residual_writer_dW(dW), k)
return B_task_lora, B_act_eval, B_w
```
Positive readout: an A-side candidate approaches these oracles and keeps behavior under causal keep/drop. Negative readout: high geometric score does not preserve behavior, or low geometric score still steers when amplified.
## Activation hypotheses
### Function-vector head basis
Construction:
```py
for head in attention_heads:
fv_head_score = causal_patch_score(head_output, task_prompt_pairs)
top_heads = topk(fv_head_score, k_heads)
B_fv = pca(stack([OV_output_basis(head) for head in top_heads]), k)
```
Interpretation: sycophancy/honesty steering may decompose into a concept vector plus a function or instruction vector. The function vector is task-level control like "answer honestly" or "agree with the user", and can be transported by a small set of middle-layer attention heads. Todd et al. claim function vectors are robust across contexts and compositional; Feucht et al. treat FV heads as distinct from concept-induction heads.
Steering use: yes as activation steering or head-output patching. It is probably a better fit for "what task is being done?" than for "what semantic concept is active?".
Fork-plan fit: new causal test, plus layer/module ablation if the trained `dW` concentrates in the `o_proj` rows for top FV heads. Add `fv_heads_only`, `non_fv_heads_only`, and `drop_fv_heads` rows if head-level masking is implemented.
Positive readout: FV-head patch changes instruction/function while preserving topic content, while same-layer random heads and concept-head controls do not. Negative readout: FV basis is just another dense TaskDiff basis and does not localize to heads.
### Concept-induction vs function-vector split
Construction:
```py
B_concept = pca(outputs(top_concept_induction_heads, semantic_copy_prompts), k)
B_function = pca(outputs(top_function_vector_heads, task_demonstration_prompts), k)
score = behavior(model, patch(B_concept)) - behavior(model, patch(B_function))
```
Interpretation: the current "concept-space" language in v10 may be underspecified. Dual-route induction suggests at least two soft-induction routes: concept heads transport what entity/concept is being discussed, while FV heads transport what transformation/task should be applied. Sycophancy may be a function-vector failure more than a concept-vector failure.
Steering use: yes, but the intervention should be head-local or route-local rather than a generic residual addition.
Fork-plan fit: new causal test. It also refines attention min/max/diff: token identity logging should distinguish concept tokens, instruction tokens, and answer-format tokens.
Positive readout: crossed dissociation. Concept-head patch changes target concept/topic with the same output policy; FV-head patch changes policy/instruction with the same concept/topic. Negative readout: both patches only move a generic sycophancy logit ratio.
### ReFT-r1 / supervised rank-1 representation finetuning baseline
Construction:
```py
r = train_rank1_reft(site=l, positives=honest_rows, negatives=sycophantic_rows)
h_l_steered = h_l + α * r.left @ (r.right.T @ h_l)
```
Interpretation: AxBench authors claim weakly supervised rank-1 representation finetuning is competitive while remaining more interpretable than prompting. This is a stronger fair activation baseline than unsupervised PCA if we allow a small supervised validation set.
Steering use: yes, but it is a learned activation intervention, not a trained weight-delta explanation.
Fork-plan fit: activation-steering baseline. It should be compared against TaskDiff and prompt baselines on identical DD rows. It should not use trained `w.pt`.
### SAE output-score signed feature basis
Construction:
```py
features = sae.encode(h_l)
input_score_j = corr(features[:, j], concept_label)
output_score_j = causal_effect(decoder[:, j], target_logit_or_behavior)
selected = [j for j in features if input_score_j > τ_in and output_score_j > τ_out]
B_sae_out = orth(decoder[:, selected] * sign(output_effect[selected]))
```
Interpretation: raw SAE activations are not enough. Arad et al. claim steering improves after selecting features with output-causal scores; Mayne et al. claim steering-vector SAE decomposition is misleading when it ignores negative feature projections. The hypothesis is that v9 PCA misses a sparse signed feature basis that is output-causal but not high-variance.
Steering use: possible. It should be tested only with signed decoder directions and output-score filtering.
Fork-plan fit: new causal test or activation-steering baseline. If converted to `P_B dW`, it becomes a trained-scale carrier test. Do not put raw SAE latent activations into the core fork-plan without output-score filtering and signed negative-projection controls.
Positive readout: output-score SAE basis steers at lower norm or lower degradation than DiffMean/TaskDiff/ReFT-r1, and ablations show both output-score filtering and signed negative projections matter. Negative readout: DiffMean or ReFT-r1 still dominates, matching AxBench's warning.
### MSRS-style shared/private steering
Construction:
```py
B_shared = intersection_or_joint_svd([B_sycophancy, B_honesty, B_refusal])
B_private_task = orth(B_task - project(B_task, B_shared))
α_tokens = router(token_features) # optional token-level weights
h_l += α_shared * P_shared @ v + α_private * P_private_task @ v
```
Interpretation: MSRS authors claim multi-attribute steering benefits from orthogonal private subspaces plus a shared subspace and token-level dynamic weighting. For this repo, the transfer target is sycophancy training to daily-dilemmas honesty. The shared/private split is a concrete alternative to one global TaskDiff.
Steering use: yes. It is especially relevant if sycophancy and honesty share some moral-agreement axis but differ in prompt/style axes.
Fork-plan fit: activation-steering baseline and cross-adapter shared `dW` ablation. Keep two variants distinct: `MSRS_activation_shared_private` for activation steering, and `dW_shared_private_transfer` for trained deltas. The trained-delta version is `B_shared` across adapters/behaviors plus adapter-private residuals.
Positive readout: shared basis preserves transfer to DD while private basis preserves sycophancy eval; mixing them beats global TaskDiff/shared SVD and private-only baselines on the transfer/degradation frontier. Negative readout: shared/private split adds complexity without improving that frontier.
### Softmax information-geometry steering
Construction:
```py
J = jacobian(log_softmax(W_U @ h), h) # or Fisher metric approximation
G = J.T @ diag(p) @ J # local softmax/Fisher metric
P_B_G = B @ inv(B.T @ G @ B) @ B.T @ G
h_steered = h + α * P_B_G @ v # or project dW outputs with G metric
```
Interpretation: this is a projection metric variant, not a new basis family. Euclidean projection may be the wrong geometry for logit-facing behavior. Park et al. claim softmax information geometry gives a natural steering metric and dual steering can change a target concept while minimizing off-target changes. This directly addresses the current degradation concern.
Steering use: yes for logit-facing activation steering. It may be less useful for hidden concept-space layers where `W_U` is not the immediate readout.
Fork-plan fit: activation-steering baseline and new projection/complement variant. Replace Euclidean `P_B dW` with Fisher/softmax-metric projection and compare behavior/degradation.
Positive readout: for the same basis, same norm, and same target effect, Fisher/softmax projection has lower off-target DD or lower perplexity degradation than Euclidean projection. Negative readout: no improvement at LoRA layers because the relevant concept is not yet logit-facing.
### TaskDiff contrast
Construction:
```py
h_pos = capture(base_model, persona_pos_prompts)
h_neg = capture(base_model, persona_neg_prompts)
B_task = pca(h_pos[l] - h_neg[l], k)
```
Interpretation: the target behavior is linearly separable in base residual activations under contrastive personas. This is the standard RepE/ActAdd-style story.
Steering use: yes. This is the main activation-steering baseline, if built from training prompts only and selected on held-out validation rows.
Fork-plan fit: activation-steering baseline. It can also seed synthetic `dW'` if converted into a write direction through pretrained writer maps, but then it is a constructive baseline, not an ablation of trained `dW`.
Failure mode: persona contrast may capture style or role behavior, not the sycophancy/honesty mechanism. v10 found persona contrast weakly captures the Yes/No axis. For a fair activation-steering benchmark, fail if code loads `w.pt` before constructing this basis.
### Suppressed or turnover subspace
Construction:
```py
mag = abs(h_clean).permute(layer, batch, dim)
Δmag = mag[1:] - mag[:-1]
added = relu(Δmag).sum(layer)
removed = relu(-Δmag).sum(layer)
B_suppressed = pca(min(added, removed), k)
```
Interpretation: planning or scratchpad features are written and later erased before final readout. This is a plausible hidden-computation basis.
Steering use: maybe. It can define an activation intervention or a synthetic `dW'` target, but v9 did not show it explains trained `dW`.
Fork-plan fit: not a direct planned `dW` basis unless applied as `P_suppressed dW` in a v10-style projection/complement causal test. Better fit: synthetic `dW'` or new causal test.
Positive readout: `keep_B(dW)` preserves behavior and `drop_B(dW)` removes it. Negative readout: both projection and complement steer, or projection is weak at trained norm.
### Stenographic subspace
Construction:
```py
B_task = pca(h_pos - h_neg, k)
B_suppressed = pca(turnover(h_clean), k)
B_steno = intersect_bases(B_task, B_suppressed, min_overlap=τ)
```
Interpretation: task-relevant signal lives in directions the model also tends to suppress. This was the strongest form of the hidden-planning story in early notebooks.
Steering use: possible but rank may collapse. Use as an activation steering or synthetic write target, not as evidence about trained `dW` without causal ablation.
Fork-plan fit: new causal test or synthetic `dW'`. If projected onto trained `dW`, it becomes a v10-style trained-scale carrier test.
### Churn
Construction:
```py
B_churn_l = pca(h_clean[l + 1] - h_clean[l], k)
```
Interpretation: important computation lives where the residual stream changes most across layers, not where static activations have high variance.
Steering use: maybe, but broad and likely nonspecific.
Fork-plan fit: synthetic `dW'` or activation-steering baseline. For trained `dW`, use projection/complement as an extra causal test, not one of the three core fork-plan ablations.
### Amplified and added features
Construction:
```py
B_amplified = pca(relu(abs(h_clean[last]) - abs(h_clean[first])), k)
B_added = pca(relu(abs(h_clean[1:]) - abs(h_clean[:-1])).sum(layer), k)
```
Interpretation: useful behavior may ride features that are progressively amplified, not features that are written then erased.
Steering use: weak prior. It is a broad activation prior rather than a behavior-specific hypothesis.
Fork-plan fit: synthetic `dW'` or activation-steering exploratory baseline. Not a core trained-`dW` ablation unless used as `P_B dW`.
### Global clean and persona residual PCA
Construction:
```py
B_clean = pca(stack_layers_and_prompts(h_clean), k)
B_persona = pca(stack(h_persona_pos, h_persona_neg), k)
```
Interpretation: behavior lies in high-variance background residual directions. This is mostly a control.
Steering use: probably poor as a specific steerer.
Fork-plan fit: random/control-like row for synthetic `dW'` or activation steering. It should not be central unless it unexpectedly beats task-specific bases.
### Attention-selected TaskDiff: min, max, diff, min times norm
Construction:
```py
attn_pos = final_token_attention(persona_pos)
attn_neg = final_token_attention(persona_neg)
tok_diff = h_pos_tokens - h_neg_tokens
B_attn_min = pca(sum_tokens(min(attn_pos, attn_neg) * tok_diff), k)
B_attn_max = pca(sum_tokens(max(attn_pos, attn_neg) * tok_diff), k)
B_attn_diff = pca(sum_tokens(abs(attn_pos - attn_neg) * tok_diff), k)
B_attn_min_norm = pca(sum_tokens(min(attn_pos, attn_neg) * norm(tok_diff) * tok_diff), k)
```
Interpretation:
- `attn_min_taskdiff`: shared attended tokens carry the stable plan.
- `attn_max_taskdiff`: any strongly attended token can carry the plan.
- `attn_diff_taskdiff`: changes in attention routing are themselves the signal.
- `attn_min_x_diffnorm_taskdiff`: shared attention matters, but high-contrast tokens get more weight.
Implementation caveat: v9 scores these as linear spans after attention-weighted aggregation. It does not prove which token type carried the signal. A real causal test should log token indices and token strings for the min/max/diff weights, e.g. final token, delimiter, question token, or persona token, and then perturb the selected attention route.
Steering use: yes as activation steering, especially if last-token extraction is too narrow.
Fork-plan fit: activation-steering baseline or new token-conditional causal test. It does not fit current layer/module `dW` ablation unless converted into `P_B dW` for residual writers.
Positive readout: attention-weighted basis beats unweighted TaskDiff on held-out behavior and projection. Negative readout: attention weights select formatting or prompt tokens and do not steer.
Refinement from MSRS: the token weights should be learned or validated by behavior, not only borrowed from raw attention. Add a matched comparison between attention-derived weights and a small router trained on token type or logit effect.
### Up-proj input contrast
Construction:
```py
x_up_pos = capture_input(model.layers[l].mlp.up_proj, persona_pos)
x_up_neg = capture_input(model.layers[l].mlp.up_proj, persona_neg)
B_up_input = pca(x_up_pos - x_up_neg, k)
```
Interpretation: the behavior is represented in the features read by the MLP expansion before nonlinear gating.
Steering use: activation steering at MLP inputs, or synthetic `dW'` into `up_proj` or `gate_proj`.
Fork-plan fit: layer/module ablation if it motivates an `up/gate` row. Synthetic `dW'` if constructing from base activations. Not covered by residual-write-only v10 projection.
### Up-proj output written contrast
Construction:
```py
u_pos = up_proj(x_up_pos)
u_neg = up_proj(x_up_neg)
B_up_written = pca((u_pos - u_neg) @ W_down.T, k)
```
Interpretation: the MLP expansion difference matters only after being mapped back to residual space.
Steering use: plausible residual-write target.
Fork-plan fit: layer/module ablation, especially `mlp_down_proj_only`, and synthetic `dW'` via MLP write maps.
### Gate-active written
Construction:
```py
gate = silu(h_clean @ W_gate.T)
up = h_clean @ W_up.T
B_gate_active = pca((gate * up) @ W_down.T, k)
```
Interpretation: target behavior may live in active gated MLP features rather than raw read/write SVD directions.
Steering use: yes, but likely nonlinear and input-dependent.
Fork-plan fit: layer/module ablation if `up/gate` or MLP modules carry behavior. New causal test if using token/input-conditional gates, because fixed linear keep/drop loses the nonlinearity.
### CHaRS-style clusters
Construction:
```py
H = concat(h_clean, h_persona_pos, h_persona_neg)
centroids = kmeans(H, n_clusters=k)
B_chars = pca(centroids - mean(centroids), k)
```
Interpretation: concept behavior is a cluster or manifold, not a single linear direction. PCA of centroids is a lossy linearization.
Steering use: maybe strong if implemented as per-cluster translations, weak if collapsed to one global span.
Fork-plan fit: new causal test. It does not fit current linear `dW` basis ablations unless deliberately linearized.
### Rotation contrast or Procrustes generator
Construction:
```py
J = pca(concat(h_neg, h_pos), rank)
X = center(h_neg) @ J
Y = center(h_pos) @ J
U, _, Vh = svd(X.T @ Y)
R = U @ Vh
B_rot = J @ left_svd_basis(R - R.T, k)
```
Interpretation: the persona contrast is better described as a rotation in a local concept manifold than as a translation.
Steering use: yes, but the intervention should be rotational, not additive activation steering.
Fork-plan fit: adapter-parameterization inspiration, especially OFT/AntiPaSTO, or a new rotation causal test. Not a natural fit for plain keep/drop unless converted to rotation-derived `dW`.
### Wendler concept-space functional probes
Construction:
```py
Δh_l = mean(h_l(+α) - h_l(-α), prompts)
E2(Δh_l) = (vocab / d) * ||U_hat @ Δh_l||^2 / ||U_hat.T @ U_hat||_F^2
cap_yn(B) = ||P_B(e_yes - e_no)||^2 / ||e_yes - e_no||^2
ldiff(B, Δh) = (e_yes - e_no).T @ P_B @ Δh
```
Interpretation: the LoRA may write a concept that is not directly readable as Yes/No until downstream layers. This tests readout visibility rather than subspace overlap.
Steering use: no direct steering basis by itself, but it tells which layer to steer or probe.
Fork-plan fit: new causal test and benchmark diagnostics. It should be added as an analysis column for layer/module ablation, because a slice that changes behavior may still be invisible to the immediate logit lens.
## Pretrained-weight bases
### Attention-output active subspace
Construction:
```py
for layer in layers:
A_out = capture(self_attn.o_proj output, clean_prompts)[layer]
B_attn_active_l = pca(A_out, k)
B_attn_Wo_l = left_svd_basis(W_o_l, k)
B_attn_active_intersect_l = intersection(B_attn_active_l, B_attn_Wo_l)
```
Interpretation: Wang et al. claim attention outputs occupy a surprisingly low-dimensional residual subspace induced by the output projection. This makes a sharper version of `attn_o_proj_only`: the question is not just whether attention output matters, but whether the trained adapter uses the active attention-output subspace or off-manifold attention-output directions.
Steering use: yes as a synthetic attention-write basis and as an SAE initialization/control.
Fork-plan fit: layer/module ablation and synthetic `dW'`. Add `P_attn_active dW_attn_o` vs complement if attention-only rows are positive.
Positive readout: `P_attn_active dW` keeps attention-mediated behavior and complement loses it, and active PCA beats both `attn_o_proj_only` and structural `W_o` left-SVD controls. Negative readout: active subspace is indistinguishable from a generic attention-module ablation or trained adapter steers via off-manifold `W_o` directions.
### `lm_head_read` and `logits_null` or weak readout
Construction:
```py
U, S, Vh = svd(W_unembed)
B_lm_read = Vh[:k].T
B_logits_null = Vh[-k:].T
```
Interpretation: `lm_head_read` is the canonical readable residual subspace; `logits_null` is weakly read out by the unembedding.
Steering use: yes for simple readout steering, but v10 suggests concept steering does not live here at LoRA layers.
Fork-plan fit: synthetic `dW'` baseline, activation-steering control, and possible `write-not-read` construction. Not a likely trained-`dW` carrier unless `P_lm dW` keeps behavior.
### Global read
Construction:
```py
G_read = sum_l(W_q_l.T @ W_q_l + W_k_l.T @ W_k_l + W_v_l.T @ W_v_l
+ W_up_l.T @ W_up_l + W_gate_l.T @ W_gate_l)
G_read += W_unembed.T @ W_unembed
B_global_read = eig_top(G_read, k)
```
Interpretation: residual directions broadly read by attention, MLP, and unembedding across the model.
Steering use: maybe as a safe/readable direction, but broad and nonspecific.
Fork-plan fit: synthetic `dW'` and controls. It is also the forbidden subspace for `global_write_not_global_read`.
### Global write
Construction:
```py
W_write_all = concat_cols([W_o_l, W_down_l for all layers])
B_global_write = left_svd_basis(W_write_all, k)
```
Interpretation: directions the model can easily write into residual stream across all layers.
Steering use: plausible but nonspecific.
Fork-plan fit: synthetic `dW'`, random/control-like global basis, or cross-adapter ablation if intersected with trained `dW` residual writers.
### Global write not global read
Construction:
```py
P_read = B_global_read_broad @ B_global_read_broad.T
B_gwnr = left_svd_basis((I - P_read) @ W_write_all, k)
```
Interpretation: globally writeable directions that are not in the dominant global read subspace. This is a model-level stenographic candidate.
Steering use: yes as synthetic `dW'` or activation intervention. It may be high-gain if downstream nonlinear paths read it despite low linear readout.
Fork-plan fit: synthetic `dW'` and optional projection/complement trained-`dW` test. If `keep_B(dW)` works and `drop_B(dW)` fails, it supports a write-not-read causal story.
### Per-layer write, attention write, and MLP write
Construction:
```py
B_write_l = left_svd_basis(concat_cols(W_o_l, W_down_l), k)
B_attn_write_l = left_svd_basis(W_o_l, k)
B_mlp_write_l = left_svd_basis(W_down_l, k)
```
Interpretation: layer-local residual write capacity, split by attention and MLP writers.
Steering use: yes for synthetic `dW'`; also direct causal ablation of trained `dW` by module.
Fork-plan fit: layer/module ablation. Required rows already include `attn_o_proj_only`, `mlp_down_proj_only`, and `residual_write_only`.
### Write not read: lm-head, global, downstream
Construction:
```py
B_wnr_lm_l = left_svd_basis((I - P_lm_read_broad) @ concat_cols(W_o_l, W_down_l), k)
B_wnr_global_l = left_svd_basis((I - P_global_read_broad) @ concat_cols(W_o_l, W_down_l), k)
B_wnr_downstream_l = left_svd_basis((I - P_downstream_read_l) @ concat_cols(W_o_l, W_down_l), k)
```
Interpretation: layer writes into directions not immediately read by a chosen downstream read model. This was an early strongest A-side recipe but v9/v10 weaken the explanatory claim.
Steering use: yes. This is one of the best synthetic `dW' candidates because it is purely pretrained and module-local.
Fork-plan fit: synthetic `dW' baseline first. As trained-`dW` ablation, use `P_B dW` and complement rows. It is not already in the three core ablations, but it is a natural extension of layer/module causal ablation.
### MLP up-read and gate-read
Construction:
```py
B_up_read_l = right_svd_basis(W_up_l, k)
B_gate_read_l = right_svd_basis(W_gate_l, k)
```
Interpretation: behavior is represented in residual directions read by the MLP expansion or gate.
Steering use: likely as input activation steering or synthetic input-side `dW`, less direct for residual-output `dW`.
Fork-plan fit: layer/module ablation if `up/gate` modules carry behavior. Adapter-parameterization ablation for IA3 MLP gates.
### Attention QKV read and input superposition
Construction:
```py
B_qkv_read_l = right_svd_basis(concat_rows(W_q_l, W_k_l, W_v_l), k)
B_input_super_l = right_svd_basis(concat_rows(W_q_l, W_k_l, W_v_l, W_up_l, W_gate_l), k)
B_kv_super_l = right_svd_basis(concat_rows(W_k_l, W_v_l), k)
```
Interpretation: the steering-relevant state is in what attention or all input-side modules read, rather than what residual writers output.
Steering use: activation steering at module inputs or synthetic `dW` for read-side matrices.
Fork-plan fit: layer/module ablation if q/k/v/up/gate trained deltas matter. Not scored by residual-output-only v10, so include read-side trained `dW` rows if this hypothesis matters.
### Merged K and Q, `qk_circuit`
Construction:
```py
K_expanded = repeat_kv_rows_to_match_q_heads(W_k_l, W_q_l.shape[0])
B_qk_l = left_svd_basis(W_q_l.T @ K_expanded, k)
```
Interpretation: planning routes through attention score geometry, the bilinear interaction between queries and keys, not through values or residual writes alone. This is the requested K/Q merge hypothesis.
Steering use: not as a simple residual write. Better as a causal attention-routing intervention or trained q/k module ablation.
Fork-plan fit: layer/module ablation if q/k deltas are kept/dropped. Otherwise new causal test: perturb QK score subspace and measure behavior. v9 includes `qk_circuit` as a geometric candidate, but that is weaker than a QK causal intervention.
### Attention OV write
Construction:
```py
V_expanded = repeat_kv_rows_to_match_o_heads(W_v_l, W_o_l.shape[1])
B_ov_l = left_svd_basis(W_o_l @ V_expanded, k)
```
Interpretation: attention writes behavior through the value-to-output circuit, not through QK selection.
Steering use: plausible residual write target because `W_o W_v` maps token content into residual output.
Fork-plan fit: layer/module ablation, especially attention-only rows. Synthetic `dW'` if signed by persona contrast.
### MLP roundtrip
Construction:
```py
B_mlp_roundtrip_l = left_svd_basis(W_down_l @ W_up_l, k)
```
Interpretation: residual-to-MLP-to-residual linear path captures the relevant feature transformation.
Steering use: yes as an MLP synthetic basis, with the caveat that real MLPs are gated and nonlinear.
Fork-plan fit: layer/module ablation and synthetic `dW'`. If this beats attention rows, the paper story moves toward feature-space MLP steering.
### Gate kernel
Construction:
```py
mean_gate = mean(silu(h_clean @ W_gate.T), batch)
B_gate_kernel_l = left_svd_basis(W_down_l @ diag(mean_gate) @ W_up_l, k)
```
Interpretation: gated MLP write under average activation is the relevant local linearization.
Steering use: yes, more plausible than raw `W_down W_up` because it includes the gate.
Fork-plan fit: layer/module ablation and new nonlinear causal test. Current trained-`dW` ablation should include `gate/up/down` splits to avoid missing it.
### Attention sink
Construction:
```py
for head in heads:
sink_vec_h = W_o_l[:, head_slice] @ (W_v_l[kv_head_slice] @ embedding(BOS))
B_sink_l = pca(stack(sink_vec_h), k)
```
Interpretation: special sink or BOS-mediated attention channels carry planning or control state.
Steering use: speculative. Could steer attention sink features, but risk selecting formatting/sink artifacts.
Fork-plan fit: new attention-specific causal test, not a core `dW` basis ablation unless trained attention output deltas project strongly here.
### Causally isolated write
Construction:
```py
B_forbidden = union(B_input_super_l, B_kv_super_l, B_lm_read_broad)
B_isolated_l = project_write_away(concat_cols(W_o_l, W_down_l), B_forbidden)
```
Interpretation: directions writeable by the layer but isolated from obvious read, KV, and lm-head axes. Strong hidden-channel hypothesis.
Steering use: yes as a synthetic hidden-write intervention.
Fork-plan fit: synthetic `dW'` and new causal projection/complement test. A positive trained-`dW` result would be interesting because it says the learned adapter uses hidden write channels.
## Learned-weight and adapter parameterization hypotheses
### Weight-space task arithmetic
Construction:
```py
dW_task = W_finetuned_task - W_base
dW_negated = -dW_task
dW_composed = dW_honesty + dW_anti_sycophancy
dW_analogy = dW_A_to_B + dW_C - dW_A
```
Interpretation: task arithmetic authors claim task vectors in weight space can be negated, added, and used in analogy-like combinations. Weight steering is already a task-vector method, but the current fork plan mostly tests subspace carriers, not algebra. This hypothesis says the meaningful object may be the full signed `dW` vector and its arithmetic across behaviors/adapters.
Steering use: yes. This is directly weight steering.
Fork-plan fit: cross-adapter causal `dW` ablation and future multi-behavior benchmark. Mark as future until there are at least two behavior diffs: `dW_honesty`, `dW_anti_sycophancy`, `dW_refusal`, etc. A sign test can be run earlier only if positive and negative adapters are independently meaningful.
Positive readout: `dW_a + dW_b` approximately adds behavioral deltas without extra degradation; `-dW` reverses the target behavior more cleanly than random sign, permuted-layer, and random-norm controls. Negative readout: composition fails because adapters exploit incompatible basins or layer/module supports.
These are the hypotheses most directly aligned with the active fork-plan ablations.
### LoRA low-rank delta
Construction:
```py
dW = B @ A
W_steered = W + α * dW
```
Interpretation: the behavior delta is low-rank in ordinary weight coordinates.
Steering use: yes, this is current baseline.
Fork-plan fit: cross-adapter SVD, per-adapter SVD, rank-component parameterization ablation, multi-seed benchmark.
### DoRA magnitude vs direction
Construction:
```py
V = W + α * (B @ A)
scale = m / stopgrad(norm(V, dim=output_axis))
W_eff = scale * V
```
Interpretation: magnitude and direction of weight vectors are separate causal degrees of freedom.
Steering use: yes, but current results say DoRA behaves similarly to LoRA on this task.
Fork-plan fit: adapter-parameterization ablation: keep/drop direction component vs magnitude component.
### DeLoRA decoupled rank directions and strengths
Construction:
```py
scale_i = λ_i / (rank * ||A_i|| * ||B_i||)
dW = B @ diag(scale) @ A
```
Interpretation: the coherent behavioral axis is angular direction plus explicit strength. Current repo evidence: strongest raw steerer and best negative coefficient symmetry, but not explained by tested activation PCA.
Steering use: yes, strongest current raw method.
Fork-plan fit: adapter-parameterization ablation. Split rank directions, λ strengths, top/bottom S-space energy, and compare to residual complement.
### PiSSA top SVD subspace
Construction:
```py
U, S, Vh = svd(W)
W_res = U[:, r:] @ diag(S[r:]) @ Vh[r:]
adapter = U[:, :r] @ diag(S[:r]) @ Vh[:r]
train(adapter)
```
Interpretation: pretrained top singular directions are the useful adaptation manifold. Current repo evidence: clean stable baseline, often high steering without DeLoRA saturation.
Steering use: yes.
Fork-plan fit: adapter-parameterization ablation with S-space quartiles and energy crops. Also cross-adapter shared SVD if PiSSA top components overlap other adapters.
### OFT rotation
Construction:
```py
A_skew = skew(params)
R = cayley(A_skew)
W_eff = W @ R.T
dW = W_eff - W
```
Interpretation: behavior can be changed by rotating pretrained features while preserving norms/angles.
Steering use: yes, but current raw effect is weaker than PiSSA/DeLoRA.
Fork-plan fit: adapter-parameterization ablation: rotation-derived component vs residualized effective update.
### IA3 gates
Construction:
```py
if feedforward:
y = W @ (x * λ)
else:
y = (W @ x) * λ
```
Interpretation: adaptation is gain control over existing channels.
Steering use: weak in current daily-dilemmas results, but useful lower bound.
Fork-plan fit: adapter-parameterization ablation: attention-gate vs MLP-gate groups. Layer/module ablation if gates identify modules rather than full tensors.
### Shared cross-adapter `dW` SVD
Construction:
```py
M_l = concat_cols([dW_adapter_l for adapter in adapters])
B_shared_l_K = left_svd_basis(M_l, K)
keep = P_B @ dW_adapter_l
drop = dW_adapter_l - P_B @ dW_adapter_l
```
Interpretation: different adapter families discover the same causal residual-write subspace.
Steering use: not from scratch, but if shared `keep` steers across families, it is the main planning-subspace evidence.
Fork-plan fit: central row of cross-adapter causal `dW` basis ablation. Positive result needs `keep_B_shared_K32` retain at least 0.7x behavior and `drop_B_shared_K32` remove it across adapters.
Refinement from task arithmetic and MSRS: separate shared-across-adapter from shared-across-behavior. A basis can be adapter-family invariant but behavior-specific, or behavior-general but adapter-specific. Use two axes: `B_shared_adapter(behavior)` and `B_shared_behavior(adapter)`.
### Per-adapter top and tail SVD
Construction:
```py
U, S, Vh = svd(dW_adapter_l)
dW_topK = U[:, :K] @ diag(S[:K]) @ Vh[:K]
dW_tail = U[:, K:] @ diag(S[K:]) @ Vh[K:]
```
Interpretation: behavior may be concentrated in each adapter's own top singular directions, even if not shared across adapters.
Steering use: yes as a distilled trained adapter.
Fork-plan fit: cross-adapter causal `dW` basis ablation and adapter-parameterization ablation. If per-adapter top keeps behavior better than shared SVD, this supports basin divergence.
### S-space quartiles and energy groups
Construction:
```py
U0, S0, V0h = svd(W_base)
dS = U0.T @ dW @ V0h.T
component = crop(dS, rows_or_cols_or_energy_group)
dW_component = U0 @ component @ V0h
residual = dW - dW_component
```
Interpretation: the trained update may be simple in the pretrained weight's singular-vector coordinate system even when it is not simple in raw weight space.
Steering use: yes if a crop keeps behavior and the residual loses it.
Fork-plan fit: adapter-parameterization causal ablation. Required rows already include `top_25pct_S`, `mid_50pct_S`, `bottom_25pct_S`, `top_50pct_energy_S`, `top_90pct_energy_S`, and residuals.
### Residual-write projection and complement into activation basis
Construction:
```py
B_act_l = act_oracle_block_basis(l, K)
dW_project = P_B_act_l @ dW_residual_write_l
dW_complement = dW_residual_write_l - dW_project
dW_project_normmatched = dW_project * (||dW_resid|| / ||dW_project||)
```
Interpretation: distinguishes whether low geometric overlap hides a load-bearing small component. v10 result: for DeLoRA, raw projection keeps little behavior and complement keeps most. This means block-local activation PCA is not the trained-scale carrier for DeLoRA residual-write behavior; it does not mean activation-PCA directions are useless for steering.
Steering use: projection can be a potent amplified steerer for PiSSA/OFT, but is not the trained-scale explanation for DeLoRA.
Fork-plan fit: already done as v10 projection falsifier. Future use as a sub-row under layer/module or cross-adapter if testing other bases.
### Layer and module localization
Construction:
```py
dW_variant = {k: v for k, v in dW.items() if layer(k) in layer_set and module(k) in module_set}
```
Interpretation: behavior is localized to modules or layers rather than to a geometric basis.
Steering use: yes if a small slice retains behavior.
Fork-plan fit: exact layer/module causal ablation. Required variants include `residual_write_only`, `attn_o_proj_only`, `mlp_down_proj_only`, `layers_8_21_only`, single-layer keep, leave-one-layer-out, early/mid/late, random controls, and zero.
## Steering and causal-test verdict table
| hypothesis | from-scratch steering candidate? | trained-`dW` explanation candidate? | best causal test | current prior |
|---|---:|---:|---|---|
| Function-vector head basis | yes | possible for attention `o_proj` | head-output patch, `fv_heads_only/drop_fv_heads` | Strong prior that FV heads exist in ICL; sycophancy/honesty untested here. |
| Concept vs function route split | yes | possible | separate concept-head and FV-head interventions | Useful refinement of vague concept-space story. |
| ReFT-r1 baseline | yes | no | fair activation-steering baseline on identical DD rows | Stronger baseline than unsupervised PCA if labels allowed. |
| SAE output-score signed basis | maybe | unknown | signed decoder feature keep/drop with output-causal filtering | Only worth testing with output-score filter; raw SAE is weak prior. |
| MSRS shared/private basis | yes | possible | shared/private activation and `dW` split | Hypothesis generator; require frontier improvement to justify complexity. |
| Softmax information geometry | yes | possible for readout-facing layers | Fisher/softmax projection vs Euclidean projection | Projection metric variant for degradation control. |
| TaskDiff contrast | yes | weak | activation-steering baseline, then compare to `dW` on same DD rows | Useful baseline, persona may be wrong concept. |
| Suppressed | maybe | weak | project trained `dW` into suppressed basis and evaluate keep/drop | Interesting hidden-state prior, not yet a trained-scale explanation. |
| Stenographic | maybe | weak | activation steering or `P_steno dW` keep/drop | High-risk, rank-collapse likely. |
| Churn | maybe | weak | activation steering control or synthetic `dW'` | Broad dynamic prior, likely nonspecific. |
| Attention min/max/diff TaskDiff | yes | unknown | token-conditional activation steering, QK/OV causal routing | Good next test if last-token basis is too narrow. |
| Attention-output active subspace | yes | possible | `P_attn_active dW_o` vs complement | Good geometry control; steering causality untested here. |
| Gate-active written | yes | unknown | MLP gate/up/down ablation plus nonlinear gate-conditioned intervention | Important if MLP feature-space story wins. |
| CHaRS clusters | maybe | not as linear span | per-cluster translation causal test | Linear v9 score penalizes it; do not over-read negative result. |
| Rotation contrast | yes, as rotation | unknown | rotation intervention, OFT/AntiPaSTO-style ablation | Better fit to parameterization than linear keep/drop. |
| `lm_head_read` | yes control | unlikely | activation steering and `P_lm dW` keep/drop | v10 says LoRA layers are not directly Yes/No readable. |
| `logits_null` or weak readout | maybe | unlikely | weak-readout steering and coherence/degradation check | Could hide information, but direct output behavior may be weak. |
| Global read | weak | unlikely | synthetic `dW'` control | Too broad. |
| Global write | maybe | weak | synthetic `dW'` and module ablation | Plausible capacity basis, not behavior-specific. |
| Write-not-read | yes | possible | `P_wnr dW` vs complement, synthetic `dW'` | Best old A-side recipe, but v9/v10 make it only suggestive. |
| QK merged circuit | not directly | possible for q/k modules | q/k keep/drop, attention-score intervention | Fits attention-routing story, not residual-write PCA. |
| OV write | yes | possible | attention-only module ablation | Natural attention write test. |
| MLP roundtrip | yes | possible | MLP-only module ablation | If positive, story shifts to feature-space steering. |
| Gate kernel | yes | possible | gate-conditioned MLP causal test | More realistic than raw MLP roundtrip. |
| Attention sink | speculative | unknown | BOS/sink attention routing ablation | Needs separate causal test. |
| LoRA rank | yes | yes | rank component keep/drop | Baseline parameterization. |
| DoRA magnitude/direction | yes | yes | magnitude vs direction ablation | Current behavioral gain over LoRA small. |
| DeLoRA direction/strength | yes | yes | λ vs normalized direction, rank groups | Best raw steerer; high priority. |
| PiSSA SVD | yes | yes | S-space quartiles and energy crops | Clean stable baseline; high priority. |
| OFT rotation | yes | yes | rotation-derived component vs residual | Medium priority. |
| IA3 gates | weak | yes for gates | attention gate vs MLP gate | Useful lower bound. |
| Weight-space task arithmetic | yes | yes | sign, addition, analogy rows across behaviors/adapters | Strong adoption signal, but future until multiple behavior diffs exist. |
| Shared adapter SVD | no | yes | shared keep/drop across families | Central planning-subspace ablation. |
| Per-adapter top/tail SVD | no | yes | own top/tail keep/drop | Distinguishes shared core vs basin divergence. |
| S-space crops | no | yes | crop/residual reconstruction and behavior | Central adapter-parameterization ablation. |
| Act projection/complement | no | tests carrier | v10 projection/complement | Already mostly negative for DeLoRA as trained-scale explanation. |
## Recommended additions to `fork_plan.md`
The current plan is mostly right. I would add three explicit sub-rows rather than a new broad experiment:
1. Under layer/module ablation, include read-side module groups: `q_proj_only`, `k_proj_only`, `v_proj_only`, `attention_qkv_only`, `up_proj_only`, `gate_proj_only`, `mlp_up_gate_only`, and `combined_read_only`, because several hypotheses are read-side and v10 residual-write-only cannot test them.
2. Under synthetic `dW'`, add a small fixed list: `write_not_downstream_read`, `gate_kernel`, `OV_write`, and `TaskDiff_signed_write`. These are the cleanest A-side constructive candidates.
3. Under future causal tests, add `attention_routing_basis`: compare QK score intervention vs OV write intervention using the same DD row keys. This is where merged K/Q and attention min/max/diff belong.
4. Under activation baselines, add `ReFT_r1` and `function_vector_head_patch` as stronger external baselines than PCA-only TaskDiff.
5. Under cross-adapter `dW`, add `task_arithmetic_sign_and_sum` once at least two behavior diffs exist.
6. Under projection/complement tests, add a metric variant: Euclidean projection vs softmax/Fisher-metric projection.
## Interpretation discipline
Use these claim templates to avoid overclaiming:
- If `keep_B` retains behavior and `drop_B` removes it: `B` is a causal carrier of the trained adapter behavior under this intervention family.
- If both `keep_B` and `drop_B` retain behavior: the basis is non-identifying or behavior is distributed/redundant.
- If `keep_B` fails but normmatched `keep_B` steers: `B` is a potent steering target, not the trained-scale carrier.
- If synthetic `dW'` steers without trained adapter deltas: the basis is a constructive method candidate, not evidence that the trained adapter used it.
- If activation steering beats weight steering on identical DD rows: weight steering is mechanistic-interest first, method baseline second.
- If an attention-weighted basis scores well: report the selected token identities before claiming attention routing, because min/max/diff attention weights can select formatting artifacts.
@@ -0,0 +1,439 @@
Title: The Unreasonable Ineffectiveness of the Deeper Layers
URL Source: https://arxiv.org/html/2403.17887
Published Time: Tue, 04 Mar 2025 03:27:48 GMT
Markdown Content:
Andrey Gromov
Meta FAIR & UMD
&Kushal Tirumala
Meta FAIR
&Hassan Shapourian
Cisco &Paolo Glorioso
Zyphra
\AND Daniel A. Roberts
MIT & Sequoia Capital Co-first authors; please direct correspondence to the union of {gromovand@meta.com, kushaltirumala99@gmail.com, drob@mit.edu}.
###### Abstract
How is knowledge stored in an LLMs weights? We study this via layer pruning: if removing a certain layer does not affect model performance in common question-answering benchmarks, then the weights in that layer are not necessary for storing the knowledge needed to answer those questions. To find these unnecessary parameters, we identify the optimal block of layers to prune by considering similarity across layers; then, to “heal” the damage, we perform a small amount of finetuning. Surprisingly, with this method we find minimal degradation of performance until after a large fraction (up to half) of the layers are removed for some common open-weight models. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge. For our study, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single 40GB A100 GPU.
1 Introduction
--------------
In this work we study a very simple pruning strategy using open-weight LLMs. In particular, we develop a method that uses the similarity between the representations at different layers to identify the optimal layers to prune for a given pruning fraction; then, after removing these layers we “heal” the pruning-induced mismatch with a small amount of fine tuning (using QLoRA). Our main result is that we can remove a substantial fraction of the _deepest layers_ from models with minimal degradation in downstream question-answering benchmarks. For example, for Llama-2-70B (Touvron et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib1)) we can eliminate up to roughly _half_ of the layers before the performance collapses. An overview of our strategy and the results of pruning Llama-2-70B are shown in Figure[1](https://arxiv.org/html/2403.17887v2#S1.F1 "Figure 1 ‣ 1 Introduction ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
![Image 1: Refer to caption](https://arxiv.org/html/2403.17887v2/x1.png)
Figure 1: Overview of our layer-pruning strategy and example results: _(a)_ a flowchart describing the algorithm: if removing n 𝑛 n italic_n layers, we find the layer, ℓ∗superscript \ell^{*}roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, that minimizes the angular distance, d 𝑑 d italic_d, between layers \ell roman_ and +n 𝑛\ell\!+\!n roman_ + italic_n; we then remove the n 𝑛 n italic_n layers beginning with layer ℓ∗superscript \ell^{*}roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT; finally, if necessary, we can “heal” the damage with a small amount of (parameter-efficient) finetuning. _(b)_ a schematic depicting the removal of n 𝑛 n italic_n total layers, indexed from ℓ∗superscript \ell^{*}\!roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT to ℓ∗+n1 superscript 𝑛 1\ell^{*}\!\!+\!n\!-\!1 roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + italic_n - 1. _(c)_ angular distance, d 𝑑 d italic_d, between different numbers of layers, n 𝑛 n italic_n, vs. the layer number, \ell roman_, that indexes the beginning of the block of n 𝑛 n italic_n; the bottom curve (darkest purple) represents n=1 𝑛 1 n=1 italic_n = 1, while the top curve (lightest yellow) represents n=64 𝑛 64 n=64 italic_n = 64; the black line traces ℓ∗⁢(n)superscript 𝑛\ell^{*}(n)roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_n ), the minimum of the angular distance across the different sized layer blocks. _(d)_ results of pruning Llama-2-70B with healing (light blue) and without healing (dark blue) as a function of the fraction of layers removed: the top (middle) panel gives the accuracy on the MMLU (BoolQ) question-answering benchmark, while the bottom panel the autoregressive loss on a subset of the C4 validation set; here, the dashed red lines (dashed gray lines) indicate the accuracy or loss of the original unpruned model (of random guessing); these plots illustrate that typical behavior we find in which there are sharp transitions in performance for the accuracy of question-answering tasks (here between 40%-50% pruning fraction), but continuity and very slow growth in the healed loss (light blue) up to at least to 80% pruning fraction.
Our intuition for dropping layers comes from considering the residual structure of the transformer architecture. In more detail, the output of the final layer can be decomposed as a sum over the outputs of all the model layers plus the embedded input. If such a sum had numerous and independent terms, then removing a handful of them should not significantly change the output. However, since the terms are not independent each layer is input to the following layer we should expect to be able to remove terms if the residual contribution from a particular layer is small. In other words, if the output of each layer does not change too much from layer to layer.1 1 1 This is strongly suggested by “lens” investigations that studied the evolution of the token distribution as a function of layer index such as the “logit lens” (nostalgebraist, [2020](https://arxiv.org/html/2403.17887v2#bib.bib2)) and the “tuned lens” (Belrose et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib3)). A separate line of reasoning along these lines previously inspired neural ODEs (Chen et al., [2018](https://arxiv.org/html/2403.17887v2#bib.bib4)), and led Yang et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib5)) to argue that ideally representation should change substantially from layer to layer in order to most effectively make use of the parameters of a network.
In conjunction with our layer pruning, we investigate the similarity of layer representations at different separations and find broadly that deeper layers are qualitatively more similar to neighboring layers than shallow layers (with the exception of the very final layer). This suggests an even simpler pruning strategy: remove layers beginning at the penultimate layer and proceed from deep to shallow until the desired number of layers have been removed. In this case, we find that, after healing the damage with a small amount of QLoRA finetuning, we can achieve performance that nearly matches the more involved similarity-informed layer pruning strategy. The effectiveness of this method is evidence that LLMs might not properly leverage the parameters in the deeper layers of the network.
That said, while question-answering (QA) benchmarks such as MMLU and BoolQ are robust to a large amount of layer pruning, other measures of performance are not: if we look at the loss on next-token predictions for an IID dataset (C4 validation set), we find that the model is smoothly damaged in proportion to the fraction of the number of layers pruned. Since perplexity typically correlates strongly with downstream metrics, this naturally begs the question: which tasks are less robust than QA benchmarks to pruning? As part of our final discussion, we explore reasoning related tasks (GSM8k and HellaSwag) and see that they are harmed by any amount of pruning. Altogether, this leads to the following accounting of state: the shallow layers likely play a critical role in the storing of knowledge and retrieving of information, while the deeper layers are important for higher-level computations such as mathematical reasoning.
The structure of this paper is as follows. In §[2](https://arxiv.org/html/2403.17887v2#S2 "2 Literature Review ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we first perform a literature review of both practical post-training strategies and science-of-deep-learning investigations that motivate our work. Then, in §[3](https://arxiv.org/html/2403.17887v2#S3 "3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we give intuition for our layer pruning strategy and explain our method in detail, while in §[4](https://arxiv.org/html/2403.17887v2#S4 "4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers") we iterate over all our experimental results. Finally, we conclude in §[5](https://arxiv.org/html/2403.17887v2#S5 "5 Discussion and Future Directions ‣ The Unreasonable Ineffectiveness of the Deeper Layers") by exploring tasks beyond QA benchmarks, such as reasoning, and highlighting directions of future work. Specific model, finetuning, dataset, and evaluation details can be found in Appendix[B](https://arxiv.org/html/2403.17887v2#A2 "Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), and evaluation ablations can be found in Appendix[C](https://arxiv.org/html/2403.17887v2#A3 "Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
2 Literature Review
-------------------
Pruning for neural networks has a long history (LeCun et al., [1989](https://arxiv.org/html/2403.17887v2#bib.bib6), Hassibi and Stork, [1992](https://arxiv.org/html/2403.17887v2#bib.bib7)): while initial work focused on _unstructured pruning_(Han et al., [2015](https://arxiv.org/html/2403.17887v2#bib.bib8), Chen et al., [2015](https://arxiv.org/html/2403.17887v2#bib.bib9), Srinivas and Babu, [2015](https://arxiv.org/html/2403.17887v2#bib.bib10)), _structured pruning_ techniques were developed to make sparse networks more efficient (Li et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib11), Wen et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib12), Hu et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib13), He et al., [2017](https://arxiv.org/html/2403.17887v2#bib.bib14), Huang et al., [2018](https://arxiv.org/html/2403.17887v2#bib.bib15), Murray and Chiang, [2015](https://arxiv.org/html/2403.17887v2#bib.bib16), See et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib17), Kim and Rush, [2016](https://arxiv.org/html/2403.17887v2#bib.bib18)). Recent work, of course, focused on structured pruning of transformers (Voita et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib19), Michel et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib20), Kim and Awadalla, [2020](https://arxiv.org/html/2403.17887v2#bib.bib21), Fan et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib22), Zhang and He, [2020](https://arxiv.org/html/2403.17887v2#bib.bib23), Fan et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib24), Jha et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib25), Sajjad et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib26), Liu et al., [2023a](https://arxiv.org/html/2403.17887v2#bib.bib27), Hou et al., [2020](https://arxiv.org/html/2403.17887v2#bib.bib28), Sharma et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib29), Ashkboos et al., [2024](https://arxiv.org/html/2403.17887v2#bib.bib30), Xia et al., [2022](https://arxiv.org/html/2403.17887v2#bib.bib31), Lagunas et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib32), Men et al., [2024](https://arxiv.org/html/2403.17887v2#bib.bib33)). Our work focuses on pruning the layers of decoder-only GPT style open-weight _large_ language models after theyve been pretrained. For an extended literature review, please see Appendix[A](https://arxiv.org/html/2403.17887v2#A1 "Appendix A Extended Literature Review ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
3 Method
--------
In this section, we give intuition for why we think layer pruning works (§[3.1](https://arxiv.org/html/2403.17887v2#S3.SS1 "3.1 Intuition ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) and then we explain our method in detail (§[3.2](https://arxiv.org/html/2403.17887v2#S3.SS2 "3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")).
### 3.1 Intuition
Our intuition for layer dropping comes from thinking about the representations as a slowly changing function of layer index. In particular, the layer-to-layer evolution of representations for a transformer is given by a _residual_ iteration equation
x(+1)=x()+f(x(),θ()),superscript 𝑥 1 superscript 𝑥 𝑓 superscript 𝑥 superscript 𝜃 x^{(\ell+1)}=x^{(\ell)}+f(x^{(\ell)},\theta^{(\ell)})\,,italic_x start_POSTSUPERSCRIPT ( roman_ + 1 ) end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT + italic_f ( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT ) ,(1)
where (x()(x^{(\ell)}( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT, θ())\theta^{(\ell)})italic_θ start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT ), respectively, are the multi-dimensional input and parameter vectors for layer \ell roman_, and f(x,θ)𝑓 𝑥 𝜃 f(x,\theta)italic_f ( italic_x , italic_θ ) describes the transformation of one multi-head self-attention _and_ MLP layer block. As for any residual network, if we unroll this iteration, we see that after L 𝐿 L italic_L total layers the output is described as a sum over the transformations of all the layers
x(L)=x(0)+∑ℓ=0 L1 f(x(),θ()).superscript 𝑥 𝐿 superscript 𝑥 0 superscript subscript 0 𝐿 1 𝑓 superscript 𝑥 superscript 𝜃 x^{(L)}=x^{(0)}+\sum_{\ell=0}^{L-1}f(x^{(\ell)},\theta^{(\ell)})\,.italic_x start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT roman_ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT ) .(2)
If the terms in the sum were _numerous_, (L≫1 much-greater-than 𝐿 1 L\gg 1 italic_L ≫ 1), and _independent_, e.g. if the block functions were instead a function of the overall input as f(x(0),θ())𝑓 superscript 𝑥 0 superscript 𝜃 f(x^{(0)},\theta^{(\ell)})italic_f ( italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT ), then perhaps any particular contribution to the sum ([2](https://arxiv.org/html/2403.17887v2#S3.E2 "In 3.1 Intuition ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) could be neglected.
Of course, they are not at all independent: if we delete layer ℓ−1 1\ell-1 roman_ - 1, then we must now connect the old input to that layer, x(ℓ−1)superscript 𝑥 1 x^{(\ell-1)}italic_x start_POSTSUPERSCRIPT ( roman_ - 1 ) end_POSTSUPERSCRIPT, into the block function of layer \ell roman_ as
x(+1)=x(ℓ−1)+f(x(ℓ−1),θ()),superscript 𝑥 1 superscript 𝑥 1 𝑓 superscript 𝑥 1 superscript 𝜃 x^{(\ell+1)}=x^{(\ell-1)}+f(x^{(\ell-1)},\theta^{(\ell)})\,,italic_x start_POSTSUPERSCRIPT ( roman_ + 1 ) end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT ( roman_ - 1 ) end_POSTSUPERSCRIPT + italic_f ( italic_x start_POSTSUPERSCRIPT ( roman_ - 1 ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT ) ,(3)
where, for clarity, we are not relabeling layers or inputs despite the deletion. In general, such a _mismatch_ between the original input and new input should be very damaging for the network. However, if, after some number of initial layers, the representations converge to a slowly changing function with respect to layer index,
x()≈x(ℓ−1)+ϵ,superscript 𝑥 superscript 𝑥 1 italic-ϵ x^{(\ell)}\approx x^{(\ell-1)}+\epsilon\,,italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT ≈ italic_x start_POSTSUPERSCRIPT ( roman_ - 1 ) end_POSTSUPERSCRIPT + italic_ϵ ,(4)
with ϵ≪x()much-less-than italic-ϵ superscript 𝑥 \epsilon\ll x^{(\ell)}italic_ϵ ≪ italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT in some appropriate sense, then the effect of deleting a particular layer \ell roman_, e.g. making the replacement x()→x(ℓ−1)→superscript 𝑥 superscript 𝑥 1 x^{(\ell)}\to x^{(\ell-1)}italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT → italic_x start_POSTSUPERSCRIPT ( roman_ - 1 ) end_POSTSUPERSCRIPT in going from ([1](https://arxiv.org/html/2403.17887v2#S3.E1 "In 3.1 Intuition ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) to ([3](https://arxiv.org/html/2403.17887v2#S3.E3 "In 3.1 Intuition ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), should only change the representation in the subsequent layer, x(+1)superscript 𝑥 1 x^{(\ell+1)}italic_x start_POSTSUPERSCRIPT ( roman_ + 1 ) end_POSTSUPERSCRIPT, by a small amount. Similarly, to successfully prune the n 𝑛 n italic_n layers before layer \ell roman_, i.e. those indexed from ℓ−n,…,ℓ−1 𝑛…ℓ 1\ell-n,\ldots,\ell-1 roman_ - italic_n , … , roman_ - 1, wed want that the input to the pruned block should be very similar to the output of the pruned block:
x()≈x(ℓ−n)+ϵ.superscript 𝑥 superscript 𝑥 𝑛 italic-ϵ x^{(\ell)}\approx x^{(\ell-n)}+\epsilon\,.italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT ≈ italic_x start_POSTSUPERSCRIPT ( roman_ - italic_n ) end_POSTSUPERSCRIPT + italic_ϵ .(5)
Regardless, any layer removal has a cascading effect: since post pruning x(+1)superscript 𝑥 1 x^{(\ell+1)}italic_x start_POSTSUPERSCRIPT ( roman_ + 1 ) end_POSTSUPERSCRIPT is computed by a different function than before, cf. ([1](https://arxiv.org/html/2403.17887v2#S3.E1 "In 3.1 Intuition ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) vs. ([3](https://arxiv.org/html/2403.17887v2#S3.E3 "In 3.1 Intuition ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), and since then x(+1)superscript 𝑥 1 x^{(\ell+1)}italic_x start_POSTSUPERSCRIPT ( roman_ + 1 ) end_POSTSUPERSCRIPT is directly or indirectly input to subsequent layers, +2,…,L 2…𝐿\ell+2,\ldots,L roman_ + 2 , … , italic_L, deleting a shallow layer should have a much greater impact than deleting a deeper layer.
From this, we have the following hypotheses that we will test experimentally:
1. _(0)_ We should be able to prune layers of a residual network.
2. _(1)_ We should have greater success pruning deeper layers.
3. _(2)_ Blocks of layers we successfully prune should have outputs that are similar to their inputs.
In the next subsection, §[3.2](https://arxiv.org/html/2403.17887v2#S3.SS2 "3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers") we will explain the details of our pruning algorithm and in the following section, §[4](https://arxiv.org/html/2403.17887v2#S4 "4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we will present experimental evidence for points _(0)-(2)_.
### 3.2 Layer-pruning algorithm(s)
Our principal layer pruning algorithm is very simple:
1. 0.Pick a a number of layers to prune n 𝑛 n italic_n.
2. 1.Compute the angular distance d(x(),x(+n))𝑑 superscript 𝑥 superscript 𝑥 𝑛 d(x^{(\ell)},x^{(\ell+n)})italic_d ( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ( roman_ + italic_n ) end_POSTSUPERSCRIPT ), cf. ([7](https://arxiv.org/html/2403.17887v2#S3.E7 "In 3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) below, between the input to layer \ell roman_ and the input to layer +n 𝑛\ell+n roman_ + italic_n on a neutral pretraining dataset or on a dataset representative of a downstream task of interest.
3. 2.Find the layer, ℓ∗superscript \ell^{*}roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, that minimizes that distance:
ℓ⋆⁢(n)≡argmin ⁡d(x(),x(+n)).superscript ℓ⋆𝑛 subscript arg min 𝑑 superscript 𝑥 superscript 𝑥 𝑛\ell^{\star}(n)\equiv\operatorname*{arg\,min}_{\ell}~{}d(x^{(\ell)},x^{(\ell+n% )})\,.roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_n ) ≡ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT roman_ end_POSTSUBSCRIPT italic_d ( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ( roman_ + italic_n ) end_POSTSUPERSCRIPT ) .(6)
4. 3.Drop layers ℓ⋆superscript ℓ⋆\ell^{\star}roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT to ℓ⋆+n1 superscript ℓ⋆𝑛 1\ell^{\star}\!\!+\!n\!-\!1 roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + italic_n - 1; connect the old input to layer ℓ⋆superscript ℓ⋆\ell^{\star}roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT to the old (ℓ⋆+n)superscript ℓ⋆𝑛(\ell^{\star}\!\!+\!n)( roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + italic_n )th layer block.2 2 2 Layers are often contained in a data structure, such a ModuleList in _PyTorch_, so to drop these layers we would simply define a new ModuleList that removes the layers from ℓ⋆superscript ℓ⋆\ell^{\star}roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT to ℓ⋆+n1 superscript ℓ⋆𝑛 1\ell^{\star}+n-1 roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + italic_n - 1.
5. 4.(Optionally) heal the mismatch at layer ℓ⋆+n superscript ℓ⋆𝑛\ell^{\star}\!+n roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + italic_n with a small amount of fine tuning on a neutral pretraining dataset or particular dataset of interest.
If fewer words inside of a figure are more helpful to you than the text in an enumerated list, then note that this algorithm is also depicted in panels (a)-(b) of Figure[1](https://arxiv.org/html/2403.17887v2#S1.F1 "Figure 1 ‣ 1 Introduction ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
Elaborating on the first step, the angular distance on a single sequence of length T 𝑇 T italic_T is given by
d(x(),x(+n))≡1 π⁢arccos(x T()⋅x T(+n)‖x T()‖⁢‖x T(+n)‖),𝑑 superscript 𝑥 superscript 𝑥 𝑛 1 𝜋⋅subscript superscript 𝑥 𝑇 subscript superscript 𝑥 𝑛 𝑇 norm subscript superscript 𝑥 𝑇 norm subscript superscript 𝑥 𝑛 𝑇 d(x^{(\ell)},x^{(\ell+n)})\equiv\frac{1}{\pi}\arccos\left(\frac{x^{(\ell)}_{T}% \cdot x^{(\ell+n)}_{T}}{\left|\!\left|x^{(\ell)}_{T}\right|\!\right|\left|\!% \left|x^{(\ell+n)}_{T}\right|\!\right|}\right)\,,italic_d ( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ( roman_ + italic_n ) end_POSTSUPERSCRIPT ) ≡ divide start_ARG 1 end_ARG start_ARG italic_π end_ARG roman_arccos ( divide start_ARG italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ⋅ italic_x start_POSTSUPERSCRIPT ( roman_ + italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG | | italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | | | italic_x start_POSTSUPERSCRIPT ( roman_ + italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | end_ARG ) ,(7)
where the inner product is over the hidden dimension of the model for the final token T 𝑇 T italic_T of the sequence, ||⋅|||\!|\cdot|\!|| | ⋅ | | denotes the L 2 superscript 𝐿 2 L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-norm, and the factor of 1/π 1 𝜋 1/\pi 1 / italic_π is a convention.3 3 3 Two comments: _(i)_, we do not expect our choice of angular distance in lieu of any other reasonable metric, e.g., such as cosine similarity to be particular significant; and _(ii)_, we chose to focus on the final token since, due to the causal attention mask, its embedding is the only one that depends on the entire sequence. This distance should then be summed over a number of examples that is large enough to get a low-fluctuation estimate but overall should be quite small.
Elaborating on the “optionality” of the final step, we find that the near-lack of performance degradation on question-answering benchmarks, cf. Figure[1](https://arxiv.org/html/2403.17887v2#S1.F1 "Figure 1 ‣ 1 Introduction ‣ The Unreasonable Ineffectiveness of the Deeper Layers")(d) and others in §[4.1](https://arxiv.org/html/2403.17887v2#S4.SS1 "4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), can be extended to greater pruning fractions with a small amount of finetuning. Depending on resource constraints and intended application of the pruned model, this may not be necessary. However, the healing procedure does have a substantial impact on perplexity, cf. Figure[1](https://arxiv.org/html/2403.17887v2#S1.F1 "Figure 1 ‣ 1 Introduction ‣ The Unreasonable Ineffectiveness of the Deeper Layers")(d) and others in §[4.2](https://arxiv.org/html/2403.17887v2#S4.SS2 "4.2 Loss on next-token predictions ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
For both the angular distance measuring and the healing, if the ultimate goal is to supervise finetune (SFT) a model for a downstream task, it could be useful to evaluate the distance of a sample from that dataset and then combine the healing process with the SFT. In contrast, for the greatest generality, its most natural to measure distance and heal with a pretraining dataset that approximates the statistics under which the model was originally pretrained.
Finally, we also investigated an even simpler pruning strategy inspired by analyzing the angular distances across different model families: drop the deepest layers, excluding the final layer before the LLM head, and then (_non-optionally_) heal the damage. For complete clarity, this means that if we are pruning n 𝑛 n italic_n layers from an L 𝐿 L italic_L-layer model, then we would remove layers (Ln)𝐿 𝑛(L-n)( italic_L - italic_n ) to (L1)𝐿 1(L-1)( italic_L - 1 ), inclusive.
4 Results
---------
In this section, we demonstrate the effectiveness of our pruning strategy on different question-answering (QA) benchmarks and highlight a robust pruning-driven transition in performance (§[4.1](https://arxiv.org/html/2403.17887v2#S4.SS1 "4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), while, in contrast, we find that the autoregressive perplexities of the healed pruned models are continuous across their transition points (§[4.2](https://arxiv.org/html/2403.17887v2#S4.SS2 "4.2 Loss on next-token predictions ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")); then, after comparing the similarity statistics between different layers across model sizes and families (§[4.3](https://arxiv.org/html/2403.17887v2#S4.SS3 "4.3 Angular distances between representations ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), we contrast our principal similarity-informed pruning strategy with a simpler remove-the-deepest-layers strategy (§[4.4](https://arxiv.org/html/2403.17887v2#S4.SS4 "4.4 A simpler pruning strategy ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")).
For our experiments, we pruned a wide variety of large-scale LLMs from 2.7B to 70B parameters spanning 32 to 80 total unpruned layers. Specifically, we used models in the Llama-2 family (Touvron et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib1)), the Qwen family (Bai et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib34)), Mistral-7B (Jiang et al., [2023a](https://arxiv.org/html/2403.17887v2#bib.bib35)), and Phi-2 (Javaheripi and Bubeck, [2023](https://arxiv.org/html/2403.17887v2#bib.bib36)). For these models, we executed the “healing” step using QLoRA (Dettmers et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib37)): our models were quantized to 4-bit precision and then finetuned, using QLoRA for efficient training, on either 164M or 328M tokens from the Colossal Clean Crawled Corpus (C4) (Raffel et al., [2020](https://arxiv.org/html/2403.17887v2#bib.bib38)), a common pretraining dataset. As a result, _each experiment of ours can be performed on a single 40GB A 100 100 100 100 GPU_. For our QA evals, we used Massive Multitask Language Understanding (MMLU) (Hendrycks et al., [2020](https://arxiv.org/html/2403.17887v2#bib.bib39)), a common world-knowledge and problem solving benchmark, and BoolQ (Clark et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib40)), a common yes/no reading comprehension benchmark where the answer has to be inferred from the text itself. The specifics of our models, healing procedure, dataset choices, and evaluation details can be found across Appendix[B](https://arxiv.org/html/2403.17887v2#A2 "Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers"); ablations of different hyperparameter choices can be found across Appendix[C](https://arxiv.org/html/2403.17887v2#A3 "Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
### 4.1 Accuracy on QA benchmarks
Our first set of results are shown in Figure[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), where we plot 5 5 5 5-shot MMLU accuracy as a function of the fraction of layers removed: in the left panel we present the Llama-2 family, in the middle panel we present models from the Qwen family, and in the right panel we show Mistral-7B and Phi-2. In order to better compare models of different total number of layers, in these plots we opted to normalize the x 𝑥 x italic_x-axis by the fraction of layers removed (rather than the absolute number of layers removed). Note that since MMLU contains multiple choice questions with four possible responses, the expected accuracy of random guessing is 25%.
![Image 2: Refer to caption](https://arxiv.org/html/2403.17887v2/x2.png)
Figure 2: MMLU accuracy (5-shot) vs. fraction of layers dropped for different model families. (_Left:_ Llama-2 family; _Middle:_ Qwen family; _Right:_ Mistral-7B and Phi-2.) The solid lines represent performance after dropping layers and healing, dotted lines show performance after dropping layers only (no healing), and the dashed gray line is the score for guessing randomly. For these models, healing leads to modest improvements, and performances are quite robust until 20%-55% pruning fractions, depending on model family and size, at which point they transitions to random guessing.
Importantly, we see a characteristic flat region of robust performance followed by a sharp transition to random accuracy at a pruning fraction around 45%-55% for models in the Llama-2 family, 35% for Mistral 7B, 25% for Phi-2, and 20% for models from the Qwen family. This implies that the essential knowledge required to achieve a models top score isnt removed by significant layer removal even though the fraction can be quite large(!) until eventually that knowledge is lost at a critical model-dependent threshold.4 4 4 This effect is rather robust to choice of QA benchmark: in Figure[7](https://arxiv.org/html/2403.17887v2#A2.F7 "Figure 7 ‣ B.2 Evaluation details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers") we plot the average 0-shot BoolQ accuracy for our model families and observe analogous behavior. Contrasting the curves with and without healing, we see that finetuning offers a modest improvement by better preserving the unpruned performance and pushing the phase transition to random guessing to slightly larger pruning fractions.
Broadly we see that layer pruning is more robust for the larger and deeper models, e.g. Llama-2-13B and Llama-2-70B, which we hypothesize could be related to the fact that either the smaller models are more overtrained, making parameters less redundant, or that the deeper models can afford to lose more layers in an absolute sense. Also, the Qwen family is strange, a fact we will further elaborate on in §[4.3](https://arxiv.org/html/2403.17887v2#S4.SS3 "4.3 Angular distances between representations ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
### 4.2 Loss on next-token predictions
In this section, we look at the effect of layer pruning on the pretraining optimization objective the cross-entropy loss of next-token prediction when evaluated on a subset of the C4 validation dataset.5 5 5 We make sure that none of the validation data are seen during the healing stage. In order to have a fair comparison across models with different sized vocabularies V 𝑉 V italic_V, we normalize the loss by logV 𝑉\log V roman_log italic_V, which corresponds to the loss of sampling tokens randomly with uniform probability. (See Appendix[B.2](https://arxiv.org/html/2403.17887v2#A2.SS2 "B.2 Evaluation details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers") for more details.)
In Figure[3](https://arxiv.org/html/2403.17887v2#S4.F3 "Figure 3 ‣ 4.2 Loss on next-token predictions ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers") , we plot the normalized C4 validation loss for all seven of our models, after healing (left panel) and before healing (right panel), as a function of the fraction layers removed. Without healing, we see that there is a somewhat sharp(ish) transition to random guessing for each model at approximately the pruning fraction that the QA benchmark accuracies also sharply transition to random guessing, suggesting that models are hopelessly harmed at this point, cf. Figure[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"). Next, contrasting the scales of both plots, we see that healing significantly restores the next-token prediction ability of all the models to near-unpruned levels, with the loss increasing slowly and linearly with layer dropping. Most strikingly from a scientific perspective is the post-healing continuity through the pruning fractions where we previously found sharp transitions for the QA benchmarks: this decoupling illustrates one way of disconnecting (or creating a miscalibration) between performance on downstream tasks such as MMLU and BoolQ and continuous measures of performance such as the cross-entropy loss. 6 6 6 This is consistent with Schaeffer et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib41)) that argued jumps in one kind of metric may not be visible in others.
![Image 3: Refer to caption](https://arxiv.org/html/2403.17887v2/x3.png)
Figure 3: Normalized C4 validation loss vs. fraction of layers dropped before healing (_left_) and after healing (_right_); each curve is normalized by the cross-entropy loss of sampling uniformly from the models vocabulary. For the experiments before healing, the loss for each model transitions to random guessing (gray dashed line) at approximately the same pruning fractions that the QA benchmarks transition to random guessing; after healing, there is continuity through the regions of sharp transition on QA tasks, cf. Figure[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"). Contrasting the overall scale of both plots, its clear that healing significantly restores the performance on next-token prediction to near-unpruned levels.
### 4.3 Angular distances between representations
Given the central role the angular distance ([7](https://arxiv.org/html/2403.17887v2#S3.E7 "In 3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) plays in our pruning strategy, lets take a subsection to look at these distances across our seven models. For this analysis, the angular distances for each model were averaged over 10k samples from the C4 validation set.
Recall from earlier Figure[1](https://arxiv.org/html/2403.17887v2#S1.F1 "Figure 1 ‣ 1 Introduction ‣ The Unreasonable Ineffectiveness of the Deeper Layers")(c): for Llama-2-70B this plotted the angular distance d(x(),x(+n))𝑑 superscript 𝑥 superscript 𝑥 𝑛 d(x^{(\ell)},x^{(\ell+n)})italic_d ( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ( roman_ + italic_n ) end_POSTSUPERSCRIPT ) that compared the \ell roman_-th layer to the (+n) 𝑛(\ell+n)( roman_ + italic_n )-th layer, across all initial indexes \ell roman_ for block sizes from n=1 𝑛 1 n=1 italic_n = 1 to n=64 𝑛 64 n=64 italic_n = 64; the minimum of the curves, ℓ⋆⁢(n)superscript ℓ⋆𝑛\ell^{\star}(n)roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_n ), gave the optimal block to prune for a given n 𝑛 n italic_n, cf. ([6](https://arxiv.org/html/2403.17887v2#S3.E6 "In item 2 ‣ 3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")).
A more compact way to display this same data is shown in the heat maps of Figure[4](https://arxiv.org/html/2403.17887v2#S4.F4 "Figure 4 ‣ 4.3 Angular distances between representations ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"): each square is colored to depict the row-normalized angular distance between layer \ell roman_ and +n 𝑛\ell+n roman_ + italic_n across all possible \ell roman_, and n 𝑛 n italic_n up to very large fractions of the total number of layers; the optimal layer to prune for a given block size, ℓ∗⁢(n)superscript 𝑛\ell^{*}(n)roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_n ), corresponds to the minimal distance in each row.
Across models, we make two generalizations: _(i)_ the smallest distances are found across the deeper blocks, meaning deeper layers are typically quite similar to each other and can be more easily dropped; _(ii)_ the distances across the deepest blocks the blocks that include the last layer take either maximal or nearly-maximal values, meaning one should never drop the final layer. While broadly true, there are a few exceptions. For some models, e.g. Phi-2-2.7B, or for the largest blocks in some models, e.g. Llama-2-7B, final _few_ layers seem important. As previously noted, the Qwen family is somewhat unusual: here we see that there are a few odd “islands” of high similarity for shallow blocks; this likely explains the shorter region of robust performance in Figure[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
![Image 4: Refer to caption](https://arxiv.org/html/2403.17887v2/x4.png)
Figure 4: Normalized angular distance ([7](https://arxiv.org/html/2403.17887v2#S3.E7 "In 3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) from initial layer \ell roman_ (x-axis) with block size n 𝑛 n italic_n (y-axis) for each of the seven models we evaluated; the distance for each n 𝑛 n italic_n is shifted and rescaled to span the same range, [0,1]0 1[0,1][ 0 , 1 ] (yellow to purple): the optimal block to prune, ℓ∗⁢(n)superscript 𝑛\ell^{*}(n)roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_n ), corresponds to the deepest yellow for each row. Across models, the deeper layers tend to be very similar, though the deepest blocks that include the final layer (squares along the outer diagonal) are (near-)maximally dissimilar.
### 4.4 A simpler pruning strategy
Inspired by our recent conclusions, we experiment with a very simple heuristic pruning strategy: _(1)_ if pruning n 𝑛 n italic_n layers from an L 𝐿 L italic_L-layer model, drop layers (Ln)𝐿 𝑛(L-n)( italic_L - italic_n ) to (L1)𝐿 1(L-1)( italic_L - 1 ) so as to remove the deepest block that excludes the final layer; then _(2)_ heal with a small amount of finetuning as before. Compared with our principal similarity-informed pruning strategy, this simpler heuristic algorithm has the advantage of never requiring practitioners to load onto a GPU or inference the unpruned model. It also provides a meaningful ablation of the importance of optimizing the block to prune.
In Figure[5](https://arxiv.org/html/2403.17887v2#S4.F5 "Figure 5 ‣ 4.4 A simpler pruning strategy ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we contrast our two pruning strategies, both before healing (left panels) and after healing (right panels), for the QA benchmarks (MMLU/BoolQ, top/middle panels) and the autoregressive loss (C4 validation, bottom panels). On the one hand, the simple heuristic performs quite poorly without healing the damage incurred by pruning: accuracy on the QA benchmarks decays rapidly to (near-) random with increased pruning fraction, and the loss begins to increase very rapidly even with small amounts of pruning. On the other hand, the results for the two pruning strategies across evaluations are quite comparable after healing: for the QA benchmarks, the similarity-informed algorithm slightly better preserves the accuracy before the phase transition, though the simple algorithm perhaps pushes the phase transition to slightly greater pruning factions; and for the loss, the curves nearly lie on top of each other, though the similarity-informed strategy does marginally outperform for all amounts of pruning. These experiments are strong evidence that the purpose of post-pruning finetuning is the healing of damage at the pruning interface and not the acquisition of additional knowledge.
![Image 5: Refer to caption](https://arxiv.org/html/2403.17887v2/x5.png)
Figure 5: Evaluation of Llama-2-70B with the simple pruning heuristic (solid red line), shown along with scores for the similarity-informed pruning strategy (solid blue line), scores of the unpruned Llama-2-70B (red dashed line), and scores for randomly guessing (gray dashed line). (_Left:_ before healing, _Right:_ after healing; _Top:_ MMLU, _Middle:_ BoolQ, _Bottom:_ C4 Validation Loss.) Without healing, the simple heuristic performs poorly across all evals; with healing, the scores of both methods are quite similar.
5 Discussion and Future Directions
----------------------------------
At the end of this work, many readers are puzzled by the following: are the deeper layers entirely useless? So far, weve provided evidence that the elimination of the deeper layers does not affect performance on QA tasks like MMLU (Figure[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), while at the same time have shown that their removal does disrupt the next-token predictions of the underlying model (Figure[3](https://arxiv.org/html/2403.17887v2#S4.F3 "Figure 3 ‣ 4.2 Loss on next-token predictions ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")). Since perplexity often correlates with performance on downstream tasks, which are the tasks that are hurt by layer pruning?
Here are two hypotheses consistent with the fact that the models perplexity is disturbed proportionally to pruning fraction:
* _(i)_ The deeper layers are not essential for storing knowledge, but are useful for more complicated computations, such as those that involve reasoning.
* _(ii)_ The deeper layers are necessary when the model has to generate many tokens before answering a question, such as when it produces a chain-of-thought (CoT).
We test these hypotheses by evaluating our layer-pruned models on tasks that involve CoTs or reasoning. For the former, well look at Chain-of-Thought MMLU (CoT-MMLU); for the latter, well look at GSM8K (Cobbe et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib42)), a grade-school math benchmark, and HellaSwag (Zellers et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib43)), a multiple choice common-sense reasoning benchmark.7 7 7 Here are the details for how we performed these three evaluations: •For CoT-MMLU, we followed the flan_cot_fewshot evaluation in EleutherAI (Gao et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib44)), in which models produce a chain of thought before generating their answer. Note that the accuracy at 0%percent 0 0\%0 % pruning fraction for MMLU without CoT is much better than the analogous accuracy at 0%percent 0 0\%0 % pruning fraction for CoT-MMLU (69%similar-to absent percent 69\sim 69\% 69 % vs. 43%similar-to absent percent 43\sim 43\% 43 %, respectively; cf. Figures[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")and[6](https://arxiv.org/html/2403.17887v2#S5.F6 "Figure 6 ‣ 5 Discussion and Future Directions ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), consistent with some previous work (e.g., see Table 16 of Chung et al. ([2024](https://arxiv.org/html/2403.17887v2#bib.bib45))).•For GSM8K, we used the gsm8k_cot evaluation in EleutherAI (Gao et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib44)) and measured pass@1; for each problem we extracted an answer from a single generation (with CoT) and checked for correctness against the ground-truth answer.•For HellaSwag, we used the hellaswag evaluation in EleutherAI (Gao et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib44)). Note that HellaSwag is a multiple-choice benchmark, so random performance is 25%.
In Figure[6](https://arxiv.org/html/2403.17887v2#S5.F6 "Figure 6 ‣ 5 Discussion and Future Directions ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we plot the performance of Llama-2 70B pruned with the similarity-informed pruning strategy across CoT-MMLU (left), GSM8K (center), and HellaSwag (right): on the one hand, both GSM8K and HellaSwag, our two reasoning tasks, exhibit immediate degradation in performance with any amount of pruning, correlating with a similar decrease in the perplexity evals (Figure[3](https://arxiv.org/html/2403.17887v2#S4.F3 "Figure 3 ‣ 4.2 Loss on next-token predictions ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")); on the other hand, CoT-MMLU shows a relatively flat region of robust performance with pruning, analogous to our previous results on QA benchmarks (e.g. Figure[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")). This is some initial evidence for hypothesis _(i)_ over hypothesis _(ii)_: the deeper layers may be useful for higher-level reasoning tasks, while less important for knowledge intensive QA tasks; moreover, perplexity errors due to pruning do not compound to hurt QA evals when the model is required to generate many tokens.
![Image 6: Refer to caption](https://arxiv.org/html/2403.17887v2/x6.png)
Figure 6: Evaluation of Llama-2 70B with the similarity-informed pruning strategy across different evaluation tasks. (_Left:_ Chain-of-Thought MMLU (CoT-MMLU), _Center:_ GSM8K, _Right:_ HellaSwag.) We see that GSM8K and HellaSwag show immediate degradation of performance with any level of pruning, while CoT-MMLU behaves qualitatively similarly to MMLU without CoT; this suggests that the deeper layers are likely necessary for reasoning tasks.
Now at the conclusion of the work, we are left with the following questions:
* •What are better layer-pruning strategies? What are better approaches to healing?8 8 8 At the cost of introducing another hyperparameter and requiring both pruned and unpruned models to fit in memory during finetuning, one natural way to improve healing is by adding an auxiliary student-teacher loss that explicitly addresses the pruning mismatch ([5](https://arxiv.org/html/2403.17887v2#S3.E5 "In 3.1 Intuition ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), such as aux(x(ℓ∗+n)(θ 0)x(ℓ∗)(θ))2,similar-to subscript aux superscript superscript 𝑥 superscript 𝑛 subscript 𝜃 0 superscript 𝑥 superscript 𝜃 2\mathcal{L}_{\text{aux}}\sim\left(x^{(\ell^{*}\!+n)}(\theta_{0})-x^{(\ell^{*})% }(\theta)\right)^{2}\,,caligraphic_L start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + italic_n ) end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_x start_POSTSUPERSCRIPT ( roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_θ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(8) where θ 0 subscript 𝜃 0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are the frozen parameters of the unpruned model, and θ 𝜃\theta italic_θ are the parameters of the pruned model to be healed; thus, x(ℓ∗+n)(θ 0)superscript 𝑥 superscript 𝑛 subscript 𝜃 0 x^{(\ell^{*}\!+n)}(\theta_{0})italic_x start_POSTSUPERSCRIPT ( roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + italic_n ) end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is the input to the (ℓ∗+n)superscript 𝑛(\ell^{*}\!+n)( roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + italic_n )-th layer in the unpruned model, x(ℓ∗)(θ)superscript 𝑥 superscript 𝜃 x^{(\ell^{*})}(\theta)italic_x start_POSTSUPERSCRIPT ( roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_θ ) is the input to that same layer after pruning, and aux subscript aux\mathcal{L}_{\text{aux}}caligraphic_L start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT minimizes their mismatch. We thank Sho Yaida for this observation.
* •Why does healing eliminate the phase transition in the loss but not in the QA accuracies?
* •With more comprehensive evals, will accuracy on different tasks degrade at different depths?
* •Relatedly, is knowledge generally stored in shallow or middle layers, or is it delocalized?
* •Can we devise a pruning strategy that is robust for reasoning tasks?
* •Do pretraining details affect the ability to prune, e.g., are scaling-law over-trained or distilled models more difficult to prune?
* •How can we enable LLMs to more effectively use the parameters in their deepest layers?
Some of these questions would benefit from studying both layer similarity and pruning across different pretraining checkpoints; for instance, at what point does the sharp phase transition and critical depth in the QA accuracies emerge, and does more training lead to better use of the prunable parameters? Others suggest explorations with different pretraining architectures and objectives, e.g. in order better make use of the deeper layers (for example, one can imagine applying layer dropout (Fan et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib22)) or early exit during pre-training (Elhoushi et al., [2024](https://arxiv.org/html/2403.17887v2#bib.bib46)) to induce equal usage of layers). With more comprehensive evaluations, if different kinds of QA tasks degrade at very different depths, then this might indicate that the knowledge required to complete those tasks is stored across different layers.9 9 9 Alternatively, one could measure d(x(),x(+n))𝑑 superscript 𝑥 superscript 𝑥 𝑛 d(x^{(\ell)},x^{(\ell+n)})italic_d ( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ( roman_ + italic_n ) end_POSTSUPERSCRIPT ) or find ℓ∗⁢(n)superscript 𝑛\ell^{*}(n)roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_n ) as a function of different eval datasets. It would be very interesting to use pruning to systematically study these kind of interpretability questions.
Acknowledgments and Disclosure of Funding
-----------------------------------------
We thank Aaron Schwartz for his initial collaboration, Aaditya Singh and Sho Yaida for discussions, and Aaditya Singh for comments on the draft. We would also like to acknowledge the 2023 NeurIPS Large Language Model Efficiency Challenge for initializing us for work on this project. A.G. is supported by the NSF CAREER grant DMR-2045181, the Sloan Foundation, and by the Laboratory for Physical Sciences through the Condensed Matter Theory Center. D.R. acknowledges support from the National Science Foundation under Cooperative Agreement PHY-2019786 (the NSF AI Institute for Artificial Intelligence and Fundamental Interactions, http://iaifi.org/) and appreciates both the sanction and support of Sequoia Capital. This paper has been brought to you residually by the letters G 𝐺 G italic_G, P 𝑃 P italic_P, and U 𝑈 U italic_U, after summing over many layers.
References
----------
* Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. _arXiv preprint arXiv:2307.09288_, 2023.
* nostalgebraist (2020) nostalgebraist. interpreting gpt: the logit lens. [https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens), 2020.
* Belrose et al. (2023) Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. _arXiv preprint arXiv:2303.08112_, 2023.
* Chen et al. (2018) Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. _Advances in neural information processing systems_, 31, 2018.
* Yang et al. (2023) Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks. _arXiv preprint arXiv:2310.02244_, 2023.
* LeCun et al. (1989) Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. In D.Touretzky, editor, _Advances in Neural Information Processing Systems_, volume 2. Morgan-Kaufmann, 1989.
* Hassibi and Stork (1992) Babak Hassibi and David Stork. Second order derivatives for network pruning: Optimal brain surgeon. In S.Hanson, J.Cowan, and C.Giles, editors, _Advances in Neural Information Processing Systems_, volume 5. Morgan-Kaufmann, 1992.
* Han et al. (2015) Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. _Advances in neural information processing systems_, 28, 2015.
* Chen et al. (2015) Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In _International conference on machine learning_, pages 22852294. PMLR, 2015.
* Srinivas and Babu (2015) Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks. _arXiv preprint arXiv:1507.06149_, 2015.
* Li et al. (2016) Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. _arXiv preprint arXiv:1608.08710_, 2016.
* Wen et al. (2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. _Advances in neural information processing systems_, 29, 2016.
* Hu et al. (2016) Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. _arXiv preprint arXiv:1607.03250_, 2016.
* He et al. (2017) Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In _Proceedings of the IEEE international conference on computer vision_, pages 13891397, 2017.
* Huang et al. (2018) Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian Q Weinberger. Condensenet: An efficient densenet using learned group convolutions. In _Proceedings of the IEEE conference on computer vision and pattern recognition_, pages 27522761, 2018.
* Murray and Chiang (2015) Kenton Murray and David Chiang. Auto-sizing neural networks: With applications to n-gram language models. _arXiv preprint arXiv:1508.05051_, 2015.
* See et al. (2016) Abigail See, Minh-Thang Luong, and Christopher D Manning. Compression of neural machine translation models via pruning. _arXiv preprint arXiv:1606.09274_, 2016.
* Kim and Rush (2016) Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. _arXiv preprint arXiv:1606.07947_, 2016.
* Voita et al. (2019) Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. _arXiv preprint arXiv:1905.09418_, 2019.
* Michel et al. (2019) Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? _Advances in neural information processing systems_, 32, 2019.
* Kim and Awadalla (2020) Young Jin Kim and Hany Hassan Awadalla. Fastformers: Highly efficient transformer models for natural language understanding. _arXiv preprint arXiv:2010.13382_, 2020.
* Fan et al. (2019) Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout. _arXiv preprint arXiv:1909.11556_, 2019.
* Zhang and He (2020) Minjia Zhang and Yuxiong He. Accelerating training of transformer-based language models with progressive layer dropping. _Advances in Neural Information Processing Systems_, 33:1401114023, 2020.
* Fan et al. (2021) Chun Fan, Jiwei Li, Xiang Ao, Fei Wu, Yuxian Meng, and Xiaofei Sun. Layer-wise model pruning based on mutual information. _arXiv preprint arXiv:2108.12594_, 2021.
* Jha et al. (2023) Ananya Harsh Jha, Dirk Groeneveld, Emma Strubell, and Iz Beltagy. Large language model distillation doesnt need a teacher. _arXiv preprint arXiv:2305.14864_, 2023.
* Sajjad et al. (2023) Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. On the effect of dropping layers of pre-trained transformer models. _Computer Speech & Language_, 77:101429, 2023.
* Liu et al. (2023a) Wei Liu, Zhiyuan Peng, and Tan Lee. Comflp: Correlation measure based fast search on asr layer pruning. _arXiv preprint arXiv:2309.11768_, 2023a.
* Hou et al. (2020) Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: Dynamic bert with adaptive width and depth. _Advances in Neural Information Processing Systems_, 33:97829793, 2020.
* Sharma et al. (2023) Pratyusha Sharma, Jordan T Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction. _arXiv preprint arXiv:2312.13558_, 2023.
* Ashkboos et al. (2024) Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. _arXiv preprint arXiv:2401.15024_, 2024.
* Xia et al. (2022) Mengzhou Xia, Zexuan Zhong, and Danqi Chen. Structured pruning learns compact and accurate models. _arXiv preprint arXiv:2204.00408_, 2022.
* Lagunas et al. (2021) François Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Block pruning for faster transformers. _arXiv preprint arXiv:2109.04838_, 2021.
* Men et al. (2024) Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. _arXiv preprint arXiv:2403.03853_, 2024.
* Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. _arXiv preprint arXiv:2309.16609_, 2023.
* Jiang et al. (2023a) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. _arXiv preprint arXiv:2310.06825_, 2023a.
* Javaheripi and Bubeck (2023) Mojan Javaheripi and Sébastien Bubeck. Phi-2: The surprising power of small language models, Dec 2023.
* Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. _arXiv preprint arXiv:2305.14314_, 2023.
* Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. _The Journal of Machine Learning Research_, 21(1):54855551, 2020.
* Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. _arXiv preprint arXiv:2009.03300_, 2020.
* Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. _arXiv preprint arXiv:1905.10044_, 2019.
* Schaeffer et al. (2023) Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? _arXiv preprint arXiv:2304.15004_, 2023.
* Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. _arXiv preprint arXiv:2110.14168_, 2021.
* Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? _arXiv preprint arXiv:1905.07830_, 2019.
* Gao et al. (2023) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noach, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 12 2023. URL [https://zenodo.org/records/10256836](https://zenodo.org/records/10256836).
* Chung et al. (2024) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. _Journal of Machine Learning Research_, 25(70):153, 2024.
* Elhoushi et al. (2024) Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layer skip: Enabling early exit inference and self-speculative decoding. _arXiv preprint arXiv:2404.16710_, 2024.
* Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. _Advances in neural information processing systems_, 30, 2017.
* Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. _arXiv preprint arXiv:1810.04805_, 2018.
* Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL [https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf).
* Zhong et al. (2023) Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and Dacheng Tao. Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. _arXiv preprint arXiv:2302.10198_, 2023.
* Ethayarajh (2019) Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. _arXiv preprint arXiv:1909.00512_, 2019.
* Baevski et al. (2020) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. _Advances in neural information processing systems_, 33:1244912460, 2020.
* Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. _arXiv preprint arXiv:1503.02531_, 2015.
* Gu et al. (2023) Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Knowledge distillation of large language models. _arXiv preprint arXiv:2306.08543_, 2023.
* Jiao et al. (2019) Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. _arXiv preprint arXiv:1909.10351_, 2019.
* Wang et al. (2021) Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. Want to reduce labeling cost? gpt-3 can help. _arXiv preprint arXiv:2108.13487_, 2021.
* Eldan and Li (2023) Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? _arXiv preprint arXiv:2305.07759_, 2023.
* Li et al. (2023a) Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report. _arXiv preprint arXiv:2309.05463_, 2023a.
* Gunasekar et al. (2023) Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need. _arXiv preprint arXiv:2306.11644_, 2023.
* Fu et al. (2023) Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning. _arXiv preprint arXiv:2301.12726_, 2023.
* Hsieh et al. (2023) Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. _arXiv preprint arXiv:2305.02301_, 2023.
* Jiang et al. (2023b) Yuxin Jiang, Chunkit Chan, Mingyang Chen, and Wei Wang. Lion: Adversarial distillation of closed-source large language model. _arXiv preprint arXiv:2305.12870_, 2023b.
* Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. _arXiv preprint arXiv:2106.09685_, 2021.
* Li et al. (2023b) Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, and Tuo Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models. _arXiv preprint arXiv:2310.08659_, 2023b.
* Zhang et al. (2023) Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. _arXiv preprint arXiv:2303.10512_, 2023.
* Leviathan et al. (2023) Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In _International Conference on Machine Learning_, pages 1927419286. PMLR, 2023.
* Cai et al. (2024) Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. _arXiv preprint arXiv:2401.10774_, 2024.
* Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. _Advances in Neural Information Processing Systems_, 35:1735917372, 2022.
* Dai et al. (2021) Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. _arXiv preprint arXiv:2104.08696_, 2021.
* Hase et al. (2023) Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. _arXiv preprint arXiv:2301.04213_, 2023.
* Geva et al. (2023) Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. _arXiv preprint arXiv:2304.14767_, 2023.
* Din et al. (2023) Alexander Yom Din, Taelin Karidi, Leshem Choshen, and Mor Geva. Jump to conclusions: Short-cutting transformers with linear transformations. _arXiv preprint arXiv:2303.09435_, 2023.
* Gurnee and Tegmark (2023) Wes Gurnee and Max Tegmark. Language models represent space and time. _arXiv preprint arXiv:2310.02207_, 2023.
* Voita et al. (2023) Elena Voita, Javier Ferrando, and Christoforos Nalmpantis. Neurons in large language models: Dead, n-gram, positional. _arXiv preprint arXiv:2309.04827_, 2023.
* Liu et al. (2023b) Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. In _International Conference on Machine Learning_, pages 2213722176. PMLR, 2023b.
* Panigrahi et al. (2023) Abhishek Panigrahi, Nikunj Saunshi, Haoyu Zhao, and Sanjeev Arora. Task-specific skill localization in fine-tuned language models. _arXiv preprint arXiv:2302.06600_, 2023.
* Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art natural language processing. In _Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations_, pages 3845, Online, October 2020. Association for Computational Linguistics. URL [https://www.aclweb.org/anthology/2020.emnlp-demos.6](https://www.aclweb.org/anthology/2020.emnlp-demos.6).
* Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. _arXiv e-prints_, 2019.
* Mangrulkar et al. (2022) Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. [https://github.com/huggingface/peft](https://github.com/huggingface/peft), 2022.
* Lee et al. (2023) Ariel N Lee, Cole J Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms. _arXiv preprint arXiv:2308.07317_, 2023.
* Dettmers et al. (2022) Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. _arXiv preprint arXiv:2208.07339_, 2022.
Appendix A Extended Literature Review
-------------------------------------
In this section, we review practical strategies for post-training efficiency and discuss some scientific investigations that provide motivation for, or insight into, our approach: in §[A.1](https://arxiv.org/html/2403.17887v2#A1.SS1 "A.1 Pruning ‣ Appendix A Extended Literature Review ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we first review the history of pruning and then discuss its modern application to LLMs; in §[A.2](https://arxiv.org/html/2403.17887v2#A1.SS2 "A.2 Model distillation ‣ Appendix A Extended Literature Review ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we contrast pruning with distillation, an alternative strategy for reducing the parameter count of LLMs; then in §[A.3](https://arxiv.org/html/2403.17887v2#A1.SS3 "A.3 Efficient finetuning and inference acceleration ‣ Appendix A Extended Literature Review ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we discuss the various practical methods for efficient finetuning and inference acceleration that can be used in conjunction with our pruning strategy; finally in §[A.4](https://arxiv.org/html/2403.17887v2#A1.SS4 "A.4 A breadth of depth-dependent studies ‣ Appendix A Extended Literature Review ‣ The Unreasonable Ineffectiveness of the Deeper Layers") we highlight some scientific investigations into some depth-dependent statistical properties of LLMs that are complementary to our results.
### A.1 Pruning
_Pruning_ is a method for reducing the size of a trained machine-learning model by removing unnecessary parameters, either individually or together as a group. Pruning for neural networks has a long history (LeCun et al., [1989](https://arxiv.org/html/2403.17887v2#bib.bib6), Hassibi and Stork, [1992](https://arxiv.org/html/2403.17887v2#bib.bib7)), and, as originally conceived, _unstructured pruning_ techniques sparsify networks by removing individual parameters based on pre-defined criteria. For instance, if a parameter of the model has a very small value, then removing it i.e. by setting it to exactly zero will likely have minimal impact on performance. Inspired by this early work, modern researchers began exploring different criteria for such unstructured pruning, focusing mostly on computer vision models (Han et al., [2015](https://arxiv.org/html/2403.17887v2#bib.bib8), Chen et al., [2015](https://arxiv.org/html/2403.17887v2#bib.bib9), Srinivas and Babu, [2015](https://arxiv.org/html/2403.17887v2#bib.bib10)). In particular, Han et al. ([2015](https://arxiv.org/html/2403.17887v2#bib.bib8)) developed an _iterative pruning_ method for alternatively pruning and finetuning a network in order to reach better compression ratios and performance.
While these models were smaller, they were not necessarily more efficient: sparsifying networks by removing individual parameters according to a criterion leads to irregular or pseudorandom sparsification patterns that are difficult to accelerate without specialized hardware or libraries designed for sparsity (Li et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib11)). To that end, _structured pruning_ techniques were developed to remove irrelevant groups of parameters together, such as particular channels or filters in convolutional networks. As this increased their practical relevance, researchers then began exploring structured pruning across computer vision (Li et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib11), Wen et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib12), Hu et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib13), He et al., [2017](https://arxiv.org/html/2403.17887v2#bib.bib14), Huang et al., [2018](https://arxiv.org/html/2403.17887v2#bib.bib15)) and pre-transformer NLP architectures (Murray and Chiang, [2015](https://arxiv.org/html/2403.17887v2#bib.bib16), See et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib17), Kim and Rush, [2016](https://arxiv.org/html/2403.17887v2#bib.bib18)).
Following unprecedented progress in language modeling, recent work has focused on applying structured pruning methods to the Transformer (Vaswani et al., [2017](https://arxiv.org/html/2403.17887v2#bib.bib47)). These studies consider nearly every possible component of the model architecture for elimination, with methods ranging from dropping attention heads (Voita et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib19), Michel et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib20), Kim and Awadalla, [2020](https://arxiv.org/html/2403.17887v2#bib.bib21)), to dropping layers (Fan et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib22), Zhang and He, [2020](https://arxiv.org/html/2403.17887v2#bib.bib23), Fan et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib24), Jha et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib25), Sajjad et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib26), Liu et al., [2023a](https://arxiv.org/html/2403.17887v2#bib.bib27)), to pruning hidden states (Hou et al., [2020](https://arxiv.org/html/2403.17887v2#bib.bib28)), to rank reducing large weight matrices (Sharma et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib29)), replacing sparse weight matrices with smaller dense ones (Ashkboos et al., [2024](https://arxiv.org/html/2403.17887v2#bib.bib30)), to many combinations of the aforementioned groups (Xia et al., [2022](https://arxiv.org/html/2403.17887v2#bib.bib31), Lagunas et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib32)).
Of the prior work that also considers transformer layer dropping, most (Fan et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib22), Zhang and He, [2020](https://arxiv.org/html/2403.17887v2#bib.bib23), Fan et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib24), Xia et al., [2022](https://arxiv.org/html/2403.17887v2#bib.bib31), Sajjad et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib26)) study BERT-style models (Devlin et al., [2018](https://arxiv.org/html/2403.17887v2#bib.bib48)), while we consider decoder-only GPT-style models (Radford et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib49)) that are most commonly used for large-scale language modeling and generation. BERT-style models are naturally suited for understanding tasks due to their bidirectional masked language modeling (MLM) objective, while GPT-style models are instead suited for generation, due to their autoregressive objective. While this divide has been questioned in light of more powerful GPT-style models (Zhong et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib50)), previous work (Ethayarajh, [2019](https://arxiv.org/html/2403.17887v2#bib.bib51)) has found significant qualitative differences between BERT and GPT models in terms of the evolution of the layer-wise representation of words. Altogether, this suggests that layer-dropping strategies will behave differently between the two families.
One study for BERT-style pre-trained models, Sajjad et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib26)), concludes that the best layer-pruning strategy is dropping the final layers; this partially resonates with our results, although in contrast we find that _(a)_ for some pruning sizes keeping the last few layers of the model is actually beneficial, and that _(b)_ for all pruning sizes keeping the very last layer is essential. Additionally, while the authors also study similarity between representations in different layers as in our approach they actually found a higher similarity between representations in the shallow layers compared to the deeper ones which very sharply disagrees with our results. Importantly, the models considered in Sajjad et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib26)) consist of a few hundred million parameters, which is much smaller than the model scales we consider in our work. Perhaps as a consequence, the authors didnt observe the sharp transition in downstream accuracies that we report in §[4.1](https://arxiv.org/html/2403.17887v2#S4.SS1 "4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), despite the fact that they also finetuned their pruned models.
In contrast, while Jha et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib25)) does consider GPT-style models, the methodology is quite different from ours: _(i)_ rather than pretraining first and then using a fixed layer-dropping strategy as we do, instead the authors incrementally drop layers in a modified pretraining procedure; and _(ii)_ the authors study their own sub-1B parameter models, while we focus on the families of readily available, open-weight, large-scale 2.7B-70B parameter models that are commonly used and/or finetuned for practical applications.
As we were finalizing our preprint, Men et al. ([2024](https://arxiv.org/html/2403.17887v2#bib.bib33)) was posted: this paper empirically studies different layer-pruning strategies for GPT-style models (Llama-2 7B and Baichuan2-7B-base) and their subsequent effects on benchmarks (MMLU, CMMLU, and CMNLI). They investigate various layer-importance metrics notably, their "Block Influence" function is similar to our cosine similarity metric and find that they are able to prune up to similar-to\sim28% of layers of Llama-2 7B with minimal impact on performance. This provides independent evidence supporting our main takeaway that the deeper layers are not critical for storing knowledge.
Finally, a systematic approach to layer dropping in transformers has also been studied in the context of _wav2vec_ models, which are encoder-only models that map speech to embeddings and are sized in the hundred-million parameter regime (Baevski et al., [2020](https://arxiv.org/html/2403.17887v2#bib.bib52)). With these models, Liu et al. ([2023a](https://arxiv.org/html/2403.17887v2#bib.bib27)) developed a layer-pruning algorithm based on the correlation between layers and downstream metrics. Beyond the model architecture and domain, one significant difference between this and our work is that Liu et al. ([2023a](https://arxiv.org/html/2403.17887v2#bib.bib27)) considered non-contiguous pruning proposals, e.g. dropping alternate layers. Our intuition for layer pruning predicts that this shouldnt work as well at least for decoder-only language models as it creates multiple mismatches, one with each block of layers removed.
### A.2 Model distillation
A completely different method for reducing the size of a trained machine-learning model is _model distillation_(Hinton et al., [2015](https://arxiv.org/html/2403.17887v2#bib.bib53)), in which knowledge is transferred from a large “teacher” model to a smaller “student” model by training the student on the distribution predicted by the teacher. The essential insight is that this can transform the very general knowledge and capabilities of the teacher into more streamlined, compressed, and possibly skill-specific representations.
While a very general technique, in the setting of language models, distillation has been implemented with _(a)_ white-box approaches, in which the the student is trained to imitate the teachers logits (Gu et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib54)) or hidden states (Jiao et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib55)); as well as with _(b)_ black-box approaches, in which the student only has access to the output tokens generated by the teacher. This latter approach broadly covers cases where the student is trained on text that is augmented by the teacher in some way, such as by adding synthetic labels (Wang et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib56)), generating high quality synthetic text (Eldan and Li, [2023](https://arxiv.org/html/2403.17887v2#bib.bib57), Li et al., [2023a](https://arxiv.org/html/2403.17887v2#bib.bib58), Gunasekar et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib59)) by providing chain of thought reasoning (Fu et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib60), Hsieh et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib61)), which aims to enhance the students reasoning skills, or by annotating instructions that enhance the students instruction-following capabilities (Jiang et al., [2023b](https://arxiv.org/html/2403.17887v2#bib.bib62)).
Compared to layer pruning, these distillation methods require considerable computational resources due to the reliance on the large teacher to process a big corpus of data. Instead, our similarity-based pruning strategy only requires computing the similarity between representations at different layers on a small subset of a pretraining corpus, while our second simpler pruning strategy only uses the reduced model post pruning.
### A.3 Efficient finetuning and inference acceleration
Complementary to directly reducing size of a model, _parameter-efficient finetuning_ (PEFT) focuses on reducing the cost of specializing LLMs to certain tasks. In particular, Low Rank Adapters (LoRA) reduce the memory and compute of fine tuning by freezing the pretrained model and introducing a parametrically small number of additional trainable weights (Hu et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib63)). We use its quantized cousin, QLoRA (Dettmers et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib37)), to keep our experiments cost efficient. Other PEFT methods that can be combined with our work are Li et al. ([2023b](https://arxiv.org/html/2403.17887v2#bib.bib64)) and Zhang et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib65)): in the first, the initialization of the LoRA matrices is adjusted to a quantization scheme; in the second, LoRA ranks for different LLM modules are chosen in an adaptive manner.
For additional efficiency gains we could combine our layer-pruned models with methods that further accelerate inference: with speculative decoding (Leviathan et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib66)), tokens are rapidly generated from a smaller draft model and then evaluated in parallel by the main model; with Medusa (Cai et al., [2024](https://arxiv.org/html/2403.17887v2#bib.bib67)) the draft model is discarded for extra decoding heads, but ultimately achieves a similar effect. In particular, it could be interesting to consider highly-compressed layer-pruned models as potential draft models in a speculative decoding setup.
### A.4 A breadth of depth-dependent studies
Finally, let us highlight some scientific work that study the depth-dependent properties of LLMs. One relevant direction considers how knowledge and linguistic properties are encoded in language models. On the one hand, Meng et al. ([2022](https://arxiv.org/html/2403.17887v2#bib.bib68)) and Dai et al. ([2021](https://arxiv.org/html/2403.17887v2#bib.bib69)) analyze the _storage and recall_ of factual associations: these works emphasize that knowledge localizes within the middle (Meng et al., [2022](https://arxiv.org/html/2403.17887v2#bib.bib68)) or final (Dai et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib69)) layers, which has implications for directly editing or erasing part of a models factual knowledge. On the other hand, attempts to perform such editing gives evidence that information may be stored non-locally across layers (Hase et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib70)). Relatedly, Geva et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib71)) investigates the way facts are _processed_ during inference, distinguishing between the role of attention heads, for attribute extraction, and the MLP blocks, for subject enrichment: both are delocalized across several layers.
Next, following the earlier “logic lens” (nostalgebraist, [2020](https://arxiv.org/html/2403.17887v2#bib.bib2)), Belrose et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib3)) invented a technique they called “tuned lens” to study the _trajectory of predictions_ by using a learnable affine transformation to convert intermediate representations into a distributions over tokens (see also Din et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib72))). By studying the layer-to-layer dynamics of this distribution, the authors noted that it tended to converge. This convergence is very suggestive that that the deeper layers could be prunable, while the fact that they had to train an affine probe is likely related to our observation that the final layer cannot be pruned. Somewhat relatedly, Gurnee and Tegmark ([2023](https://arxiv.org/html/2403.17887v2#bib.bib73)) observed that geographic features in the underlying text can be determined from linear probes trained on intermediate activations, as long as the activations are deeper than halfway.
More abstractly, Voita et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib74)) and Liu et al. ([2023b](https://arxiv.org/html/2403.17887v2#bib.bib75)) found that the sparsity of activations transitions at around halfway through a networks forward pass, evolving from sparse to dense. Perhaps relatedly, Panigrahi et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib76)) investigated which model weights update the most during finetuning, finding that its those in the mid-layers.
Altogether, these deep studies are complementary to our work, which, on the one hand, provides evidence that removing the deepest layers of an LLM does not significantly alter the models performance, and, on the other hand, demonstrates a sharp pruning transition after removing approximately half of an LLMs deepest layers.
Appendix B Experimental Details
-------------------------------
Here we explain various details of models and healing (§[B.1](https://arxiv.org/html/2403.17887v2#A2.SS1 "B.1 Model and healing details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) and of evaluations (§[B.2](https://arxiv.org/html/2403.17887v2#A2.SS2 "B.2 Evaluation details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers")).
### B.1 Model and healing details
All models in this paper were fine-tuned using the Hugging Face Trainer API(Wolf et al., [2020](https://arxiv.org/html/2403.17887v2#bib.bib77)). A list of models and their paths on Hugging Face are as follows:
For healing, we used the version of the Colossal Clean Crawled Corpus (C4) (Raffel et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib78)) from Hugging Face: `data = load_dataset("c4", en)`. We truncated long examples as described later in the paragraph and added special tokens when available.10 10 10 N.B. the Qwen tokenizer from Hugging Face does not include any special tokens; in this case, it was essential to add a default padding token. Models were finetuned for 5000 steps with a global batch size of 16: this corresponds to total finetuning tokens of 16×5000×[max_seq_length]16 5000 delimited-[]max_seq_length 16\times 5000\times[\text{{max\_seq\_length}}]16 × 5000 × [ max_seq_length ] for each model. We used a cosine-annealed learning rate schedule, with a warmup of 100 steps. When possible, the peak learning rate was set to the peak learning rate from the models pretraining; in practice, this means all models were trained with a peak LR of 3e-4, with the exceptions of Phi-2 (Javaheripi and Bubeck, [2023](https://arxiv.org/html/2403.17887v2#bib.bib36)), which was trained with a peak LR of 2e-4 during pre-training, Llama-2-70B, which was trained with a peak LR of 3e-5 (a value that resulted from a sweep), and Mistral-7B which was trained with a peak LR of 3e-6 (also a value that resulted from a sweep). All models 7B parameters or smaller were trained with a max sequence length of 2048 tokens, while all models 13B parameters or greater were trained with a max sequence length of 4096 tokens. While we realize that some models may have been pretrained on longer sequences, e.g. Qwen _-the-outlier_(Bai et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib34)), we decided to the max sequence length consistent across models of similar size to allow fairer comparisons across model families.
On top of the Hugging Face Trainer API, we used quantization and Low-Rank Adapters (LoRA) (Hu et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib63)) for all of our finetuning:
* •For quantization, we used the bitsandbytes library for QLoRA(Dettmers et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib37)) to quantize our models to 4 bits.
* •For LoRA, we used the Hugging Face peft library (Mangrulkar et al., [2022](https://arxiv.org/html/2403.17887v2#bib.bib79)). We set the LoRA dropout to 0.05 and kept the LoRA α 𝛼\alpha italic_α equivalent to the LoRA rank, following (Lee et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib80)). Aside from two exceptions, discussed below, models are trained with LoRA rank 64.
* •Also following Lee et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib80)), we only applied LoRA to FFN modules: `["gate_proj", "down_proj", "up_proj"]` for Llama-2 and Mistral models, `["fc1", "fc2"]` for Phi-2, and `["w1", "w2", "c_proj"]` for Qwen models.
The large majority of these hyperparameter choices are standard and found in previous works, e.g. Lee et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib80)) and Dettmers et al. ([2022](https://arxiv.org/html/2403.17887v2#bib.bib81)). For absolute clarity, we list display all the model specific architecture and healing details below:
We also have the following hyperparameters common between all models:
### B.2 Evaluation details
We performed three principal evaluations: accuracy on _MMLU_, accuracy on _BoolQ_, and loss on _C4_.
For MMLU accuracy:
* •We use the `cais/mmlu` version of the dataset from Hugging Face.
* •We follow the formatting suggested in the original reference (Hendrycks et al., [2020](https://arxiv.org/html/2403.17887v2#bib.bib39)) without further prompt engineering.
* •For constructing few-shot examples, we use the `dev` set from `cais/mmlu`.
* •For our experiments, we use 0 0 few-shot examples; our results and analysis are robust to this choice, cf. Figure[8](https://arxiv.org/html/2403.17887v2#A3.F8 "Figure 8 ‣ C.1 Prompting ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
* •We report average accuracy across all subjects.
For BoolQ accuracy:
* •We used the `hassansh/boolq_n_shot` version from Hugging Face.
* •For our experiments, we use 0 0 few-shot examples.
* •The complete BoolQ results truncated from the main text are shown here in Figure[7](https://arxiv.org/html/2403.17887v2#A2.F7 "Figure 7 ‣ B.2 Evaluation details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers"): in the left panel we present the Llama-2 family, in the middle panel we present models from the Qwen family, and in the right panel we should Mistral-7B and Phi-2; we also make the experiments without healing semi-transparent in order to better display the results from the complete similarity-informed pruning method. Importantly, while we see here that healing plays a more important role than it did for MMLU in Figure[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), after healing we still have a characteristic flat region of robust performance; as before, the capabilities required to achieve a models top score isnt removed by significant layer pruning until a critical model-dependent threshold.
![Image 7: Refer to caption](https://arxiv.org/html/2403.17887v2/x7.png)
Figure 7: BoolQ accuracy (0-shot) vs. fraction of layers dropped for different model families. (_Left:_ Llama-2 family; _Middle:_ Qwen family; _Right:_ Mistral-7B and Phi-2.) The solid lines represent performance after dropping layers and healing, and the (semi-transparent) dotted lines show performance after dropping layers only (no healing), and the dashed gray line is the score for guessing randomly. For BoolQ, healing leads to important improvements such that performances; then, across all models, performances are quite robust until 20%-55% pruning fractions, depending on model family and size, at which point they transitions to random guessing.
For C4 Validation Loss:
* •We used the `c4` version from Hugging Face (soon be deprecated in favor of `allenai/c4`).
* •We evaluated using the _validation_ split as we healed with the train split.
* •Given its size, we randomly sampled 60k sequences and held them fixed across all models.
* •In Figure[3](https://arxiv.org/html/2403.17887v2#S4.F3 "Figure 3 ‣ 4.2 Loss on next-token predictions ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers") we normalized the loss to facilitate fair comparison across model families that employ different vocab sizes: to normalize, we divided by logV 𝑉\log V roman_log italic_V, where V 𝑉 V italic_V is the _per-model_ vocab size (listed in a table in §[B.1](https://arxiv.org/html/2403.17887v2#A2.SS1 "B.1 Model and healing details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers")). This, logV 𝑉\log V roman_log italic_V, corresponds to the loss of sampling tokens uniformly, which naturally sets the scale for a given model.
Appendix C Ablations
--------------------
Here we detail various ablations: prompting (§[C.1](https://arxiv.org/html/2403.17887v2#A3.SS1 "C.1 Prompting ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), finetuning seed (§[C.2](https://arxiv.org/html/2403.17887v2#A3.SS2 "C.2 Finetuning seed ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), LoRA rank (§[C.3](https://arxiv.org/html/2403.17887v2#A3.SS3 "C.3 LoRA rank ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), other pruning strategies (§[C.4](https://arxiv.org/html/2403.17887v2#A3.SS4 "C.4 Other pruning strategies ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers")). Qualitatively, the results of the paper are quite robust to the variation of any of these.
### C.1 Prompting
Its common knowledge that altering the prompt on QA evaluations can significantly impact results. To control for prompting, we ablate the MMLU accuracy for our principal similarity-informed pruning described in §[3.2](https://arxiv.org/html/2403.17887v2#S3.SS2 "3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers") when applied to Llama-2-13B: in the left panel of Figure[8](https://arxiv.org/html/2403.17887v2#A3.F8 "Figure 8 ‣ C.1 Prompting ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we show results for changing the ordering of the few-shot examples in the prompt, and in the right panel the same figure, we show results for changing the number of few-shot examples. Broadly we see that the layer-pruning method is robust to these changes.
![Image 8: Refer to caption](https://arxiv.org/html/2403.17887v2/x8.png)
Figure 8: Effect of prompt ablations on MMLU accuracy vs. fraction of layers dropped for Llama-2-13B. _Left:_ We vary the ordering of the few-shot examples and see it does not have any impact. _Right:_ We very the number n 𝑛 n italic_n of few-shot examples; while careful study of the flat region suggests increasing the number of few-shot examples marginally improves performance, regardless, the layer-pruning strategy is robust to this kind of variation.
### C.2 Finetuning seed
Here we vary the finetuning seed. For all of our experiments, we use the following code snippet to ensure reproducibility:
SEED_VAL = 0
transformers.enable_full_determinism(SEED_VAL)
Since we begin with a pretrained model, the finetuning seed doesnt affect initialization, but it will impact the stochastic aspects of further training such as data order. To control for this, we ablate the finetuning seed for our principal similarity-informed pruning described in §[3.2](https://arxiv.org/html/2403.17887v2#S3.SS2 "3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers") when applied to Llama-2-13B: in Figure[9](https://arxiv.org/html/2403.17887v2#A3.F9 "Figure 9 ‣ C.2 Finetuning seed ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers") we observe that the layer-pruning method is robust to the choice of seed.
![Image 9: Refer to caption](https://arxiv.org/html/2403.17887v2/x9.png)
Figure 9: Effect of varying the finetuning seed on MMLU accuracy vs. fraction of layers dropped for Llama-2-13B: there is no meaningful effect.
### C.3 LoRA rank
Here we vary the LoRA rank used for healing. Unfortunately, our compute budget did not allow us to make an exhaustive sweep across all of our experimental configurations. In lieu of that, we employed the following protocol for our main experiments:
* •Begin with rank 64, following the QLoRA setup (see, e.g. Appendix B.2 of Dettmers et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib37))).
* •If healing with that rank significantly harms the performance compared to no healing, then sweep LoRA ranks for that model and, for the other evaluations, pick the best performing LoRA rank according to its MMLU accuracy.
This protocol is designed to maximize the chance that healing will improve performance across all of our evaluations. For simplicity, we ran this rank-picking protocol using the simple pruning heuristic, with the exception of Llama-2-70B.
In practice, this led to us using rank 64 for every model with the exceptions of Mistral-7B, with rank 4, Llama-2-7B, with rank 2, and Llama-2-70B, with rank 8. (To review this same information in tabular form, see the second Table in §[B.1](https://arxiv.org/html/2403.17887v2#A2.SS1 "B.1 Model and healing details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers").) Figure[10](https://arxiv.org/html/2403.17887v2#A3.F10 "Figure 10 ‣ C.3 LoRA rank ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers") displays the sweeps over MMLU accuracy supporting these choices for Mistral-7B (bottom left panel), Llama-2-7B (bottom middle panel), and Llama-2-70B (top right panel): overall, while the LoRA rank does not have a significant impact on the qualitative behavior of the healed model, decreasing the LoRA rank generally improves performance. In the top left and middle panels of Figure[10](https://arxiv.org/html/2403.17887v2#A3.F10 "Figure 10 ‣ C.3 LoRA rank ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we show corresponding sweeps for Mistral-7B (top) and Llama-2-7B (middle) using the similarity-informed pruning strategy: we see that for this pruning method both models are much more robust, though rank 2 is still the top performing rank for Llama-2-7B.
![Image 10: Refer to caption](https://arxiv.org/html/2403.17887v2/x10.png)
Figure 10: Effect of varying the LoRA rank. Top: 5-shot MMLU accuracy vs. fraction of layers dropped using the similarity-informed pruning strategy on Mistral-7B (_left_), Llama-2-7B (middle), and Llama-2-70B (right). Across all ranks we observe similar behavior, though theres a small effect of decreasing rank improving overall performance. Bottom, left and middle: 5-shot MMLU accuracy vs. fraction of layers dropped using the simple pruning heuristic on Mistral-7B (_left_) and Llama-2-7B (middle). As before, qualitative behavior is similar across ranks, though in this case its much clearer that decreasing rank improves performance. Bottom, right: C4 validation loss vs. fraction of layers dropped using the similarity-informed pruning strategy on Mistral-7B. In contrast to MMLU, decreasing rank harms performance; together, these results suggest that larger ranks may be overfitting.
The characteristic improvement of MMLU accuracy with decreasing LoRA rank even for extremely low ranks(!) deserves an explanation. One possibility is that lowering the LoRA rank can better regularize finetuning against overfitting. In particular, astute readers may have been surprised at the discussion of peak learning rates in §[B.1](https://arxiv.org/html/2403.17887v2#A2.SS1 "B.1 Model and healing details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers"): models were finetuned with the same peak used in pretraining; a “large” LoRA rank of 64 introduces a number of additional parameters that may overfit to C4. This overfitting would certainly be harmful, since the actual pretraining datasets for the models we consider are _(a)_ unknown to us, and _(b)_, likely to be of significantly higher quality than C4.
We investigate this directly for Mistral-7B. In the bottom right panel of Figure[10](https://arxiv.org/html/2403.17887v2#A3.F10 "Figure 10 ‣ C.3 LoRA rank ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers") we plot the C4 validation loss across different LoRA ranks: we see that while decreasing the LoRA rank generally improves MMLU accuracy (cf. left-most panels), at the same time it harms the C4 validation loss. This supports our overfitting hypothesis. In a greater-resourced future, it would be interesting to improve the healing process by considering other forms of regularization and learning rate tuning.
### C.4 Other pruning strategies
Here we study how the similarity-informed pruning strategy (§[3.2](https://arxiv.org/html/2403.17887v2#S3.SS2 "3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) compares to other layer-pruning baselines: specifically, we contrast with pruning random layers and pruning shallow layers. In Figure[11](https://arxiv.org/html/2403.17887v2#A3.F11 "Figure 11 ‣ C.4 Other pruning strategies ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we observe that the similarity-informed strategy from the main text outperforms both of these other strategies on an MMLU evaluation of Llama-7B.
![Image 11: Refer to caption](https://arxiv.org/html/2403.17887v2/x11.png)
Figure 11: Comparison of the similarity-informed pruning strategy (blue) to random-layer pruning (orange) and shallow-layer pruning (green) on MMLU accuracy, with Llama-2 7B and LoRA rank 64. The similarity-informed pruning strategy clearly outperforms these baselines.
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
+2
View File
@@ -0,0 +1,2 @@
Error: Paper '2504.03022' not found on the Hub.
Set HF_DEBUG=1 as environment variable for full traceback.
+58
View File
@@ -0,0 +1,58 @@
Warning: 48287 documents (93%) need embeddings. Run 'qmd embed' for better results.
Expanding query...
├─ 2504.03022 · (lexical+vector)
├─ 2504.03022 code · (lexical)
├─ 2504.03022 usage · (lexical)
├─ code examples for 2504.03022 · (vector)
├─ practical applications of 2504.03022 · (vector)
└─ The topic of 2504.03022 covers code examples for 2504.03022. Here are a few e... · (hyde)
Searching 3 lexical + 4 vector queries...
Reranking 40 documents...
]9;4;3]9;4;0
qmd://papers/books-bulk/ai-docs/lng-process-control/advanced-chemical-process-control-putting-theory-into-morten-hovd-1-auflage-weinheim-2023-epub.md #7d7f82
Title: List of Tables
Score: 88%
@@ -1,3 @@ (0 before, 60 after)
![](media=outputs/lng_process_control/Advanced Chemical Process Control - Putting Theory into -- Morten Hovd -- 1_ Auflage, Weinheim, 2023_artifacts/images/9783527842483.jpg)
[]{#cover.xhtml}
qmd://markdown-notes/2021/12/22.md #e11e56
Title: 22
Score: 50%
@@ -1,2 @@ (0 before, 0 after)
need to change pos...
qmd://markdown-notes/logseq-notes/pages/omnivore.md #df8f28
Title: 🔖 Articles
Score: 38%
@@ -1,3 @@ (0 before, 25 after)
## 🔖 Articles
- [Training with quantization noise for extreme model compression](https://omnivore.app/me/training-with-quantization-noise-for-extreme-model-compression-18a6db1914b)
collapsed:: true
qmd://markdown-notes/2021/07/02.md #46e042
Title: 02
Score: 35%
@@ -1,3 @@ (0 before, 11 after)
- [ ] meditate
- [ ] walk
qmd://markdown-notes/2019/08/22.md #727ecd
Title: 22
Score: 34%
@@ -1,3 @@ (0 before, 65 after)
- [x] look up smartmod
- [ ] plan fastapi etc
- [ ] meetup
@@ -0,0 +1,439 @@
Title: The Unreasonable Ineffectiveness of the Deeper Layers
URL Source: https://arxiv.org/html/2403.17887
Published Time: Tue, 04 Mar 2025 03:27:48 GMT
Markdown Content:
Andrey Gromov
Meta FAIR & UMD
&Kushal Tirumala
Meta FAIR
&Hassan Shapourian
Cisco &Paolo Glorioso
Zyphra
\AND Daniel A. Roberts
MIT & Sequoia Capital Co-first authors; please direct correspondence to the union of {gromovand@meta.com, kushaltirumala99@gmail.com, drob@mit.edu}.
###### Abstract
How is knowledge stored in an LLMs weights? We study this via layer pruning: if removing a certain layer does not affect model performance in common question-answering benchmarks, then the weights in that layer are not necessary for storing the knowledge needed to answer those questions. To find these unnecessary parameters, we identify the optimal block of layers to prune by considering similarity across layers; then, to “heal” the damage, we perform a small amount of finetuning. Surprisingly, with this method we find minimal degradation of performance until after a large fraction (up to half) of the layers are removed for some common open-weight models. From a scientific perspective, the robustness of these LLMs to the deletion of layers implies either that current pretraining methods are not properly leveraging the parameters in the deeper layers of the network or that the shallow layers play a critical role in storing knowledge. For our study, we use parameter-efficient finetuning (PEFT) methods, specifically quantization and Low Rank Adapters (QLoRA), such that each of our experiments can be performed on a single 40GB A100 GPU.
1 Introduction
--------------
In this work we study a very simple pruning strategy using open-weight LLMs. In particular, we develop a method that uses the similarity between the representations at different layers to identify the optimal layers to prune for a given pruning fraction; then, after removing these layers we “heal” the pruning-induced mismatch with a small amount of fine tuning (using QLoRA). Our main result is that we can remove a substantial fraction of the _deepest layers_ from models with minimal degradation in downstream question-answering benchmarks. For example, for Llama-2-70B (Touvron et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib1)) we can eliminate up to roughly _half_ of the layers before the performance collapses. An overview of our strategy and the results of pruning Llama-2-70B are shown in Figure[1](https://arxiv.org/html/2403.17887v2#S1.F1 "Figure 1 ‣ 1 Introduction ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
![Image 1: Refer to caption](https://arxiv.org/html/2403.17887v2/x1.png)
Figure 1: Overview of our layer-pruning strategy and example results: _(a)_ a flowchart describing the algorithm: if removing n 𝑛 n italic_n layers, we find the layer, ℓ∗superscript \ell^{*}roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, that minimizes the angular distance, d 𝑑 d italic_d, between layers \ell roman_ and +n 𝑛\ell\!+\!n roman_ + italic_n; we then remove the n 𝑛 n italic_n layers beginning with layer ℓ∗superscript \ell^{*}roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT; finally, if necessary, we can “heal” the damage with a small amount of (parameter-efficient) finetuning. _(b)_ a schematic depicting the removal of n 𝑛 n italic_n total layers, indexed from ℓ∗superscript \ell^{*}\!roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT to ℓ∗+n1 superscript 𝑛 1\ell^{*}\!\!+\!n\!-\!1 roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + italic_n - 1. _(c)_ angular distance, d 𝑑 d italic_d, between different numbers of layers, n 𝑛 n italic_n, vs. the layer number, \ell roman_, that indexes the beginning of the block of n 𝑛 n italic_n; the bottom curve (darkest purple) represents n=1 𝑛 1 n=1 italic_n = 1, while the top curve (lightest yellow) represents n=64 𝑛 64 n=64 italic_n = 64; the black line traces ℓ∗⁢(n)superscript 𝑛\ell^{*}(n)roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_n ), the minimum of the angular distance across the different sized layer blocks. _(d)_ results of pruning Llama-2-70B with healing (light blue) and without healing (dark blue) as a function of the fraction of layers removed: the top (middle) panel gives the accuracy on the MMLU (BoolQ) question-answering benchmark, while the bottom panel the autoregressive loss on a subset of the C4 validation set; here, the dashed red lines (dashed gray lines) indicate the accuracy or loss of the original unpruned model (of random guessing); these plots illustrate that typical behavior we find in which there are sharp transitions in performance for the accuracy of question-answering tasks (here between 40%-50% pruning fraction), but continuity and very slow growth in the healed loss (light blue) up to at least to 80% pruning fraction.
Our intuition for dropping layers comes from considering the residual structure of the transformer architecture. In more detail, the output of the final layer can be decomposed as a sum over the outputs of all the model layers plus the embedded input. If such a sum had numerous and independent terms, then removing a handful of them should not significantly change the output. However, since the terms are not independent each layer is input to the following layer we should expect to be able to remove terms if the residual contribution from a particular layer is small. In other words, if the output of each layer does not change too much from layer to layer.1 1 1 This is strongly suggested by “lens” investigations that studied the evolution of the token distribution as a function of layer index such as the “logit lens” (nostalgebraist, [2020](https://arxiv.org/html/2403.17887v2#bib.bib2)) and the “tuned lens” (Belrose et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib3)). A separate line of reasoning along these lines previously inspired neural ODEs (Chen et al., [2018](https://arxiv.org/html/2403.17887v2#bib.bib4)), and led Yang et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib5)) to argue that ideally representation should change substantially from layer to layer in order to most effectively make use of the parameters of a network.
In conjunction with our layer pruning, we investigate the similarity of layer representations at different separations and find broadly that deeper layers are qualitatively more similar to neighboring layers than shallow layers (with the exception of the very final layer). This suggests an even simpler pruning strategy: remove layers beginning at the penultimate layer and proceed from deep to shallow until the desired number of layers have been removed. In this case, we find that, after healing the damage with a small amount of QLoRA finetuning, we can achieve performance that nearly matches the more involved similarity-informed layer pruning strategy. The effectiveness of this method is evidence that LLMs might not properly leverage the parameters in the deeper layers of the network.
That said, while question-answering (QA) benchmarks such as MMLU and BoolQ are robust to a large amount of layer pruning, other measures of performance are not: if we look at the loss on next-token predictions for an IID dataset (C4 validation set), we find that the model is smoothly damaged in proportion to the fraction of the number of layers pruned. Since perplexity typically correlates strongly with downstream metrics, this naturally begs the question: which tasks are less robust than QA benchmarks to pruning? As part of our final discussion, we explore reasoning related tasks (GSM8k and HellaSwag) and see that they are harmed by any amount of pruning. Altogether, this leads to the following accounting of state: the shallow layers likely play a critical role in the storing of knowledge and retrieving of information, while the deeper layers are important for higher-level computations such as mathematical reasoning.
The structure of this paper is as follows. In §[2](https://arxiv.org/html/2403.17887v2#S2 "2 Literature Review ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we first perform a literature review of both practical post-training strategies and science-of-deep-learning investigations that motivate our work. Then, in §[3](https://arxiv.org/html/2403.17887v2#S3 "3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we give intuition for our layer pruning strategy and explain our method in detail, while in §[4](https://arxiv.org/html/2403.17887v2#S4 "4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers") we iterate over all our experimental results. Finally, we conclude in §[5](https://arxiv.org/html/2403.17887v2#S5 "5 Discussion and Future Directions ‣ The Unreasonable Ineffectiveness of the Deeper Layers") by exploring tasks beyond QA benchmarks, such as reasoning, and highlighting directions of future work. Specific model, finetuning, dataset, and evaluation details can be found in Appendix[B](https://arxiv.org/html/2403.17887v2#A2 "Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), and evaluation ablations can be found in Appendix[C](https://arxiv.org/html/2403.17887v2#A3 "Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
2 Literature Review
-------------------
Pruning for neural networks has a long history (LeCun et al., [1989](https://arxiv.org/html/2403.17887v2#bib.bib6), Hassibi and Stork, [1992](https://arxiv.org/html/2403.17887v2#bib.bib7)): while initial work focused on _unstructured pruning_(Han et al., [2015](https://arxiv.org/html/2403.17887v2#bib.bib8), Chen et al., [2015](https://arxiv.org/html/2403.17887v2#bib.bib9), Srinivas and Babu, [2015](https://arxiv.org/html/2403.17887v2#bib.bib10)), _structured pruning_ techniques were developed to make sparse networks more efficient (Li et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib11), Wen et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib12), Hu et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib13), He et al., [2017](https://arxiv.org/html/2403.17887v2#bib.bib14), Huang et al., [2018](https://arxiv.org/html/2403.17887v2#bib.bib15), Murray and Chiang, [2015](https://arxiv.org/html/2403.17887v2#bib.bib16), See et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib17), Kim and Rush, [2016](https://arxiv.org/html/2403.17887v2#bib.bib18)). Recent work, of course, focused on structured pruning of transformers (Voita et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib19), Michel et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib20), Kim and Awadalla, [2020](https://arxiv.org/html/2403.17887v2#bib.bib21), Fan et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib22), Zhang and He, [2020](https://arxiv.org/html/2403.17887v2#bib.bib23), Fan et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib24), Jha et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib25), Sajjad et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib26), Liu et al., [2023a](https://arxiv.org/html/2403.17887v2#bib.bib27), Hou et al., [2020](https://arxiv.org/html/2403.17887v2#bib.bib28), Sharma et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib29), Ashkboos et al., [2024](https://arxiv.org/html/2403.17887v2#bib.bib30), Xia et al., [2022](https://arxiv.org/html/2403.17887v2#bib.bib31), Lagunas et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib32), Men et al., [2024](https://arxiv.org/html/2403.17887v2#bib.bib33)). Our work focuses on pruning the layers of decoder-only GPT style open-weight _large_ language models after theyve been pretrained. For an extended literature review, please see Appendix[A](https://arxiv.org/html/2403.17887v2#A1 "Appendix A Extended Literature Review ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
3 Method
--------
In this section, we give intuition for why we think layer pruning works (§[3.1](https://arxiv.org/html/2403.17887v2#S3.SS1 "3.1 Intuition ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) and then we explain our method in detail (§[3.2](https://arxiv.org/html/2403.17887v2#S3.SS2 "3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")).
### 3.1 Intuition
Our intuition for layer dropping comes from thinking about the representations as a slowly changing function of layer index. In particular, the layer-to-layer evolution of representations for a transformer is given by a _residual_ iteration equation
x(+1)=x()+f(x(),θ()),superscript 𝑥 1 superscript 𝑥 𝑓 superscript 𝑥 superscript 𝜃 x^{(\ell+1)}=x^{(\ell)}+f(x^{(\ell)},\theta^{(\ell)})\,,italic_x start_POSTSUPERSCRIPT ( roman_ + 1 ) end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT + italic_f ( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT ) ,(1)
where (x()(x^{(\ell)}( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT, θ())\theta^{(\ell)})italic_θ start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT ), respectively, are the multi-dimensional input and parameter vectors for layer \ell roman_, and f(x,θ)𝑓 𝑥 𝜃 f(x,\theta)italic_f ( italic_x , italic_θ ) describes the transformation of one multi-head self-attention _and_ MLP layer block. As for any residual network, if we unroll this iteration, we see that after L 𝐿 L italic_L total layers the output is described as a sum over the transformations of all the layers
x(L)=x(0)+∑ℓ=0 L1 f(x(),θ()).superscript 𝑥 𝐿 superscript 𝑥 0 superscript subscript 0 𝐿 1 𝑓 superscript 𝑥 superscript 𝜃 x^{(L)}=x^{(0)}+\sum_{\ell=0}^{L-1}f(x^{(\ell)},\theta^{(\ell)})\,.italic_x start_POSTSUPERSCRIPT ( italic_L ) end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT roman_ = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L - 1 end_POSTSUPERSCRIPT italic_f ( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT ) .(2)
If the terms in the sum were _numerous_, (L≫1 much-greater-than 𝐿 1 L\gg 1 italic_L ≫ 1), and _independent_, e.g. if the block functions were instead a function of the overall input as f(x(0),θ())𝑓 superscript 𝑥 0 superscript 𝜃 f(x^{(0)},\theta^{(\ell)})italic_f ( italic_x start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT ), then perhaps any particular contribution to the sum ([2](https://arxiv.org/html/2403.17887v2#S3.E2 "In 3.1 Intuition ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) could be neglected.
Of course, they are not at all independent: if we delete layer ℓ−1 1\ell-1 roman_ - 1, then we must now connect the old input to that layer, x(ℓ−1)superscript 𝑥 1 x^{(\ell-1)}italic_x start_POSTSUPERSCRIPT ( roman_ - 1 ) end_POSTSUPERSCRIPT, into the block function of layer \ell roman_ as
x(+1)=x(ℓ−1)+f(x(ℓ−1),θ()),superscript 𝑥 1 superscript 𝑥 1 𝑓 superscript 𝑥 1 superscript 𝜃 x^{(\ell+1)}=x^{(\ell-1)}+f(x^{(\ell-1)},\theta^{(\ell)})\,,italic_x start_POSTSUPERSCRIPT ( roman_ + 1 ) end_POSTSUPERSCRIPT = italic_x start_POSTSUPERSCRIPT ( roman_ - 1 ) end_POSTSUPERSCRIPT + italic_f ( italic_x start_POSTSUPERSCRIPT ( roman_ - 1 ) end_POSTSUPERSCRIPT , italic_θ start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT ) ,(3)
where, for clarity, we are not relabeling layers or inputs despite the deletion. In general, such a _mismatch_ between the original input and new input should be very damaging for the network. However, if, after some number of initial layers, the representations converge to a slowly changing function with respect to layer index,
x()≈x(ℓ−1)+ϵ,superscript 𝑥 superscript 𝑥 1 italic-ϵ x^{(\ell)}\approx x^{(\ell-1)}+\epsilon\,,italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT ≈ italic_x start_POSTSUPERSCRIPT ( roman_ - 1 ) end_POSTSUPERSCRIPT + italic_ϵ ,(4)
with ϵ≪x()much-less-than italic-ϵ superscript 𝑥 \epsilon\ll x^{(\ell)}italic_ϵ ≪ italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT in some appropriate sense, then the effect of deleting a particular layer \ell roman_, e.g. making the replacement x()→x(ℓ−1)→superscript 𝑥 superscript 𝑥 1 x^{(\ell)}\to x^{(\ell-1)}italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT → italic_x start_POSTSUPERSCRIPT ( roman_ - 1 ) end_POSTSUPERSCRIPT in going from ([1](https://arxiv.org/html/2403.17887v2#S3.E1 "In 3.1 Intuition ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) to ([3](https://arxiv.org/html/2403.17887v2#S3.E3 "In 3.1 Intuition ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), should only change the representation in the subsequent layer, x(+1)superscript 𝑥 1 x^{(\ell+1)}italic_x start_POSTSUPERSCRIPT ( roman_ + 1 ) end_POSTSUPERSCRIPT, by a small amount. Similarly, to successfully prune the n 𝑛 n italic_n layers before layer \ell roman_, i.e. those indexed from ℓ−n,…,ℓ−1 𝑛…ℓ 1\ell-n,\ldots,\ell-1 roman_ - italic_n , … , roman_ - 1, wed want that the input to the pruned block should be very similar to the output of the pruned block:
x()≈x(ℓ−n)+ϵ.superscript 𝑥 superscript 𝑥 𝑛 italic-ϵ x^{(\ell)}\approx x^{(\ell-n)}+\epsilon\,.italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT ≈ italic_x start_POSTSUPERSCRIPT ( roman_ - italic_n ) end_POSTSUPERSCRIPT + italic_ϵ .(5)
Regardless, any layer removal has a cascading effect: since post pruning x(+1)superscript 𝑥 1 x^{(\ell+1)}italic_x start_POSTSUPERSCRIPT ( roman_ + 1 ) end_POSTSUPERSCRIPT is computed by a different function than before, cf. ([1](https://arxiv.org/html/2403.17887v2#S3.E1 "In 3.1 Intuition ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) vs. ([3](https://arxiv.org/html/2403.17887v2#S3.E3 "In 3.1 Intuition ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), and since then x(+1)superscript 𝑥 1 x^{(\ell+1)}italic_x start_POSTSUPERSCRIPT ( roman_ + 1 ) end_POSTSUPERSCRIPT is directly or indirectly input to subsequent layers, +2,…,L 2…𝐿\ell+2,\ldots,L roman_ + 2 , … , italic_L, deleting a shallow layer should have a much greater impact than deleting a deeper layer.
From this, we have the following hypotheses that we will test experimentally:
1. _(0)_ We should be able to prune layers of a residual network.
2. _(1)_ We should have greater success pruning deeper layers.
3. _(2)_ Blocks of layers we successfully prune should have outputs that are similar to their inputs.
In the next subsection, §[3.2](https://arxiv.org/html/2403.17887v2#S3.SS2 "3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers") we will explain the details of our pruning algorithm and in the following section, §[4](https://arxiv.org/html/2403.17887v2#S4 "4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we will present experimental evidence for points _(0)-(2)_.
### 3.2 Layer-pruning algorithm(s)
Our principal layer pruning algorithm is very simple:
1. 0.Pick a a number of layers to prune n 𝑛 n italic_n.
2. 1.Compute the angular distance d(x(),x(+n))𝑑 superscript 𝑥 superscript 𝑥 𝑛 d(x^{(\ell)},x^{(\ell+n)})italic_d ( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ( roman_ + italic_n ) end_POSTSUPERSCRIPT ), cf. ([7](https://arxiv.org/html/2403.17887v2#S3.E7 "In 3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) below, between the input to layer \ell roman_ and the input to layer +n 𝑛\ell+n roman_ + italic_n on a neutral pretraining dataset or on a dataset representative of a downstream task of interest.
3. 2.Find the layer, ℓ∗superscript \ell^{*}roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, that minimizes that distance:
ℓ⋆⁢(n)≡argmin ⁡d(x(),x(+n)).superscript ℓ⋆𝑛 subscript arg min 𝑑 superscript 𝑥 superscript 𝑥 𝑛\ell^{\star}(n)\equiv\operatorname*{arg\,min}_{\ell}~{}d(x^{(\ell)},x^{(\ell+n% )})\,.roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_n ) ≡ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT roman_ end_POSTSUBSCRIPT italic_d ( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ( roman_ + italic_n ) end_POSTSUPERSCRIPT ) .(6)
4. 3.Drop layers ℓ⋆superscript ℓ⋆\ell^{\star}roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT to ℓ⋆+n1 superscript ℓ⋆𝑛 1\ell^{\star}\!\!+\!n\!-\!1 roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + italic_n - 1; connect the old input to layer ℓ⋆superscript ℓ⋆\ell^{\star}roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT to the old (ℓ⋆+n)superscript ℓ⋆𝑛(\ell^{\star}\!\!+\!n)( roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + italic_n )th layer block.2 2 2 Layers are often contained in a data structure, such a ModuleList in _PyTorch_, so to drop these layers we would simply define a new ModuleList that removes the layers from ℓ⋆superscript ℓ⋆\ell^{\star}roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT to ℓ⋆+n1 superscript ℓ⋆𝑛 1\ell^{\star}+n-1 roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + italic_n - 1.
5. 4.(Optionally) heal the mismatch at layer ℓ⋆+n superscript ℓ⋆𝑛\ell^{\star}\!+n roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT + italic_n with a small amount of fine tuning on a neutral pretraining dataset or particular dataset of interest.
If fewer words inside of a figure are more helpful to you than the text in an enumerated list, then note that this algorithm is also depicted in panels (a)-(b) of Figure[1](https://arxiv.org/html/2403.17887v2#S1.F1 "Figure 1 ‣ 1 Introduction ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
Elaborating on the first step, the angular distance on a single sequence of length T 𝑇 T italic_T is given by
d(x(),x(+n))≡1 π⁢arccos(x T()⋅x T(+n)‖x T()‖⁢‖x T(+n)‖),𝑑 superscript 𝑥 superscript 𝑥 𝑛 1 𝜋⋅subscript superscript 𝑥 𝑇 subscript superscript 𝑥 𝑛 𝑇 norm subscript superscript 𝑥 𝑇 norm subscript superscript 𝑥 𝑛 𝑇 d(x^{(\ell)},x^{(\ell+n)})\equiv\frac{1}{\pi}\arccos\left(\frac{x^{(\ell)}_{T}% \cdot x^{(\ell+n)}_{T}}{\left|\!\left|x^{(\ell)}_{T}\right|\!\right|\left|\!% \left|x^{(\ell+n)}_{T}\right|\!\right|}\right)\,,italic_d ( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ( roman_ + italic_n ) end_POSTSUPERSCRIPT ) ≡ divide start_ARG 1 end_ARG start_ARG italic_π end_ARG roman_arccos ( divide start_ARG italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ⋅ italic_x start_POSTSUPERSCRIPT ( roman_ + italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG start_ARG | | italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | | | italic_x start_POSTSUPERSCRIPT ( roman_ + italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | | end_ARG ) ,(7)
where the inner product is over the hidden dimension of the model for the final token T 𝑇 T italic_T of the sequence, ||⋅|||\!|\cdot|\!|| | ⋅ | | denotes the L 2 superscript 𝐿 2 L^{2}italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT-norm, and the factor of 1/π 1 𝜋 1/\pi 1 / italic_π is a convention.3 3 3 Two comments: _(i)_, we do not expect our choice of angular distance in lieu of any other reasonable metric, e.g., such as cosine similarity to be particular significant; and _(ii)_, we chose to focus on the final token since, due to the causal attention mask, its embedding is the only one that depends on the entire sequence. This distance should then be summed over a number of examples that is large enough to get a low-fluctuation estimate but overall should be quite small.
Elaborating on the “optionality” of the final step, we find that the near-lack of performance degradation on question-answering benchmarks, cf. Figure[1](https://arxiv.org/html/2403.17887v2#S1.F1 "Figure 1 ‣ 1 Introduction ‣ The Unreasonable Ineffectiveness of the Deeper Layers")(d) and others in §[4.1](https://arxiv.org/html/2403.17887v2#S4.SS1 "4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), can be extended to greater pruning fractions with a small amount of finetuning. Depending on resource constraints and intended application of the pruned model, this may not be necessary. However, the healing procedure does have a substantial impact on perplexity, cf. Figure[1](https://arxiv.org/html/2403.17887v2#S1.F1 "Figure 1 ‣ 1 Introduction ‣ The Unreasonable Ineffectiveness of the Deeper Layers")(d) and others in §[4.2](https://arxiv.org/html/2403.17887v2#S4.SS2 "4.2 Loss on next-token predictions ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
For both the angular distance measuring and the healing, if the ultimate goal is to supervise finetune (SFT) a model for a downstream task, it could be useful to evaluate the distance of a sample from that dataset and then combine the healing process with the SFT. In contrast, for the greatest generality, its most natural to measure distance and heal with a pretraining dataset that approximates the statistics under which the model was originally pretrained.
Finally, we also investigated an even simpler pruning strategy inspired by analyzing the angular distances across different model families: drop the deepest layers, excluding the final layer before the LLM head, and then (_non-optionally_) heal the damage. For complete clarity, this means that if we are pruning n 𝑛 n italic_n layers from an L 𝐿 L italic_L-layer model, then we would remove layers (Ln)𝐿 𝑛(L-n)( italic_L - italic_n ) to (L1)𝐿 1(L-1)( italic_L - 1 ), inclusive.
4 Results
---------
In this section, we demonstrate the effectiveness of our pruning strategy on different question-answering (QA) benchmarks and highlight a robust pruning-driven transition in performance (§[4.1](https://arxiv.org/html/2403.17887v2#S4.SS1 "4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), while, in contrast, we find that the autoregressive perplexities of the healed pruned models are continuous across their transition points (§[4.2](https://arxiv.org/html/2403.17887v2#S4.SS2 "4.2 Loss on next-token predictions ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")); then, after comparing the similarity statistics between different layers across model sizes and families (§[4.3](https://arxiv.org/html/2403.17887v2#S4.SS3 "4.3 Angular distances between representations ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), we contrast our principal similarity-informed pruning strategy with a simpler remove-the-deepest-layers strategy (§[4.4](https://arxiv.org/html/2403.17887v2#S4.SS4 "4.4 A simpler pruning strategy ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")).
For our experiments, we pruned a wide variety of large-scale LLMs from 2.7B to 70B parameters spanning 32 to 80 total unpruned layers. Specifically, we used models in the Llama-2 family (Touvron et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib1)), the Qwen family (Bai et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib34)), Mistral-7B (Jiang et al., [2023a](https://arxiv.org/html/2403.17887v2#bib.bib35)), and Phi-2 (Javaheripi and Bubeck, [2023](https://arxiv.org/html/2403.17887v2#bib.bib36)). For these models, we executed the “healing” step using QLoRA (Dettmers et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib37)): our models were quantized to 4-bit precision and then finetuned, using QLoRA for efficient training, on either 164M or 328M tokens from the Colossal Clean Crawled Corpus (C4) (Raffel et al., [2020](https://arxiv.org/html/2403.17887v2#bib.bib38)), a common pretraining dataset. As a result, _each experiment of ours can be performed on a single 40GB A 100 100 100 100 GPU_. For our QA evals, we used Massive Multitask Language Understanding (MMLU) (Hendrycks et al., [2020](https://arxiv.org/html/2403.17887v2#bib.bib39)), a common world-knowledge and problem solving benchmark, and BoolQ (Clark et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib40)), a common yes/no reading comprehension benchmark where the answer has to be inferred from the text itself. The specifics of our models, healing procedure, dataset choices, and evaluation details can be found across Appendix[B](https://arxiv.org/html/2403.17887v2#A2 "Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers"); ablations of different hyperparameter choices can be found across Appendix[C](https://arxiv.org/html/2403.17887v2#A3 "Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
### 4.1 Accuracy on QA benchmarks
Our first set of results are shown in Figure[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), where we plot 5 5 5 5-shot MMLU accuracy as a function of the fraction of layers removed: in the left panel we present the Llama-2 family, in the middle panel we present models from the Qwen family, and in the right panel we show Mistral-7B and Phi-2. In order to better compare models of different total number of layers, in these plots we opted to normalize the x 𝑥 x italic_x-axis by the fraction of layers removed (rather than the absolute number of layers removed). Note that since MMLU contains multiple choice questions with four possible responses, the expected accuracy of random guessing is 25%.
![Image 2: Refer to caption](https://arxiv.org/html/2403.17887v2/x2.png)
Figure 2: MMLU accuracy (5-shot) vs. fraction of layers dropped for different model families. (_Left:_ Llama-2 family; _Middle:_ Qwen family; _Right:_ Mistral-7B and Phi-2.) The solid lines represent performance after dropping layers and healing, dotted lines show performance after dropping layers only (no healing), and the dashed gray line is the score for guessing randomly. For these models, healing leads to modest improvements, and performances are quite robust until 20%-55% pruning fractions, depending on model family and size, at which point they transitions to random guessing.
Importantly, we see a characteristic flat region of robust performance followed by a sharp transition to random accuracy at a pruning fraction around 45%-55% for models in the Llama-2 family, 35% for Mistral 7B, 25% for Phi-2, and 20% for models from the Qwen family. This implies that the essential knowledge required to achieve a models top score isnt removed by significant layer removal even though the fraction can be quite large(!) until eventually that knowledge is lost at a critical model-dependent threshold.4 4 4 This effect is rather robust to choice of QA benchmark: in Figure[7](https://arxiv.org/html/2403.17887v2#A2.F7 "Figure 7 ‣ B.2 Evaluation details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers") we plot the average 0-shot BoolQ accuracy for our model families and observe analogous behavior. Contrasting the curves with and without healing, we see that finetuning offers a modest improvement by better preserving the unpruned performance and pushing the phase transition to random guessing to slightly larger pruning fractions.
Broadly we see that layer pruning is more robust for the larger and deeper models, e.g. Llama-2-13B and Llama-2-70B, which we hypothesize could be related to the fact that either the smaller models are more overtrained, making parameters less redundant, or that the deeper models can afford to lose more layers in an absolute sense. Also, the Qwen family is strange, a fact we will further elaborate on in §[4.3](https://arxiv.org/html/2403.17887v2#S4.SS3 "4.3 Angular distances between representations ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
### 4.2 Loss on next-token predictions
In this section, we look at the effect of layer pruning on the pretraining optimization objective the cross-entropy loss of next-token prediction when evaluated on a subset of the C4 validation dataset.5 5 5 We make sure that none of the validation data are seen during the healing stage. In order to have a fair comparison across models with different sized vocabularies V 𝑉 V italic_V, we normalize the loss by logV 𝑉\log V roman_log italic_V, which corresponds to the loss of sampling tokens randomly with uniform probability. (See Appendix[B.2](https://arxiv.org/html/2403.17887v2#A2.SS2 "B.2 Evaluation details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers") for more details.)
In Figure[3](https://arxiv.org/html/2403.17887v2#S4.F3 "Figure 3 ‣ 4.2 Loss on next-token predictions ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers") , we plot the normalized C4 validation loss for all seven of our models, after healing (left panel) and before healing (right panel), as a function of the fraction layers removed. Without healing, we see that there is a somewhat sharp(ish) transition to random guessing for each model at approximately the pruning fraction that the QA benchmark accuracies also sharply transition to random guessing, suggesting that models are hopelessly harmed at this point, cf. Figure[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"). Next, contrasting the scales of both plots, we see that healing significantly restores the next-token prediction ability of all the models to near-unpruned levels, with the loss increasing slowly and linearly with layer dropping. Most strikingly from a scientific perspective is the post-healing continuity through the pruning fractions where we previously found sharp transitions for the QA benchmarks: this decoupling illustrates one way of disconnecting (or creating a miscalibration) between performance on downstream tasks such as MMLU and BoolQ and continuous measures of performance such as the cross-entropy loss. 6 6 6 This is consistent with Schaeffer et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib41)) that argued jumps in one kind of metric may not be visible in others.
![Image 3: Refer to caption](https://arxiv.org/html/2403.17887v2/x3.png)
Figure 3: Normalized C4 validation loss vs. fraction of layers dropped before healing (_left_) and after healing (_right_); each curve is normalized by the cross-entropy loss of sampling uniformly from the models vocabulary. For the experiments before healing, the loss for each model transitions to random guessing (gray dashed line) at approximately the same pruning fractions that the QA benchmarks transition to random guessing; after healing, there is continuity through the regions of sharp transition on QA tasks, cf. Figure[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"). Contrasting the overall scale of both plots, its clear that healing significantly restores the performance on next-token prediction to near-unpruned levels.
### 4.3 Angular distances between representations
Given the central role the angular distance ([7](https://arxiv.org/html/2403.17887v2#S3.E7 "In 3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) plays in our pruning strategy, lets take a subsection to look at these distances across our seven models. For this analysis, the angular distances for each model were averaged over 10k samples from the C4 validation set.
Recall from earlier Figure[1](https://arxiv.org/html/2403.17887v2#S1.F1 "Figure 1 ‣ 1 Introduction ‣ The Unreasonable Ineffectiveness of the Deeper Layers")(c): for Llama-2-70B this plotted the angular distance d(x(),x(+n))𝑑 superscript 𝑥 superscript 𝑥 𝑛 d(x^{(\ell)},x^{(\ell+n)})italic_d ( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ( roman_ + italic_n ) end_POSTSUPERSCRIPT ) that compared the \ell roman_-th layer to the (+n) 𝑛(\ell+n)( roman_ + italic_n )-th layer, across all initial indexes \ell roman_ for block sizes from n=1 𝑛 1 n=1 italic_n = 1 to n=64 𝑛 64 n=64 italic_n = 64; the minimum of the curves, ℓ⋆⁢(n)superscript ℓ⋆𝑛\ell^{\star}(n)roman_ start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_n ), gave the optimal block to prune for a given n 𝑛 n italic_n, cf. ([6](https://arxiv.org/html/2403.17887v2#S3.E6 "In item 2 ‣ 3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")).
A more compact way to display this same data is shown in the heat maps of Figure[4](https://arxiv.org/html/2403.17887v2#S4.F4 "Figure 4 ‣ 4.3 Angular distances between representations ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"): each square is colored to depict the row-normalized angular distance between layer \ell roman_ and +n 𝑛\ell+n roman_ + italic_n across all possible \ell roman_, and n 𝑛 n italic_n up to very large fractions of the total number of layers; the optimal layer to prune for a given block size, ℓ∗⁢(n)superscript 𝑛\ell^{*}(n)roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_n ), corresponds to the minimal distance in each row.
Across models, we make two generalizations: _(i)_ the smallest distances are found across the deeper blocks, meaning deeper layers are typically quite similar to each other and can be more easily dropped; _(ii)_ the distances across the deepest blocks the blocks that include the last layer take either maximal or nearly-maximal values, meaning one should never drop the final layer. While broadly true, there are a few exceptions. For some models, e.g. Phi-2-2.7B, or for the largest blocks in some models, e.g. Llama-2-7B, final _few_ layers seem important. As previously noted, the Qwen family is somewhat unusual: here we see that there are a few odd “islands” of high similarity for shallow blocks; this likely explains the shorter region of robust performance in Figure[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
![Image 4: Refer to caption](https://arxiv.org/html/2403.17887v2/x4.png)
Figure 4: Normalized angular distance ([7](https://arxiv.org/html/2403.17887v2#S3.E7 "In 3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) from initial layer \ell roman_ (x-axis) with block size n 𝑛 n italic_n (y-axis) for each of the seven models we evaluated; the distance for each n 𝑛 n italic_n is shifted and rescaled to span the same range, [0,1]0 1[0,1][ 0 , 1 ] (yellow to purple): the optimal block to prune, ℓ∗⁢(n)superscript 𝑛\ell^{*}(n)roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_n ), corresponds to the deepest yellow for each row. Across models, the deeper layers tend to be very similar, though the deepest blocks that include the final layer (squares along the outer diagonal) are (near-)maximally dissimilar.
### 4.4 A simpler pruning strategy
Inspired by our recent conclusions, we experiment with a very simple heuristic pruning strategy: _(1)_ if pruning n 𝑛 n italic_n layers from an L 𝐿 L italic_L-layer model, drop layers (Ln)𝐿 𝑛(L-n)( italic_L - italic_n ) to (L1)𝐿 1(L-1)( italic_L - 1 ) so as to remove the deepest block that excludes the final layer; then _(2)_ heal with a small amount of finetuning as before. Compared with our principal similarity-informed pruning strategy, this simpler heuristic algorithm has the advantage of never requiring practitioners to load onto a GPU or inference the unpruned model. It also provides a meaningful ablation of the importance of optimizing the block to prune.
In Figure[5](https://arxiv.org/html/2403.17887v2#S4.F5 "Figure 5 ‣ 4.4 A simpler pruning strategy ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we contrast our two pruning strategies, both before healing (left panels) and after healing (right panels), for the QA benchmarks (MMLU/BoolQ, top/middle panels) and the autoregressive loss (C4 validation, bottom panels). On the one hand, the simple heuristic performs quite poorly without healing the damage incurred by pruning: accuracy on the QA benchmarks decays rapidly to (near-) random with increased pruning fraction, and the loss begins to increase very rapidly even with small amounts of pruning. On the other hand, the results for the two pruning strategies across evaluations are quite comparable after healing: for the QA benchmarks, the similarity-informed algorithm slightly better preserves the accuracy before the phase transition, though the simple algorithm perhaps pushes the phase transition to slightly greater pruning factions; and for the loss, the curves nearly lie on top of each other, though the similarity-informed strategy does marginally outperform for all amounts of pruning. These experiments are strong evidence that the purpose of post-pruning finetuning is the healing of damage at the pruning interface and not the acquisition of additional knowledge.
![Image 5: Refer to caption](https://arxiv.org/html/2403.17887v2/x5.png)
Figure 5: Evaluation of Llama-2-70B with the simple pruning heuristic (solid red line), shown along with scores for the similarity-informed pruning strategy (solid blue line), scores of the unpruned Llama-2-70B (red dashed line), and scores for randomly guessing (gray dashed line). (_Left:_ before healing, _Right:_ after healing; _Top:_ MMLU, _Middle:_ BoolQ, _Bottom:_ C4 Validation Loss.) Without healing, the simple heuristic performs poorly across all evals; with healing, the scores of both methods are quite similar.
5 Discussion and Future Directions
----------------------------------
At the end of this work, many readers are puzzled by the following: are the deeper layers entirely useless? So far, weve provided evidence that the elimination of the deeper layers does not affect performance on QA tasks like MMLU (Figure[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), while at the same time have shown that their removal does disrupt the next-token predictions of the underlying model (Figure[3](https://arxiv.org/html/2403.17887v2#S4.F3 "Figure 3 ‣ 4.2 Loss on next-token predictions ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")). Since perplexity often correlates with performance on downstream tasks, which are the tasks that are hurt by layer pruning?
Here are two hypotheses consistent with the fact that the models perplexity is disturbed proportionally to pruning fraction:
* _(i)_ The deeper layers are not essential for storing knowledge, but are useful for more complicated computations, such as those that involve reasoning.
* _(ii)_ The deeper layers are necessary when the model has to generate many tokens before answering a question, such as when it produces a chain-of-thought (CoT).
We test these hypotheses by evaluating our layer-pruned models on tasks that involve CoTs or reasoning. For the former, well look at Chain-of-Thought MMLU (CoT-MMLU); for the latter, well look at GSM8K (Cobbe et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib42)), a grade-school math benchmark, and HellaSwag (Zellers et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib43)), a multiple choice common-sense reasoning benchmark.7 7 7 Here are the details for how we performed these three evaluations: •For CoT-MMLU, we followed the flan_cot_fewshot evaluation in EleutherAI (Gao et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib44)), in which models produce a chain of thought before generating their answer. Note that the accuracy at 0%percent 0 0\%0 % pruning fraction for MMLU without CoT is much better than the analogous accuracy at 0%percent 0 0\%0 % pruning fraction for CoT-MMLU (69%similar-to absent percent 69\sim 69\% 69 % vs. 43%similar-to absent percent 43\sim 43\% 43 %, respectively; cf. Figures[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")and[6](https://arxiv.org/html/2403.17887v2#S5.F6 "Figure 6 ‣ 5 Discussion and Future Directions ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), consistent with some previous work (e.g., see Table 16 of Chung et al. ([2024](https://arxiv.org/html/2403.17887v2#bib.bib45))).•For GSM8K, we used the gsm8k_cot evaluation in EleutherAI (Gao et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib44)) and measured pass@1; for each problem we extracted an answer from a single generation (with CoT) and checked for correctness against the ground-truth answer.•For HellaSwag, we used the hellaswag evaluation in EleutherAI (Gao et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib44)). Note that HellaSwag is a multiple-choice benchmark, so random performance is 25%.
In Figure[6](https://arxiv.org/html/2403.17887v2#S5.F6 "Figure 6 ‣ 5 Discussion and Future Directions ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we plot the performance of Llama-2 70B pruned with the similarity-informed pruning strategy across CoT-MMLU (left), GSM8K (center), and HellaSwag (right): on the one hand, both GSM8K and HellaSwag, our two reasoning tasks, exhibit immediate degradation in performance with any amount of pruning, correlating with a similar decrease in the perplexity evals (Figure[3](https://arxiv.org/html/2403.17887v2#S4.F3 "Figure 3 ‣ 4.2 Loss on next-token predictions ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")); on the other hand, CoT-MMLU shows a relatively flat region of robust performance with pruning, analogous to our previous results on QA benchmarks (e.g. Figure[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers")). This is some initial evidence for hypothesis _(i)_ over hypothesis _(ii)_: the deeper layers may be useful for higher-level reasoning tasks, while less important for knowledge intensive QA tasks; moreover, perplexity errors due to pruning do not compound to hurt QA evals when the model is required to generate many tokens.
![Image 6: Refer to caption](https://arxiv.org/html/2403.17887v2/x6.png)
Figure 6: Evaluation of Llama-2 70B with the similarity-informed pruning strategy across different evaluation tasks. (_Left:_ Chain-of-Thought MMLU (CoT-MMLU), _Center:_ GSM8K, _Right:_ HellaSwag.) We see that GSM8K and HellaSwag show immediate degradation of performance with any level of pruning, while CoT-MMLU behaves qualitatively similarly to MMLU without CoT; this suggests that the deeper layers are likely necessary for reasoning tasks.
Now at the conclusion of the work, we are left with the following questions:
* •What are better layer-pruning strategies? What are better approaches to healing?8 8 8 At the cost of introducing another hyperparameter and requiring both pruned and unpruned models to fit in memory during finetuning, one natural way to improve healing is by adding an auxiliary student-teacher loss that explicitly addresses the pruning mismatch ([5](https://arxiv.org/html/2403.17887v2#S3.E5 "In 3.1 Intuition ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), such as aux(x(ℓ∗+n)(θ 0)x(ℓ∗)(θ))2,similar-to subscript aux superscript superscript 𝑥 superscript 𝑛 subscript 𝜃 0 superscript 𝑥 superscript 𝜃 2\mathcal{L}_{\text{aux}}\sim\left(x^{(\ell^{*}\!+n)}(\theta_{0})-x^{(\ell^{*})% }(\theta)\right)^{2}\,,caligraphic_L start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT ( roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + italic_n ) end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_x start_POSTSUPERSCRIPT ( roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_θ ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,(8) where θ 0 subscript 𝜃 0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are the frozen parameters of the unpruned model, and θ 𝜃\theta italic_θ are the parameters of the pruned model to be healed; thus, x(ℓ∗+n)(θ 0)superscript 𝑥 superscript 𝑛 subscript 𝜃 0 x^{(\ell^{*}\!+n)}(\theta_{0})italic_x start_POSTSUPERSCRIPT ( roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + italic_n ) end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is the input to the (ℓ∗+n)superscript 𝑛(\ell^{*}\!+n)( roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT + italic_n )-th layer in the unpruned model, x(ℓ∗)(θ)superscript 𝑥 superscript 𝜃 x^{(\ell^{*})}(\theta)italic_x start_POSTSUPERSCRIPT ( roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUPERSCRIPT ( italic_θ ) is the input to that same layer after pruning, and aux subscript aux\mathcal{L}_{\text{aux}}caligraphic_L start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT minimizes their mismatch. We thank Sho Yaida for this observation.
* •Why does healing eliminate the phase transition in the loss but not in the QA accuracies?
* •With more comprehensive evals, will accuracy on different tasks degrade at different depths?
* •Relatedly, is knowledge generally stored in shallow or middle layers, or is it delocalized?
* •Can we devise a pruning strategy that is robust for reasoning tasks?
* •Do pretraining details affect the ability to prune, e.g., are scaling-law over-trained or distilled models more difficult to prune?
* •How can we enable LLMs to more effectively use the parameters in their deepest layers?
Some of these questions would benefit from studying both layer similarity and pruning across different pretraining checkpoints; for instance, at what point does the sharp phase transition and critical depth in the QA accuracies emerge, and does more training lead to better use of the prunable parameters? Others suggest explorations with different pretraining architectures and objectives, e.g. in order better make use of the deeper layers (for example, one can imagine applying layer dropout (Fan et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib22)) or early exit during pre-training (Elhoushi et al., [2024](https://arxiv.org/html/2403.17887v2#bib.bib46)) to induce equal usage of layers). With more comprehensive evaluations, if different kinds of QA tasks degrade at very different depths, then this might indicate that the knowledge required to complete those tasks is stored across different layers.9 9 9 Alternatively, one could measure d(x(),x(+n))𝑑 superscript 𝑥 superscript 𝑥 𝑛 d(x^{(\ell)},x^{(\ell+n)})italic_d ( italic_x start_POSTSUPERSCRIPT ( roman_ ) end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT ( roman_ + italic_n ) end_POSTSUPERSCRIPT ) or find ℓ∗⁢(n)superscript 𝑛\ell^{*}(n)roman_ start_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_n ) as a function of different eval datasets. It would be very interesting to use pruning to systematically study these kind of interpretability questions.
Acknowledgments and Disclosure of Funding
-----------------------------------------
We thank Aaron Schwartz for his initial collaboration, Aaditya Singh and Sho Yaida for discussions, and Aaditya Singh for comments on the draft. We would also like to acknowledge the 2023 NeurIPS Large Language Model Efficiency Challenge for initializing us for work on this project. A.G. is supported by the NSF CAREER grant DMR-2045181, the Sloan Foundation, and by the Laboratory for Physical Sciences through the Condensed Matter Theory Center. D.R. acknowledges support from the National Science Foundation under Cooperative Agreement PHY-2019786 (the NSF AI Institute for Artificial Intelligence and Fundamental Interactions, http://iaifi.org/) and appreciates both the sanction and support of Sequoia Capital. This paper has been brought to you residually by the letters G 𝐺 G italic_G, P 𝑃 P italic_P, and U 𝑈 U italic_U, after summing over many layers.
References
----------
* Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. _arXiv preprint arXiv:2307.09288_, 2023.
* nostalgebraist (2020) nostalgebraist. interpreting gpt: the logit lens. [https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens), 2020.
* Belrose et al. (2023) Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. _arXiv preprint arXiv:2303.08112_, 2023.
* Chen et al. (2018) Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. _Advances in neural information processing systems_, 31, 2018.
* Yang et al. (2023) Greg Yang, Dingli Yu, Chen Zhu, and Soufiane Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks. _arXiv preprint arXiv:2310.02244_, 2023.
* LeCun et al. (1989) Yann LeCun, John Denker, and Sara Solla. Optimal brain damage. In D.Touretzky, editor, _Advances in Neural Information Processing Systems_, volume 2. Morgan-Kaufmann, 1989.
* Hassibi and Stork (1992) Babak Hassibi and David Stork. Second order derivatives for network pruning: Optimal brain surgeon. In S.Hanson, J.Cowan, and C.Giles, editors, _Advances in Neural Information Processing Systems_, volume 5. Morgan-Kaufmann, 1992.
* Han et al. (2015) Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. _Advances in neural information processing systems_, 28, 2015.
* Chen et al. (2015) Wenlin Chen, James Wilson, Stephen Tyree, Kilian Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In _International conference on machine learning_, pages 22852294. PMLR, 2015.
* Srinivas and Babu (2015) Suraj Srinivas and R Venkatesh Babu. Data-free parameter pruning for deep neural networks. _arXiv preprint arXiv:1507.06149_, 2015.
* Li et al. (2016) Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning filters for efficient convnets. _arXiv preprint arXiv:1608.08710_, 2016.
* Wen et al. (2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. _Advances in neural information processing systems_, 29, 2016.
* Hu et al. (2016) Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. _arXiv preprint arXiv:1607.03250_, 2016.
* He et al. (2017) Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In _Proceedings of the IEEE international conference on computer vision_, pages 13891397, 2017.
* Huang et al. (2018) Gao Huang, Shichen Liu, Laurens Van der Maaten, and Kilian Q Weinberger. Condensenet: An efficient densenet using learned group convolutions. In _Proceedings of the IEEE conference on computer vision and pattern recognition_, pages 27522761, 2018.
* Murray and Chiang (2015) Kenton Murray and David Chiang. Auto-sizing neural networks: With applications to n-gram language models. _arXiv preprint arXiv:1508.05051_, 2015.
* See et al. (2016) Abigail See, Minh-Thang Luong, and Christopher D Manning. Compression of neural machine translation models via pruning. _arXiv preprint arXiv:1606.09274_, 2016.
* Kim and Rush (2016) Yoon Kim and Alexander M Rush. Sequence-level knowledge distillation. _arXiv preprint arXiv:1606.07947_, 2016.
* Voita et al. (2019) Elena Voita, David Talbot, Fedor Moiseev, Rico Sennrich, and Ivan Titov. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. _arXiv preprint arXiv:1905.09418_, 2019.
* Michel et al. (2019) Paul Michel, Omer Levy, and Graham Neubig. Are sixteen heads really better than one? _Advances in neural information processing systems_, 32, 2019.
* Kim and Awadalla (2020) Young Jin Kim and Hany Hassan Awadalla. Fastformers: Highly efficient transformer models for natural language understanding. _arXiv preprint arXiv:2010.13382_, 2020.
* Fan et al. (2019) Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout. _arXiv preprint arXiv:1909.11556_, 2019.
* Zhang and He (2020) Minjia Zhang and Yuxiong He. Accelerating training of transformer-based language models with progressive layer dropping. _Advances in Neural Information Processing Systems_, 33:1401114023, 2020.
* Fan et al. (2021) Chun Fan, Jiwei Li, Xiang Ao, Fei Wu, Yuxian Meng, and Xiaofei Sun. Layer-wise model pruning based on mutual information. _arXiv preprint arXiv:2108.12594_, 2021.
* Jha et al. (2023) Ananya Harsh Jha, Dirk Groeneveld, Emma Strubell, and Iz Beltagy. Large language model distillation doesnt need a teacher. _arXiv preprint arXiv:2305.14864_, 2023.
* Sajjad et al. (2023) Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. On the effect of dropping layers of pre-trained transformer models. _Computer Speech & Language_, 77:101429, 2023.
* Liu et al. (2023a) Wei Liu, Zhiyuan Peng, and Tan Lee. Comflp: Correlation measure based fast search on asr layer pruning. _arXiv preprint arXiv:2309.11768_, 2023a.
* Hou et al. (2020) Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: Dynamic bert with adaptive width and depth. _Advances in Neural Information Processing Systems_, 33:97829793, 2020.
* Sharma et al. (2023) Pratyusha Sharma, Jordan T Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction. _arXiv preprint arXiv:2312.13558_, 2023.
* Ashkboos et al. (2024) Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. Slicegpt: Compress large language models by deleting rows and columns. _arXiv preprint arXiv:2401.15024_, 2024.
* Xia et al. (2022) Mengzhou Xia, Zexuan Zhong, and Danqi Chen. Structured pruning learns compact and accurate models. _arXiv preprint arXiv:2204.00408_, 2022.
* Lagunas et al. (2021) François Lagunas, Ella Charlaix, Victor Sanh, and Alexander M Rush. Block pruning for faster transformers. _arXiv preprint arXiv:2109.04838_, 2021.
* Men et al. (2024) Xin Men, Mingyu Xu, Qingyu Zhang, Bingning Wang, Hongyu Lin, Yaojie Lu, Xianpei Han, and Weipeng Chen. Shortgpt: Layers in large language models are more redundant than you expect. _arXiv preprint arXiv:2403.03853_, 2024.
* Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. _arXiv preprint arXiv:2309.16609_, 2023.
* Jiang et al. (2023a) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. _arXiv preprint arXiv:2310.06825_, 2023a.
* Javaheripi and Bubeck (2023) Mojan Javaheripi and Sébastien Bubeck. Phi-2: The surprising power of small language models, Dec 2023.
* Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. _arXiv preprint arXiv:2305.14314_, 2023.
* Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. _The Journal of Machine Learning Research_, 21(1):54855551, 2020.
* Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. _arXiv preprint arXiv:2009.03300_, 2020.
* Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. _arXiv preprint arXiv:1905.10044_, 2019.
* Schaeffer et al. (2023) Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? _arXiv preprint arXiv:2304.15004_, 2023.
* Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. _arXiv preprint arXiv:2110.14168_, 2021.
* Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? _arXiv preprint arXiv:1905.07830_, 2019.
* Gao et al. (2023) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noach, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, 12 2023. URL [https://zenodo.org/records/10256836](https://zenodo.org/records/10256836).
* Chung et al. (2024) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. _Journal of Machine Learning Research_, 25(70):153, 2024.
* Elhoushi et al. (2024) Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layer skip: Enabling early exit inference and self-speculative decoding. _arXiv preprint arXiv:2404.16710_, 2024.
* Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. _Advances in neural information processing systems_, 30, 2017.
* Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. _arXiv preprint arXiv:1810.04805_, 2018.
* Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019. URL [https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf).
* Zhong et al. (2023) Qihuang Zhong, Liang Ding, Juhua Liu, Bo Du, and Dacheng Tao. Can chatgpt understand too? a comparative study on chatgpt and fine-tuned bert. _arXiv preprint arXiv:2302.10198_, 2023.
* Ethayarajh (2019) Kawin Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. _arXiv preprint arXiv:1909.00512_, 2019.
* Baevski et al. (2020) Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. _Advances in neural information processing systems_, 33:1244912460, 2020.
* Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. _arXiv preprint arXiv:1503.02531_, 2015.
* Gu et al. (2023) Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Knowledge distillation of large language models. _arXiv preprint arXiv:2306.08543_, 2023.
* Jiao et al. (2019) Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding. _arXiv preprint arXiv:1909.10351_, 2019.
* Wang et al. (2021) Shuohang Wang, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. Want to reduce labeling cost? gpt-3 can help. _arXiv preprint arXiv:2108.13487_, 2021.
* Eldan and Li (2023) Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? _arXiv preprint arXiv:2305.07759_, 2023.
* Li et al. (2023a) Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report. _arXiv preprint arXiv:2309.05463_, 2023a.
* Gunasekar et al. (2023) Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, et al. Textbooks are all you need. _arXiv preprint arXiv:2306.11644_, 2023.
* Fu et al. (2023) Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, and Tushar Khot. Specializing smaller language models towards multi-step reasoning. _arXiv preprint arXiv:2301.12726_, 2023.
* Hsieh et al. (2023) Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. _arXiv preprint arXiv:2305.02301_, 2023.
* Jiang et al. (2023b) Yuxin Jiang, Chunkit Chan, Mingyang Chen, and Wei Wang. Lion: Adversarial distillation of closed-source large language model. _arXiv preprint arXiv:2305.12870_, 2023b.
* Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. _arXiv preprint arXiv:2106.09685_, 2021.
* Li et al. (2023b) Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, and Tuo Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models. _arXiv preprint arXiv:2310.08659_, 2023b.
* Zhang et al. (2023) Qingru Zhang, Minshuo Chen, Alexander Bukharin, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adaptive budget allocation for parameter-efficient fine-tuning. _arXiv preprint arXiv:2303.10512_, 2023.
* Leviathan et al. (2023) Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In _International Conference on Machine Learning_, pages 1927419286. PMLR, 2023.
* Cai et al. (2024) Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. _arXiv preprint arXiv:2401.10774_, 2024.
* Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. _Advances in Neural Information Processing Systems_, 35:1735917372, 2022.
* Dai et al. (2021) Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. Knowledge neurons in pretrained transformers. _arXiv preprint arXiv:2104.08696_, 2021.
* Hase et al. (2023) Peter Hase, Mohit Bansal, Been Kim, and Asma Ghandeharioun. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. _arXiv preprint arXiv:2301.04213_, 2023.
* Geva et al. (2023) Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. _arXiv preprint arXiv:2304.14767_, 2023.
* Din et al. (2023) Alexander Yom Din, Taelin Karidi, Leshem Choshen, and Mor Geva. Jump to conclusions: Short-cutting transformers with linear transformations. _arXiv preprint arXiv:2303.09435_, 2023.
* Gurnee and Tegmark (2023) Wes Gurnee and Max Tegmark. Language models represent space and time. _arXiv preprint arXiv:2310.02207_, 2023.
* Voita et al. (2023) Elena Voita, Javier Ferrando, and Christoforos Nalmpantis. Neurons in large language models: Dead, n-gram, positional. _arXiv preprint arXiv:2309.04827_, 2023.
* Liu et al. (2023b) Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, et al. Deja vu: Contextual sparsity for efficient llms at inference time. In _International Conference on Machine Learning_, pages 2213722176. PMLR, 2023b.
* Panigrahi et al. (2023) Abhishek Panigrahi, Nikunj Saunshi, Haoyu Zhao, and Sanjeev Arora. Task-specific skill localization in fine-tuned language models. _arXiv preprint arXiv:2302.06600_, 2023.
* Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art natural language processing. In _Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations_, pages 3845, Online, October 2020. Association for Computational Linguistics. URL [https://www.aclweb.org/anthology/2020.emnlp-demos.6](https://www.aclweb.org/anthology/2020.emnlp-demos.6).
* Raffel et al. (2019) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. _arXiv e-prints_, 2019.
* Mangrulkar et al. (2022) Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. Peft: State-of-the-art parameter-efficient fine-tuning methods. [https://github.com/huggingface/peft](https://github.com/huggingface/peft), 2022.
* Lee et al. (2023) Ariel N Lee, Cole J Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms. _arXiv preprint arXiv:2308.07317_, 2023.
* Dettmers et al. (2022) Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. _arXiv preprint arXiv:2208.07339_, 2022.
Appendix A Extended Literature Review
-------------------------------------
In this section, we review practical strategies for post-training efficiency and discuss some scientific investigations that provide motivation for, or insight into, our approach: in §[A.1](https://arxiv.org/html/2403.17887v2#A1.SS1 "A.1 Pruning ‣ Appendix A Extended Literature Review ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we first review the history of pruning and then discuss its modern application to LLMs; in §[A.2](https://arxiv.org/html/2403.17887v2#A1.SS2 "A.2 Model distillation ‣ Appendix A Extended Literature Review ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we contrast pruning with distillation, an alternative strategy for reducing the parameter count of LLMs; then in §[A.3](https://arxiv.org/html/2403.17887v2#A1.SS3 "A.3 Efficient finetuning and inference acceleration ‣ Appendix A Extended Literature Review ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we discuss the various practical methods for efficient finetuning and inference acceleration that can be used in conjunction with our pruning strategy; finally in §[A.4](https://arxiv.org/html/2403.17887v2#A1.SS4 "A.4 A breadth of depth-dependent studies ‣ Appendix A Extended Literature Review ‣ The Unreasonable Ineffectiveness of the Deeper Layers") we highlight some scientific investigations into some depth-dependent statistical properties of LLMs that are complementary to our results.
### A.1 Pruning
_Pruning_ is a method for reducing the size of a trained machine-learning model by removing unnecessary parameters, either individually or together as a group. Pruning for neural networks has a long history (LeCun et al., [1989](https://arxiv.org/html/2403.17887v2#bib.bib6), Hassibi and Stork, [1992](https://arxiv.org/html/2403.17887v2#bib.bib7)), and, as originally conceived, _unstructured pruning_ techniques sparsify networks by removing individual parameters based on pre-defined criteria. For instance, if a parameter of the model has a very small value, then removing it i.e. by setting it to exactly zero will likely have minimal impact on performance. Inspired by this early work, modern researchers began exploring different criteria for such unstructured pruning, focusing mostly on computer vision models (Han et al., [2015](https://arxiv.org/html/2403.17887v2#bib.bib8), Chen et al., [2015](https://arxiv.org/html/2403.17887v2#bib.bib9), Srinivas and Babu, [2015](https://arxiv.org/html/2403.17887v2#bib.bib10)). In particular, Han et al. ([2015](https://arxiv.org/html/2403.17887v2#bib.bib8)) developed an _iterative pruning_ method for alternatively pruning and finetuning a network in order to reach better compression ratios and performance.
While these models were smaller, they were not necessarily more efficient: sparsifying networks by removing individual parameters according to a criterion leads to irregular or pseudorandom sparsification patterns that are difficult to accelerate without specialized hardware or libraries designed for sparsity (Li et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib11)). To that end, _structured pruning_ techniques were developed to remove irrelevant groups of parameters together, such as particular channels or filters in convolutional networks. As this increased their practical relevance, researchers then began exploring structured pruning across computer vision (Li et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib11), Wen et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib12), Hu et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib13), He et al., [2017](https://arxiv.org/html/2403.17887v2#bib.bib14), Huang et al., [2018](https://arxiv.org/html/2403.17887v2#bib.bib15)) and pre-transformer NLP architectures (Murray and Chiang, [2015](https://arxiv.org/html/2403.17887v2#bib.bib16), See et al., [2016](https://arxiv.org/html/2403.17887v2#bib.bib17), Kim and Rush, [2016](https://arxiv.org/html/2403.17887v2#bib.bib18)).
Following unprecedented progress in language modeling, recent work has focused on applying structured pruning methods to the Transformer (Vaswani et al., [2017](https://arxiv.org/html/2403.17887v2#bib.bib47)). These studies consider nearly every possible component of the model architecture for elimination, with methods ranging from dropping attention heads (Voita et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib19), Michel et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib20), Kim and Awadalla, [2020](https://arxiv.org/html/2403.17887v2#bib.bib21)), to dropping layers (Fan et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib22), Zhang and He, [2020](https://arxiv.org/html/2403.17887v2#bib.bib23), Fan et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib24), Jha et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib25), Sajjad et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib26), Liu et al., [2023a](https://arxiv.org/html/2403.17887v2#bib.bib27)), to pruning hidden states (Hou et al., [2020](https://arxiv.org/html/2403.17887v2#bib.bib28)), to rank reducing large weight matrices (Sharma et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib29)), replacing sparse weight matrices with smaller dense ones (Ashkboos et al., [2024](https://arxiv.org/html/2403.17887v2#bib.bib30)), to many combinations of the aforementioned groups (Xia et al., [2022](https://arxiv.org/html/2403.17887v2#bib.bib31), Lagunas et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib32)).
Of the prior work that also considers transformer layer dropping, most (Fan et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib22), Zhang and He, [2020](https://arxiv.org/html/2403.17887v2#bib.bib23), Fan et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib24), Xia et al., [2022](https://arxiv.org/html/2403.17887v2#bib.bib31), Sajjad et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib26)) study BERT-style models (Devlin et al., [2018](https://arxiv.org/html/2403.17887v2#bib.bib48)), while we consider decoder-only GPT-style models (Radford et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib49)) that are most commonly used for large-scale language modeling and generation. BERT-style models are naturally suited for understanding tasks due to their bidirectional masked language modeling (MLM) objective, while GPT-style models are instead suited for generation, due to their autoregressive objective. While this divide has been questioned in light of more powerful GPT-style models (Zhong et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib50)), previous work (Ethayarajh, [2019](https://arxiv.org/html/2403.17887v2#bib.bib51)) has found significant qualitative differences between BERT and GPT models in terms of the evolution of the layer-wise representation of words. Altogether, this suggests that layer-dropping strategies will behave differently between the two families.
One study for BERT-style pre-trained models, Sajjad et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib26)), concludes that the best layer-pruning strategy is dropping the final layers; this partially resonates with our results, although in contrast we find that _(a)_ for some pruning sizes keeping the last few layers of the model is actually beneficial, and that _(b)_ for all pruning sizes keeping the very last layer is essential. Additionally, while the authors also study similarity between representations in different layers as in our approach they actually found a higher similarity between representations in the shallow layers compared to the deeper ones which very sharply disagrees with our results. Importantly, the models considered in Sajjad et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib26)) consist of a few hundred million parameters, which is much smaller than the model scales we consider in our work. Perhaps as a consequence, the authors didnt observe the sharp transition in downstream accuracies that we report in §[4.1](https://arxiv.org/html/2403.17887v2#S4.SS1 "4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), despite the fact that they also finetuned their pruned models.
In contrast, while Jha et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib25)) does consider GPT-style models, the methodology is quite different from ours: _(i)_ rather than pretraining first and then using a fixed layer-dropping strategy as we do, instead the authors incrementally drop layers in a modified pretraining procedure; and _(ii)_ the authors study their own sub-1B parameter models, while we focus on the families of readily available, open-weight, large-scale 2.7B-70B parameter models that are commonly used and/or finetuned for practical applications.
As we were finalizing our preprint, Men et al. ([2024](https://arxiv.org/html/2403.17887v2#bib.bib33)) was posted: this paper empirically studies different layer-pruning strategies for GPT-style models (Llama-2 7B and Baichuan2-7B-base) and their subsequent effects on benchmarks (MMLU, CMMLU, and CMNLI). They investigate various layer-importance metrics notably, their "Block Influence" function is similar to our cosine similarity metric and find that they are able to prune up to similar-to\sim28% of layers of Llama-2 7B with minimal impact on performance. This provides independent evidence supporting our main takeaway that the deeper layers are not critical for storing knowledge.
Finally, a systematic approach to layer dropping in transformers has also been studied in the context of _wav2vec_ models, which are encoder-only models that map speech to embeddings and are sized in the hundred-million parameter regime (Baevski et al., [2020](https://arxiv.org/html/2403.17887v2#bib.bib52)). With these models, Liu et al. ([2023a](https://arxiv.org/html/2403.17887v2#bib.bib27)) developed a layer-pruning algorithm based on the correlation between layers and downstream metrics. Beyond the model architecture and domain, one significant difference between this and our work is that Liu et al. ([2023a](https://arxiv.org/html/2403.17887v2#bib.bib27)) considered non-contiguous pruning proposals, e.g. dropping alternate layers. Our intuition for layer pruning predicts that this shouldnt work as well at least for decoder-only language models as it creates multiple mismatches, one with each block of layers removed.
### A.2 Model distillation
A completely different method for reducing the size of a trained machine-learning model is _model distillation_(Hinton et al., [2015](https://arxiv.org/html/2403.17887v2#bib.bib53)), in which knowledge is transferred from a large “teacher” model to a smaller “student” model by training the student on the distribution predicted by the teacher. The essential insight is that this can transform the very general knowledge and capabilities of the teacher into more streamlined, compressed, and possibly skill-specific representations.
While a very general technique, in the setting of language models, distillation has been implemented with _(a)_ white-box approaches, in which the the student is trained to imitate the teachers logits (Gu et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib54)) or hidden states (Jiao et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib55)); as well as with _(b)_ black-box approaches, in which the student only has access to the output tokens generated by the teacher. This latter approach broadly covers cases where the student is trained on text that is augmented by the teacher in some way, such as by adding synthetic labels (Wang et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib56)), generating high quality synthetic text (Eldan and Li, [2023](https://arxiv.org/html/2403.17887v2#bib.bib57), Li et al., [2023a](https://arxiv.org/html/2403.17887v2#bib.bib58), Gunasekar et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib59)) by providing chain of thought reasoning (Fu et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib60), Hsieh et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib61)), which aims to enhance the students reasoning skills, or by annotating instructions that enhance the students instruction-following capabilities (Jiang et al., [2023b](https://arxiv.org/html/2403.17887v2#bib.bib62)).
Compared to layer pruning, these distillation methods require considerable computational resources due to the reliance on the large teacher to process a big corpus of data. Instead, our similarity-based pruning strategy only requires computing the similarity between representations at different layers on a small subset of a pretraining corpus, while our second simpler pruning strategy only uses the reduced model post pruning.
### A.3 Efficient finetuning and inference acceleration
Complementary to directly reducing size of a model, _parameter-efficient finetuning_ (PEFT) focuses on reducing the cost of specializing LLMs to certain tasks. In particular, Low Rank Adapters (LoRA) reduce the memory and compute of fine tuning by freezing the pretrained model and introducing a parametrically small number of additional trainable weights (Hu et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib63)). We use its quantized cousin, QLoRA (Dettmers et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib37)), to keep our experiments cost efficient. Other PEFT methods that can be combined with our work are Li et al. ([2023b](https://arxiv.org/html/2403.17887v2#bib.bib64)) and Zhang et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib65)): in the first, the initialization of the LoRA matrices is adjusted to a quantization scheme; in the second, LoRA ranks for different LLM modules are chosen in an adaptive manner.
For additional efficiency gains we could combine our layer-pruned models with methods that further accelerate inference: with speculative decoding (Leviathan et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib66)), tokens are rapidly generated from a smaller draft model and then evaluated in parallel by the main model; with Medusa (Cai et al., [2024](https://arxiv.org/html/2403.17887v2#bib.bib67)) the draft model is discarded for extra decoding heads, but ultimately achieves a similar effect. In particular, it could be interesting to consider highly-compressed layer-pruned models as potential draft models in a speculative decoding setup.
### A.4 A breadth of depth-dependent studies
Finally, let us highlight some scientific work that study the depth-dependent properties of LLMs. One relevant direction considers how knowledge and linguistic properties are encoded in language models. On the one hand, Meng et al. ([2022](https://arxiv.org/html/2403.17887v2#bib.bib68)) and Dai et al. ([2021](https://arxiv.org/html/2403.17887v2#bib.bib69)) analyze the _storage and recall_ of factual associations: these works emphasize that knowledge localizes within the middle (Meng et al., [2022](https://arxiv.org/html/2403.17887v2#bib.bib68)) or final (Dai et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib69)) layers, which has implications for directly editing or erasing part of a models factual knowledge. On the other hand, attempts to perform such editing gives evidence that information may be stored non-locally across layers (Hase et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib70)). Relatedly, Geva et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib71)) investigates the way facts are _processed_ during inference, distinguishing between the role of attention heads, for attribute extraction, and the MLP blocks, for subject enrichment: both are delocalized across several layers.
Next, following the earlier “logic lens” (nostalgebraist, [2020](https://arxiv.org/html/2403.17887v2#bib.bib2)), Belrose et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib3)) invented a technique they called “tuned lens” to study the _trajectory of predictions_ by using a learnable affine transformation to convert intermediate representations into a distributions over tokens (see also Din et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib72))). By studying the layer-to-layer dynamics of this distribution, the authors noted that it tended to converge. This convergence is very suggestive that that the deeper layers could be prunable, while the fact that they had to train an affine probe is likely related to our observation that the final layer cannot be pruned. Somewhat relatedly, Gurnee and Tegmark ([2023](https://arxiv.org/html/2403.17887v2#bib.bib73)) observed that geographic features in the underlying text can be determined from linear probes trained on intermediate activations, as long as the activations are deeper than halfway.
More abstractly, Voita et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib74)) and Liu et al. ([2023b](https://arxiv.org/html/2403.17887v2#bib.bib75)) found that the sparsity of activations transitions at around halfway through a networks forward pass, evolving from sparse to dense. Perhaps relatedly, Panigrahi et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib76)) investigated which model weights update the most during finetuning, finding that its those in the mid-layers.
Altogether, these deep studies are complementary to our work, which, on the one hand, provides evidence that removing the deepest layers of an LLM does not significantly alter the models performance, and, on the other hand, demonstrates a sharp pruning transition after removing approximately half of an LLMs deepest layers.
Appendix B Experimental Details
-------------------------------
Here we explain various details of models and healing (§[B.1](https://arxiv.org/html/2403.17887v2#A2.SS1 "B.1 Model and healing details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) and of evaluations (§[B.2](https://arxiv.org/html/2403.17887v2#A2.SS2 "B.2 Evaluation details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers")).
### B.1 Model and healing details
All models in this paper were fine-tuned using the Hugging Face Trainer API(Wolf et al., [2020](https://arxiv.org/html/2403.17887v2#bib.bib77)). A list of models and their paths on Hugging Face are as follows:
For healing, we used the version of the Colossal Clean Crawled Corpus (C4) (Raffel et al., [2019](https://arxiv.org/html/2403.17887v2#bib.bib78)) from Hugging Face: `data = load_dataset("c4", en)`. We truncated long examples as described later in the paragraph and added special tokens when available.10 10 10 N.B. the Qwen tokenizer from Hugging Face does not include any special tokens; in this case, it was essential to add a default padding token. Models were finetuned for 5000 steps with a global batch size of 16: this corresponds to total finetuning tokens of 16×5000×[max_seq_length]16 5000 delimited-[]max_seq_length 16\times 5000\times[\text{{max\_seq\_length}}]16 × 5000 × [ max_seq_length ] for each model. We used a cosine-annealed learning rate schedule, with a warmup of 100 steps. When possible, the peak learning rate was set to the peak learning rate from the models pretraining; in practice, this means all models were trained with a peak LR of 3e-4, with the exceptions of Phi-2 (Javaheripi and Bubeck, [2023](https://arxiv.org/html/2403.17887v2#bib.bib36)), which was trained with a peak LR of 2e-4 during pre-training, Llama-2-70B, which was trained with a peak LR of 3e-5 (a value that resulted from a sweep), and Mistral-7B which was trained with a peak LR of 3e-6 (also a value that resulted from a sweep). All models 7B parameters or smaller were trained with a max sequence length of 2048 tokens, while all models 13B parameters or greater were trained with a max sequence length of 4096 tokens. While we realize that some models may have been pretrained on longer sequences, e.g. Qwen _-the-outlier_(Bai et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib34)), we decided to the max sequence length consistent across models of similar size to allow fairer comparisons across model families.
On top of the Hugging Face Trainer API, we used quantization and Low-Rank Adapters (LoRA) (Hu et al., [2021](https://arxiv.org/html/2403.17887v2#bib.bib63)) for all of our finetuning:
* •For quantization, we used the bitsandbytes library for QLoRA(Dettmers et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib37)) to quantize our models to 4 bits.
* •For LoRA, we used the Hugging Face peft library (Mangrulkar et al., [2022](https://arxiv.org/html/2403.17887v2#bib.bib79)). We set the LoRA dropout to 0.05 and kept the LoRA α 𝛼\alpha italic_α equivalent to the LoRA rank, following (Lee et al., [2023](https://arxiv.org/html/2403.17887v2#bib.bib80)). Aside from two exceptions, discussed below, models are trained with LoRA rank 64.
* •Also following Lee et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib80)), we only applied LoRA to FFN modules: `["gate_proj", "down_proj", "up_proj"]` for Llama-2 and Mistral models, `["fc1", "fc2"]` for Phi-2, and `["w1", "w2", "c_proj"]` for Qwen models.
The large majority of these hyperparameter choices are standard and found in previous works, e.g. Lee et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib80)) and Dettmers et al. ([2022](https://arxiv.org/html/2403.17887v2#bib.bib81)). For absolute clarity, we list display all the model specific architecture and healing details below:
We also have the following hyperparameters common between all models:
### B.2 Evaluation details
We performed three principal evaluations: accuracy on _MMLU_, accuracy on _BoolQ_, and loss on _C4_.
For MMLU accuracy:
* •We use the `cais/mmlu` version of the dataset from Hugging Face.
* •We follow the formatting suggested in the original reference (Hendrycks et al., [2020](https://arxiv.org/html/2403.17887v2#bib.bib39)) without further prompt engineering.
* •For constructing few-shot examples, we use the `dev` set from `cais/mmlu`.
* •For our experiments, we use 0 0 few-shot examples; our results and analysis are robust to this choice, cf. Figure[8](https://arxiv.org/html/2403.17887v2#A3.F8 "Figure 8 ‣ C.1 Prompting ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers").
* •We report average accuracy across all subjects.
For BoolQ accuracy:
* •We used the `hassansh/boolq_n_shot` version from Hugging Face.
* •For our experiments, we use 0 0 few-shot examples.
* •The complete BoolQ results truncated from the main text are shown here in Figure[7](https://arxiv.org/html/2403.17887v2#A2.F7 "Figure 7 ‣ B.2 Evaluation details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers"): in the left panel we present the Llama-2 family, in the middle panel we present models from the Qwen family, and in the right panel we should Mistral-7B and Phi-2; we also make the experiments without healing semi-transparent in order to better display the results from the complete similarity-informed pruning method. Importantly, while we see here that healing plays a more important role than it did for MMLU in Figure[2](https://arxiv.org/html/2403.17887v2#S4.F2 "Figure 2 ‣ 4.1 Accuracy on QA benchmarks ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), after healing we still have a characteristic flat region of robust performance; as before, the capabilities required to achieve a models top score isnt removed by significant layer pruning until a critical model-dependent threshold.
![Image 7: Refer to caption](https://arxiv.org/html/2403.17887v2/x7.png)
Figure 7: BoolQ accuracy (0-shot) vs. fraction of layers dropped for different model families. (_Left:_ Llama-2 family; _Middle:_ Qwen family; _Right:_ Mistral-7B and Phi-2.) The solid lines represent performance after dropping layers and healing, and the (semi-transparent) dotted lines show performance after dropping layers only (no healing), and the dashed gray line is the score for guessing randomly. For BoolQ, healing leads to important improvements such that performances; then, across all models, performances are quite robust until 20%-55% pruning fractions, depending on model family and size, at which point they transitions to random guessing.
For C4 Validation Loss:
* •We used the `c4` version from Hugging Face (soon be deprecated in favor of `allenai/c4`).
* •We evaluated using the _validation_ split as we healed with the train split.
* •Given its size, we randomly sampled 60k sequences and held them fixed across all models.
* •In Figure[3](https://arxiv.org/html/2403.17887v2#S4.F3 "Figure 3 ‣ 4.2 Loss on next-token predictions ‣ 4 Results ‣ The Unreasonable Ineffectiveness of the Deeper Layers") we normalized the loss to facilitate fair comparison across model families that employ different vocab sizes: to normalize, we divided by logV 𝑉\log V roman_log italic_V, where V 𝑉 V italic_V is the _per-model_ vocab size (listed in a table in §[B.1](https://arxiv.org/html/2403.17887v2#A2.SS1 "B.1 Model and healing details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers")). This, logV 𝑉\log V roman_log italic_V, corresponds to the loss of sampling tokens uniformly, which naturally sets the scale for a given model.
Appendix C Ablations
--------------------
Here we detail various ablations: prompting (§[C.1](https://arxiv.org/html/2403.17887v2#A3.SS1 "C.1 Prompting ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), finetuning seed (§[C.2](https://arxiv.org/html/2403.17887v2#A3.SS2 "C.2 Finetuning seed ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), LoRA rank (§[C.3](https://arxiv.org/html/2403.17887v2#A3.SS3 "C.3 LoRA rank ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers")), other pruning strategies (§[C.4](https://arxiv.org/html/2403.17887v2#A3.SS4 "C.4 Other pruning strategies ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers")). Qualitatively, the results of the paper are quite robust to the variation of any of these.
### C.1 Prompting
Its common knowledge that altering the prompt on QA evaluations can significantly impact results. To control for prompting, we ablate the MMLU accuracy for our principal similarity-informed pruning described in §[3.2](https://arxiv.org/html/2403.17887v2#S3.SS2 "3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers") when applied to Llama-2-13B: in the left panel of Figure[8](https://arxiv.org/html/2403.17887v2#A3.F8 "Figure 8 ‣ C.1 Prompting ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we show results for changing the ordering of the few-shot examples in the prompt, and in the right panel the same figure, we show results for changing the number of few-shot examples. Broadly we see that the layer-pruning method is robust to these changes.
![Image 8: Refer to caption](https://arxiv.org/html/2403.17887v2/x8.png)
Figure 8: Effect of prompt ablations on MMLU accuracy vs. fraction of layers dropped for Llama-2-13B. _Left:_ We vary the ordering of the few-shot examples and see it does not have any impact. _Right:_ We very the number n 𝑛 n italic_n of few-shot examples; while careful study of the flat region suggests increasing the number of few-shot examples marginally improves performance, regardless, the layer-pruning strategy is robust to this kind of variation.
### C.2 Finetuning seed
Here we vary the finetuning seed. For all of our experiments, we use the following code snippet to ensure reproducibility:
SEED_VAL = 0
transformers.enable_full_determinism(SEED_VAL)
Since we begin with a pretrained model, the finetuning seed doesnt affect initialization, but it will impact the stochastic aspects of further training such as data order. To control for this, we ablate the finetuning seed for our principal similarity-informed pruning described in §[3.2](https://arxiv.org/html/2403.17887v2#S3.SS2 "3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers") when applied to Llama-2-13B: in Figure[9](https://arxiv.org/html/2403.17887v2#A3.F9 "Figure 9 ‣ C.2 Finetuning seed ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers") we observe that the layer-pruning method is robust to the choice of seed.
![Image 9: Refer to caption](https://arxiv.org/html/2403.17887v2/x9.png)
Figure 9: Effect of varying the finetuning seed on MMLU accuracy vs. fraction of layers dropped for Llama-2-13B: there is no meaningful effect.
### C.3 LoRA rank
Here we vary the LoRA rank used for healing. Unfortunately, our compute budget did not allow us to make an exhaustive sweep across all of our experimental configurations. In lieu of that, we employed the following protocol for our main experiments:
* •Begin with rank 64, following the QLoRA setup (see, e.g. Appendix B.2 of Dettmers et al. ([2023](https://arxiv.org/html/2403.17887v2#bib.bib37))).
* •If healing with that rank significantly harms the performance compared to no healing, then sweep LoRA ranks for that model and, for the other evaluations, pick the best performing LoRA rank according to its MMLU accuracy.
This protocol is designed to maximize the chance that healing will improve performance across all of our evaluations. For simplicity, we ran this rank-picking protocol using the simple pruning heuristic, with the exception of Llama-2-70B.
In practice, this led to us using rank 64 for every model with the exceptions of Mistral-7B, with rank 4, Llama-2-7B, with rank 2, and Llama-2-70B, with rank 8. (To review this same information in tabular form, see the second Table in §[B.1](https://arxiv.org/html/2403.17887v2#A2.SS1 "B.1 Model and healing details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers").) Figure[10](https://arxiv.org/html/2403.17887v2#A3.F10 "Figure 10 ‣ C.3 LoRA rank ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers") displays the sweeps over MMLU accuracy supporting these choices for Mistral-7B (bottom left panel), Llama-2-7B (bottom middle panel), and Llama-2-70B (top right panel): overall, while the LoRA rank does not have a significant impact on the qualitative behavior of the healed model, decreasing the LoRA rank generally improves performance. In the top left and middle panels of Figure[10](https://arxiv.org/html/2403.17887v2#A3.F10 "Figure 10 ‣ C.3 LoRA rank ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we show corresponding sweeps for Mistral-7B (top) and Llama-2-7B (middle) using the similarity-informed pruning strategy: we see that for this pruning method both models are much more robust, though rank 2 is still the top performing rank for Llama-2-7B.
![Image 10: Refer to caption](https://arxiv.org/html/2403.17887v2/x10.png)
Figure 10: Effect of varying the LoRA rank. Top: 5-shot MMLU accuracy vs. fraction of layers dropped using the similarity-informed pruning strategy on Mistral-7B (_left_), Llama-2-7B (middle), and Llama-2-70B (right). Across all ranks we observe similar behavior, though theres a small effect of decreasing rank improving overall performance. Bottom, left and middle: 5-shot MMLU accuracy vs. fraction of layers dropped using the simple pruning heuristic on Mistral-7B (_left_) and Llama-2-7B (middle). As before, qualitative behavior is similar across ranks, though in this case its much clearer that decreasing rank improves performance. Bottom, right: C4 validation loss vs. fraction of layers dropped using the similarity-informed pruning strategy on Mistral-7B. In contrast to MMLU, decreasing rank harms performance; together, these results suggest that larger ranks may be overfitting.
The characteristic improvement of MMLU accuracy with decreasing LoRA rank even for extremely low ranks(!) deserves an explanation. One possibility is that lowering the LoRA rank can better regularize finetuning against overfitting. In particular, astute readers may have been surprised at the discussion of peak learning rates in §[B.1](https://arxiv.org/html/2403.17887v2#A2.SS1 "B.1 Model and healing details ‣ Appendix B Experimental Details ‣ The Unreasonable Ineffectiveness of the Deeper Layers"): models were finetuned with the same peak used in pretraining; a “large” LoRA rank of 64 introduces a number of additional parameters that may overfit to C4. This overfitting would certainly be harmful, since the actual pretraining datasets for the models we consider are _(a)_ unknown to us, and _(b)_, likely to be of significantly higher quality than C4.
We investigate this directly for Mistral-7B. In the bottom right panel of Figure[10](https://arxiv.org/html/2403.17887v2#A3.F10 "Figure 10 ‣ C.3 LoRA rank ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers") we plot the C4 validation loss across different LoRA ranks: we see that while decreasing the LoRA rank generally improves MMLU accuracy (cf. left-most panels), at the same time it harms the C4 validation loss. This supports our overfitting hypothesis. In a greater-resourced future, it would be interesting to improve the healing process by considering other forms of regularization and learning rate tuning.
### C.4 Other pruning strategies
Here we study how the similarity-informed pruning strategy (§[3.2](https://arxiv.org/html/2403.17887v2#S3.SS2 "3.2 Layer-pruning algorithm(s) ‣ 3 Method ‣ The Unreasonable Ineffectiveness of the Deeper Layers")) compares to other layer-pruning baselines: specifically, we contrast with pruning random layers and pruning shallow layers. In Figure[11](https://arxiv.org/html/2403.17887v2#A3.F11 "Figure 11 ‣ C.4 Other pruning strategies ‣ Appendix C Ablations ‣ The Unreasonable Ineffectiveness of the Deeper Layers"), we observe that the similarity-informed strategy from the main text outperforms both of these other strategies on an MMLU evaluation of Llama-7B.
![Image 11: Refer to caption](https://arxiv.org/html/2403.17887v2/x11.png)
Figure 11: Comparison of the similarity-informed pruning strategy (blue) to random-layer pruning (orange) and shallow-layer pruning (green) on MMLU accuracy, with Llama-2 7B and LoRA rank 64. The similarity-informed pruning strategy clearly outperforms these baselines.