mirror of
https://github.com/wassname/adapters_as_hypotheses.git
synced 2026-06-27 15:14:03 +08:00
Remove LessWrong draft, symlink README to adapters_as_hypotheses.md
This commit is contained in:
@@ -1,112 +0,0 @@
|
||||
# Adapters as Representational Hypotheses: What 30 PEFT Methods Tell Us About Transformer Geometry
|
||||
|
||||
*Crossposted from [github.com/wassname/adapters_as_hypotheses](https://github.com/wassname/adapters_as_hypotheses)*
|
||||
|
||||
## The core claim
|
||||
|
||||
Every parameter-efficient fine-tuning (PEFT) adapter encodes a structural hypothesis about how to intervene in transformer internals. LoRA says weight changes are low-rank. OFT says orthogonal rotations preserve semantic structure. PiSSA says the principal SVD components carry the signal. When one adapter outperforms another under controlled conditions -- same model, same data, same parameter budget -- **the winning method's structural assumptions are empirically supported as a better description of the weight manifold.**
|
||||
|
||||
This is hiding in plain sight. Hundreds of PEFT papers run controlled comparisons. Almost nobody reads them as science about representations.
|
||||
|
||||
## Why this matters for interpretability
|
||||
|
||||
We want to understand how transformers work. There are many approaches -- probing, ablation, SAEs -- but most of them *observe* rather than *intervene*.
|
||||
|
||||
- Probing finds representations that predict behavior, but high probe accuracy does not mean the model uses that representation (Belinkov, 2022).
|
||||
- CCS discovers latent knowledge but cannot intervene on it (Burns et al., 2022).
|
||||
- Intervention shortcuts both problems: if modifying a representation reliably changes behavior, we have causal evidence of what we control.
|
||||
|
||||
The GDM interpretability team recently pivoted toward ["pragmatic interpretability"](https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability) -- empirical feedback on the critical path to AGI going well. Adapter benchmarks are precisely this kind of empirical feedback: which structural assumptions about transformer internals hold up under intervention?
|
||||
|
||||
The adapter literature is a natural experiment. Each method constrains the *form* of the weight update. When a constrained method matches or beats an unconstrained one, that constraint reflects real structure in the weight manifold. When it generalizes OOD, the structure is *causally relevant*, not merely correlated.
|
||||
|
||||
## The catalog
|
||||
|
||||
I went through ~30 PEFT methods in HuggingFace PEFT and the broader literature. For each one I:
|
||||
|
||||
1. Extracted pseudocode for the forward pass (what the intervention actually does)
|
||||
2. Stated the hypothesis it encodes about transformer internals
|
||||
3. Graded the evidence on a rough hierarchy:
|
||||
|
||||
| Grade | Meaning |
|
||||
|-------|---------|
|
||||
| \* | Parameter-efficient (matches LoRA with fewer params) |
|
||||
| \*\* | Beats LoRA on raw performance |
|
||||
| \*\*!\*\* | Beats full fine-tuning |
|
||||
| \*\*!!\*\* | Data-efficient (few-shot, fast convergence) |
|
||||
| \*\*!!!\*\* | Generalizes out-of-distribution |
|
||||
|
||||
The full catalog with pseudocode is at [github.com/wassname/adapters_as_hypotheses](https://github.com/wassname/adapters_as_hypotheses). Here I'll summarize the main findings.
|
||||
|
||||
## What the evidence says
|
||||
|
||||
### 1. The SVD basis is the natural coordinate system
|
||||
|
||||
Methods that use the model's own SVD decomposition consistently outperform random-basis methods at the same parameter count:
|
||||
|
||||
- **PiSSA** (NeurIPS 2024): Initialize LoRA from top-$r$ SVD of $W$, freeze the residual. Gemma-7B on GSM8K: PiSSA 77.7% vs LoRA 74.5%. Same architecture, same params -- the only difference is *which subspace you start in*.
|
||||
- **SVFT**: Fix both singular vector sets from $W$'s SVD, learn only sparse coefficients. Recovers 96% of full FT performance with 0.006% of parameters. LoRA/DoRA recover only 85% with 0.03-0.8%.
|
||||
- **SSVD**: Rotate right singular vectors (Cayley transform), shift singular values, keep left singular vectors fixed. Matches LoRA with 10M fewer params on domain-shifted ASR.
|
||||
|
||||
The message: the SVD basis isn't an arbitrary mathematical convenience. It captures *meaningful computational directions* that the model actually uses.
|
||||
|
||||
### 2. Orthogonal adapters preserve something real
|
||||
|
||||
The OFT family (OFT, BOFT, GOFT, HRA) constrains adaptation to orthogonal transformations -- rotations without scaling. They work well on tasks where you want to *repurpose* existing representations without *destroying* them (DreamBooth, ControlNet, domain adaptation).
|
||||
|
||||
HRA makes a surprising bridge: a chain of $r$ Householder reflections is *both* orthogonal *and* equivalent to a rank-$r$ perturbation. The "low-rank vs orthogonal" dichotomy is a false one. The effective adaptation might be low-rank *and* approximately orthogonal simultaneously.
|
||||
|
||||
### 3. Direction and strength decouple
|
||||
|
||||
Three independent teams converged on the same design: separate *what to change* (direction in weight space) from *how much to change it* (magnitude):
|
||||
|
||||
- **DoRA** (ICML 2024): Magnitude/direction decomposition of $W$. Consistently beats LoRA.
|
||||
- **DeLoRA** (ICLR 2025): Normalize each rank-1 component, introduce learnable scalar $\lambda$. Better robustness to learning rate.
|
||||
- **ROAD**: 2D rotary adaptation with explicit angle $\theta$ and magnitude $\alpha$.
|
||||
|
||||
When you don't decouple them (standard LoRA), the optimizer wastes capacity fighting magnitude dynamics when it should be learning directions. Prediction: methods that decouple direction from strength will systematically show better OOD transfer, because the direction captures *what* to change (task-invariant) while the strength captures *how much* (task-specific).
|
||||
|
||||
### 4. Scaling alone goes surprisingly far
|
||||
|
||||
**IA3** learns nothing but a per-channel scaling vector ($\lambda \in \mathbb{R}^d$, initialized to 1). With T0-3B, it outperforms ICL with GPT-3 175B on Super-NaturalInstructions. **LN Tuning** learns only LayerNorm affine parameters (~0.5% of model).
|
||||
|
||||
A large fraction of "task adaptation" is just reweighting existing features -- gain control over channels. The model already computes the right features; the bottleneck is which ones to attend to. When scaling fails, that's when genuine new feature combinations are needed, and only then do you need weight-space interventions.
|
||||
|
||||
### 5. The strongest evidence: OOD generalization
|
||||
|
||||
Most adapter comparisons are parameter-efficiency contests on the same benchmarks. The really informative test is out-of-distribution transfer: does the adapter capture causal structure or just surface correlation?
|
||||
|
||||
**AntiPaSTO** ([Clark, 2025](https://arxiv.org/abs/2601.07473)) synthesizes several of the above insights -- SVD basis (PiSSA), Cayley rotation of right singular vectors (SSVD), direction/strength decoupling (DeLoRA) -- into a single adapter that steers model behavior bidirectionally via a coefficient $\alpha \in [-1, +1]$. Trained on 800 contrastive word pairs (no preference labels), it transfers from template sentences to real ethical dilemmas with 6.9x the steering performance of prompting. The same adapter at $\alpha = +1$ makes the model more honest; at $\alpha = -1$, less honest.
|
||||
|
||||
The OOD transfer is the strong claim. The SVD rotation basis learned on trivial templates captures something causally relevant about how the model structures its honesty computations. (Caveat: primary evidence is on models up to 4B; larger models need further exploration.)
|
||||
|
||||
## Design lineages
|
||||
|
||||
One interesting pattern: you can trace design lineages that progressively refine the same hypothesis.
|
||||
|
||||
**Orthogonal family:** OFT (block-diagonal rotation) -> BOFT (butterfly factorization, $O(d \log d)$ params) -> GOFT (Givens rotations, $O(d)$ params) -> HRA (Householder reflections, bridges to low-rank)
|
||||
|
||||
**SVD-aware family:** PiSSA (SVD initialization) -> SVFT (sparse SVD coefficients) -> SSVD (asymmetric U/V treatment + Cayley rotation) -> AntiPaSTO (Cayley + steering coefficient)
|
||||
|
||||
**Decoupling family:** DoRA (magnitude/direction) -> ETHER (fixed-strength orthogonal) -> DeLoRA (normalized rank-1 + $\lambda$) -> AntiPaSTO ($\alpha$-controlled rotation)
|
||||
|
||||
Each refinement tests a more specific version of the parent hypothesis. When the refinement works better, we learn something more specific about the geometry.
|
||||
|
||||
## What I'm most uncertain about
|
||||
|
||||
- **Scale dependence.** Most of these results are on 1B-7B models. The geometry might change at 70B+. Some evidence (SSVD) suggests the SVD hypothesis gets *stronger* with scale, but this isn't settled.
|
||||
- **Task dependence.** Orthogonal methods shine on vision/generation (semantic preservation) but may not apply where magnitude changes matter (NLU, reasoning). The "right" geometry may be task-specific.
|
||||
- **Controlled comparisons are rare.** Many papers compare against LoRA with different hyperparameters, different scales, different tasks. The cleanest evidence comes from papers that do careful all-else-equal ablations (DoRA, PiSSA, SSVD).
|
||||
- **Publication bias.** Methods that don't work don't get published. The catalog over-represents "successful" hypotheses.
|
||||
|
||||
## The repo
|
||||
|
||||
The full catalog with pseudocode, evidence, and grades for 30 methods is at:
|
||||
|
||||
**[github.com/wassname/adapters_as_hypotheses](https://github.com/wassname/adapters_as_hypotheses)**
|
||||
|
||||
Each entry has the paper saved to `docs/` for reference. Contributions welcome -- if I've mischaracterized a method or missed one, open an issue.
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
The framing of "adapters as representational hypotheses" originates from Appendix A.3 of [AntiPaSTO](https://arxiv.org/abs/2601.07473) (Clark, 2025). The "pragmatic interpretability" direction that motivates this is from [Nanda et al. (2025)](https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability).
|
||||
@@ -1,40 +0,0 @@
|
||||
# Adapters as Representational Hypotheses
|
||||
|
||||
*What does each PEFT method believe about transformer internals?*
|
||||
|
||||
Each adapter architecture encodes a structural claim about how to intervene in pretrained weights. When one outperforms another under controlled conditions (same model, same data, same parameter budget), the winner's assumptions are supported as a better description of the weight manifold.
|
||||
|
||||
This catalog reframes ~30 PEFT methods as **hypotheses about transformer geometry**, extracts pseudocode for each intervention, and grades the evidence.
|
||||
|
||||
## Evidence hierarchy
|
||||
|
||||
| Grade | Meaning |
|
||||
|-------|---------|
|
||||
| * | Parameter-efficient (matches LoRA with fewer params) |
|
||||
| ** | Beats LoRA on raw performance |
|
||||
| **!** | Beats full fine-tuning |
|
||||
| **!!** | Data-efficient (few-shot, fast convergence) |
|
||||
| **!!!** | Generalizes out-of-distribution |
|
||||
|
||||
## Contents
|
||||
|
||||
- [adapters_as_hypotheses.md](adapters_as_hypotheses.md) -- the main catalog
|
||||
- [docs/](docs/) -- saved papers (full text, markdown)
|
||||
|
||||
## Key findings
|
||||
|
||||
1. **SVD basis is the natural coordinate system.** Methods that use the model's own SVD decomposition (PiSSA, SVFT, SSVD, AntiPaSTO) consistently outperform random-basis methods at the same parameter count.
|
||||
2. **Orthogonal >> arbitrary.** Orthogonal constraints (OFT, BOFT, HRA, AntiPaSTO) preserve semantic structure and improve OOD transfer, at the cost of limited magnitude changes.
|
||||
3. **Direction and strength decouple.** Methods that separate *what to change* from *how much* (DeLoRA, ROAD, AntiPaSTO) show better robustness and enable bidirectional steering.
|
||||
4. **Low-rank is necessary but not sufficient.** LoRA's rank bottleneck limits hard tasks; full-rank methods (RandLoRA, SHiRA) close the gap with full FT.
|
||||
5. **Scaling alone goes far.** IA3 and LN Tuning show that a surprising amount of adaptation is just reweighting existing features -- "gain control" over channels.
|
||||
|
||||
## Related
|
||||
|
||||
- [A Pragmatic Vision for Interpretability](https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability) -- Nanda et al. 2025
|
||||
- [AntiPaSTO: Antiparallel Steering](https://arxiv.org/abs/2601.07473) -- Clark 2025 (Appendix A.3 is the origin of this framing)
|
||||
- [HuggingFace PEFT](https://github.com/huggingface/peft) -- reference implementations
|
||||
|
||||
## License
|
||||
|
||||
Content is CC-BY-4.0. Papers in docs/ are fetched from arXiv for reference and remain under their original licenses.
|
||||
Reference in New Issue
Block a user