From eb2d5fb3c37c2a6715a30814ba1705b8043c8fb4 Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Sun, 22 Feb 2026 19:50:06 +0800 Subject: [PATCH] Add LessWrong draft, improve preamble (interventionist framing) --- LESSWRONG_DRAFT.md | 112 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 112 insertions(+) create mode 100644 LESSWRONG_DRAFT.md diff --git a/LESSWRONG_DRAFT.md b/LESSWRONG_DRAFT.md new file mode 100644 index 0000000..5c26772 --- /dev/null +++ b/LESSWRONG_DRAFT.md @@ -0,0 +1,112 @@ +# Adapters as Representational Hypotheses: What 30 PEFT Methods Tell Us About Transformer Geometry + +*Crossposted from [github.com/wassname/adapters_as_hypotheses](https://github.com/wassname/adapters_as_hypotheses)* + +## The core claim + +Every parameter-efficient fine-tuning (PEFT) adapter encodes a structural hypothesis about how to intervene in transformer internals. LoRA says weight changes are low-rank. OFT says orthogonal rotations preserve semantic structure. PiSSA says the principal SVD components carry the signal. When one adapter outperforms another under controlled conditions -- same model, same data, same parameter budget -- **the winning method's structural assumptions are empirically supported as a better description of the weight manifold.** + +This is hiding in plain sight. Hundreds of PEFT papers run controlled comparisons. Almost nobody reads them as science about representations. + +## Why this matters for interpretability + +We want to understand how transformers work. There are many approaches -- probing, ablation, SAEs -- but most of them *observe* rather than *intervene*. + +- Probing finds representations that predict behavior, but high probe accuracy does not mean the model uses that representation (Belinkov, 2022). +- CCS discovers latent knowledge but cannot intervene on it (Burns et al., 2022). +- Intervention shortcuts both problems: if modifying a representation reliably changes behavior, we have causal evidence of what we control. + +The GDM interpretability team recently pivoted toward ["pragmatic interpretability"](https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability) -- empirical feedback on the critical path to AGI going well. Adapter benchmarks are precisely this kind of empirical feedback: which structural assumptions about transformer internals hold up under intervention? + +The adapter literature is a natural experiment. Each method constrains the *form* of the weight update. When a constrained method matches or beats an unconstrained one, that constraint reflects real structure in the weight manifold. When it generalizes OOD, the structure is *causally relevant*, not merely correlated. + +## The catalog + +I went through ~30 PEFT methods in HuggingFace PEFT and the broader literature. For each one I: + +1. Extracted pseudocode for the forward pass (what the intervention actually does) +2. Stated the hypothesis it encodes about transformer internals +3. Graded the evidence on a rough hierarchy: + +| Grade | Meaning | +|-------|---------| +| \* | Parameter-efficient (matches LoRA with fewer params) | +| \*\* | Beats LoRA on raw performance | +| \*\*!\*\* | Beats full fine-tuning | +| \*\*!!\*\* | Data-efficient (few-shot, fast convergence) | +| \*\*!!!\*\* | Generalizes out-of-distribution | + +The full catalog with pseudocode is at [github.com/wassname/adapters_as_hypotheses](https://github.com/wassname/adapters_as_hypotheses). Here I'll summarize the main findings. + +## What the evidence says + +### 1. The SVD basis is the natural coordinate system + +Methods that use the model's own SVD decomposition consistently outperform random-basis methods at the same parameter count: + +- **PiSSA** (NeurIPS 2024): Initialize LoRA from top-$r$ SVD of $W$, freeze the residual. Gemma-7B on GSM8K: PiSSA 77.7% vs LoRA 74.5%. Same architecture, same params -- the only difference is *which subspace you start in*. +- **SVFT**: Fix both singular vector sets from $W$'s SVD, learn only sparse coefficients. Recovers 96% of full FT performance with 0.006% of parameters. LoRA/DoRA recover only 85% with 0.03-0.8%. +- **SSVD**: Rotate right singular vectors (Cayley transform), shift singular values, keep left singular vectors fixed. Matches LoRA with 10M fewer params on domain-shifted ASR. + +The message: the SVD basis isn't an arbitrary mathematical convenience. It captures *meaningful computational directions* that the model actually uses. + +### 2. Orthogonal adapters preserve something real + +The OFT family (OFT, BOFT, GOFT, HRA) constrains adaptation to orthogonal transformations -- rotations without scaling. They work well on tasks where you want to *repurpose* existing representations without *destroying* them (DreamBooth, ControlNet, domain adaptation). + +HRA makes a surprising bridge: a chain of $r$ Householder reflections is *both* orthogonal *and* equivalent to a rank-$r$ perturbation. The "low-rank vs orthogonal" dichotomy is a false one. The effective adaptation might be low-rank *and* approximately orthogonal simultaneously. + +### 3. Direction and strength decouple + +Three independent teams converged on the same design: separate *what to change* (direction in weight space) from *how much to change it* (magnitude): + +- **DoRA** (ICML 2024): Magnitude/direction decomposition of $W$. Consistently beats LoRA. +- **DeLoRA** (ICLR 2025): Normalize each rank-1 component, introduce learnable scalar $\lambda$. Better robustness to learning rate. +- **ROAD**: 2D rotary adaptation with explicit angle $\theta$ and magnitude $\alpha$. + +When you don't decouple them (standard LoRA), the optimizer wastes capacity fighting magnitude dynamics when it should be learning directions. Prediction: methods that decouple direction from strength will systematically show better OOD transfer, because the direction captures *what* to change (task-invariant) while the strength captures *how much* (task-specific). + +### 4. Scaling alone goes surprisingly far + +**IA3** learns nothing but a per-channel scaling vector ($\lambda \in \mathbb{R}^d$, initialized to 1). With T0-3B, it outperforms ICL with GPT-3 175B on Super-NaturalInstructions. **LN Tuning** learns only LayerNorm affine parameters (~0.5% of model). + +A large fraction of "task adaptation" is just reweighting existing features -- gain control over channels. The model already computes the right features; the bottleneck is which ones to attend to. When scaling fails, that's when genuine new feature combinations are needed, and only then do you need weight-space interventions. + +### 5. The strongest evidence: OOD generalization + +Most adapter comparisons are parameter-efficiency contests on the same benchmarks. The really informative test is out-of-distribution transfer: does the adapter capture causal structure or just surface correlation? + +**AntiPaSTO** ([Clark, 2025](https://arxiv.org/abs/2601.07473)) synthesizes several of the above insights -- SVD basis (PiSSA), Cayley rotation of right singular vectors (SSVD), direction/strength decoupling (DeLoRA) -- into a single adapter that steers model behavior bidirectionally via a coefficient $\alpha \in [-1, +1]$. Trained on 800 contrastive word pairs (no preference labels), it transfers from template sentences to real ethical dilemmas with 6.9x the steering performance of prompting. The same adapter at $\alpha = +1$ makes the model more honest; at $\alpha = -1$, less honest. + +The OOD transfer is the strong claim. The SVD rotation basis learned on trivial templates captures something causally relevant about how the model structures its honesty computations. (Caveat: primary evidence is on models up to 4B; larger models need further exploration.) + +## Design lineages + +One interesting pattern: you can trace design lineages that progressively refine the same hypothesis. + +**Orthogonal family:** OFT (block-diagonal rotation) -> BOFT (butterfly factorization, $O(d \log d)$ params) -> GOFT (Givens rotations, $O(d)$ params) -> HRA (Householder reflections, bridges to low-rank) + +**SVD-aware family:** PiSSA (SVD initialization) -> SVFT (sparse SVD coefficients) -> SSVD (asymmetric U/V treatment + Cayley rotation) -> AntiPaSTO (Cayley + steering coefficient) + +**Decoupling family:** DoRA (magnitude/direction) -> ETHER (fixed-strength orthogonal) -> DeLoRA (normalized rank-1 + $\lambda$) -> AntiPaSTO ($\alpha$-controlled rotation) + +Each refinement tests a more specific version of the parent hypothesis. When the refinement works better, we learn something more specific about the geometry. + +## What I'm most uncertain about + +- **Scale dependence.** Most of these results are on 1B-7B models. The geometry might change at 70B+. Some evidence (SSVD) suggests the SVD hypothesis gets *stronger* with scale, but this isn't settled. +- **Task dependence.** Orthogonal methods shine on vision/generation (semantic preservation) but may not apply where magnitude changes matter (NLU, reasoning). The "right" geometry may be task-specific. +- **Controlled comparisons are rare.** Many papers compare against LoRA with different hyperparameters, different scales, different tasks. The cleanest evidence comes from papers that do careful all-else-equal ablations (DoRA, PiSSA, SSVD). +- **Publication bias.** Methods that don't work don't get published. The catalog over-represents "successful" hypotheses. + +## The repo + +The full catalog with pseudocode, evidence, and grades for 30 methods is at: + +**[github.com/wassname/adapters_as_hypotheses](https://github.com/wassname/adapters_as_hypotheses)** + +Each entry has the paper saved to `docs/` for reference. Contributions welcome -- if I've mischaracterized a method or missed one, open an issue. + +## Acknowledgments + +The framing of "adapters as representational hypotheses" originates from Appendix A.3 of [AntiPaSTO](https://arxiv.org/abs/2601.07473) (Clark, 2025). The "pragmatic interpretability" direction that motivates this is from [Nanda et al. (2025)](https://www.lesswrong.com/posts/StENzDcD3kpfGJssR/a-pragmatic-vision-for-interpretability).