mirror of
https://github.com/wassname/lora-lite.git
synced 2026-06-27 18:05:16 +08:00
5f9d90d8b8
- antipasto_rot: add rotate_basis="both" (independent V+U Cayley rotations), run_id suffix __rotU/__rotboth so ablation arms get their own output dirs - justfile: thread rotate_basis through bench-variant - corda/eva: padding-mask fix in calibration capture + bf16-tight residual - README: fill PiSSA/DoRA/CorDA/ASVD/ablate/dplr/rot rows; record the metric-axis ablation (C=I 56.0 > diag-C 55.6 > full-C 54.7) and the rotation ablation (V 57.2 > U 56.5 > both 55.6) conclusions - docs/reviews: external ref-checks + deepseek/gpt reviews of the cores Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
233 lines
13 KiB
Markdown
233 lines
13 KiB
Markdown
# Reference-fidelity check: `antipasto_ablate.py` vs Arditi+ 2024 (arXiv:2406.11717)
|
||
|
||
Scope: math/algorithm fidelity + citation only. Fail-fast research code, so no
|
||
defensive-programming, None-check, or backward-compat flags. Secondary `cov_orient`
|
||
block checked against CorDA (arXiv:2406.05223).
|
||
|
||
Files:
|
||
- impl: `/media/wassname/SGIronWolf/projects5/2026/lite/lora-lite/src/lora_lite/variants/antipasto_ablate.py`
|
||
- paper: `/media/wassname/SGIronWolf/projects5/2026/lite/lora-lite/docs/papers/md/arditi_2406.11717.md`
|
||
- secondary: `/media/wassname/SGIronWolf/projects5/2026/lite/lora-lite/docs/papers/md/corda_2406.05223.md`
|
||
|
||
---
|
||
|
||
## 1. Ablation operator: residual-stream projection vs per-layer output-singular projection
|
||
|
||
Paper, directional ablation operator (arditi md:153-157):
|
||
|
||
> 𝐱′ ← 𝐱 − 𝐫̂ 𝐫̂⊺ 𝐱. [...] We perform this operation at every activation 𝐱ᵢ(l) and 𝐱̃ᵢ(l), across all layers l and all token positions i. This effectively prevents the model from ever representing this direction in its residual stream.
|
||
|
||
Paper, the weight-space equivalent (the form our code structurally resembles), Eq 5 (arditi md:246-248):
|
||
|
||
> W_out′ ← W_out − 𝐫̂ 𝐫̂⊺ W_out. [...] the matrices that write to the residual stream are: the embedding matrix, the positional embedding matrix, attention out matrices, and MLP out matrices. Orthogonalizing all of these matrices [...] with respect to the direction 𝐫̂ effectively prevents the model from ever writing 𝐫̂ to its residual stream.
|
||
|
||
Code core (antipasto_ablate.py:182-186):
|
||
|
||
```
|
||
h = (x @ Vh.T) * S # (..., r) output S-coords
|
||
proj = h @ Chat # (..., k)
|
||
h = h - coeff * (proj * alpha) @ Chat.T # contractive removal in r-dim U-space
|
||
return y + h @ U.T # map back to d_out
|
||
```
|
||
|
||
Observation (math): Chat are orthonormal in the r-dim coordinate space that indexes
|
||
the columns of U (lora_U is (d_out, r), Chat is (r, k); ablate at :183-185 happens on
|
||
h which lives in this r-space, before `h @ U.T` lifts to d_out at :186). Since U has
|
||
orthonormal columns, a unit direction `c` in r-space maps to the unit direction
|
||
`U @ c` in d_out (residual) space, and the projector obeys
|
||
`U (I - Chat Chatᵀ) Uᵀ = I_{d_out} - (U Chat)(U Chat)ᵀ` on the column space of U. So
|
||
within the rank-r output subspace this is exactly an Arditi-style outer-product
|
||
projector `I - d̂ d̂ᵀ` with `d̂ = U @ ĉ`.
|
||
|
||
Inference (the mismatch, three parts):
|
||
1. Per-layer, not shared. Arditi ablates ONE direction 𝐫̂ read off the residual
|
||
stream and applies the SAME 𝐫̂ at every component (Eq 5 lists embed/attn-out/
|
||
mlp-out all orthogonalized w.r.t. the same 𝐫̂). Our code learns a SEPARATE
|
||
direction per target layer (lora_c is per-layer, param_specs:70). These are
|
||
different objects: Arditi's d̂ is global; ours is a bouquet of per-layer d̂.
|
||
2. Restricted to top-r U-span. Arditi's projector acts on the full d_model residual
|
||
vector. Ours can only remove components that lie in span(U[:, :r]); anything in
|
||
W_res (the frozen remainder, init:86-87) is untouched. With cov_orient=False this
|
||
is plain-SVD top-r, which need not contain the behavior direction (the docstring
|
||
itself flags this: ":52-53 measured 1.00 vs 0.65 capture at r=16").
|
||
3. Output-side only vs residual-stream. Arditi's residual-stream ablation (Eq 4) and
|
||
its weight-equivalent (Eq 5) zero the direction the layer WRITES. Our `y + h@U.T`
|
||
subtracts from this layer's additive output contribution, which IS a residual
|
||
writer for down_proj/o_proj. The docstring scopes the variant to exactly those
|
||
("target residual writers (down_proj, o_proj)", :17-18), so for those targets the
|
||
output-side framing is the correct analogue of Eq 5. For a non-writer Linear it
|
||
would not be (correctly excluded by design).
|
||
|
||
VERDICT: DEVIATES-OK. The single-direction outer-product structure is faithfully
|
||
reproduced inside each layer's output subspace, and for residual-writer targets the
|
||
output-side action matches Eq 5's "don't write 𝐫̂". The deviations (per-layer learned
|
||
direction instead of one shared difference-in-means direction; confined to top-r) are
|
||
deliberate design choices, not the paper's algorithm, and the docstring's framing
|
||
("trainable form of directional ablation") signals the analogy rather than claiming
|
||
identity. No bug, but the correspondence is "Arditi-style projector, per-layer,
|
||
within rank-r" — not the verbatim global single-direction ablation.
|
||
|
||
---
|
||
|
||
## 2. Contraction / "ablation cannot amplify"
|
||
|
||
Paper does not state a contraction theorem in the algebraic sense, but the operator
|
||
itself is a projection. Closest textual support, Eq 4 (arditi md:155) is the
|
||
orthogonal projector `I - 𝐫̂𝐫̂ᵀ` (idempotent, eigenvalues {0,1}); and the paper
|
||
repeatedly frames ablation as removal/erasure, never amplification:
|
||
|
||
> Directional ablation "zeroes out" the component along 𝐫̂ for every residual stream activation (arditi md:151)
|
||
|
||
> these loss metrics suggest that directional ablation is more surgical than activation addition based methods (arditi md:323)
|
||
|
||
(The paper's amplification is a SEPARATE intervention — activation *addition*, Eq 3,
|
||
md:144 — confirming ablation is the non-amplifying side.)
|
||
|
||
Code (forward:179, 185 and clamp):
|
||
|
||
```
|
||
alpha = layer.lora_alpha.to(x.dtype).clamp(0.0, 1.0) # :179
|
||
h = h - coeff * (proj * alpha) @ Chat.T # :185
|
||
```
|
||
|
||
Observation (math): with coeff=1 the core operator on r-space is
|
||
`M = I - Chat diag(alpha) Chatᵀ`. Chat orthonormal ⇒ eigenvalues are `1 - alphaⱼ`
|
||
along each chat_j and `1` on the orthogonal complement. With alpha clamped to [0,1]
|
||
(:179), every eigenvalue lies in [0,1]. So M is a contraction (operator norm ≤ 1):
|
||
it cannot amplify. Arditi's exact projector is the alpha=1 endpoint (eigenvalue 0 =
|
||
full erasure); alpha<1 is partial ablation, still non-amplifying.
|
||
|
||
Caveat (the code's own warning, :20-21, :49): coeff is OUTSIDE the clamp. coeff<0
|
||
flips the sign and ADDS the direction back (eigenvalue `1 + |coeff|·alpha` > 1), and
|
||
coeff>1 over-subtracts (eigenvalue `1 - coeff·alpha` can go negative, |·|>1). The
|
||
contraction guarantee holds only for coeff ∈ [0,1], which the docstring states
|
||
explicitly ("<0 adds the direction back (the side that can grow, so bound coeff
|
||
there)"). That is correct and self-documented, not a silent bug.
|
||
|
||
VERDICT: MATCHES (for the documented operating range coeff∈[0,1], alpha∈[0,1]).
|
||
Eigenvalues in [0,1] confirmed; the alpha clamp is the load-bearing line. The coeff<0
|
||
amplifying branch is intentional and flagged.
|
||
|
||
---
|
||
|
||
## 3. Direction source: fixed difference-in-means vs trainable
|
||
|
||
Paper, the direction is a FIXED difference-of-means, selected once (arditi md:124, 134):
|
||
|
||
> We then compute the difference-in-means vector 𝐫ᵢ(l) = 𝛍ᵢ(l) − 𝛎ᵢ(l).
|
||
|
||
> We notate the selected vector as 𝐫, and its corresponding unit-norm vector as 𝐫̂.
|
||
|
||
i.e. 𝐫̂ is computed from mean(harmful) − mean(harmless) and then frozen; the paper
|
||
performs NO gradient descent on it ("does not require gradient-based optimization",
|
||
md:236).
|
||
|
||
Code: lora_c is a trainable parameter (param_specs:70 `trainable` default; random
|
||
normal init), optionally warm-started from a contrastive S-space direction dS
|
||
(docstring init:88-90 "group_init() should warm-start lora_c from the S-space
|
||
contrastive direction dS").
|
||
|
||
Observation: ours is structurally the same OBJECT (a unit direction that gets
|
||
projected out) but obtained by a different procedure (SGD, optionally seeded from a
|
||
contrastive diff) rather than a closed-form difference-of-means.
|
||
|
||
Inference: calling it "the trainable form of directional ablation (Arditi+ 2024)"
|
||
(:16-17) is defensible AS A FRAMING: the operator (Eq 4/5 outer-product removal) is
|
||
Arditi's; the novelty claimed is making the direction learnable. The optional
|
||
warm-start from a contrastive dS is even closer to Arditi's diff-of-means seed. This
|
||
is a fair "trainable variant of X" claim, not a misattribution. It would be wrong to
|
||
claim Arditi's *method* (which is explicitly gradient-free) — but the code claims the
|
||
*ablation operator*, not the extraction method.
|
||
|
||
VERDICT: DEVIATES-OK. "Trainable form of directional ablation" is an honest framing:
|
||
same operator, deliberately different (learned) direction source. Recommend the
|
||
docstring keep the word "form"/"trainable" prominent so it is not read as
|
||
reproducing Arditi's gradient-free diff-of-means extraction.
|
||
|
||
---
|
||
|
||
## 4. SVD sign disambiguation
|
||
|
||
Paper: SILENT on singular-vector sign. Arditi never does an SVD of a weight to get
|
||
its ablation direction — 𝐫̂ comes from difference-of-means (md:117-124), and the
|
||
weight-orthogonalization Eq 5 (md:246) uses the outer product 𝐫̂𝐫̂ᵀ. So the paper
|
||
offers no quote on SVD sign; I reason from the math.
|
||
|
||
Code: ablation core is the outer product `Chat Chatᵀ` (forward:183-185), and U is
|
||
orthonormal from `torch.linalg.svd` (init:80).
|
||
|
||
Observation (math):
|
||
- Chat enters only as `Chat Chatᵀ` (proj `h @ Chat` then `@ Chat.T`, :183-185).
|
||
Flipping any column sign `chat_j → −chat_j` leaves `chat_j chat_jᵀ` unchanged ⇒
|
||
the operator is sign-invariant in c. (lora_c is trainable anyway, so its sign is
|
||
not even an SVD artifact.)
|
||
- U enters only as `h @ U.T` (:186) AND the projector identity from point 1 is
|
||
`(U c)(U c)ᵀ`; a sign flip `u_i → −u_i` with the matching `vh_i → −vh_i` leaves
|
||
`W = (U S) Vh` invariant (init:86) and leaves `(U c)(U c)ᵀ` invariant. So neither
|
||
the reconstructed weight nor the ablation projector depends on per-vector sign.
|
||
- S is non-negative by SVD definition; no sign issue.
|
||
|
||
Inference: no sign canonicalization is needed anywhere in this file. The ablation is
|
||
a quadratic form in both the (trainable, hence sign-free) direction and the
|
||
orthonormal basis, and every place U/Vh appear they appear in sign-paired products
|
||
or outer products. This is the correct situation (a basis/span use, not a
|
||
sign-sensitive coordinate use).
|
||
|
||
VERDICT: MATCHES (math-derived; paper silent on SVD sign, stated). No
|
||
canonicalization required and none missing.
|
||
|
||
---
|
||
|
||
## 5. Citation check
|
||
|
||
Docstring (:16-17, :23):
|
||
|
||
> This is the trainable form of directional ablation (Arditi+ 2024 [...]). Refs: [...] directional ablation Arditi+ 2024 arXiv:2406.11717.
|
||
|
||
Paper title/authors (arditi md:1-21):
|
||
|
||
> # Refusal in Language Models Is Mediated by a Single Direction
|
||
> Andy Arditi [...] Oscar Obeso [...] Aaquib Syed [...] Daniel Paleka [...] Nina Rimsky [...] Wes Gurnee [...] Neel Nanda
|
||
|
||
Observation: arXiv:2406.11717 = "Refusal in Language Models Is Mediated by a Single
|
||
Direction", first author surname Arditi. The term "directional ablation" is the
|
||
paper's own (§2.4 heading, md:148). The arXiv id, surname, year, and method name all
|
||
match.
|
||
|
||
cov_orient / CorDA attribution. Code config comment (:50) and group_init docstring
|
||
(:94) say "CorDA" by name:
|
||
|
||
> CorDA-orient the basis from input covariance (group_init [...]) (:50)
|
||
> re-orient each target's SVD by input covariance C=E[x xᵀ] (CorDA) (:94)
|
||
|
||
CorDA paper core (corda md:23, line "C=XXᵀ", and `SVD(WC)=UΣVᵀ`, reconstruct
|
||
`Ŵ=UΣVᵀC⁻¹`):
|
||
|
||
> obtain the covariance matrix of the input activation [...] C=XXᵀ [...] perform singular value decomposition for the weight multiplied by the covariance matrix, i.e. SVD(WC)=UΣVᵀ [...] the inverse of these covariance matrices is multiplied with the decomposed components to hold the same inference result with the original model
|
||
|
||
Observation: CorDA IS attributed by name (:50, :94). One math nuance worth recording
|
||
(not a citation error): CorDA whitens with `C` on one side and reconstructs with
|
||
`C⁻¹` (`SVD(WC)`, then `VᵀC⁻¹`). The code instead uses SYMMETRIC whitening
|
||
`SVD(W C^{1/2})` with `Pr = Vh C^{-1/2}` (group_init:145-150). Both preserve the
|
||
forward map (`W C^{1/2} · C^{-1/2} = W`) and both put data-relevant output
|
||
directions in the top-r, but the orientation is not bit-identical to CorDA's `WC`.
|
||
This is an acknowledged variant (task list #22 "Ablate whitening: C^1/2 (mine) vs C
|
||
(PEFT)"), so the "CorDA-orient" name is a fair attribution of the IDEA
|
||
(covariance-oriented SVD with inverse-covariance reconstruction), with the symmetric
|
||
square-root being this repo's choice.
|
||
|
||
VERDICT: MATCHES (Arditi citation correct: id, surname, method name). CorDA is
|
||
attributed by name; the C^{1/2} symmetric-whitening detail is a labeled variant of
|
||
CorDA's C/C⁻¹, not a mis-citation.
|
||
|
||
---
|
||
|
||
## Bottom line
|
||
|
||
No real bugs: the operator reproduces Arditi's outer-product ablation faithfully
|
||
inside each layer's rank-r output subspace, is a proven contraction for the documented
|
||
coeff∈[0,1]/alpha∈[0,1] range, needs no SVD sign canonicalization, and both Arditi and
|
||
CorDA are correctly attributed (the per-layer-learned direction and C^{1/2} whitening
|
||
are deliberate, self-documented design choices, not deviations from a claimed
|
||
reproduction).
|