- antipasto_rot: add rotate_basis="both" (independent V+U Cayley rotations), run_id suffix __rotU/__rotboth so ablation arms get their own output dirs - justfile: thread rotate_basis through bench-variant - corda/eva: padding-mask fix in calibration capture + bf16-tight residual - README: fill PiSSA/DoRA/CorDA/ASVD/ablate/dplr/rot rows; record the metric-axis ablation (C=I 56.0 > diag-C 55.6 > full-C 54.7) and the rotation ablation (V 57.2 > U 56.5 > both 55.6) conclusions - docs/reviews: external ref-checks + deepseek/gpt reviews of the cores Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
13 KiB
Reference-fidelity check: antipasto_ablate.py vs Arditi+ 2024 (arXiv:2406.11717)
Scope: math/algorithm fidelity + citation only. Fail-fast research code, so no
defensive-programming, None-check, or backward-compat flags. Secondary cov_orient
block checked against CorDA (arXiv:2406.05223).
Files:
- impl:
/media/wassname/SGIronWolf/projects5/2026/lite/lora-lite/src/lora_lite/variants/antipasto_ablate.py - paper:
/media/wassname/SGIronWolf/projects5/2026/lite/lora-lite/docs/papers/md/arditi_2406.11717.md - secondary:
/media/wassname/SGIronWolf/projects5/2026/lite/lora-lite/docs/papers/md/corda_2406.05223.md
1. Ablation operator: residual-stream projection vs per-layer output-singular projection
Paper, directional ablation operator (arditi md:153-157):
𝐱′ ← 𝐱 − 𝐫̂ 𝐫̂⊺ 𝐱. [...] We perform this operation at every activation 𝐱ᵢ(l) and 𝐱̃ᵢ(l), across all layers l and all token positions i. This effectively prevents the model from ever representing this direction in its residual stream.
Paper, the weight-space equivalent (the form our code structurally resembles), Eq 5 (arditi md:246-248):
W_out′ ← W_out − 𝐫̂ 𝐫̂⊺ W_out. [...] the matrices that write to the residual stream are: the embedding matrix, the positional embedding matrix, attention out matrices, and MLP out matrices. Orthogonalizing all of these matrices [...] with respect to the direction 𝐫̂ effectively prevents the model from ever writing 𝐫̂ to its residual stream.
Code core (antipasto_ablate.py:182-186):
h = (x @ Vh.T) * S # (..., r) output S-coords
proj = h @ Chat # (..., k)
h = h - coeff * (proj * alpha) @ Chat.T # contractive removal in r-dim U-space
return y + h @ U.T # map back to d_out
Observation (math): Chat are orthonormal in the r-dim coordinate space that indexes
the columns of U (lora_U is (d_out, r), Chat is (r, k); ablate at :183-185 happens on
h which lives in this r-space, before h @ U.T lifts to d_out at :186). Since U has
orthonormal columns, a unit direction c in r-space maps to the unit direction
U @ c in d_out (residual) space, and the projector obeys
U (I - Chat Chatᵀ) Uᵀ = I_{d_out} - (U Chat)(U Chat)ᵀ on the column space of U. So
within the rank-r output subspace this is exactly an Arditi-style outer-product
projector I - d̂ d̂ᵀ with d̂ = U @ ĉ.
Inference (the mismatch, three parts):
- Per-layer, not shared. Arditi ablates ONE direction 𝐫̂ read off the residual stream and applies the SAME 𝐫̂ at every component (Eq 5 lists embed/attn-out/ mlp-out all orthogonalized w.r.t. the same 𝐫̂). Our code learns a SEPARATE direction per target layer (lora_c is per-layer, param_specs:70). These are different objects: Arditi's d̂ is global; ours is a bouquet of per-layer d̂.
- Restricted to top-r U-span. Arditi's projector acts on the full d_model residual vector. Ours can only remove components that lie in span(U[:, :r]); anything in W_res (the frozen remainder, init:86-87) is untouched. With cov_orient=False this is plain-SVD top-r, which need not contain the behavior direction (the docstring itself flags this: ":52-53 measured 1.00 vs 0.65 capture at r=16").
- Output-side only vs residual-stream. Arditi's residual-stream ablation (Eq 4) and
its weight-equivalent (Eq 5) zero the direction the layer WRITES. Our
y + h@U.Tsubtracts from this layer's additive output contribution, which IS a residual writer for down_proj/o_proj. The docstring scopes the variant to exactly those ("target residual writers (down_proj, o_proj)", :17-18), so for those targets the output-side framing is the correct analogue of Eq 5. For a non-writer Linear it would not be (correctly excluded by design).
VERDICT: DEVIATES-OK. The single-direction outer-product structure is faithfully reproduced inside each layer's output subspace, and for residual-writer targets the output-side action matches Eq 5's "don't write 𝐫̂". The deviations (per-layer learned direction instead of one shared difference-in-means direction; confined to top-r) are deliberate design choices, not the paper's algorithm, and the docstring's framing ("trainable form of directional ablation") signals the analogy rather than claiming identity. No bug, but the correspondence is "Arditi-style projector, per-layer, within rank-r" — not the verbatim global single-direction ablation.
2. Contraction / "ablation cannot amplify"
Paper does not state a contraction theorem in the algebraic sense, but the operator
itself is a projection. Closest textual support, Eq 4 (arditi md:155) is the
orthogonal projector I - 𝐫̂𝐫̂ᵀ (idempotent, eigenvalues {0,1}); and the paper
repeatedly frames ablation as removal/erasure, never amplification:
Directional ablation "zeroes out" the component along 𝐫̂ for every residual stream activation (arditi md:151)
these loss metrics suggest that directional ablation is more surgical than activation addition based methods (arditi md:323)
(The paper's amplification is a SEPARATE intervention — activation addition, Eq 3, md:144 — confirming ablation is the non-amplifying side.)
Code (forward:179, 185 and clamp):
alpha = layer.lora_alpha.to(x.dtype).clamp(0.0, 1.0) # :179
h = h - coeff * (proj * alpha) @ Chat.T # :185
Observation (math): with coeff=1 the core operator on r-space is
M = I - Chat diag(alpha) Chatᵀ. Chat orthonormal ⇒ eigenvalues are 1 - alphaⱼ
along each chat_j and 1 on the orthogonal complement. With alpha clamped to [0,1]
(:179), every eigenvalue lies in [0,1]. So M is a contraction (operator norm ≤ 1):
it cannot amplify. Arditi's exact projector is the alpha=1 endpoint (eigenvalue 0 =
full erasure); alpha<1 is partial ablation, still non-amplifying.
Caveat (the code's own warning, :20-21, :49): coeff is OUTSIDE the clamp. coeff<0
flips the sign and ADDS the direction back (eigenvalue 1 + |coeff|·alpha > 1), and
coeff>1 over-subtracts (eigenvalue 1 - coeff·alpha can go negative, |·|>1). The
contraction guarantee holds only for coeff ∈ [0,1], which the docstring states
explicitly ("<0 adds the direction back (the side that can grow, so bound coeff
there)"). That is correct and self-documented, not a silent bug.
VERDICT: MATCHES (for the documented operating range coeff∈[0,1], alpha∈[0,1]). Eigenvalues in [0,1] confirmed; the alpha clamp is the load-bearing line. The coeff<0 amplifying branch is intentional and flagged.
3. Direction source: fixed difference-in-means vs trainable
Paper, the direction is a FIXED difference-of-means, selected once (arditi md:124, 134):
We then compute the difference-in-means vector 𝐫ᵢ(l) = 𝛍ᵢ(l) − 𝛎ᵢ(l).
We notate the selected vector as 𝐫, and its corresponding unit-norm vector as 𝐫̂.
i.e. 𝐫̂ is computed from mean(harmful) − mean(harmless) and then frozen; the paper performs NO gradient descent on it ("does not require gradient-based optimization", md:236).
Code: lora_c is a trainable parameter (param_specs:70 trainable default; random
normal init), optionally warm-started from a contrastive S-space direction dS
(docstring init:88-90 "group_init() should warm-start lora_c from the S-space
contrastive direction dS").
Observation: ours is structurally the same OBJECT (a unit direction that gets projected out) but obtained by a different procedure (SGD, optionally seeded from a contrastive diff) rather than a closed-form difference-of-means.
Inference: calling it "the trainable form of directional ablation (Arditi+ 2024)" (:16-17) is defensible AS A FRAMING: the operator (Eq 4/5 outer-product removal) is Arditi's; the novelty claimed is making the direction learnable. The optional warm-start from a contrastive dS is even closer to Arditi's diff-of-means seed. This is a fair "trainable variant of X" claim, not a misattribution. It would be wrong to claim Arditi's method (which is explicitly gradient-free) — but the code claims the ablation operator, not the extraction method.
VERDICT: DEVIATES-OK. "Trainable form of directional ablation" is an honest framing: same operator, deliberately different (learned) direction source. Recommend the docstring keep the word "form"/"trainable" prominent so it is not read as reproducing Arditi's gradient-free diff-of-means extraction.
4. SVD sign disambiguation
Paper: SILENT on singular-vector sign. Arditi never does an SVD of a weight to get its ablation direction — 𝐫̂ comes from difference-of-means (md:117-124), and the weight-orthogonalization Eq 5 (md:246) uses the outer product 𝐫̂𝐫̂ᵀ. So the paper offers no quote on SVD sign; I reason from the math.
Code: ablation core is the outer product Chat Chatᵀ (forward:183-185), and U is
orthonormal from torch.linalg.svd (init:80).
Observation (math):
- Chat enters only as
Chat Chatᵀ(projh @ Chatthen@ Chat.T, :183-185). Flipping any column signchat_j → −chat_jleaveschat_j chat_jᵀunchanged ⇒ the operator is sign-invariant in c. (lora_c is trainable anyway, so its sign is not even an SVD artifact.) - U enters only as
h @ U.T(:186) AND the projector identity from point 1 is(U c)(U c)ᵀ; a sign flipu_i → −u_iwith the matchingvh_i → −vh_ileavesW = (U S) Vhinvariant (init:86) and leaves(U c)(U c)ᵀinvariant. So neither the reconstructed weight nor the ablation projector depends on per-vector sign. - S is non-negative by SVD definition; no sign issue.
Inference: no sign canonicalization is needed anywhere in this file. The ablation is a quadratic form in both the (trainable, hence sign-free) direction and the orthonormal basis, and every place U/Vh appear they appear in sign-paired products or outer products. This is the correct situation (a basis/span use, not a sign-sensitive coordinate use).
VERDICT: MATCHES (math-derived; paper silent on SVD sign, stated). No canonicalization required and none missing.
5. Citation check
Docstring (:16-17, :23):
This is the trainable form of directional ablation (Arditi+ 2024 [...]). Refs: [...] directional ablation Arditi+ 2024 arXiv:2406.11717.
Paper title/authors (arditi md:1-21):
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi [...] Oscar Obeso [...] Aaquib Syed [...] Daniel Paleka [...] Nina Rimsky [...] Wes Gurnee [...] Neel Nanda
Observation: arXiv:2406.11717 = "Refusal in Language Models Is Mediated by a Single Direction", first author surname Arditi. The term "directional ablation" is the paper's own (§2.4 heading, md:148). The arXiv id, surname, year, and method name all match.
cov_orient / CorDA attribution. Code config comment (:50) and group_init docstring (:94) say "CorDA" by name:
CorDA-orient the basis from input covariance (group_init [...]) (:50) re-orient each target's SVD by input covariance C=E[x xᵀ] (CorDA) (:94)
CorDA paper core (corda md:23, line "C=XXᵀ", and SVD(WC)=UΣVᵀ, reconstruct
Ŵ=UΣVᵀC⁻¹):
obtain the covariance matrix of the input activation [...] C=XXᵀ [...] perform singular value decomposition for the weight multiplied by the covariance matrix, i.e. SVD(WC)=UΣVᵀ [...] the inverse of these covariance matrices is multiplied with the decomposed components to hold the same inference result with the original model
Observation: CorDA IS attributed by name (:50, :94). One math nuance worth recording
(not a citation error): CorDA whitens with C on one side and reconstructs with
C⁻¹ (SVD(WC), then VᵀC⁻¹). The code instead uses SYMMETRIC whitening
SVD(W C^{1/2}) with Pr = Vh C^{-1/2} (group_init:145-150). Both preserve the
forward map (W C^{1/2} · C^{-1/2} = W) and both put data-relevant output
directions in the top-r, but the orientation is not bit-identical to CorDA's WC.
This is an acknowledged variant (task list #22 "Ablate whitening: C^1/2 (mine) vs C
(PEFT)"), so the "CorDA-orient" name is a fair attribution of the IDEA
(covariance-oriented SVD with inverse-covariance reconstruction), with the symmetric
square-root being this repo's choice.
VERDICT: MATCHES (Arditi citation correct: id, surname, method name). CorDA is attributed by name; the C^{1/2} symmetric-whitening detail is a labeled variant of CorDA's C/C⁻¹, not a mis-citation.
Bottom line
No real bugs: the operator reproduces Arditi's outer-product ablation faithfully inside each layer's rank-r output subspace, is a proven contraction for the documented coeff∈[0,1]/alpha∈[0,1] range, needs no SVD sign canonicalization, and both Arditi and CorDA are correctly attributed (the per-layer-learned direction and C^{1/2} whitening are deliberate, self-documented design choices, not deviations from a claimed reproduction).