Files
lora-lite/docs/reviews/ref_check_ablate.md
T
wassname 5f9d90d8b8 benchmark sweep: rot(U/both) ablation, whitening conclusion, cost rows
- antipasto_rot: add rotate_basis="both" (independent V+U Cayley rotations),
  run_id suffix __rotU/__rotboth so ablation arms get their own output dirs
- justfile: thread rotate_basis through bench-variant
- corda/eva: padding-mask fix in calibration capture + bf16-tight residual
- README: fill PiSSA/DoRA/CorDA/ASVD/ablate/dplr/rot rows; record the
  metric-axis ablation (C=I 56.0 > diag-C 55.6 > full-C 54.7) and the
  rotation ablation (V 57.2 > U 56.5 > both 55.6) conclusions
- docs/reviews: external ref-checks + deepseek/gpt reviews of the cores

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-17 06:17:53 +08:00

13 KiB
Raw Blame History

Reference-fidelity check: antipasto_ablate.py vs Arditi+ 2024 (arXiv:2406.11717)

Scope: math/algorithm fidelity + citation only. Fail-fast research code, so no defensive-programming, None-check, or backward-compat flags. Secondary cov_orient block checked against CorDA (arXiv:2406.05223).

Files:

  • impl: /media/wassname/SGIronWolf/projects5/2026/lite/lora-lite/src/lora_lite/variants/antipasto_ablate.py
  • paper: /media/wassname/SGIronWolf/projects5/2026/lite/lora-lite/docs/papers/md/arditi_2406.11717.md
  • secondary: /media/wassname/SGIronWolf/projects5/2026/lite/lora-lite/docs/papers/md/corda_2406.05223.md

1. Ablation operator: residual-stream projection vs per-layer output-singular projection

Paper, directional ablation operator (arditi md:153-157):

𝐱′ ← 𝐱 − 𝐫̂ 𝐫̂⊺ 𝐱. [...] We perform this operation at every activation 𝐱ᵢ(l) and 𝐱̃ᵢ(l), across all layers l and all token positions i. This effectively prevents the model from ever representing this direction in its residual stream.

Paper, the weight-space equivalent (the form our code structurally resembles), Eq 5 (arditi md:246-248):

W_out ← W_out 𝐫̂ 𝐫̂⊺ W_out. [...] the matrices that write to the residual stream are: the embedding matrix, the positional embedding matrix, attention out matrices, and MLP out matrices. Orthogonalizing all of these matrices [...] with respect to the direction 𝐫̂ effectively prevents the model from ever writing 𝐫̂ to its residual stream.

Code core (antipasto_ablate.py:182-186):

h = (x @ Vh.T) * S                          # (..., r) output S-coords
proj = h @ Chat                             # (..., k)
h = h - coeff * (proj * alpha) @ Chat.T     # contractive removal in r-dim U-space
return y + h @ U.T                          # map back to d_out

Observation (math): Chat are orthonormal in the r-dim coordinate space that indexes the columns of U (lora_U is (d_out, r), Chat is (r, k); ablate at :183-185 happens on h which lives in this r-space, before h @ U.T lifts to d_out at :186). Since U has orthonormal columns, a unit direction c in r-space maps to the unit direction U @ c in d_out (residual) space, and the projector obeys U (I - Chat Chatᵀ) Uᵀ = I_{d_out} - (U Chat)(U Chat)ᵀ on the column space of U. So within the rank-r output subspace this is exactly an Arditi-style outer-product projector I - d̂ d̂ᵀ with d̂ = U @ ĉ.

Inference (the mismatch, three parts):

  1. Per-layer, not shared. Arditi ablates ONE direction 𝐫̂ read off the residual stream and applies the SAME 𝐫̂ at every component (Eq 5 lists embed/attn-out/ mlp-out all orthogonalized w.r.t. the same 𝐫̂). Our code learns a SEPARATE direction per target layer (lora_c is per-layer, param_specs:70). These are different objects: Arditi's d̂ is global; ours is a bouquet of per-layer d̂.
  2. Restricted to top-r U-span. Arditi's projector acts on the full d_model residual vector. Ours can only remove components that lie in span(U[:, :r]); anything in W_res (the frozen remainder, init:86-87) is untouched. With cov_orient=False this is plain-SVD top-r, which need not contain the behavior direction (the docstring itself flags this: ":52-53 measured 1.00 vs 0.65 capture at r=16").
  3. Output-side only vs residual-stream. Arditi's residual-stream ablation (Eq 4) and its weight-equivalent (Eq 5) zero the direction the layer WRITES. Our y + h@U.T subtracts from this layer's additive output contribution, which IS a residual writer for down_proj/o_proj. The docstring scopes the variant to exactly those ("target residual writers (down_proj, o_proj)", :17-18), so for those targets the output-side framing is the correct analogue of Eq 5. For a non-writer Linear it would not be (correctly excluded by design).

VERDICT: DEVIATES-OK. The single-direction outer-product structure is faithfully reproduced inside each layer's output subspace, and for residual-writer targets the output-side action matches Eq 5's "don't write 𝐫̂". The deviations (per-layer learned direction instead of one shared difference-in-means direction; confined to top-r) are deliberate design choices, not the paper's algorithm, and the docstring's framing ("trainable form of directional ablation") signals the analogy rather than claiming identity. No bug, but the correspondence is "Arditi-style projector, per-layer, within rank-r" — not the verbatim global single-direction ablation.


2. Contraction / "ablation cannot amplify"

Paper does not state a contraction theorem in the algebraic sense, but the operator itself is a projection. Closest textual support, Eq 4 (arditi md:155) is the orthogonal projector I - 𝐫̂𝐫̂ᵀ (idempotent, eigenvalues {0,1}); and the paper repeatedly frames ablation as removal/erasure, never amplification:

Directional ablation "zeroes out" the component along 𝐫̂ for every residual stream activation (arditi md:151)

these loss metrics suggest that directional ablation is more surgical than activation addition based methods (arditi md:323)

(The paper's amplification is a SEPARATE intervention — activation addition, Eq 3, md:144 — confirming ablation is the non-amplifying side.)

Code (forward:179, 185 and clamp):

alpha = layer.lora_alpha.to(x.dtype).clamp(0.0, 1.0)   # :179
h = h - coeff * (proj * alpha) @ Chat.T                # :185

Observation (math): with coeff=1 the core operator on r-space is M = I - Chat diag(alpha) Chatᵀ. Chat orthonormal ⇒ eigenvalues are 1 - alphaⱼ along each chat_j and 1 on the orthogonal complement. With alpha clamped to [0,1] (:179), every eigenvalue lies in [0,1]. So M is a contraction (operator norm ≤ 1): it cannot amplify. Arditi's exact projector is the alpha=1 endpoint (eigenvalue 0 = full erasure); alpha<1 is partial ablation, still non-amplifying.

Caveat (the code's own warning, :20-21, :49): coeff is OUTSIDE the clamp. coeff<0 flips the sign and ADDS the direction back (eigenvalue 1 + |coeff|·alpha > 1), and coeff>1 over-subtracts (eigenvalue 1 - coeff·alpha can go negative, |·|>1). The contraction guarantee holds only for coeff ∈ [0,1], which the docstring states explicitly ("<0 adds the direction back (the side that can grow, so bound coeff there)"). That is correct and self-documented, not a silent bug.

VERDICT: MATCHES (for the documented operating range coeff∈[0,1], alpha∈[0,1]). Eigenvalues in [0,1] confirmed; the alpha clamp is the load-bearing line. The coeff<0 amplifying branch is intentional and flagged.


3. Direction source: fixed difference-in-means vs trainable

Paper, the direction is a FIXED difference-of-means, selected once (arditi md:124, 134):

We then compute the difference-in-means vector 𝐫ᵢ(l) = 𝛍ᵢ(l) 𝛎ᵢ(l).

We notate the selected vector as 𝐫, and its corresponding unit-norm vector as 𝐫̂.

i.e. 𝐫̂ is computed from mean(harmful) mean(harmless) and then frozen; the paper performs NO gradient descent on it ("does not require gradient-based optimization", md:236).

Code: lora_c is a trainable parameter (param_specs:70 trainable default; random normal init), optionally warm-started from a contrastive S-space direction dS (docstring init:88-90 "group_init() should warm-start lora_c from the S-space contrastive direction dS").

Observation: ours is structurally the same OBJECT (a unit direction that gets projected out) but obtained by a different procedure (SGD, optionally seeded from a contrastive diff) rather than a closed-form difference-of-means.

Inference: calling it "the trainable form of directional ablation (Arditi+ 2024)" (:16-17) is defensible AS A FRAMING: the operator (Eq 4/5 outer-product removal) is Arditi's; the novelty claimed is making the direction learnable. The optional warm-start from a contrastive dS is even closer to Arditi's diff-of-means seed. This is a fair "trainable variant of X" claim, not a misattribution. It would be wrong to claim Arditi's method (which is explicitly gradient-free) — but the code claims the ablation operator, not the extraction method.

VERDICT: DEVIATES-OK. "Trainable form of directional ablation" is an honest framing: same operator, deliberately different (learned) direction source. Recommend the docstring keep the word "form"/"trainable" prominent so it is not read as reproducing Arditi's gradient-free diff-of-means extraction.


4. SVD sign disambiguation

Paper: SILENT on singular-vector sign. Arditi never does an SVD of a weight to get its ablation direction — 𝐫̂ comes from difference-of-means (md:117-124), and the weight-orthogonalization Eq 5 (md:246) uses the outer product 𝐫̂𝐫̂ᵀ. So the paper offers no quote on SVD sign; I reason from the math.

Code: ablation core is the outer product Chat Chatᵀ (forward:183-185), and U is orthonormal from torch.linalg.svd (init:80).

Observation (math):

  • Chat enters only as Chat Chatᵀ (proj h @ Chat then @ Chat.T, :183-185). Flipping any column sign chat_j → chat_j leaves chat_j chat_jᵀ unchanged ⇒ the operator is sign-invariant in c. (lora_c is trainable anyway, so its sign is not even an SVD artifact.)
  • U enters only as h @ U.T (:186) AND the projector identity from point 1 is (U c)(U c)ᵀ; a sign flip u_i → u_i with the matching vh_i → vh_i leaves W = (U S) Vh invariant (init:86) and leaves (U c)(U c)ᵀ invariant. So neither the reconstructed weight nor the ablation projector depends on per-vector sign.
  • S is non-negative by SVD definition; no sign issue.

Inference: no sign canonicalization is needed anywhere in this file. The ablation is a quadratic form in both the (trainable, hence sign-free) direction and the orthonormal basis, and every place U/Vh appear they appear in sign-paired products or outer products. This is the correct situation (a basis/span use, not a sign-sensitive coordinate use).

VERDICT: MATCHES (math-derived; paper silent on SVD sign, stated). No canonicalization required and none missing.


5. Citation check

Docstring (:16-17, :23):

This is the trainable form of directional ablation (Arditi+ 2024 [...]). Refs: [...] directional ablation Arditi+ 2024 arXiv:2406.11717.

Paper title/authors (arditi md:1-21):

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi [...] Oscar Obeso [...] Aaquib Syed [...] Daniel Paleka [...] Nina Rimsky [...] Wes Gurnee [...] Neel Nanda

Observation: arXiv:2406.11717 = "Refusal in Language Models Is Mediated by a Single Direction", first author surname Arditi. The term "directional ablation" is the paper's own (§2.4 heading, md:148). The arXiv id, surname, year, and method name all match.

cov_orient / CorDA attribution. Code config comment (:50) and group_init docstring (:94) say "CorDA" by name:

CorDA-orient the basis from input covariance (group_init [...]) (:50) re-orient each target's SVD by input covariance C=E[x xᵀ] (CorDA) (:94)

CorDA paper core (corda md:23, line "C=XXᵀ", and SVD(WC)=UΣVᵀ, reconstruct Ŵ=UΣVᵀC⁻¹):

obtain the covariance matrix of the input activation [...] C=XXᵀ [...] perform singular value decomposition for the weight multiplied by the covariance matrix, i.e. SVD(WC)=UΣVᵀ [...] the inverse of these covariance matrices is multiplied with the decomposed components to hold the same inference result with the original model

Observation: CorDA IS attributed by name (:50, :94). One math nuance worth recording (not a citation error): CorDA whitens with C on one side and reconstructs with C⁻¹ (SVD(WC), then VᵀC⁻¹). The code instead uses SYMMETRIC whitening SVD(W C^{1/2}) with Pr = Vh C^{-1/2} (group_init:145-150). Both preserve the forward map (W C^{1/2} · C^{-1/2} = W) and both put data-relevant output directions in the top-r, but the orientation is not bit-identical to CorDA's WC. This is an acknowledged variant (task list #22 "Ablate whitening: C^1/2 (mine) vs C (PEFT)"), so the "CorDA-orient" name is a fair attribution of the IDEA (covariance-oriented SVD with inverse-covariance reconstruction), with the symmetric square-root being this repo's choice.

VERDICT: MATCHES (Arditi citation correct: id, surname, method name). CorDA is attributed by name; the C^{1/2} symmetric-whitening detail is a labeled variant of CorDA's C/C⁻¹, not a mis-citation.


Bottom line

No real bugs: the operator reproduces Arditi's outer-product ablation faithfully inside each layer's rank-r output subspace, is a proven contraction for the documented coeff∈[0,1]/alpha∈[0,1] range, needs no SVD sign canonicalization, and both Arditi and CorDA are correctly attributed (the per-layer-learned direction and C^{1/2} whitening are deliberate, self-documented design choices, not deviations from a claimed reproduction).