External review (GPT-5.5) flagged 'two near-orthonormal bases' as inaccurate:
only B=Ur is orthonormal; A folds the singular values so its rows are scaled.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Bałazy et al. 2024 (arxiv 2405.17604). A=diag(Sr)Vhr, B=Ur frozen from
top-r SVD of W (W left intact); only the r×r R is trained, init normal(0,1e-5)
so the adapter ~ identity at t=0. ~25k params at r=32 (24 down_proj targets).
justfile: alpha=r (scale=1) and lr=4e-3, matching the ref LLaMA math config.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
main keeps a single antipasto = the rotation+delta SVD adapter (the published
method, paper 2601.07473), default rotate_basis=V. On GSM8K/down_proj rot(V)
led the family (57.2) and at a single seed nothing separated from it, while the
covariance-oriented arms cost 34-120s init for no gain. The full family (gain
core, U/both rotations, ablate, dplr, corda, asvd) is preserved on the
antipasto-variants branch.
- antipasto.py is now the rotation implementation, registered as "antipasto"
- delete antipasto_{rot,ablate,corda,asvd,dplr}.py + their config exports
- benchmark/justfile/cost_report/smoke: drop the removed variants + dead knobs
(antipasto_coeff/suppress_only/ablate_k/cov_orient/lora_rank); keep
--antipasto-rotate-basis as antipasto's V/U/both/none ablation axis
- README: subset table to one antipasto row, add rank column, note single-seed
noise floor (~1.4pp), point the full family at the branch
smoke: 10 passed
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 94 result (Qwen3.5-0.8B, GSM8K, 2500 steps, single seed):
warm-start (top-k S-space output-variance PC): test 55.6 / valid 64.0, init 33.2s
random-init (prior default): test 56.0 / valid 68.0, init 2.2s
Equal-or-worse accuracy (within single-seed noise) for +31s of calibration init.
The optimal ablation direction is loss-defined, not variance-defined, so seeding
lora_c from the data-variance PC buys nothing here. Reverts fe562c2; ablate is
back to the cheap random-init default. cov_orient (CorDA re-orient) path kept.
The FIXME's actual proposal -- a *contrastive* dS seed -- stays open but needs
pos/neg pairs this SFT benchmark lacks (only relevant for labelled steering).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Lets the rot-basis ablation get a second seed without clobbering the seed0
run_id, so V>U>both can be confirmed against seed noise.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Replaces rotation/Cayley antipasto.py with three bounded, interpretable cores
(gain 1+ELU, contractive ablation, CorDA/ASVD better-basis) + dplr, plus full
GSM8K cost table and the rot-basis ablation. Resolves the three review FIXMEs
from 3af2a2a (rambling removed; CorDA split into its own variant; group_init
recovers W_orig so it no longer runs on cropped matrices).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
# Conflicts:
# src/lora_lite/variants/antipasto.py
group_init now seeds each lora_c to the top-k principal axes of the S-space
output coords h=diag(S)Vh x (highest-energy output dirs => largest loss-grad on
the ablation strength), so lora_c starts in a high-gradient region not random.
Cheap r x r second moment when not orienting; reuses Sigma xx^T when cov_orient.
Benchmark always calibrates ablate now. This is the data-variance direction, not
a contrastive behavior dir (SFT has no pos/neg split) -- noted in the docstring.
UAT: |cos(lora_c, top output-PC)| = 1.0000 vs ~0.35 chance; smoke green.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- antipasto_rot: add rotate_basis="both" (independent V+U Cayley rotations),
run_id suffix __rotU/__rotboth so ablation arms get their own output dirs
- justfile: thread rotate_basis through bench-variant
- corda/eva: padding-mask fix in calibration capture + bf16-tight residual
- README: fill PiSSA/DoRA/CorDA/ASVD/ablate/dplr/rot rows; record the
metric-axis ablation (C=I 56.0 > diag-C 55.6 > full-C 54.7) and the
rotation ablation (V 57.2 > U 56.5 > both 55.6) conclusions
- docs/reviews: external ref-checks + deepseek/gpt reviews of the cores
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
gpt-5.5 review (decorrelated) found three real issues deepseek missed:
- BLOCKER: calibration ran through a cropped model. attach() did init() (crops
every target to W_res) then group_init() (calibration forward) then registered
the adapter hooks -- so CorDA's covariance and Wanda's scores were collected from
a model missing every target's top-r. Now register hooks BEFORE group_init; at
g=0/B=0 they reconstruct the cropped component exactly, so calibration sees full W.
- detach() left the model cropped (deleted buffers without adding the frozen top-r
back). Now reconstructs W = W_res + U_r S_r (Vh|P)_r before removing buffers.
- base-residual persistence wasn't in checkpoint metadata, so load->re-save dropped
it. Persist base_weight_keys in metadata, validate on load, carry onto attach state.
Docstring/citation cleanup (review + user style asks):
- antipasto_corda: drop changelog narration and the stale "None -> plain SVD" claim
(it raises now); exact reconstruction states W_res; slim the CPU/OOM note.
- antipasto_dplr: drop the arrowhead archaeology; docstring math now matches the
forward (p@A.T@B.T); fix the k=0 comment (code requires 1<=k<=r).
- citations: Wanda (Sun+ 2023, 2306.11695), ASVD (Yuan+ 2023, 2312.05821),
PiSSA (Meng+ 2024, 2404.02948), LoRA (Hu+ 2021, 2106.09685).
Co-Authored-By: Claudypoo <noreply@anthropic.com>
The benchmark only passed calibration_data to eva, so antipasto_corda's
group_init hit `if calibration_data is None: return` and every corda run was
actually plain SVD. The covariance orientation never executed -- all prior
corda-vs-antipasto comparisons are void.
- antipasto_corda.group_init: raise on None instead of silently degrading
(orientation is the variant's whole identity; fail loud).
- benchmark: feed ~256 MetaMath calibration samples (IPM, per PEFT/CorDA) to
corda and to cov_orient ablate; run_id now carries an __lr tag.
- adapter.save/load: a data-driven group_init rewrites the frozen base residual
W_res into a form init() cannot reproduce at load (it only knows the plain
top-r crop). Persist those residuals in the adapter and restore them. Fixes a
reload-logits mismatch that was masked while group_init never ran.
- probe check: compare every saved tensor (lora_ buffers AND base residuals)
against the reloaded model state.
- justfile: bench-variant gains an lr_override (the core wants a tamer lr than
the gain's 5e-3).
Co-Authored-By: Claudypoo <noreply@anthropic.com>
antipasto_arrow -> antipasto_dplr. The arrowhead's dense b x b block is the wrong
shape: b^2 params, mixes only the top-b, and sits on the S-scaled coords so its
perturbation is amplified by the largest singular values (block=128 collapsed to
45.7% at the gain's lr). Replace it with LoRA's lesson -- a low-rank core inside
the frozen basis, ADDED to the gain:
DeltaW = U [diag(S_eff) + coeff * B A] Vh, A:(k,r) B:(r,k), B=0 at init
The low-rank part mixes the whole top-r subspace for 2*r*k params (k=LoRA's rank),
and being additive (not * diag(S)) it is S-independent -- the amplification edge is
gone by construction. Diagonal gain unchanged; identity at init from B=0 and g=0.
Wired through benchmark (antipasto_lora_rank, run_id __k suffix), justfile, cost_report,
smoke (green, dplr attaches/trains/round-trips). Arrow code removed; its run results
stay on disk for comparison.
Co-Authored-By: Claudypoo <noreply@anthropic.com>
bench-variant gains an r_override arg (alpha tracks r for the antipasto family);
run_id appends __r<N> when an antipasto-family run uses r!=256, so the low-rank
corda-vs-antipasto sweep does not overwrite the r=256 results.
Co-Authored-By: Claudypoo <noreply@anthropic.com>
README: 'we validate the same way PEFT does; trained properly they clear 49% on
GSM8K, all pass' + link to the benchmark script.
justfile: arrow with block>8 uses lr=1e-4 not 5e-3. The 5e-3 that suits the tiny
S-space gain destabilizes the large dense block -- block=128 at 5e-3 scored 45.7%
(below the bar, vs block=8's 60.5%). Capacity sweep requeued at LoRA's 1e-4 to
de-confound params-vs-lr.
Co-Authored-By: Claudypoo <noreply@anthropic.com>
The point is that every adapter clears PEFT's ~48% LoRA bar on the same
MetaMathQA->GSM8K protocol -- that all rows pass is the it-trains signal,
not a competitive ranking.
Co-Authored-By: Claudypoo <noreply@anthropic.com>
- Fold the family into the main Variants table as rows (CorDA/ablate/arrow)
instead of a separate table.
- Lead with the point (freeze W's SVD, learn only a bounded gain -> interpretable,
O(r) params) before any numbers.
- Replace the unsourced 'PEFT reports 49.0%' line (wrong; LoRA is ~48%) with a
real link to PEFT's method_comparison/MetaMathQA and a pointer to the benchmark
script for hyperparameters. Link CorDA/Arditi papers inline.
Co-Authored-By: Claudypoo <noreply@anthropic.com>
Replace the per-experiment family breakdown table + comparison prose with a
2-sentence method description (frozen interpretable SVD basis, O(r) gain, the
three variant cores). Experiment findings (rotation comparison, arrow capacity,
cost/timing) belong in the research journal, not the README skim path.
Co-Authored-By: Claudypoo <noreply@anthropic.com>
Rewrite antipasto/ablate/corda/arrow docstrings to the house style (purpose +
math block + identity line + refs), dropping the rambly meta-commentary aimed at
past design decisions ('Changes vs the rotation version', chat references, inline
measurements). Net -74 lines.
Also answer the FIXMEs left on main's old copy:
- group_init is Wanda/ASVD *selection* (re-rank W's own singular vectors), NOT
CorDA re-orientation -- that is antipasto_corda.py.
- it rebuilds the FULL W exactly (W_res + stored top-r == W), so the re-SVD sees
the whole spectrum, not a cropped matrix.
Arrow capacity: --antipasto-block CLI knob (justfile bench-variant 4th arg) so the
block can be scaled toward LoRA params; run_id gets a __b<N> suffix so block-sweep
runs do not collide. Smoke green (14 passed).
Co-Authored-By: Claudypoo <noreply@anthropic.com>
Replace the stale single AntiPaSTO row (was 35.8K params from the removed
rotation version, described block-Cayley which no longer exists) with the
real 5000-step Qwen3-0.6B numbers and a family breakdown:
corda 61.9% 14.3K (best: covariance-oriented basis)
plain 61.4% 14.3K
rot 61.4% 35.8K (the rotation this replaces)
ablate 61.0% 14.4K
arrow 60.5% 17.5K
Headline: ~320x fewer trainable params than LoRA at ~97% of its accuracy.
Rotation buys nothing (rot matches plain to 3 s.f. at 2.5x params, +20%
wall-time, plus a per-forward Cayley solve), confirming the drop.
Co-Authored-By: Claudypoo <noreply@anthropic.com>
External (codex) review found the suppress_only "attenuation only" claim holds
only for coeff>=0 (coeff<0 inverts the product and re-amplifies). Doc-only
caveat in antipasto/_corda/_arrow; no math change (sweeps run coeff=1.0).
Also clarify arrow's group_init top-b lands on largest-S-among-selected, not
highest-score.
Co-Authored-By: Claudypoo <noreply@anthropic.com>
The README GSM8K sweep was queued as raw expanded commands with an
unquoted --target-name '(q_proj|v_proj)$'; pueue runs via sh -c, so the
parens errored instantly before training. Routing through bench-variant
(bash shebang quotes the target) fixes it. Also bake the antipasto family's
r=256/alpha=256 into the case block so it matches the published AntiPaSTO
row, replacing the dead trailing "$@" (shebang recipes get no extra args).
Co-Authored-By: Claudypoo <noreply@anthropic.com>
The small-param antipasto family (gain/block/ablate/corda) all need the higher
lr to clear the bf16 round-to-nearest floor, not just antipasto. Glob the case.
Co-Authored-By: Claudypoo <noreply@anthropic.com>
antipasto's diagonal core can only rescale each frozen singular direction; it
can never let direction i's input drive direction j's output, yet the steered
behaviour is an off-axis combination. A dense r x r core fixes that but costs
r^2 params. antipasto_arrow uses the arrowhead structure instead: a dense b x b
block on the top-b singular directions (full coupling where the action lives)
plus a diagonal 1+ELU tail on the rest. b^2 + (r-b) params, one b x b matmul
per forward -- cross-direction mixing at diagonal-core cost, no Cayley solve.
Identity at init (M=0 -> B=I, g=0 -> gain=1). Verified on a Linear: rel_err
1.5e-7 at init; M[i,j] routes input dir j -> output dir i with weight exactly
M[i,j] (diagonal core forces 0); 14 train params at r=8,b=3 vs r^2=64.
Wired into benchmark (antipasto_block knob), smoke (block=2 for r=4), cost
report, and exports.
Co-Authored-By: Claudypoo <noreply@anthropic.com>
Replace antipasto's rotation/Cayley with a bounded 1+ELU gain and split the
S-space idea into four interpretable PiSSA-style cores (frozen U/S/Vh, small
trainable core):
- antipasto: S_eff = S*(1+ELU(coeff*g)). exp-bounded attenuation, linear
amplification (constant gradient, no runaway). g=0 -> exact identity.
- antipasto_rot: keeps the block-Cayley rotation as a separate variant for
cost comparison (its per-forward solve is the 72ms vs 36ms gap).
- antipasto_ablate: contractive (I - a c c^T) diag(S), eigenvalues in [0,1],
cannot blow up. Optional cov_orient (CorDA) basis.
- antipasto_corda: covariance-oriented oblique projector P = Vh C^{-1/2}, the
data-energy basis rather than the weight-gain basis. 1+ELU gain.
Add scripts/_cost.py + scripts/cost_report.py: one-row-per-variant cost table
(trainable params, peak GPU mem, fwd/bwd ms, added MACs/tok, group_init ms).
Wire all four into the benchmark, smoke test, and __init__ exports.
External review (DeepSeek-v4-pro, docs/reviews/) verified the math; acted on
its one real point (corda g now inits to zeros for exact identity).
Co-Authored-By: Claudypoo <noreply@anthropic.com>
- delete _road_matrix in variants/road.py (zero callers)
- drop redundant callable(m) clause in is_linear_like (every nn.Module is callable)
- remove try/except in current_git_commit so missing git crashes loudly
instead of writing "unknown" into the results TSV
Co-Authored-By: Claudypoo <noreply@anthropic.com>
Score each singular dimension by S[i] * mean|X @ Vh[i]| (weight magnitude
times activation magnitude), then pick top-r by joint score instead of top-r
by S alone. Keeps the weight-SVD basis; only reorders which r dimensions are
retained based on real input activations.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Weight-SVD init (PiSSA-style) kept as fallback; when calibration_data is
provided, group_init() collects pre-hook activations, SVDs the pooled inputs
per layer, and re-decomposes W_orig through the top-r input-PCA directions.
Vhr_final = Vh_A @ Vhr_new keeps rows orthonormal while preserving the
input-aligned span.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _base_weight_fingerprint was PiSSA-only defensive check that cluttered
every save with per-target SHA256. If you load onto wrong base, you get
wrong weights -- that's user error, not a library bug.
- test_lora_lite.py deleted. All coverage lives in test_metamath_smoke.py
which runs the real benchmark pipeline per variant.
Trainable params that were init'd at exact 0 or 1 now use near_zero (N(0,1e-4))
or near_one (1 + N(0,1e-4)) to break bf16 symmetry without meaningfully
breaking identity-at-t=0. Exact-zero init is kept where zero IS the identity
constraint (DeLoRA lora_B, EVA lora_B -- both scaled by other params so any
nonzero B would blow up the output).
AntiPaSTO: delta_s and rot_T now near_zero. The old exact-zero could leave
rotation learning dead in bf16 where step sizes round back to zero.
IA3: lora_g now near_one instead of exact ones. Avoids the bf16 spacing issue
around 1.0 where eps_bf16 ~ 7.8e-3 and lr=1e-3 updates were rounding away.
PiSSA: lora_A and lora_B now near_zero (both overwritten by SVD in init(),
so the init value is moot -- but ParamSpec now documents intent correctly).
HRA: lora_U now near_zero (overwritten by symmetric init in init()).
ParamSpec: added 'near_zero' and 'near_one' init modes. Default changed from
'zeros' to 'near_zero'. Tests relaxed identity tolerances accordingly.