Commit Graph

23 Commits

Author SHA1 Message Date
wassname 2a50373311 test: put scripts/ on sys.path so benchmark's sibling _cost import resolves in CI
CI collection failed with ModuleNotFoundError: No module named '_cost' because
exec_module loads the script without its dir on sys.path.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-19 08:47:41 +08:00
wassname c792ad3e5f Add LoRA-XS variant: train only r×r core R between frozen SVD factors
Bałazy et al. 2024 (arxiv 2405.17604). A=diag(Sr)Vhr, B=Ur frozen from
top-r SVD of W (W left intact); only the r×r R is trained, init normal(0,1e-5)
so the adapter ~ identity at t=0. ~25k params at r=32 (24 down_proj targets).
justfile: alpha=r (scale=1) and lr=4e-3, matching the ref LLaMA math config.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-18 19:48:40 +08:00
wassname 12fa56f328 Collapse antipasto family to one variant: rot(V) becomes canonical antipasto
main keeps a single antipasto = the rotation+delta SVD adapter (the published
method, paper 2601.07473), default rotate_basis=V. On GSM8K/down_proj rot(V)
led the family (57.2) and at a single seed nothing separated from it, while the
covariance-oriented arms cost 34-120s init for no gain. The full family (gain
core, U/both rotations, ablate, dplr, corda, asvd) is preserved on the
antipasto-variants branch.

- antipasto.py is now the rotation implementation, registered as "antipasto"
- delete antipasto_{rot,ablate,corda,asvd,dplr}.py + their config exports
- benchmark/justfile/cost_report/smoke: drop the removed variants + dead knobs
  (antipasto_coeff/suppress_only/ablate_k/cov_orient/lora_rank); keep
  --antipasto-rotate-basis as antipasto's V/U/both/none ablation axis
- README: subset table to one antipasto row, add rank column, note single-seed
  noise floor (~1.4pp), point the full family at the branch

smoke: 10 passed

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-17 21:05:51 +08:00
wassname 5f9d90d8b8 benchmark sweep: rot(U/both) ablation, whitening conclusion, cost rows
- antipasto_rot: add rotate_basis="both" (independent V+U Cayley rotations),
  run_id suffix __rotU/__rotboth so ablation arms get their own output dirs
- justfile: thread rotate_basis through bench-variant
- corda/eva: padding-mask fix in calibration capture + bf16-tight residual
- README: fill PiSSA/DoRA/CorDA/ASVD/ablate/dplr/rot rows; record the
  metric-axis ablation (C=I 56.0 > diag-C 55.6 > full-C 54.7) and the
  rotation ablation (V 57.2 > U 56.5 > both 55.6) conclusions
- docs/reviews: external ref-checks + deepseek/gpt reviews of the cores

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-17 06:17:53 +08:00
wassname 9d027752ad variants: replace arrow's dense block with diagonal-plus-low-rank core
antipasto_arrow -> antipasto_dplr. The arrowhead's dense b x b block is the wrong
shape: b^2 params, mixes only the top-b, and sits on the S-scaled coords so its
perturbation is amplified by the largest singular values (block=128 collapsed to
45.7% at the gain's lr). Replace it with LoRA's lesson -- a low-rank core inside
the frozen basis, ADDED to the gain:

    DeltaW = U [diag(S_eff) + coeff * B A] Vh,   A:(k,r) B:(r,k), B=0 at init

The low-rank part mixes the whole top-r subspace for 2*r*k params (k=LoRA's rank),
and being additive (not * diag(S)) it is S-independent -- the amplification edge is
gone by construction. Diagonal gain unchanged; identity at init from B=0 and g=0.

Wired through benchmark (antipasto_lora_rank, run_id __k suffix), justfile, cost_report,
smoke (green, dplr attaches/trains/round-trips). Arrow code removed; its run results
stay on disk for comparison.

Co-Authored-By: Claudypoo <noreply@anthropic.com>
2026-06-15 20:13:15 +08:00
wassname 0d40cc9b38 Add antipasto_arrow: structured fixed-basis core (cross-direction mixing)
antipasto's diagonal core can only rescale each frozen singular direction; it
can never let direction i's input drive direction j's output, yet the steered
behaviour is an off-axis combination. A dense r x r core fixes that but costs
r^2 params. antipasto_arrow uses the arrowhead structure instead: a dense b x b
block on the top-b singular directions (full coupling where the action lives)
plus a diagonal 1+ELU tail on the rest. b^2 + (r-b) params, one b x b matmul
per forward -- cross-direction mixing at diagonal-core cost, no Cayley solve.

Identity at init (M=0 -> B=I, g=0 -> gain=1). Verified on a Linear: rel_err
1.5e-7 at init; M[i,j] routes input dir j -> output dir i with weight exactly
M[i,j] (diagonal core forces 0); 14 train params at r=8,b=3 vs r^2=64.

Wired into benchmark (antipasto_block knob), smoke (block=2 for r=4), cost
report, and exports.

Co-Authored-By: Claudypoo <noreply@anthropic.com>
2026-06-14 19:18:59 +08:00
wassname b80d7778af Add rotation-free S-space adapter cores (antipasto family)
Replace antipasto's rotation/Cayley with a bounded 1+ELU gain and split the
S-space idea into four interpretable PiSSA-style cores (frozen U/S/Vh, small
trainable core):

- antipasto: S_eff = S*(1+ELU(coeff*g)). exp-bounded attenuation, linear
  amplification (constant gradient, no runaway). g=0 -> exact identity.
- antipasto_rot: keeps the block-Cayley rotation as a separate variant for
  cost comparison (its per-forward solve is the 72ms vs 36ms gap).
- antipasto_ablate: contractive (I - a c c^T) diag(S), eigenvalues in [0,1],
  cannot blow up. Optional cov_orient (CorDA) basis.
- antipasto_corda: covariance-oriented oblique projector P = Vh C^{-1/2}, the
  data-energy basis rather than the weight-gain basis. 1+ELU gain.

Add scripts/_cost.py + scripts/cost_report.py: one-row-per-variant cost table
(trainable params, peak GPU mem, fwd/bwd ms, added MACs/tok, group_init ms).
Wire all four into the benchmark, smoke test, and __init__ exports.

External review (DeepSeek-v4-pro, docs/reviews/) verified the math; acted on
its one real point (corda g now inits to zeros for exact identity).

Co-Authored-By: Claudypoo <noreply@anthropic.com>
2026-06-14 19:12:27 +08:00
wassname 7df786e80b remove base_weight_fingerprint and test_lora_lite.py
- _base_weight_fingerprint was PiSSA-only defensive check that cluttered
  every save with per-target SHA256. If you load onto wrong base, you get
  wrong weights -- that's user error, not a library bug.
- test_lora_lite.py deleted. All coverage lives in test_metamath_smoke.py
  which runs the real benchmark pipeline per variant.
2026-04-27 16:15:40 +08:00
wassname e624cd244f feat: near_zero/near_one init for trainable params (breaks bf16 dead-grad symmetry)
Trainable params that were init'd at exact 0 or 1 now use near_zero (N(0,1e-4))
or near_one (1 + N(0,1e-4)) to break bf16 symmetry without meaningfully
breaking identity-at-t=0. Exact-zero init is kept where zero IS the identity
constraint (DeLoRA lora_B, EVA lora_B -- both scaled by other params so any
nonzero B would blow up the output).

AntiPaSTO: delta_s and rot_T now near_zero. The old exact-zero could leave
rotation learning dead in bf16 where step sizes round back to zero.

IA3: lora_g now near_one instead of exact ones. Avoids the bf16 spacing issue
around 1.0 where eps_bf16 ~ 7.8e-3 and lr=1e-3 updates were rounding away.

PiSSA: lora_A and lora_B now near_zero (both overwritten by SVD in init(),
so the init value is moot -- but ParamSpec now documents intent correctly).

HRA: lora_U now near_zero (overwritten by symmetric init in init()).

ParamSpec: added 'near_zero' and 'near_one' init modes. Default changed from
'zeros' to 'near_zero'. Tests relaxed identity tolerances accordingly.
2026-04-27 15:55:05 +08:00
wassname 24ba8deb02 simpler test 2026-04-27 09:47:07 +08:00
wassname 727ef6ea73 tidy tests to subset of metamath 2026-04-27 09:20:07 +08:00
wassname 1a93df10b2 fixes 2026-04-27 07:46:10 +08:00
wassname bb8887e66c tidy 2026-04-27 07:12:56 +08:00
wassname b179771cc6 tyro and benchmark 2026-04-27 06:23:30 +08:00
wassname 67a6daf6aa fix: 5 V4 must-fix bugs (DeLoRA B-init, HRA forward order, EVA A trainable, AntiPaSTO refs, qwen probe)
DeLoRA (variants/delora.py):
  lora_B init zeros not kaiming, matching peft (docs/refs/peft_delora_layer.py:139).
  With B=0 the t=0 delta is zero regardless of lambda, so identity holds with
  the peft default lambda0=15 instead of needing the lambda0=0 hack.

HRA (variants/hra.py):
  forward_input loop reversed: now applies x @ H_{r-1} ... H_0 = x @ R^T so
  the base layer computes x R^T W^T = F.linear(x, W @ R), matching peft. The
  bug was masked by paired-symmetry init (R = R^T at t=0) but would corrupt
  any non-symmetric U.

EVA (variants/eva.py):
  lora_A is now a trainable Parameter (peft semantics): SVD only changes the
  init. group_init still copies the SVD basis but under a no_grad guard.

AntiPaSTO (variants/antipasto.py):
  docstring now references arxiv.org/pdf/2601.07473 and
  github.com/wassname/AntiPaSTO so V4 review NO_REFERENCE flag is resolved.

qwen probe (scripts/qwen_train_probe.py):
  perturb_first_adapter walks priority list including lora_U (HRA) and
  lora_A (EVA, LoRA-style A-trainable variants) so HRA tests no longer raise
  'no perturbable adapter parameter found'.

smoke (tests/smoke.py):
  + hra_forward_order_smoke: distinguishing check that compares adapted output
    to F.linear(x, W @ R) with paired symmetry broken; would fail under the
    forward-iter bug.
  + EVA assert lora_A.requires_grad == True per layer.
  - DeLoRA bnb moved to bnb_skip (fp16 + B=0 + clamp(min=1e-4) overflow makes
    grad NaN; real bnb usage needs dequant).
  delora train still uses lambda0=0.1 because peft default 15.0 explodes
  Adam lr=1e-1 in 20 steps.
2026-04-26 20:57:24 +08:00
copilot 55757e829d fix V3 review must-fixes: DoRA bias passthrough + EVA load path
V3 external review (docs/audit/variants_review_v3.md, 97KB) found 3
must-fix bugs.

DoRA: bias was being scaled by m/||V|| because we operated on the full
base layer output. Now subtract bias before normalization, add back
after. Matches peft DoRA exactly (docs/refs/peft_lora_dora.py:157-161).
New smoke dora_bias_smoke verifies identity at t=0 with bias=True.

EVA load: adapter.load() called attach() which called group_init() which
required calibration_data and raised. Added _skip_group_init flag to
attach(); load() passes it. EVA group_init still raises loudly when
called directly without data. New smoke verifies save+load WITHOUT
calibration data on load path.

Also tightened EVA error message.

Smoke now covers 8 variants + EVA roundtrip + DoRA-bias roundtrip + bnb
4/8-bit. ALL PASS.

V3 nice-to-haves (PiSSA scaling, AntiPaSTO init choice, stale GH refs)
deferred -- documented as intentional in module docstrings.
2026-04-26 19:50:48 +08:00
copilot 185eb29c70 fix v2 review bugs + add EVA, AntiPaSTO
DeLoRA: per-input-channel wnorm buffer (not scalar Parameter), forward
matches peft (x*wnorm @ A.T then per-rank scale (lambda/r)/(An*Bn)).
Smoke: 89.7% loss drop (was 35.8%).

HRA: symmetric repeated-column init (PEFT-style) instead of zero gate.
Adjacent Householder pairs cancel exactly so R=I at t=0, and U receives
gradient from step 0 (no dead-grad). Even r required.

IA3: split into two variants. ia3 stays output-side (k_proj/v_proj);
new ia3_ff is input-side (down_proj/fc2), matching peft is_feedforward.

Config: dropout field removed (never honored by any variant).

PiSSA: adapter.save records base-weight fingerprint per target;
adapter.load recomputes init then verifies fingerprint -> fails loud
when reloaded onto a different base.

EVA (new): data-driven init via group_init + calibration_data. Top-r
right singular vectors of pooled layer-input activations -> lora_A
(buffer, frozen); only lora_B trains. Stress-tests group_init API.

AntiPaSTO (new): SVD steering with frozen U,S,Vh,W_res and learnable
delta_s (per-singular-value bias) + rot_T (block-diagonal Cayley
rotation on V or U). Lite port of antipasto3 SVD adapter.

ParamSpec: as_buffer field + make_tensor() for buffer registration.
adapter.attach honors as_buffer with register_buffer; detach cleans
both _parameters and _buffers.

Smoke covers all 8 variants: identity at t=0, save/load round-trip,
gradient-driven loss drop. EVA gets dedicated test for calibration
data path. ALL PASS including bnb 4/8-bit path.
2026-04-26 19:41:59 +08:00
wassname 7eeaeed206 Verify all variants on bnb 4bit/8bit; HRA paper-faithful rewrite
- Test all 6 variants against bnb.Linear8bitLt + Linear4bit in smoke
- bnb-friendly (LoRA, IA3, HRA, DeLoRA): identity err <= 2.4e-4
- bnb-incompatible (PiSSA, DoRA): fail-loud TypeError as expected
- HRA: rewrite to paper-faithful input-side reflections (h <- (I-2vv^T)h),
  fixing previous broken output-side formulation
- IA3: bypass dtype upcast for bnb (params stay fp16/quantized)
- DeLoRA: explicit type check rejecting non-nn.Linear (incl. bnb)
- adapter: special-case bnb param assignment via .data
- Re-verified Qwen0.6B HRA probe: drop=20.7%, id_err=0, reload=0
2026-04-26 18:08:06 +08:00
wassname 0d929f93b3 feat(hra): add Householder Reflection Adaptation, hook-only/bnb-friendly + Qwen proof 2026-04-26 17:58:56 +08:00
wassname 2abf616be6 feat(dora): add weight-decomposed LoRA variant for fp layers 2026-04-26 17:53:33 +08:00
wassname 699fde31bf feat: ia3 variant, real bnb 4bit/8bit smoke, dev guide split, user-only readme 2026-04-26 17:49:17 +08:00
wassname 69bf5f4e44 test: prove adapter training paths 2026-04-26 17:00:39 +08:00
wassname 4db5cee5a9 init 2026-04-26 14:10:20 +08:00