lora-lite

mirror of https://github.com/wassname/lora-lite.git synced 2026-06-27 16:30:44 +08:00

Author	SHA1	Message	Date
wassname	2c56196dea	justfile/run_id: r override for low-rank antipasto sweeps bench-variant gains an r_override arg (alpha tracks r for the antipasto family); run_id appends __r<N> when an antipasto-family run uses r!=256, so the low-rank corda-vs-antipasto sweep does not overwrite the r=256 results. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 18:35:54 +08:00
wassname	e8ca6f5944	README: validation framing per wassname's wording; arrow large-block lr=1e-4 README: 'we validate the same way PEFT does; trained properly they clear 49% on GSM8K, all pass' + link to the benchmark script. justfile: arrow with block>8 uses lr=1e-4 not 5e-3. The 5e-3 that suits the tiny S-space gain destabilizes the large dense block -- block=128 at 5e-3 scored 45.7% (below the bar, vs block=8's 60.5%). Capacity sweep requeued at LoRA's 1e-4 to de-confound params-vs-lr. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 18:27:33 +08:00
wassname	6b7b3a47dd	README: frame the GSM8K table as a validation harness, not a leaderboard The point is that every adapter clears PEFT's ~48% LoRA bar on the same MetaMathQA->GSM8K protocol -- that all rows pass is the it-trains signal, not a competitive ranking. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 18:20:53 +08:00
wassname	6ab1dfff0e	README: antipasto variants as table rows; real PEFT reference - Fold the family into the main Variants table as rows (CorDA/ablate/arrow) instead of a separate table. - Lead with the point (freeze W's SVD, learn only a bounded gain -> interpretable, O(r) params) before any numbers. - Replace the unsourced 'PEFT reports 49.0%' line (wrong; LoRA is ~48%) with a real link to PEFT's method_comparison/MetaMathQA and a pointer to the benchmark script for hyperparameters. Link CorDA/Arditi papers inline. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 18:18:09 +08:00
wassname	fa69e0cac3	README: trim AntiPaSTO section for researcher audience Replace the per-experiment family breakdown table + comparison prose with a 2-sentence method description (frozen interpretable SVD basis, O(r) gain, the three variant cores). Experiment findings (rotation comparison, arrow capacity, cost/timing) belong in the research journal, not the README skim path. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 18:12:31 +08:00
wassname	d9d31a160f	variants: clean docstrings to research pseudocode; arrow block param Rewrite antipasto/ablate/corda/arrow docstrings to the house style (purpose + math block + identity line + refs), dropping the rambly meta-commentary aimed at past design decisions ('Changes vs the rotation version', chat references, inline measurements). Net -74 lines. Also answer the FIXMEs left on main's old copy: - group_init is Wanda/ASVD selection (re-rank W's own singular vectors), NOT CorDA re-orientation -- that is antipasto_corda.py. - it rebuilds the FULL W exactly (W_res + stored top-r == W), so the re-SVD sees the whole spectrum, not a cropped matrix. Arrow capacity: --antipasto-block CLI knob (justfile bench-variant 4th arg) so the block can be scaled toward LoRA params; run_id gets a __b<N> suffix so block-sweep runs do not collide. Smoke green (14 passed). Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 18:09:53 +08:00
wassname	90b5199ed9	README: AntiPaSTO family GSM8K results (5 variants, r=256) Replace the stale single AntiPaSTO row (was 35.8K params from the removed rotation version, described block-Cayley which no longer exists) with the real 5000-step Qwen3-0.6B numbers and a family breakdown: corda 61.9% 14.3K (best: covariance-oriented basis) plain 61.4% 14.3K rot 61.4% 35.8K (the rotation this replaces) ablate 61.0% 14.4K arrow 60.5% 17.5K Headline: ~320x fewer trainable params than LoRA at ~97% of its accuracy. Rotation buys nothing (rot matches plain to 3 s.f. at 2.5x params, +20% wall-time, plus a per-forward Cayley solve), confirming the drop. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 07:05:45 +08:00
wassname	a5999bdeb8	docs: tighten suppress_only contract + arrow top-b selection note External (codex) review found the suppress_only "attenuation only" claim holds only for coeff>=0 (coeff<0 inverts the product and re-amplifies). Doc-only caveat in antipasto/_corda/_arrow; no math change (sweeps run coeff=1.0). Also clarify arrow's group_init top-b lands on largest-S-among-selected, not highest-score. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 06:24:23 +08:00
wassname	32b1fd885a	justfile: route antipasto bench through r=256/alpha=256 in bench-variant The README GSM8K sweep was queued as raw expanded commands with an unquoted --target-name '(q_proj\|v_proj)$'; pueue runs via sh -c, so the parens errored instantly before training. Routing through bench-variant (bash shebang quotes the target) fixes it. Also bake the antipasto family's r=256/alpha=256 into the case block so it matches the published AntiPaSTO row, replacing the dead trailing "$@" (shebang recipes get no extra args). Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 05:58:34 +08:00
wassname	d6b242818a	justfile: lr=5e-3 for all antipasto_* cores in bench-variant The small-param antipasto family (gain/block/ablate/corda) all need the higher lr to clear the bf16 round-to-nearest floor, not just antipasto. Glob the case. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-14 19:20:35 +08:00
wassname	0d40cc9b38	Add antipasto_arrow: structured fixed-basis core (cross-direction mixing) antipasto's diagonal core can only rescale each frozen singular direction; it can never let direction i's input drive direction j's output, yet the steered behaviour is an off-axis combination. A dense r x r core fixes that but costs r^2 params. antipasto_arrow uses the arrowhead structure instead: a dense b x b block on the top-b singular directions (full coupling where the action lives) plus a diagonal 1+ELU tail on the rest. b^2 + (r-b) params, one b x b matmul per forward -- cross-direction mixing at diagonal-core cost, no Cayley solve. Identity at init (M=0 -> B=I, g=0 -> gain=1). Verified on a Linear: rel_err 1.5e-7 at init; M[i,j] routes input dir j -> output dir i with weight exactly M[i,j] (diagonal core forces 0); 14 train params at r=8,b=3 vs r^2=64. Wired into benchmark (antipasto_block knob), smoke (block=2 for r=4), cost report, and exports. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-14 19:18:59 +08:00
wassname	b80d7778af	Add rotation-free S-space adapter cores (antipasto family) Replace antipasto's rotation/Cayley with a bounded 1+ELU gain and split the S-space idea into four interpretable PiSSA-style cores (frozen U/S/Vh, small trainable core): - antipasto: S_eff = S(1+ELU(coeffg)). exp-bounded attenuation, linear amplification (constant gradient, no runaway). g=0 -> exact identity. - antipasto_rot: keeps the block-Cayley rotation as a separate variant for cost comparison (its per-forward solve is the 72ms vs 36ms gap). - antipasto_ablate: contractive (I - a c c^T) diag(S), eigenvalues in [0,1], cannot blow up. Optional cov_orient (CorDA) basis. - antipasto_corda: covariance-oriented oblique projector P = Vh C^{-1/2}, the data-energy basis rather than the weight-gain basis. 1+ELU gain. Add scripts/_cost.py + scripts/cost_report.py: one-row-per-variant cost table (trainable params, peak GPU mem, fwd/bwd ms, added MACs/tok, group_init ms). Wire all four into the benchmark, smoke test, and __init__ exports. External review (DeepSeek-v4-pro, docs/reviews/) verified the math; acted on its one real point (corda g now inits to zeros for exact identity). Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-14 19:12:27 +08:00
wassname (Michael J Clark)	e5048fcaff	Update antipasto.py	2026-06-10 15:55:14 +08:00
wassname (Michael J Clark)	0dcbc753ac	Update antipasto.py	2026-06-10 15:54:49 +08:00
wassname	072a816cee	docs: fix hallucinated arxiv links in variants table AntiPaSTO, EVA, and HRA pointed at unrelated papers (stock prediction, LLM-vs-lawyer study, 2D Ising model). Replaced with verified IDs. Co-Authored-By: Claudypoo <claudypoo@noreply.invalid>	2026-05-26 05:48:49 +08:00
wassname	ce8c250422	perf: use matmul for lora adapter projections	2026-05-21 08:23:56 +08:00
wassname	56937e1b18	remove dead code: _road_matrix, callable(m) clause, silent git fallback - delete _road_matrix in variants/road.py (zero callers) - drop redundant callable(m) clause in is_linear_like (every nn.Module is callable) - remove try/except in current_git_commit so missing git crashes loudly instead of writing "unknown" into the results TSV Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-05-19 19:11:32 +08:00
wassname	19888fbb82	antipasto: replace EVA-style group_init with Wanda-style dimension selection Score each singular dimension by S[i] * mean\|X @ Vh[i]\| (weight magnitude times activation magnitude), then pick top-r by joint score instead of top-r by S alone. Keeps the weight-SVD basis; only reorders which r dimensions are retained based on real input activations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-01 21:24:52 +08:00
wassname	f91c7b23f2	antipasto: add EVA-style data-driven group_init Weight-SVD init (PiSSA-style) kept as fallback; when calibration_data is provided, group_init() collects pre-hook activations, SVDs the pooled inputs per layer, and re-decomposes W_orig through the top-r input-PCA directions. Vhr_final = Vh_A @ Vhr_new keeps rows orthonormal while preserving the input-aligned span. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-01 20:55:56 +08:00
wassname	b698331cfa	feat: add HRA benchmark result (61.6%), update README table	2026-04-27 20:07:19 +08:00
wassname	f6fd410677	benchmark: antipasto rotate_basis CLI + lr=5e-3 + ablation queue	2026-04-27 16:29:25 +08:00
wassname	88f107a423	antipasto: delta_s init 4e-4+N(0,4e-4) from antipasto3, rotate_basis='none' option	2026-04-27 16:27:12 +08:00
wassname	7df786e80b	remove base_weight_fingerprint and test_lora_lite.py - _base_weight_fingerprint was PiSSA-only defensive check that cluttered every save with per-target SHA256. If you load onto wrong base, you get wrong weights -- that's user error, not a library bug. - test_lora_lite.py deleted. All coverage lives in test_metamath_smoke.py which runs the real benchmark pipeline per variant.	2026-04-27 16:15:40 +08:00
wassname	e624cd244f	feat: near_zero/near_one init for trainable params (breaks bf16 dead-grad symmetry) Trainable params that were init'd at exact 0 or 1 now use near_zero (N(0,1e-4)) or near_one (1 + N(0,1e-4)) to break bf16 symmetry without meaningfully breaking identity-at-t=0. Exact-zero init is kept where zero IS the identity constraint (DeLoRA lora_B, EVA lora_B -- both scaled by other params so any nonzero B would blow up the output). AntiPaSTO: delta_s and rot_T now near_zero. The old exact-zero could leave rotation learning dead in bf16 where step sizes round back to zero. IA3: lora_g now near_one instead of exact ones. Avoids the bf16 spacing issue around 1.0 where eps_bf16 ~ 7.8e-3 and lr=1e-3 updates were rounding away. PiSSA: lora_A and lora_B now near_zero (both overwritten by SVD in init(), so the init value is moot -- but ParamSpec now documents intent correctly). HRA: lora_U now near_zero (overwritten by symmetric init in init()). ParamSpec: added 'near_zero' and 'near_one' init modes. Default changed from 'zeros' to 'near_zero'. Tests relaxed identity tolerances accordingly.	2026-04-27 15:55:05 +08:00
wassname	0bd091fe5b	tidy	2026-04-27 11:44:40 +08:00
wassname	a342801807	wip	2026-04-27 11:24:19 +08:00
wassname	24ba8deb02	simpler test	2026-04-27 09:47:07 +08:00
wassname	b60a8c3f9b	readme	2026-04-27 09:46:52 +08:00
wassname	727ef6ea73	tidy tests to subset of metamath	2026-04-27 09:20:07 +08:00
wassname	1a93df10b2	fixes	2026-04-27 07:46:10 +08:00
wassname	bb8887e66c	tidy	2026-04-27 07:12:56 +08:00
wassname	74c374e741	tidy, review	2026-04-27 07:03:24 +08:00
wassname	a44fc039af	rm defensive docstr	2026-04-27 06:39:18 +08:00
wassname	a81ed6ffaf	misc	2026-04-27 06:23:36 +08:00
wassname	b179771cc6	tyro and benchmark	2026-04-27 06:23:30 +08:00
wassname	67a6daf6aa	fix: 5 V4 must-fix bugs (DeLoRA B-init, HRA forward order, EVA A trainable, AntiPaSTO refs, qwen probe) DeLoRA (variants/delora.py): lora_B init zeros not kaiming, matching peft (docs/refs/peft_delora_layer.py:139). With B=0 the t=0 delta is zero regardless of lambda, so identity holds with the peft default lambda0=15 instead of needing the lambda0=0 hack. HRA (variants/hra.py): forward_input loop reversed: now applies x @ H_{r-1} ... H_0 = x @ R^T so the base layer computes x R^T W^T = F.linear(x, W @ R), matching peft. The bug was masked by paired-symmetry init (R = R^T at t=0) but would corrupt any non-symmetric U. EVA (variants/eva.py): lora_A is now a trainable Parameter (peft semantics): SVD only changes the init. group_init still copies the SVD basis but under a no_grad guard. AntiPaSTO (variants/antipasto.py): docstring now references arxiv.org/pdf/2601.07473 and github.com/wassname/AntiPaSTO so V4 review NO_REFERENCE flag is resolved. qwen probe (scripts/qwen_train_probe.py): perturb_first_adapter walks priority list including lora_U (HRA) and lora_A (EVA, LoRA-style A-trainable variants) so HRA tests no longer raise 'no perturbable adapter parameter found'. smoke (tests/smoke.py): + hra_forward_order_smoke: distinguishing check that compares adapted output to F.linear(x, W @ R) with paired symmetry broken; would fail under the forward-iter bug. + EVA assert lora_A.requires_grad == True per layer. - DeLoRA bnb moved to bnb_skip (fp16 + B=0 + clamp(min=1e-4) overflow makes grad NaN; real bnb usage needs dequant). delora train still uses lambda0=0.1 because peft default 15.0 explodes Adam lr=1e-1 in 20 steps.	2026-04-26 20:57:24 +08:00
wassname	053901e0ca	types, review	2026-04-26 20:35:38 +08:00
copilot	55757e829d	fix V3 review must-fixes: DoRA bias passthrough + EVA load path V3 external review (docs/audit/variants_review_v3.md, 97KB) found 3 must-fix bugs. DoRA: bias was being scaled by m/\|\|V\|\| because we operated on the full base layer output. Now subtract bias before normalization, add back after. Matches peft DoRA exactly (docs/refs/peft_lora_dora.py:157-161). New smoke dora_bias_smoke verifies identity at t=0 with bias=True. EVA load: adapter.load() called attach() which called group_init() which required calibration_data and raised. Added _skip_group_init flag to attach(); load() passes it. EVA group_init still raises loudly when called directly without data. New smoke verifies save+load WITHOUT calibration data on load path. Also tightened EVA error message. Smoke now covers 8 variants + EVA roundtrip + DoRA-bias roundtrip + bnb 4/8-bit. ALL PASS. V3 nice-to-haves (PiSSA scaling, AntiPaSTO init choice, stale GH refs) deferred -- documented as intentional in module docstrings.	2026-04-26 19:50:48 +08:00
copilot	185eb29c70	fix v2 review bugs + add EVA, AntiPaSTO DeLoRA: per-input-channel wnorm buffer (not scalar Parameter), forward matches peft (xwnorm @ A.T then per-rank scale (lambda/r)/(AnBn)). Smoke: 89.7% loss drop (was 35.8%). HRA: symmetric repeated-column init (PEFT-style) instead of zero gate. Adjacent Householder pairs cancel exactly so R=I at t=0, and U receives gradient from step 0 (no dead-grad). Even r required. IA3: split into two variants. ia3 stays output-side (k_proj/v_proj); new ia3_ff is input-side (down_proj/fc2), matching peft is_feedforward. Config: dropout field removed (never honored by any variant). PiSSA: adapter.save records base-weight fingerprint per target; adapter.load recomputes init then verifies fingerprint -> fails loud when reloaded onto a different base. EVA (new): data-driven init via group_init + calibration_data. Top-r right singular vectors of pooled layer-input activations -> lora_A (buffer, frozen); only lora_B trains. Stress-tests group_init API. AntiPaSTO (new): SVD steering with frozen U,S,Vh,W_res and learnable delta_s (per-singular-value bias) + rot_T (block-diagonal Cayley rotation on V or U). Lite port of antipasto3 SVD adapter. ParamSpec: as_buffer field + make_tensor() for buffer registration. adapter.attach honors as_buffer with register_buffer; detach cleans both _parameters and _buffers. Smoke covers all 8 variants: identity at t=0, save/load round-trip, gradient-driven loss drop. EVA gets dedicated test for calibration data path. ALL PASS including bnb 4/8-bit path.	2026-04-26 19:41:59 +08:00
wassname	fdb4c77d6c	Add reference-impl URLs to variant docstrings + V2 external review - Fetch canonical reference impls for offline review: * peft_{lora,hra,delora,ia3}_layer.py + peft_lora_{dora,variants}.py * orig_pissa_init.py (MuLabPKU/PiSSA) * orig_hra_layer.py (DaShenZi721/HRA) * orig_delora.py (ExplainableML/DeLoRA author fork) - Add reference-impl URLs to all 6 variant docstrings - Document HRA gate=0 dead-grad issue and DoRA detach-omission in their docstrings - Re-run external review (codex) with refs available -> docs/audit/variants_review_v2.md Major NEW findings vs paper-only review: * DeLoRA: scalar W.norm() should be per-input-channel norm(dim=0) * HRA: PEFT uses symmetric repeated-column init (no dead grad), not zero gate * IA3: FFN targets need input-side gating, not output, our up_proj advice wrong * All LoRA-family: cfg.dropout silently ignored (no-op) * DeLoRA: wnorm should be persistent buffer, not Parameter HRA and DeLoRA upgraded to BUGGY (from Partial)	2026-04-26 19:27:47 +08:00
wassname	d0b4c52740	External review: per-variant audit + design notes - Two acpx external reviews (codex + opencode): * docs/audit/variants_review.md: per-variant paper-vs-impl audit * docs/audit/design_review.md: peft EVA / baukit / antipasto3 vs lora-lite * docs/audit/SUMMARY.md: aggregate verdicts + 3 risks + 5 follow-ups - docs/refs/: peft_eva.py, peft_eva_finetuning.py, baukit_nethook.py, antipasto3_svd_adapter.py for offline reference Findings: LoRA clean; PiSSA/DoRA/IA3/HRA/DeLoRA have documented partial deviations. Top risks: init/grad tradeoffs hidden by coarse tests; qwen probe lacks strict identity tol; IA3 target placement untested.	2026-04-26 19:01:29 +08:00
wassname	7eeaeed206	Verify all variants on bnb 4bit/8bit; HRA paper-faithful rewrite - Test all 6 variants against bnb.Linear8bitLt + Linear4bit in smoke - bnb-friendly (LoRA, IA3, HRA, DeLoRA): identity err <= 2.4e-4 - bnb-incompatible (PiSSA, DoRA): fail-loud TypeError as expected - HRA: rewrite to paper-faithful input-side reflections (h <- (I-2vv^T)h), fixing previous broken output-side formulation - IA3: bypass dtype upcast for bnb (params stay fp16/quantized) - DeLoRA: explicit type check rejecting non-nn.Linear (incl. bnb) - adapter: special-case bnb param assignment via .data - Re-verified Qwen0.6B HRA probe: drop=20.7%, id_err=0, reload=0	2026-04-26 18:08:06 +08:00
wassname	0d929f93b3	feat(hra): add Householder Reflection Adaptation, hook-only/bnb-friendly + Qwen proof	2026-04-26 17:58:56 +08:00
wassname	43e620176c	docs: record DoRA + IA3 Qwen-0.6B proof results (tasks 80, 81)	2026-04-26 17:54:54 +08:00
wassname	2abf616be6	feat(dora): add weight-decomposed LoRA variant for fp layers	2026-04-26 17:53:33 +08:00
wassname	699fde31bf	feat: ia3 variant, real bnb 4bit/8bit smoke, dev guide split, user-only readme	2026-04-26 17:49:17 +08:00
wassname	f2d9021511	ci: add publishable check workflow	2026-04-26 17:09:47 +08:00
wassname	69bf5f4e44	test: prove adapter training paths	2026-04-26 17:00:39 +08:00
wassname	4db5cee5a9	init	2026-04-26 14:10:20 +08:00
wassname	de97724b65	init	2026-04-26 14:10:18 +08:00

50 Commits