lora-lite

mirror of https://github.com/wassname/lora-lite.git synced 2026-06-27 16:45:56 +08:00

Author	SHA1	Message	Date
wassname	2a50373311	test: put scripts/ on sys.path so benchmark's sibling _cost import resolves in CI CI collection failed with ModuleNotFoundError: No module named '_cost' because exec_module loads the script without its dir on sys.path. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-19 08:47:41 +08:00
wassname	28d04f1e1d	gitignore: match loraxs_ review scratch; track curated loraxs_review.md Broaden raw/err patterns to raw/err so prefixed scratch (loraxs_raw.jsonl, loraxs_err.txt) is ignored. Add the GPT-5.5 review of the lora_xs variant as the curated artifact. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-19 06:04:25 +08:00
wassname	8005423c47	README: note LoRA-XS all-linear spread didn't help (test 55.6 vs down_proj 56.8) Paper spreads LoRA-XS across all q/k/v/o + FFN linears, not down_proj only. Tried it (150 modules, 0.154M params): test 55.6 / valid 62.0, slightly below the down_proj row at 6x params, within single-seed noise. down_proj-only stays the table entry. result: outputs/metamath_gsm8k_alllinear/...__seed0/result.json Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-18 23:49:36 +08:00
wassname	5d910996b3	justfile: bench-variant takes a target_override arg, routed to its own out dir LoRA-XS's paper recipe spreads across q/k/v/o + all 3 FFN projections, not down_proj only. run_id ignores target, so overridden runs go to outputs/metamath_gsm8k_alllinear to avoid clobbering the canonical down_proj results the README table is built from. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-18 21:53:30 +08:00
wassname	a75bed492b	README: add LoRA-XS variant row (test 56.8 / valid 68.0, params 0.025M) Qwen3.5-0.8B-Base, down_proj all 24 layers, r=32 alpha=32 lr=4e-3, 2500 steps. UAT: grad=0.699>0, dθ=60.0>0, base_grad_leaks=0. result: outputs/metamath_gsm8k/Qwen--Qwen3.5-0.8B-Base__lora_xs__s2500__seed0/result.json Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-18 21:28:10 +08:00
wassname	4e03f9c07f	lora_xs: fix docstring -- A=diag(Sr)Vhr has row norms Sr, not orthonormal External review (GPT-5.5) flagged 'two near-orthonormal bases' as inaccurate: only B=Ur is orthonormal; A folds the singular values so its rows are scaled. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-18 20:01:59 +08:00
wassname	c792ad3e5f	Add LoRA-XS variant: train only r×r core R between frozen SVD factors Bałazy et al. 2024 (arxiv 2405.17604). A=diag(Sr)Vhr, B=Ur frozen from top-r SVD of W (W left intact); only the r×r R is trained, init normal(0,1e-5) so the adapter ~ identity at t=0. ~25k params at r=32 (24 down_proj targets). justfile: alpha=r (scale=1) and lr=4e-3, matching the ref LLaMA math config. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-18 19:48:40 +08:00
wassname	12e13cca79	README: rot basis is within noise (seed order flips), soften V claim Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-18 03:32:01 +08:00
wassname	12fa56f328	Collapse antipasto family to one variant: rot(V) becomes canonical antipasto main keeps a single antipasto = the rotation+delta SVD adapter (the published method, paper 2601.07473), default rotate_basis=V. On GSM8K/down_proj rot(V) led the family (57.2) and at a single seed nothing separated from it, while the covariance-oriented arms cost 34-120s init for no gain. The full family (gain core, U/both rotations, ablate, dplr, corda, asvd) is preserved on the antipasto-variants branch. - antipasto.py is now the rotation implementation, registered as "antipasto" - delete antipasto_{rot,ablate,corda,asvd,dplr}.py + their config exports - benchmark/justfile/cost_report/smoke: drop the removed variants + dead knobs (antipasto_coeff/suppress_only/ablate_k/cov_orient/lora_rank); keep --antipasto-rotate-basis as antipasto's V/U/both/none ablation axis - README: subset table to one antipasto row, add rank column, note single-seed noise floor (~1.4pp), point the full family at the branch smoke: 10 passed Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-17 21:05:51 +08:00
wassname	21cc9a84ee	gitignore: external-review scratch (.pi, raw jsonl, err txt) + papers/md Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-17 20:29:30 +08:00
wassname	09dcfe0d41	Revert ablate lora_c warm-start: variance-PC seed didn't help on SFT Job 94 result (Qwen3.5-0.8B, GSM8K, 2500 steps, single seed): warm-start (top-k S-space output-variance PC): test 55.6 / valid 64.0, init 33.2s random-init (prior default): test 56.0 / valid 68.0, init 2.2s Equal-or-worse accuracy (within single-seed noise) for +31s of calibration init. The optimal ablation direction is loss-defined, not variance-defined, so seeding lora_c from the data-variance PC buys nothing here. Reverts fe562c2; ablate is back to the cheap random-init default. cov_orient (CorDA re-orient) path kept. The FIXME's actual proposal -- a contrastive dS seed -- stays open but needs pos/neg pairs this SFT benchmark lacks (only relevant for labelled steering). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-17 20:18:41 +08:00
wassname	458c3861e8	justfile: bench-variant takes a seed arg (default 0, unchanged) Lets the rot-basis ablation get a second seed without clobbering the seed0 run_id, so V>U>both can be confirmed against seed noise. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-17 18:29:23 +08:00
wassname	ef69c889a7	Merge antipasto-svd-cores: rotation-free S-space adapter family Replaces rotation/Cayley antipasto.py with three bounded, interpretable cores (gain 1+ELU, contractive ablation, CorDA/ASVD better-basis) + dplr, plus full GSM8K cost table and the rot-basis ablation. Resolves the three review FIXMEs from `3af2a2a` (rambling removed; CorDA split into its own variant; group_init recovers W_orig so it no longer runs on cropped matrices). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> # Conflicts: # src/lora_lite/variants/antipasto.py	2026-06-17 18:28:06 +08:00
wassname	12109b6fc0	README: order variant table by test accuracy Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-17 18:26:45 +08:00
wassname	fe562c2b5c	antipasto_ablate: warm-start lora_c from S-space output variance group_init now seeds each lora_c to the top-k principal axes of the S-space output coords h=diag(S)Vh x (highest-energy output dirs => largest loss-grad on the ablation strength), so lora_c starts in a high-gradient region not random. Cheap r x r second moment when not orienting; reuses Sigma xx^T when cov_orient. Benchmark always calibrates ablate now. This is the data-variance direction, not a contrastive behavior dir (SFT has no pos/neg split) -- noted in the docstring. UAT: \|cos(lora_c, top output-PC)\| = 1.0000 vs ~0.35 chance; smoke green. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-17 18:18:32 +08:00
wassname	6cb350a4b6	README: fill IA3-FF row (56.3/62.0, 86k params, 0 added MACs) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-17 15:49:02 +08:00
wassname	4962bffd7d	README: fill EVA + IA3 baseline rows EVA 59.3/74.0 (28s SVD-warmstart init), IA3 52.3/62.0 (6k params, 0 added MACs). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-17 15:26:50 +08:00
wassname	7e024b4734	comment hygiene + HRA row: shorten docstrings, drop dead init branch, track asvd - variant.py: fix mislabeled "legacy entry" (make() is the live param path); drop unused near_one init branch - config.py: drop "replaces older LoraLiteConfig" history narration - antipasto_ablate.py: aspirational "should warm-start" comment -> tracked FIXME - antipasto_rot.py: cut "kept as separate variant" / "why antipasto dropped rotation" ramble - benchmark: merge duplicate antipasto/corda/asvd cfg branch - README: fill HRA row (test 59.2 / valid 70.0) - track antipasto_asvd.py (was imported+registered but uncommitted) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-17 11:16:07 +08:00
wassname	5f9d90d8b8	benchmark sweep: rot(U/both) ablation, whitening conclusion, cost rows - antipasto_rot: add rotate_basis="both" (independent V+U Cayley rotations), run_id suffix __rotU/__rotboth so ablation arms get their own output dirs - justfile: thread rotate_basis through bench-variant - corda/eva: padding-mask fix in calibration capture + bf16-tight residual - README: fill PiSSA/DoRA/CorDA/ASVD/ablate/dplr/rot rows; record the metric-axis ablation (C=I 56.0 > diag-C 55.6 > full-C 54.7) and the rotation ablation (V 57.2 > U 56.5 > both 55.6) conclusions - docs/reviews: external ref-checks + deepseek/gpt reviews of the cores Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-17 06:17:53 +08:00
wassname	7986edad2c	fix: calibration through cropped model + detach/checkpoint gaps (external review) gpt-5.5 review (decorrelated) found three real issues deepseek missed: - BLOCKER: calibration ran through a cropped model. attach() did init() (crops every target to W_res) then group_init() (calibration forward) then registered the adapter hooks -- so CorDA's covariance and Wanda's scores were collected from a model missing every target's top-r. Now register hooks BEFORE group_init; at g=0/B=0 they reconstruct the cropped component exactly, so calibration sees full W. - detach() left the model cropped (deleted buffers without adding the frozen top-r back). Now reconstructs W = W_res + U_r S_r (Vh\|P)_r before removing buffers. - base-residual persistence wasn't in checkpoint metadata, so load->re-save dropped it. Persist base_weight_keys in metadata, validate on load, carry onto attach state. Docstring/citation cleanup (review + user style asks): - antipasto_corda: drop changelog narration and the stale "None -> plain SVD" claim (it raises now); exact reconstruction states W_res; slim the CPU/OOM note. - antipasto_dplr: drop the arrowhead archaeology; docstring math now matches the forward (p@A.T@B.T); fix the k=0 comment (code requires 1<=k<=r). - citations: Wanda (Sun+ 2023, 2306.11695), ASVD (Yuan+ 2023, 2312.05821), PiSSA (Meng+ 2024, 2404.02948), LoRA (Hu+ 2021, 2106.09685). Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-16 06:37:18 +08:00
wassname	d4ec550dd8	fix: corda silently ran as plain SVD; wire calibration + persist data-driven residual The benchmark only passed calibration_data to eva, so antipasto_corda's group_init hit `if calibration_data is None: return` and every corda run was actually plain SVD. The covariance orientation never executed -- all prior corda-vs-antipasto comparisons are void. - antipasto_corda.group_init: raise on None instead of silently degrading (orientation is the variant's whole identity; fail loud). - benchmark: feed ~256 MetaMath calibration samples (IPM, per PEFT/CorDA) to corda and to cov_orient ablate; run_id now carries an __lr tag. - adapter.save/load: a data-driven group_init rewrites the frozen base residual W_res into a form init() cannot reproduce at load (it only knows the plain top-r crop). Persist those residuals in the adapter and restore them. Fixes a reload-logits mismatch that was masked while group_init never ran. - probe check: compare every saved tensor (lora_ buffers AND base residuals) against the reloaded model state. - justfile: bench-variant gains an lr_override (the core wants a tamer lr than the gain's 5e-3). Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-16 05:56:02 +08:00
wassname	9d027752ad	variants: replace arrow's dense block with diagonal-plus-low-rank core antipasto_arrow -> antipasto_dplr. The arrowhead's dense b x b block is the wrong shape: b^2 params, mixes only the top-b, and sits on the S-scaled coords so its perturbation is amplified by the largest singular values (block=128 collapsed to 45.7% at the gain's lr). Replace it with LoRA's lesson -- a low-rank core inside the frozen basis, ADDED to the gain: DeltaW = U [diag(S_eff) + coeff * B A] Vh, A:(k,r) B:(r,k), B=0 at init The low-rank part mixes the whole top-r subspace for 2rk params (k=LoRA's rank), and being additive (not * diag(S)) it is S-independent -- the amplification edge is gone by construction. Diagonal gain unchanged; identity at init from B=0 and g=0. Wired through benchmark (antipasto_lora_rank, run_id __k suffix), justfile, cost_report, smoke (green, dplr attaches/trains/round-trips). Arrow code removed; its run results stay on disk for comparison. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 20:13:15 +08:00
wassname	2c56196dea	justfile/run_id: r override for low-rank antipasto sweeps bench-variant gains an r_override arg (alpha tracks r for the antipasto family); run_id appends __r<N> when an antipasto-family run uses r!=256, so the low-rank corda-vs-antipasto sweep does not overwrite the r=256 results. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 18:35:54 +08:00
wassname	e8ca6f5944	README: validation framing per wassname's wording; arrow large-block lr=1e-4 README: 'we validate the same way PEFT does; trained properly they clear 49% on GSM8K, all pass' + link to the benchmark script. justfile: arrow with block>8 uses lr=1e-4 not 5e-3. The 5e-3 that suits the tiny S-space gain destabilizes the large dense block -- block=128 at 5e-3 scored 45.7% (below the bar, vs block=8's 60.5%). Capacity sweep requeued at LoRA's 1e-4 to de-confound params-vs-lr. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 18:27:33 +08:00
wassname	6b7b3a47dd	README: frame the GSM8K table as a validation harness, not a leaderboard The point is that every adapter clears PEFT's ~48% LoRA bar on the same MetaMathQA->GSM8K protocol -- that all rows pass is the it-trains signal, not a competitive ranking. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 18:20:53 +08:00
wassname	6ab1dfff0e	README: antipasto variants as table rows; real PEFT reference - Fold the family into the main Variants table as rows (CorDA/ablate/arrow) instead of a separate table. - Lead with the point (freeze W's SVD, learn only a bounded gain -> interpretable, O(r) params) before any numbers. - Replace the unsourced 'PEFT reports 49.0%' line (wrong; LoRA is ~48%) with a real link to PEFT's method_comparison/MetaMathQA and a pointer to the benchmark script for hyperparameters. Link CorDA/Arditi papers inline. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 18:18:09 +08:00
wassname	fa69e0cac3	README: trim AntiPaSTO section for researcher audience Replace the per-experiment family breakdown table + comparison prose with a 2-sentence method description (frozen interpretable SVD basis, O(r) gain, the three variant cores). Experiment findings (rotation comparison, arrow capacity, cost/timing) belong in the research journal, not the README skim path. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 18:12:31 +08:00
wassname	d9d31a160f	variants: clean docstrings to research pseudocode; arrow block param Rewrite antipasto/ablate/corda/arrow docstrings to the house style (purpose + math block + identity line + refs), dropping the rambly meta-commentary aimed at past design decisions ('Changes vs the rotation version', chat references, inline measurements). Net -74 lines. Also answer the FIXMEs left on main's old copy: - group_init is Wanda/ASVD selection (re-rank W's own singular vectors), NOT CorDA re-orientation -- that is antipasto_corda.py. - it rebuilds the FULL W exactly (W_res + stored top-r == W), so the re-SVD sees the whole spectrum, not a cropped matrix. Arrow capacity: --antipasto-block CLI knob (justfile bench-variant 4th arg) so the block can be scaled toward LoRA params; run_id gets a __b<N> suffix so block-sweep runs do not collide. Smoke green (14 passed). Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 18:09:53 +08:00
wassname (Michael J Clark)	3af2a2a66a	Update antipasto.py	2026-06-15 15:41:38 +08:00
wassname	90b5199ed9	README: AntiPaSTO family GSM8K results (5 variants, r=256) Replace the stale single AntiPaSTO row (was 35.8K params from the removed rotation version, described block-Cayley which no longer exists) with the real 5000-step Qwen3-0.6B numbers and a family breakdown: corda 61.9% 14.3K (best: covariance-oriented basis) plain 61.4% 14.3K rot 61.4% 35.8K (the rotation this replaces) ablate 61.0% 14.4K arrow 60.5% 17.5K Headline: ~320x fewer trainable params than LoRA at ~97% of its accuracy. Rotation buys nothing (rot matches plain to 3 s.f. at 2.5x params, +20% wall-time, plus a per-forward Cayley solve), confirming the drop. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 07:05:45 +08:00
wassname	a5999bdeb8	docs: tighten suppress_only contract + arrow top-b selection note External (codex) review found the suppress_only "attenuation only" claim holds only for coeff>=0 (coeff<0 inverts the product and re-amplifies). Doc-only caveat in antipasto/_corda/_arrow; no math change (sweeps run coeff=1.0). Also clarify arrow's group_init top-b lands on largest-S-among-selected, not highest-score. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 06:24:23 +08:00
wassname	32b1fd885a	justfile: route antipasto bench through r=256/alpha=256 in bench-variant The README GSM8K sweep was queued as raw expanded commands with an unquoted --target-name '(q_proj\|v_proj)$'; pueue runs via sh -c, so the parens errored instantly before training. Routing through bench-variant (bash shebang quotes the target) fixes it. Also bake the antipasto family's r=256/alpha=256 into the case block so it matches the published AntiPaSTO row, replacing the dead trailing "$@" (shebang recipes get no extra args). Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-15 05:58:34 +08:00
wassname	d6b242818a	justfile: lr=5e-3 for all antipasto_* cores in bench-variant The small-param antipasto family (gain/block/ablate/corda) all need the higher lr to clear the bf16 round-to-nearest floor, not just antipasto. Glob the case. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-14 19:20:35 +08:00
wassname	0d40cc9b38	Add antipasto_arrow: structured fixed-basis core (cross-direction mixing) antipasto's diagonal core can only rescale each frozen singular direction; it can never let direction i's input drive direction j's output, yet the steered behaviour is an off-axis combination. A dense r x r core fixes that but costs r^2 params. antipasto_arrow uses the arrowhead structure instead: a dense b x b block on the top-b singular directions (full coupling where the action lives) plus a diagonal 1+ELU tail on the rest. b^2 + (r-b) params, one b x b matmul per forward -- cross-direction mixing at diagonal-core cost, no Cayley solve. Identity at init (M=0 -> B=I, g=0 -> gain=1). Verified on a Linear: rel_err 1.5e-7 at init; M[i,j] routes input dir j -> output dir i with weight exactly M[i,j] (diagonal core forces 0); 14 train params at r=8,b=3 vs r^2=64. Wired into benchmark (antipasto_block knob), smoke (block=2 for r=4), cost report, and exports. Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-14 19:18:59 +08:00
wassname	b80d7778af	Add rotation-free S-space adapter cores (antipasto family) Replace antipasto's rotation/Cayley with a bounded 1+ELU gain and split the S-space idea into four interpretable PiSSA-style cores (frozen U/S/Vh, small trainable core): - antipasto: S_eff = S(1+ELU(coeffg)). exp-bounded attenuation, linear amplification (constant gradient, no runaway). g=0 -> exact identity. - antipasto_rot: keeps the block-Cayley rotation as a separate variant for cost comparison (its per-forward solve is the 72ms vs 36ms gap). - antipasto_ablate: contractive (I - a c c^T) diag(S), eigenvalues in [0,1], cannot blow up. Optional cov_orient (CorDA) basis. - antipasto_corda: covariance-oriented oblique projector P = Vh C^{-1/2}, the data-energy basis rather than the weight-gain basis. 1+ELU gain. Add scripts/_cost.py + scripts/cost_report.py: one-row-per-variant cost table (trainable params, peak GPU mem, fwd/bwd ms, added MACs/tok, group_init ms). Wire all four into the benchmark, smoke test, and __init__ exports. External review (DeepSeek-v4-pro, docs/reviews/) verified the math; acted on its one real point (corda g now inits to zeros for exact identity). Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-06-14 19:12:27 +08:00
wassname (Michael J Clark)	e5048fcaff	Update antipasto.py	2026-06-10 15:55:14 +08:00
wassname (Michael J Clark)	0dcbc753ac	Update antipasto.py	2026-06-10 15:54:49 +08:00
wassname	072a816cee	docs: fix hallucinated arxiv links in variants table AntiPaSTO, EVA, and HRA pointed at unrelated papers (stock prediction, LLM-vs-lawyer study, 2D Ising model). Replaced with verified IDs. Co-Authored-By: Claudypoo <claudypoo@noreply.invalid>	2026-05-26 05:48:49 +08:00
wassname	ce8c250422	perf: use matmul for lora adapter projections	2026-05-21 08:23:56 +08:00
wassname	56937e1b18	remove dead code: _road_matrix, callable(m) clause, silent git fallback - delete _road_matrix in variants/road.py (zero callers) - drop redundant callable(m) clause in is_linear_like (every nn.Module is callable) - remove try/except in current_git_commit so missing git crashes loudly instead of writing "unknown" into the results TSV Co-Authored-By: Claudypoo <noreply@anthropic.com>	2026-05-19 19:11:32 +08:00
wassname	19888fbb82	antipasto: replace EVA-style group_init with Wanda-style dimension selection Score each singular dimension by S[i] * mean\|X @ Vh[i]\| (weight magnitude times activation magnitude), then pick top-r by joint score instead of top-r by S alone. Keeps the weight-SVD basis; only reorders which r dimensions are retained based on real input activations. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-01 21:24:52 +08:00
wassname	f91c7b23f2	antipasto: add EVA-style data-driven group_init Weight-SVD init (PiSSA-style) kept as fallback; when calibration_data is provided, group_init() collects pre-hook activations, SVDs the pooled inputs per layer, and re-decomposes W_orig through the top-r input-PCA directions. Vhr_final = Vh_A @ Vhr_new keeps rows orthonormal while preserving the input-aligned span. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-01 20:55:56 +08:00
wassname	b698331cfa	feat: add HRA benchmark result (61.6%), update README table	2026-04-27 20:07:19 +08:00
wassname	f6fd410677	benchmark: antipasto rotate_basis CLI + lr=5e-3 + ablation queue	2026-04-27 16:29:25 +08:00
wassname	88f107a423	antipasto: delta_s init 4e-4+N(0,4e-4) from antipasto3, rotate_basis='none' option	2026-04-27 16:27:12 +08:00
wassname	7df786e80b	remove base_weight_fingerprint and test_lora_lite.py - _base_weight_fingerprint was PiSSA-only defensive check that cluttered every save with per-target SHA256. If you load onto wrong base, you get wrong weights -- that's user error, not a library bug. - test_lora_lite.py deleted. All coverage lives in test_metamath_smoke.py which runs the real benchmark pipeline per variant.	2026-04-27 16:15:40 +08:00
wassname	e624cd244f	feat: near_zero/near_one init for trainable params (breaks bf16 dead-grad symmetry) Trainable params that were init'd at exact 0 or 1 now use near_zero (N(0,1e-4)) or near_one (1 + N(0,1e-4)) to break bf16 symmetry without meaningfully breaking identity-at-t=0. Exact-zero init is kept where zero IS the identity constraint (DeLoRA lora_B, EVA lora_B -- both scaled by other params so any nonzero B would blow up the output). AntiPaSTO: delta_s and rot_T now near_zero. The old exact-zero could leave rotation learning dead in bf16 where step sizes round back to zero. IA3: lora_g now near_one instead of exact ones. Avoids the bf16 spacing issue around 1.0 where eps_bf16 ~ 7.8e-3 and lr=1e-3 updates were rounding away. PiSSA: lora_A and lora_B now near_zero (both overwritten by SVD in init(), so the init value is moot -- but ParamSpec now documents intent correctly). HRA: lora_U now near_zero (overwritten by symmetric init in init()). ParamSpec: added 'near_zero' and 'near_one' init modes. Default changed from 'zeros' to 'near_zero'. Tests relaxed identity tolerances accordingly.	2026-04-27 15:55:05 +08:00
wassname	0bd091fe5b	tidy	2026-04-27 11:44:40 +08:00
wassname	a342801807	wip	2026-04-27 11:24:19 +08:00
wassname	24ba8deb02	simpler test	2026-04-27 09:47:07 +08:00

1 2

73 Commits