Commit Graph

27 Commits

Author SHA1 Message Date
wassname 8005423c47 README: note LoRA-XS all-linear spread didn't help (test 55.6 vs down_proj 56.8)
Paper spreads LoRA-XS across all q/k/v/o + FFN linears, not down_proj only.
Tried it (150 modules, 0.154M params): test 55.6 / valid 62.0, slightly below
the down_proj row at 6x params, within single-seed noise. down_proj-only stays
the table entry. result: outputs/metamath_gsm8k_alllinear/...__seed0/result.json

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-18 23:49:36 +08:00
wassname a75bed492b README: add LoRA-XS variant row (test 56.8 / valid 68.0, params 0.025M)
Qwen3.5-0.8B-Base, down_proj all 24 layers, r=32 alpha=32 lr=4e-3, 2500 steps.
UAT: grad=0.699>0, dθ=60.0>0, base_grad_leaks=0.
result: outputs/metamath_gsm8k/Qwen--Qwen3.5-0.8B-Base__lora_xs__s2500__seed0/result.json

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-18 21:28:10 +08:00
wassname 12e13cca79 README: rot basis is within noise (seed order flips), soften V claim
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-18 03:32:01 +08:00
wassname 12fa56f328 Collapse antipasto family to one variant: rot(V) becomes canonical antipasto
main keeps a single antipasto = the rotation+delta SVD adapter (the published
method, paper 2601.07473), default rotate_basis=V. On GSM8K/down_proj rot(V)
led the family (57.2) and at a single seed nothing separated from it, while the
covariance-oriented arms cost 34-120s init for no gain. The full family (gain
core, U/both rotations, ablate, dplr, corda, asvd) is preserved on the
antipasto-variants branch.

- antipasto.py is now the rotation implementation, registered as "antipasto"
- delete antipasto_{rot,ablate,corda,asvd,dplr}.py + their config exports
- benchmark/justfile/cost_report/smoke: drop the removed variants + dead knobs
  (antipasto_coeff/suppress_only/ablate_k/cov_orient/lora_rank); keep
  --antipasto-rotate-basis as antipasto's V/U/both/none ablation axis
- README: subset table to one antipasto row, add rank column, note single-seed
  noise floor (~1.4pp), point the full family at the branch

smoke: 10 passed

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-17 21:05:51 +08:00
wassname 12109b6fc0 README: order variant table by test accuracy
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-17 18:26:45 +08:00
wassname 6cb350a4b6 README: fill IA3-FF row (56.3/62.0, 86k params, 0 added MACs)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-17 15:49:02 +08:00
wassname 4962bffd7d README: fill EVA + IA3 baseline rows
EVA 59.3/74.0 (28s SVD-warmstart init), IA3 52.3/62.0 (6k params, 0 added MACs).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-17 15:26:50 +08:00
wassname 7e024b4734 comment hygiene + HRA row: shorten docstrings, drop dead init branch, track asvd
- variant.py: fix mislabeled "legacy entry" (make() is the live param path); drop unused near_one init branch
- config.py: drop "replaces older LoraLiteConfig" history narration
- antipasto_ablate.py: aspirational "should warm-start" comment -> tracked FIXME
- antipasto_rot.py: cut "kept as separate variant" / "why antipasto dropped rotation" ramble
- benchmark: merge duplicate antipasto/corda/asvd cfg branch
- README: fill HRA row (test 59.2 / valid 70.0)
- track antipasto_asvd.py (was imported+registered but uncommitted)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-17 11:16:07 +08:00
wassname 5f9d90d8b8 benchmark sweep: rot(U/both) ablation, whitening conclusion, cost rows
- antipasto_rot: add rotate_basis="both" (independent V+U Cayley rotations),
  run_id suffix __rotU/__rotboth so ablation arms get their own output dirs
- justfile: thread rotate_basis through bench-variant
- corda/eva: padding-mask fix in calibration capture + bf16-tight residual
- README: fill PiSSA/DoRA/CorDA/ASVD/ablate/dplr/rot rows; record the
  metric-axis ablation (C=I 56.0 > diag-C 55.6 > full-C 54.7) and the
  rotation ablation (V 57.2 > U 56.5 > both 55.6) conclusions
- docs/reviews: external ref-checks + deepseek/gpt reviews of the cores

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-17 06:17:53 +08:00
wassname e8ca6f5944 README: validation framing per wassname's wording; arrow large-block lr=1e-4
README: 'we validate the same way PEFT does; trained properly they clear 49% on
GSM8K, all pass' + link to the benchmark script.

justfile: arrow with block>8 uses lr=1e-4 not 5e-3. The 5e-3 that suits the tiny
S-space gain destabilizes the large dense block -- block=128 at 5e-3 scored 45.7%
(below the bar, vs block=8's 60.5%). Capacity sweep requeued at LoRA's 1e-4 to
de-confound params-vs-lr.

Co-Authored-By: Claudypoo <noreply@anthropic.com>
2026-06-15 18:27:33 +08:00
wassname 6b7b3a47dd README: frame the GSM8K table as a validation harness, not a leaderboard
The point is that every adapter clears PEFT's ~48% LoRA bar on the same
MetaMathQA->GSM8K protocol -- that all rows pass is the it-trains signal,
not a competitive ranking.

Co-Authored-By: Claudypoo <noreply@anthropic.com>
2026-06-15 18:20:53 +08:00
wassname 6ab1dfff0e README: antipasto variants as table rows; real PEFT reference
- Fold the family into the main Variants table as rows (CorDA/ablate/arrow)
  instead of a separate table.
- Lead with the point (freeze W's SVD, learn only a bounded gain -> interpretable,
  O(r) params) before any numbers.
- Replace the unsourced 'PEFT reports 49.0%' line (wrong; LoRA is ~48%) with a
  real link to PEFT's method_comparison/MetaMathQA and a pointer to the benchmark
  script for hyperparameters. Link CorDA/Arditi papers inline.

Co-Authored-By: Claudypoo <noreply@anthropic.com>
2026-06-15 18:18:09 +08:00
wassname fa69e0cac3 README: trim AntiPaSTO section for researcher audience
Replace the per-experiment family breakdown table + comparison prose with a
2-sentence method description (frozen interpretable SVD basis, O(r) gain, the
three variant cores). Experiment findings (rotation comparison, arrow capacity,
cost/timing) belong in the research journal, not the README skim path.

Co-Authored-By: Claudypoo <noreply@anthropic.com>
2026-06-15 18:12:31 +08:00
wassname 90b5199ed9 README: AntiPaSTO family GSM8K results (5 variants, r=256)
Replace the stale single AntiPaSTO row (was 35.8K params from the removed
rotation version, described block-Cayley which no longer exists) with the
real 5000-step Qwen3-0.6B numbers and a family breakdown:

  corda  61.9% 14.3K  (best: covariance-oriented basis)
  plain  61.4% 14.3K
  rot    61.4% 35.8K  (the rotation this replaces)
  ablate 61.0% 14.4K
  arrow  60.5% 17.5K

Headline: ~320x fewer trainable params than LoRA at ~97% of its accuracy.
Rotation buys nothing (rot matches plain to 3 s.f. at 2.5x params, +20%
wall-time, plus a per-forward Cayley solve), confirming the drop.

Co-Authored-By: Claudypoo <noreply@anthropic.com>
2026-06-15 07:05:45 +08:00
wassname 072a816cee docs: fix hallucinated arxiv links in variants table
AntiPaSTO, EVA, and HRA pointed at unrelated papers (stock prediction,
LLM-vs-lawyer study, 2D Ising model). Replaced with verified IDs.

Co-Authored-By: Claudypoo <claudypoo@noreply.invalid>
2026-05-26 05:48:49 +08:00
wassname b698331cfa feat: add HRA benchmark result (61.6%), update README table 2026-04-27 20:07:19 +08:00
wassname e624cd244f feat: near_zero/near_one init for trainable params (breaks bf16 dead-grad symmetry)
Trainable params that were init'd at exact 0 or 1 now use near_zero (N(0,1e-4))
or near_one (1 + N(0,1e-4)) to break bf16 symmetry without meaningfully
breaking identity-at-t=0. Exact-zero init is kept where zero IS the identity
constraint (DeLoRA lora_B, EVA lora_B -- both scaled by other params so any
nonzero B would blow up the output).

AntiPaSTO: delta_s and rot_T now near_zero. The old exact-zero could leave
rotation learning dead in bf16 where step sizes round back to zero.

IA3: lora_g now near_one instead of exact ones. Avoids the bf16 spacing issue
around 1.0 where eps_bf16 ~ 7.8e-3 and lr=1e-3 updates were rounding away.

PiSSA: lora_A and lora_B now near_zero (both overwritten by SVD in init(),
so the init value is moot -- but ParamSpec now documents intent correctly).

HRA: lora_U now near_zero (overwritten by symmetric init in init()).

ParamSpec: added 'near_zero' and 'near_one' init modes. Default changed from
'zeros' to 'near_zero'. Tests relaxed identity tolerances accordingly.
2026-04-27 15:55:05 +08:00
wassname a342801807 wip 2026-04-27 11:24:19 +08:00
wassname b60a8c3f9b readme 2026-04-27 09:46:52 +08:00
wassname bb8887e66c tidy 2026-04-27 07:12:56 +08:00
wassname b179771cc6 tyro and benchmark 2026-04-27 06:23:30 +08:00
wassname 0d929f93b3 feat(hra): add Householder Reflection Adaptation, hook-only/bnb-friendly + Qwen proof 2026-04-26 17:58:56 +08:00
wassname 2abf616be6 feat(dora): add weight-decomposed LoRA variant for fp layers 2026-04-26 17:53:33 +08:00
wassname 699fde31bf feat: ia3 variant, real bnb 4bit/8bit smoke, dev guide split, user-only readme 2026-04-26 17:49:17 +08:00
wassname f2d9021511 ci: add publishable check workflow 2026-04-26 17:09:47 +08:00
wassname 69bf5f4e44 test: prove adapter training paths 2026-04-26 17:00:39 +08:00
wassname 4db5cee5a9 init 2026-04-26 14:10:20 +08:00