- delete _road_matrix in variants/road.py (zero callers)
- drop redundant callable(m) clause in is_linear_like (every nn.Module is callable)
- remove try/except in current_git_commit so missing git crashes loudly
instead of writing "unknown" into the results TSV
Co-Authored-By: Claudypoo <noreply@anthropic.com>
Score each singular dimension by S[i] * mean|X @ Vh[i]| (weight magnitude
times activation magnitude), then pick top-r by joint score instead of top-r
by S alone. Keeps the weight-SVD basis; only reorders which r dimensions are
retained based on real input activations.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Weight-SVD init (PiSSA-style) kept as fallback; when calibration_data is
provided, group_init() collects pre-hook activations, SVDs the pooled inputs
per layer, and re-decomposes W_orig through the top-r input-PCA directions.
Vhr_final = Vh_A @ Vhr_new keeps rows orthonormal while preserving the
input-aligned span.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- _base_weight_fingerprint was PiSSA-only defensive check that cluttered
every save with per-target SHA256. If you load onto wrong base, you get
wrong weights -- that's user error, not a library bug.
- test_lora_lite.py deleted. All coverage lives in test_metamath_smoke.py
which runs the real benchmark pipeline per variant.
Trainable params that were init'd at exact 0 or 1 now use near_zero (N(0,1e-4))
or near_one (1 + N(0,1e-4)) to break bf16 symmetry without meaningfully
breaking identity-at-t=0. Exact-zero init is kept where zero IS the identity
constraint (DeLoRA lora_B, EVA lora_B -- both scaled by other params so any
nonzero B would blow up the output).
AntiPaSTO: delta_s and rot_T now near_zero. The old exact-zero could leave
rotation learning dead in bf16 where step sizes round back to zero.
IA3: lora_g now near_one instead of exact ones. Avoids the bf16 spacing issue
around 1.0 where eps_bf16 ~ 7.8e-3 and lr=1e-3 updates were rounding away.
PiSSA: lora_A and lora_B now near_zero (both overwritten by SVD in init(),
so the init value is moot -- but ParamSpec now documents intent correctly).
HRA: lora_U now near_zero (overwritten by symmetric init in init()).
ParamSpec: added 'near_zero' and 'near_one' init modes. Default changed from
'zeros' to 'near_zero'. Tests relaxed identity tolerances accordingly.
DeLoRA (variants/delora.py):
lora_B init zeros not kaiming, matching peft (docs/refs/peft_delora_layer.py:139).
With B=0 the t=0 delta is zero regardless of lambda, so identity holds with
the peft default lambda0=15 instead of needing the lambda0=0 hack.
HRA (variants/hra.py):
forward_input loop reversed: now applies x @ H_{r-1} ... H_0 = x @ R^T so
the base layer computes x R^T W^T = F.linear(x, W @ R), matching peft. The
bug was masked by paired-symmetry init (R = R^T at t=0) but would corrupt
any non-symmetric U.
EVA (variants/eva.py):
lora_A is now a trainable Parameter (peft semantics): SVD only changes the
init. group_init still copies the SVD basis but under a no_grad guard.
AntiPaSTO (variants/antipasto.py):
docstring now references arxiv.org/pdf/2601.07473 and
github.com/wassname/AntiPaSTO so V4 review NO_REFERENCE flag is resolved.
qwen probe (scripts/qwen_train_probe.py):
perturb_first_adapter walks priority list including lora_U (HRA) and
lora_A (EVA, LoRA-style A-trainable variants) so HRA tests no longer raise
'no perturbable adapter parameter found'.
smoke (tests/smoke.py):
+ hra_forward_order_smoke: distinguishing check that compares adapted output
to F.linear(x, W @ R) with paired symmetry broken; would fail under the
forward-iter bug.
+ EVA assert lora_A.requires_grad == True per layer.
- DeLoRA bnb moved to bnb_skip (fp16 + B=0 + clamp(min=1e-4) overflow makes
grad NaN; real bnb usage needs dequant).
delora train still uses lambda0=0.1 because peft default 15.0 explodes
Adam lr=1e-1 in 20 steps.
V3 external review (docs/audit/variants_review_v3.md, 97KB) found 3
must-fix bugs.
DoRA: bias was being scaled by m/||V|| because we operated on the full
base layer output. Now subtract bias before normalization, add back
after. Matches peft DoRA exactly (docs/refs/peft_lora_dora.py:157-161).
New smoke dora_bias_smoke verifies identity at t=0 with bias=True.
EVA load: adapter.load() called attach() which called group_init() which
required calibration_data and raised. Added _skip_group_init flag to
attach(); load() passes it. EVA group_init still raises loudly when
called directly without data. New smoke verifies save+load WITHOUT
calibration data on load path.
Also tightened EVA error message.
Smoke now covers 8 variants + EVA roundtrip + DoRA-bias roundtrip + bnb
4/8-bit. ALL PASS.
V3 nice-to-haves (PiSSA scaling, AntiPaSTO init choice, stale GH refs)
deferred -- documented as intentional in module docstrings.
DeLoRA: per-input-channel wnorm buffer (not scalar Parameter), forward
matches peft (x*wnorm @ A.T then per-rank scale (lambda/r)/(An*Bn)).
Smoke: 89.7% loss drop (was 35.8%).
HRA: symmetric repeated-column init (PEFT-style) instead of zero gate.
Adjacent Householder pairs cancel exactly so R=I at t=0, and U receives
gradient from step 0 (no dead-grad). Even r required.
IA3: split into two variants. ia3 stays output-side (k_proj/v_proj);
new ia3_ff is input-side (down_proj/fc2), matching peft is_feedforward.
Config: dropout field removed (never honored by any variant).
PiSSA: adapter.save records base-weight fingerprint per target;
adapter.load recomputes init then verifies fingerprint -> fails loud
when reloaded onto a different base.
EVA (new): data-driven init via group_init + calibration_data. Top-r
right singular vectors of pooled layer-input activations -> lora_A
(buffer, frozen); only lora_B trains. Stress-tests group_init API.
AntiPaSTO (new): SVD steering with frozen U,S,Vh,W_res and learnable
delta_s (per-singular-value bias) + rot_T (block-diagonal Cayley
rotation on V or U). Lite port of antipasto3 SVD adapter.
ParamSpec: as_buffer field + make_tensor() for buffer registration.
adapter.attach honors as_buffer with register_buffer; detach cleans
both _parameters and _buffers.
Smoke covers all 8 variants: identity at t=0, save/load round-trip,
gradient-driven loss drop. EVA gets dedicated test for calibration
data path. ALL PASS including bnb 4/8-bit path.
- Fetch canonical reference impls for offline review:
* peft_{lora,hra,delora,ia3}_layer.py + peft_lora_{dora,variants}.py
* orig_pissa_init.py (MuLabPKU/PiSSA)
* orig_hra_layer.py (DaShenZi721/HRA)
* orig_delora.py (ExplainableML/DeLoRA author fork)
- Add reference-impl URLs to all 6 variant docstrings
- Document HRA gate=0 dead-grad issue and DoRA detach-omission in their docstrings
- Re-run external review (codex) with refs available -> docs/audit/variants_review_v2.md
Major NEW findings vs paper-only review:
* DeLoRA: scalar W.norm() should be per-input-channel norm(dim=0)
* HRA: PEFT uses symmetric repeated-column init (no dead grad), not zero gate
* IA3: FFN targets need input-side gating, not output, our up_proj advice wrong
* All LoRA-family: cfg.dropout silently ignored (no-op)
* DeLoRA: wnorm should be persistent buffer, not Parameter
HRA and DeLoRA upgraded to BUGGY (from Partial)