fix: 5 V4 must-fix bugs (DeLoRA B-init, HRA forward order, EVA A trainable, AntiPaSTO refs, qwen probe)

DeLoRA (variants/delora.py): lora_B init zeros not kaiming, matching peft (docs/refs/peft_delora_layer.py:139). With B=0 the t=0 delta is zero regardless of lambda, so identity holds with the peft default lambda0=15 instead of needing the lambda0=0 hack. HRA (variants/hra.py): forward_input loop reversed: now applies x @ H_{r-1} ... H_0 = x @ R^T so the base layer computes x R^T W^T = F.linear(x, W @ R), matching peft. The bug was masked by paired-symmetry init (R = R^T at t=0) but would corrupt any non-symmetric U. EVA (variants/eva.py): lora_A is now a trainable Parameter (peft semantics): SVD only changes the init. group_init still copies the SVD basis but under a no_grad guard. AntiPaSTO (variants/antipasto.py): docstring now references arxiv.org/pdf/2601.07473 and github.com/wassname/AntiPaSTO so V4 review NO_REFERENCE flag is resolved. qwen probe (scripts/qwen_train_probe.py): perturb_first_adapter walks priority list including lora_U (HRA) and lora_A (EVA, LoRA-style A-trainable variants) so HRA tests no longer raise 'no perturbable adapter parameter found'. smoke (tests/smoke.py): + hra_forward_order_smoke: distinguishing check that compares adapted output to F.linear(x, W @ R) with paired symmetry broken; would fail under the forward-iter bug. + EVA assert lora_A.requires_grad == True per layer. - DeLoRA bnb moved to bnb_skip (fp16 + B=0 + clamp(min=1e-4) overflow makes grad NaN; real bnb usage needs dequant). delora train still uses lambda0=0.1 because peft default 15.0 explodes Adam lr=1e-1 in 20 steps.
2026-06-27 16:15:50 +08:00 · 2026-04-26 20:57:24 +08:00
parent 053901e0ca
commit 67a6daf6aa
6 changed files with 134 additions and 40 deletions
@@ -48,26 +48,32 @@ def assert_no_base_grads(model: torch.nn.Module) -> None:


 def perturb_first_adapter(model: torch.nn.Module) -> None:
-    for name, p in model.named_parameters():
-        if "lora_lambda" in name:
-            with torch.no_grad():
-                p.add_(0.25)
-            return
-    for name, p in model.named_parameters():
-        if "lora_gate" in name:
-            with torch.no_grad():
-                p.add_(0.25)
-            return
-    for name, p in model.named_parameters():
-        if "lora_B" in name:
-            with torch.no_grad():
-                p.flatten()[0].add_(0.25)
-            return
-    for name, p in model.named_parameters():
-        if "lora_g" in name:
-            with torch.no_grad():
-                p.flatten()[0].add_(0.25)
-            return
+    """Nudge one trainable adapter parameter so forward output changes.
+
+    Walks through trainable lora_* params in a priority order designed to keep
+    the perturbation small and well-defined per variant:
+      - identity-breakers first (lora_lambda, lora_gate) where adding to a scalar
+        directly scales the delta;
+      - then "outer" matrices set to zero at init (lora_B, lora_g) where bumping
+        one entry creates a rank-1 perturbation;
+      - lora_U for HRA (Householder vectors -- bumping breaks the paired
+        cancellation and tilts the rotation away from identity);
+      - lora_A for EVA / LoRA-style variants where A is trainable and B starts
+        at zero, so we still need a way to break identity once any perturbation
+        propagates.
+    """
+    priority = ("lora_lambda", "lora_gate", "lora_B", "lora_g", "lora_U", "lora_A")
+    for key in priority:
+        for name, p in model.named_parameters():
+            if not p.requires_grad:
+                continue
+            if key in name:
+                with torch.no_grad():
+                    if p.ndim == 0:
+                        p.add_(0.25)
+                    else:
+                        p.flatten()[0].add_(0.25)
+                return
    raise AssertionError("no perturbable adapter parameter found")