mirror of
https://github.com/wassname/lora-lite.git
synced 2026-06-27 18:05:16 +08:00
d0b4c52740
- Two acpx external reviews (codex + opencode): * docs/audit/variants_review.md: per-variant paper-vs-impl audit * docs/audit/design_review.md: peft EVA / baukit / antipasto3 vs lora-lite * docs/audit/SUMMARY.md: aggregate verdicts + 3 risks + 5 follow-ups - docs/refs/: peft_eva.py, peft_eva_finetuning.py, baukit_nethook.py, antipasto3_svd_adapter.py for offline reference Findings: LoRA clean; PiSSA/DoRA/IA3/HRA/DeLoRA have documented partial deviations. Top risks: init/grad tradeoffs hidden by coarse tests; qwen probe lacks strict identity tol; IA3 target placement untested.
781 lines
31 KiB
Markdown
781 lines
31 KiB
Markdown
[client] initialize (running)
|
||
|
||
[client] session/new (running)
|
||
|
||
[thinking] This is a detailed design review request. I need to: 1. Read the lora-lite codebase to understand its current structure 2. Read the three reference implementations 3. Answer specific questions for each Let me start by exploring the codebase structure and reading the key files.
|
||
|
||
[tool] glob (pending)
|
||
input: {}
|
||
|
||
[tool] glob (pending)
|
||
input: {}
|
||
|
||
[tool] glob (completed)
|
||
kind: search
|
||
input: src/lora_lite/**/*.py
|
||
output:
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/adapter.py
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/lora.py
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/delora.py
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/hra.py
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/dora.py
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/ia3.py
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/__init__.py
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/pissa.py
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/target.py
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variant.py
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/__init__.py
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/config.py
|
||
|
||
[tool] glob (completed)
|
||
kind: search
|
||
input: docs/refs/*.py
|
||
output:
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/antipasto3_svd_adapter.py
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/baukit_nethook.py
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva_finetuning.py
|
||
/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva.py
|
||
|
||
[thinking] I need to read all these files to do a proper comparative analysis. Let me read the key lora-lite files first (adapter.py, variant.py, variants/lora.py), then the reference files.
|
||
|
||
[tool] read (pending)
|
||
input: {}
|
||
|
||
[tool] read (pending)
|
||
input: {}
|
||
|
||
[tool] src/lora_lite/adapter.py (completed)
|
||
kind: read
|
||
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/adapter.py
|
||
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/adapter.py
|
||
output:
|
||
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/adapter.py</path>
|
||
<type>file</type>
|
||
<content>
|
||
1: """attach / detach / save / load. The whole runtime."""
|
||
2: from __future__ import annotations
|
||
3: import torch
|
||
4: from torch import nn
|
||
5: from torch.utils.hooks import RemovableHandle
|
||
6:
|
||
7: from .config import LoraLiteConfig
|
||
8: from .variant import REGISTRY
|
||
9: from .target import find_targets
|
||
10:
|
||
11:
|
||
12: _ATTACHED_ATTR = "_lora_lite_attached"
|
||
13:
|
||
14:
|
||
15: def _hook(layer, args, y):
|
||
16: (x,) = args
|
||
17: cfg: LoraLiteConfig = layer._lora_cfg
|
||
18: x_cast = x.to(cfg.dtype)
|
||
19: out = layer._lora_variant.forward(layer, x_cast, y)
|
||
20: return out.to(y.dtype)
|
||
21:
|
||
22:
|
||
23: def _pre_hook(layer, args):
|
||
24: (x,) = args
|
||
25: cfg: LoraLiteConfig = layer._lora_cfg
|
||
... (92 more lines)
|
||
|
||
[tool] read (pending)
|
||
input: {}
|
||
|
||
[tool] src/lora_lite/variant.py (completed)
|
||
kind: read
|
||
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variant.py
|
||
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variant.py
|
||
output:
|
||
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variant.py</path>
|
||
<type>file</type>
|
||
<content>
|
||
1: """Variant protocol + registry. Variants own (x, layer.weight, layer.lora_*) -> y_new."""
|
||
2: from dataclasses import dataclass
|
||
3: from typing import Callable, Protocol, Any
|
||
4: import torch
|
||
5: from torch import nn
|
||
6:
|
||
7: from .config import LoraLiteConfig
|
||
8:
|
||
9:
|
||
10: @dataclass
|
||
11: class ParamSpec:
|
||
12: shape: tuple[int, ...]
|
||
13: init: str | Callable[[torch.Tensor], None] = "zeros" # 'zeros'|'kaiming'|'ones'|callable(t)
|
||
14: trainable: bool = True
|
||
15:
|
||
16: def make(self, dtype: torch.dtype, device) -> nn.Parameter:
|
||
17: t = torch.empty(self.shape, dtype=dtype, device=device)
|
||
18: if callable(self.init):
|
||
19: self.init(t)
|
||
20: elif self.init == "zeros":
|
||
21: t.zero_()
|
||
22: elif self.init == "ones":
|
||
23: t.fill_(1.0)
|
||
24: elif self.init == "kaiming":
|
||
25: # match nn.Linear default: kaiming_uniform_(a=sqrt(5))
|
||
... (34 more lines)
|
||
|
||
[tool] src/lora_lite/variants/lora.py (completed)
|
||
kind: read
|
||
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/lora.py
|
||
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/lora.py
|
||
output:
|
||
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/lora.py</path>
|
||
<type>file</type>
|
||
<content>
|
||
1: """Vanilla LoRA. Hu et al. 2021 https://arxiv.org/abs/2106.09685
|
||
2:
|
||
3: h = W x + (alpha/r) B A x
|
||
4:
|
||
5: Identity at t=0 from B=0. Faithful to the paper.
|
||
6: """
|
||
7: from einops import einsum
|
||
8: from torch import nn
|
||
9: import torch
|
||
10:
|
||
11: from ..variant import register, ParamSpec
|
||
12:
|
||
13:
|
||
14: @register
|
||
15: class LoRA:
|
||
16: name = "lora"
|
||
17:
|
||
18: @staticmethod
|
||
19: def param_specs(d_in, d_out, cfg):
|
||
20: return {
|
||
21: "lora_A": ParamSpec((cfg.r, d_in), init="kaiming", trainable=True),
|
||
22: "lora_B": ParamSpec((d_out, cfg.r), init="zeros", trainable=True),
|
||
23: }
|
||
24:
|
||
25: @staticmethod
|
||
... (14 more lines)
|
||
|
||
[thinking] Now let me read the reference implementations to answer the questions.
|
||
|
||
[tool] read (pending)
|
||
input: {}
|
||
|
||
[tool] read (pending)
|
||
input: {}
|
||
|
||
[tool] docs/refs/peft_eva_finetuning.py (completed)
|
||
kind: read
|
||
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva_finetuning.py
|
||
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva_finetuning.py
|
||
output:
|
||
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva_finetuning.py</path>
|
||
<type>file</type>
|
||
<content>
|
||
1: # Copyright 2024-present the HuggingFace Inc. team.
|
||
2: #
|
||
3: # Licensed under the Apache License, Version 2.0 (the "License");
|
||
4: # you may not use this file except in compliance with the License.
|
||
5: # You may obtain a copy of the License at
|
||
6: #
|
||
7: # http://www.apache.org/licenses/LICENSE-2.0
|
||
8: #
|
||
9: # Unless required by applicable law or agreed to in writing, software
|
||
10: # distributed under the License is distributed on an "AS IS" BASIS,
|
||
11: # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||
12: # See the License for the specific language governing permissions and
|
||
13: # limitations under the License.
|
||
14:
|
||
15: import torch
|
||
16: from datasets import load_dataset
|
||
17: from torch.utils.data import DataLoader
|
||
18: from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
|
||
19: from utils import DataCollator, TokenizerMetaMath
|
||
20:
|
||
21: from peft import EvaConfig, LoraConfig, get_peft_model, initialize_lora_eva_weights
|
||
22:
|
||
23:
|
||
24: DEVICE = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"
|
||
25:
|
||
... (74 more lines)
|
||
|
||
[tool] docs/refs/peft_eva.py (completed)
|
||
kind: read
|
||
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva.py
|
||
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva.py
|
||
output:
|
||
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/peft_eva.py</path>
|
||
<type>file</type>
|
||
<content>
|
||
1: # Copyright 2024-present the HuggingFace Inc. team.
|
||
2: #
|
||
3: # Licensed under the Apache License, Version 2.0 (the "License");
|
||
4: # you may not use this file except in compliance with the License.
|
||
5: # You may obtain a copy of the License at
|
||
6: #
|
||
7: # http://www.apache.org/licenses/LICENSE-2.0
|
||
8: #
|
||
9: # Unless required by applicable law or agreed to in writing, software
|
||
10: # distributed under the License is distributed on an "AS IS" BASIS,
|
||
11: # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||
12: # See the License for the specific language governing permissions and
|
||
13: # limitations under the License.
|
||
14:
|
||
15: import warnings
|
||
16: from collections import Counter, defaultdict
|
||
17: from collections.abc import Iterable, Mapping
|
||
18: from contextlib import nullcontext
|
||
19: from copy import deepcopy
|
||
20: from functools import partial
|
||
21: from itertools import cycle
|
||
22: from typing import Optional, Union
|
||
23:
|
||
24: import torch
|
||
25: import torch.distributed as dist
|
||
... (717 more lines)
|
||
|
||
[thinking] Now let me read the baukit and antipasto3 references.
|
||
|
||
[tool] read (pending)
|
||
input: {}
|
||
|
||
[tool] read (pending)
|
||
input: {}
|
||
|
||
[tool] docs/refs/baukit_nethook.py (completed)
|
||
kind: read
|
||
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/baukit_nethook.py
|
||
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/baukit_nethook.py
|
||
output:
|
||
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/baukit_nethook.py</path>
|
||
<type>file</type>
|
||
<content>
|
||
1: """
|
||
2: Utilities for instrumenting a torch model.
|
||
3:
|
||
4: Trace will hook one layer at a time.
|
||
5: TraceDict will hook multiple layers at once.
|
||
6: subsequence slices intervals from Sequential modules.
|
||
7: get_module, replace_module, get_parameter resolve dotted names.
|
||
8: set_requires_grad recursively sets requires_grad in module parameters.
|
||
9: """
|
||
10:
|
||
11: import contextlib
|
||
12: import copy
|
||
13: import inspect
|
||
14: from collections import OrderedDict
|
||
15:
|
||
16: import torch
|
||
17:
|
||
18:
|
||
19: class Trace(contextlib.AbstractContextManager):
|
||
20: """
|
||
21: To retain the output of the named layer during the computation of
|
||
22: the given network:
|
||
23:
|
||
24: with Trace(net, 'layer.name') as ret:
|
||
25: _ = net(inp)
|
||
... (449 more lines)
|
||
|
||
[tool] docs/refs/antipasto3_svd_adapter.py (completed)
|
||
kind: read
|
||
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/antipasto3_svd_adapter.py
|
||
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/antipasto3_svd_adapter.py
|
||
output:
|
||
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/docs/refs/antipasto3_svd_adapter.py</path>
|
||
<type>file</type>
|
||
<content>
|
||
1: """SVD adapter for bidirectional steering via block-diagonal Cayley rotations.
|
||
2:
|
||
3: Flax NNX port.
|
||
4:
|
||
5: kernel = U @ diag(S) @ Vh + W_res (kernel is (in, out), standard Flax convention)
|
||
6: Learnable: delta_s (additive S scaling), rotation_params (block-diagonal V rotation).
|
||
7: alpha scales both: S + alpha*delta_s, U @ R(alpha).
|
||
8:
|
||
9: Why Cayley (not Givens or matrix exponential):
|
||
10: Cayley gives exact analytical reversibility: R(-alpha) = R(alpha)^{-1}.
|
||
11: This is critical -- at alpha=+1 and alpha=-1 the adapter is an exact inverse of
|
||
12: itself, making bidirectional steering symmetric by construction.
|
||
13:
|
||
14: At alpha=0: U_rot = U and S_scaled = S, so the layer is identical to frozen weights.
|
||
15: """
|
||
16:
|
||
17: import math
|
||
18:
|
||
19: import jax
|
||
20: import jax.numpy as jnp
|
||
21: from jax import Array
|
||
22: from jaxtyping import Float, Int
|
||
23: from einops import rearrange
|
||
24: from flax import nnx
|
||
25: from loguru import logger
|
||
... (353 more lines)
|
||
|
||
[thinking] Now let me read the lora-lite config and variant protocol more carefully to understand the current API. Then I can compare it to the references.
|
||
|
||
[tool] read (pending)
|
||
input: {}
|
||
|
||
[tool] read (pending)
|
||
input: {}
|
||
|
||
[tool] src/lora_lite/config.py (completed)
|
||
kind: read
|
||
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/config.py
|
||
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/config.py
|
||
output:
|
||
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/config.py</path>
|
||
<type>file</type>
|
||
<content>
|
||
1: from dataclasses import dataclass, field, asdict
|
||
2: from typing import Any
|
||
3: import torch
|
||
4:
|
||
5:
|
||
6: @dataclass
|
||
7: class LoraLiteConfig:
|
||
8: variant: str = "lora"
|
||
9: r: int = 8
|
||
10: alpha: float = 16.0
|
||
11: dropout: float = 0.0 # currently ignored; variants may use cfg.variant_kwargs
|
||
12: dtype: torch.dtype = torch.bfloat16
|
||
13:
|
||
14: # targeting
|
||
15: target_roles: tuple[str, ...] = ("reader", "writer")
|
||
16: target_names: tuple[str, ...] = ()
|
||
17: exclude_names: tuple[str, ...] = ("lm_head", "embed_tokens")
|
||
18: layers: tuple[int, ...] | None = None
|
||
19:
|
||
20: # variant-specific bag (e.g. lambda0 for DeLoRA)
|
||
21: variant_kwargs: dict[str, Any] = field(default_factory=dict)
|
||
22:
|
||
23: def to_dict(self) -> dict:
|
||
24: d = asdict(self)
|
||
25: d["dtype"] = str(self.dtype).removeprefix("torch.")
|
||
... (16 more lines)
|
||
|
||
[tool] src/lora_lite/target.py (completed)
|
||
kind: read
|
||
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/target.py
|
||
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/target.py
|
||
output:
|
||
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/target.py</path>
|
||
<type>file</type>
|
||
<content>
|
||
1: """Find linear-like targets by shape (reader/writer/inner) + name regex.
|
||
2:
|
||
3: Structural matching is deliberate: bnb Linear4bit/8bitLt are not nn.Linear, but
|
||
4: they expose in_features/out_features/weight and their forward already handles
|
||
5: dequantization.
|
||
6: """
|
||
7: import re
|
||
8: from torch import nn
|
||
9:
|
||
10:
|
||
11: def is_linear_like(m: nn.Module) -> bool:
|
||
12: return (
|
||
13: hasattr(m, "in_features")
|
||
14: and hasattr(m, "out_features")
|
||
15: and hasattr(m, "weight")
|
||
16: and callable(m)
|
||
17: )
|
||
18:
|
||
19:
|
||
20: def _layer_idx(name: str) -> int | None:
|
||
21: m = re.search(r"\.layers?\.(\d+)\.", name)
|
||
22: return int(m.group(1)) if m else None
|
||
23:
|
||
24:
|
||
25: def _classify(m: nn.Module, d_model: int, name: str) -> str:
|
||
... (37 more lines)
|
||
|
||
[thinking] OK now I have a clear picture. Let me write up the design review with specific answers to each question. Let me also check if there's an AGENTS.md file:
|
||
|
||
[tool] glob (pending)
|
||
input: {}
|
||
|
||
[tool] glob (completed)
|
||
kind: search
|
||
input: AGENTS.md
|
||
output:
|
||
No files found
|
||
|
||
[thinking] Let me check for AGENTS.md in parent directories or look for any README or docs.
|
||
|
||
[tool] glob (pending)
|
||
input: {}
|
||
|
||
[tool] glob (completed)
|
||
kind: search
|
||
input: **/AGENTS.md
|
||
output:
|
||
No files found
|
||
|
||
[thinking] The user wants design notes. I have all the information I need. Let me write up the design notes for each reference.
|
||
|
||
[tool] todowrite (pending)
|
||
input: {}
|
||
|
||
[plan]
|
||
- [in_progress] write design notes for peft EVA
|
||
- [pending] write design notes for baukit nethook
|
||
- [pending] write design notes for antipasto3 SVD
|
||
|
||
[tool] 3 todos (completed)
|
||
kind: other
|
||
input: {"todos":[{"content":"write design notes for peft EVA","status":"in_progress","priority":"high"},{"content":"write design notes for baukit nethook","status":"pending","priority":"high"},{"content":"write design notes ...
|
||
output:
|
||
[
|
||
{
|
||
"content": "write design notes for peft EVA",
|
||
"status": "in_progress",
|
||
"priority": "high"
|
||
},
|
||
{
|
||
"content": "write design notes for baukit nethook",
|
||
"status": "pending",
|
||
"priority": "high"
|
||
},
|
||
{
|
||
"content": "write design notes for antipasto3 SVD",
|
||
"status": "pending",
|
||
"priority": "high"
|
||
}
|
||
]
|
||
|
||
[thinking] Let me write out the design notes for each reference. ## A. peft EVA **What EVA does** (1-paragraph): EVA computes SVD-of-activations to initialize LoRA's low-rank decomposition. Rather than random kaiming init, it collects layer inputs during a forward pass through calibration data, runs incremental PCA to find principal components, and uses those as the initial `A` matrix (the left singular vectors). Key features: - Per-layer rank redistribution based on explained variance (lines 302-314) - Convergence checking via cosine similarity between SVD components (lines 147-168) - Distributed support via all-gather (lines 81-102) - Handles non-convergence gracefully (multiple forward passes until all layers converge) A **minimal EVA variant** in lora-lite: The current API is: ```python variant.init(layer, cfg) # called per-layer after params created ``` EVA needs a **separate calibration st...
|
||
|
||
|
||
# Design Review: lora-lite vs peft EVA / baukit / antipasto3
|
||
|
||
## A. peft EVA
|
||
|
||
**What EVA does** (1-paragraph summary):
|
||
EVA computes SVD-of-activations to initialize LoRA's low-rank decomposition. Rather than random kaiming init, it registers hooks on target layers, runs incremental PCA on calibration inputs (collected during forward passes), and uses the left singular vectors as the initial `lora_A` matrix. Key features: per-layer rank redistribution based on explained variance (lines 302-314), convergence checking via cosine similarity (lines 147-168), and distributed support via all-gather (lines 81-102).
|
||
|
||
**Minimal EVA variant sketch:**
|
||
|
||
The current API is:
|
||
```python
|
||
variant.init(layer, cfg) # called per-layer after params created (adapter.py:56)
|
||
```
|
||
|
||
EVA needs a **separate calibration step before `attach()`**:
|
||
|
||
```python
|
||
# Option: calibrate() -> eva_state dict
|
||
def calibrate(model, dataloader, cfg):
|
||
# Run forward passes, collect activations per target
|
||
# Compute SVD, return {"layer_name": U_svd}
|
||
...
|
||
|
||
# Then in LoRA variant.init():
|
||
if "eva_state" in cfg.variant_kwargs:
|
||
with torch.no_grad():
|
||
layer.lora_A.weight.copy_(cfg.variant_kwargs["eva_state"][layer_name])
|
||
```
|
||
|
||
**What to drop if re-implementing:**
|
||
- Rank redistribution logic (lines 302-314) — skip, use fixed rank
|
||
- Distributed gather (lines 81-102) — skip single-GPU only
|
||
- Hash deduplication for equal inputs (lines 173-213, 374-386) — skip
|
||
- Resume-from-checkpoint plumbing (lines 489-559) — skip
|
||
- Whiten option (lines 475-477) — skip
|
||
- PEFT model wrapping (lines 612-656) — skip
|
||
|
||
**API impact:**
|
||
Current `Variant.init(layer, cfg)` signature at adapter.py:56 supports EVA via `cfg.variant_kwargs["eva_state"]`. Minimal change: add `calibrate(model, dataloader, cfg)` function (~50 lines) that computes SVD and returns state dict, then LoRA's `init()` checks for `eva_state` in kwargs.
|
||
|
||
---
|
||
|
||
**Recommendation: PARTIAL**
|
||
|
||
- **Import**: Core SVD-of-activations collection + SVD computation (lines 289-486 minus distributed/rank-redist/PEFT plumbing)
|
||
- **Approximate line-count**: +~80 lines (calibrate function + init hook), −0 removed
|
||
- **Trade**: Adds data-driven init capability; costs one new function and variant_kwargs passthrough
|
||
|
||
---
|
||
|
||
## B. baukit nethook
|
||
|
||
**What `TraceDict`/`Trace` gives us that lora-lite doesn't:**
|
||
- Context manager pattern with automatic cleanup (adapter.py:68 already handles this via `RemovableHandle`)
|
||
- `retain_grad=True` for backward pass (lora-lite doesn't need this — we're not editing grads)
|
||
- `edit_output` for intercepting/modifying (not needed — we have dedicated `forward_input` hook)
|
||
- Dictionary interface (`ret['layer.name'].output`) — we just store layer ref directly
|
||
|
||
Quote from adapter.py:64-68:
|
||
```python
|
||
for _, layer, _ in attached_targets:
|
||
if hasattr(layer._lora_variant, "forward_input"):
|
||
handles.append(layer.register_forward_pre_hook(_pre_hook))
|
||
else:
|
||
handles.append(layer.register_forward_hook(_hook))
|
||
```
|
||
|
||
**Would baukit simplify or complicate adapter.py?**
|
||
- **Neutral to negative**. Current code is 5 lines (64-68). Baukit's TraceDict adds ~100 lines of context-management indirection. We don't need:
|
||
- Per-layer config dicts (retain_output/retain_input per layer)
|
||
- Edit hooks mid-forward
|
||
- Stop-after-layer control
|
||
- clone/detach options
|
||
|
||
**Dependencies:**
|
||
Baukit is a single file (~470 lines). lora-lite currently has no deps beyond torch+einops. Adding baukit adds a dep for ~5 lines of value.
|
||
|
||
---
|
||
|
||
**Recommendation: SKIP**
|
||
|
||
Baukit adds ~400 lines for capabilities we don't need. Current hook registration (adapter.py:64-68) is simpler. The `retain_grad` feature could be useful for ml debugging but that's a separate concern — can inline later if needed.
|
||
|
||
---
|
||
|
||
## C. antipasto3 SVD adapter
|
||
|
||
**1. What to adopt or diverge from?**
|
||
|
||
| Feature | antipasto3 | lora-lite | Verdict |
|
||
|--------|-----------|----------|--------|
|
||
| SVD init from kernel | `create_svd_adapter()` line 169 | manual param_specs | **Divergen**: lora-lite missing |
|
||
| Delta-s scaling | `S_scaled = S + alpha*delta_s` (line 160) | direct scaling via `alpha/r` | **Adopt**: antipasto3 better — learnable scaling, not fixed |
|
||
| Block-diagonal rotation | Cayley transform (lines 135-141) | N/A | **Too complex** for lora-lite ethos |
|
||
| Residual: `W_res = kernel - U@S@Vh` | line 204 | N/A | **Consider**: adds expressivity |
|
||
| Dimension selection: L1 trip | lines 219-267 | None | **Skip**: too domain-specific |
|
||
| Polarity interleave | lines 271-287 | N/A | **Skip**: steering-specific |
|
||
|
||
**Key insight**: antipasto3's `W_res` residual (kernel - U@S@Vh) is the cleanest simplification over vanilla LoRA. Instead of:
|
||
- LoRA: `y = Wx + (α/r)BAx` (learns full rank-r delta)
|
||
|
||
You get:
|
||
- SVD: `y = W_res@x + (α/r)U_scaled@S_scaled@Vh@x` (SVD-based delta, capped)
|
||
|
||
This is essentially **PiSSA** (arXiv:2409.13349) — already in lora-lite variants/ (pissa.py).
|
||
|
||
**2. PiSSA-like factoring:**
|
||
PiSSA is in variants/pissa.py. Let me check it:
|
||
|
||
[tool] read (pending)
|
||
input: {}
|
||
|
||
[tool] src/lora_lite/variants/pissa.py (completed)
|
||
kind: read
|
||
input: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/pissa.py
|
||
files: /media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/pissa.py
|
||
output:
|
||
<path>/media/wassname/SGIronWolf/projects5/2026/lora-lite/src/lora_lite/variants/pissa.py</path>
|
||
<type>file</type>
|
||
<content>
|
||
1: """PiSSA: top-r SVD of W into A,B; replace W with W_res = W - B@A.
|
||
2:
|
||
3: Meng et al. 2024 https://arxiv.org/abs/2404.02948
|
||
4: W_eff(t=0) = W_res + B@A = W (numerically; bf16 round-trip not bit-exact).
|
||
5: """
|
||
6: import torch
|
||
7: from einops import einsum
|
||
8: from torch import nn
|
||
9:
|
||
10: from ..variant import register, ParamSpec
|
||
11:
|
||
12:
|
||
13: @register
|
||
14: class PiSSA:
|
||
15: name = "pissa"
|
||
16:
|
||
17: @staticmethod
|
||
18: def param_specs(d_in, d_out, cfg):
|
||
19: return {
|
||
20: "lora_A": ParamSpec((cfg.r, d_in), init="zeros", trainable=True),
|
||
21: "lora_B": ParamSpec((d_out, cfg.r), init="zeros", trainable=True),
|
||
22: }
|
||
23:
|
||
24: @staticmethod
|
||
25: def init(layer: nn.Linear, cfg) -> None:
|
||
... (32 more lines)
|
||
|
||
[thinking] So PiSSA is already in lora-lite! Let me check what antipasto3 offers beyond this. Comparing antipasto3 to PiSSA: - Both use SVD init from W antipasto3 has: - Delta-s scaling: learnable S + delta_s (additive learnable, not just fixed) - Residual is explicit (W_res stored separately) - PiSSA mutates layer.weight into W_res The key differences: 1. **Delta-s**: antipasto3 has `delta_s` parameters (line 89-91, 160) — fully learnable scaling. PiSSA has no learnable scaling, just fixed `alpha/r`. 2. **Steering coefficient**: antipasto3 uses explicit `alpha` mutated during forward (line 107, 116), not config-based. 3. **Residual**: PiSSA mutates `layer.weight` directly. antipasto3 stores `W_res` separately. antipasto3's delta-s is the main "new" thing worth considering. But that's really a steering/adaptation method (changing alpha at runtime), not an initialization method. For lora-lite's m...
|
||
|
||
[tool] todowrite (pending)
|
||
input: {}
|
||
|
||
[plan]
|
||
- [completed] write design notes for peft EVA
|
||
- [completed] write design notes for baukit nethook
|
||
- [completed] write design notes for antipasto3 SVD
|
||
|
||
[tool] 0 todos (completed)
|
||
kind: other
|
||
input: {"todos":[{"content":"write design notes for peft EVA","priority":"high","status":"completed"},{"content":"write design notes for baukit nethook","priority":"high","status":"completed"},{"content":"write design notes ...
|
||
output:
|
||
[
|
||
{
|
||
"content": "write design notes for peft EVA",
|
||
"status": "completed",
|
||
"priority": "high"
|
||
},
|
||
{
|
||
"content": "write design notes for baukit nethook",
|
||
"status": "completed",
|
||
"priority": "high"
|
||
},
|
||
{
|
||
"content": "write design notes for antipasto3 SVD",
|
||
"status": "completed",
|
||
"priority": "high"
|
||
}
|
||
]
|
||
|
||
[thinking] Wait, I wrote the design notes directly as output but the user asked me to answer questions and end with a recommendation. Let me reformat as requested. Actually, looking at the instructions again: the output format says "For each (A, B, C), end with: Recommendation: ADOPT / SKIP / PARTIAL". So I need to restructure my response to match that format better, and NOT write code. Let me restructure: A, B, C sections with each ending in recommendation.
|
||
|
||
|
||
# Design Review: lora-lite vs refs
|
||
|
||
## A. peft EVA (`docs/refs/peft_eva.py` + `peft_eva_finetuning.py`)
|
||
|
||
### 1. What EVA actually does
|
||
|
||
EVA uses SVD-of-activations (incremental PCA on layer inputs during forward passes) to initialize LoRA rather than random kaiming init. Key mechanism:
|
||
- Registers hooks on target layers, collects inputs across calibration batches
|
||
- Computes left singular vectors via `IncrementalPCA` (lines 138-143)
|
||
- Convergence check via cosine similarity between current/previous components (lines 167-168)
|
||
- Per-layer rank redistribution based on explained variance ratio (lines 302-314, 456)
|
||
|
||
The user workflow (peft_eva_finetuning.py:66-76):
|
||
```python
|
||
eva_config = EvaConfig(rho=rho)
|
||
peft_config = LoraConfig(r=rank, init_lora_weights="eva", eva_config=eva_config)
|
||
peft_model = get_peft_model(model, peft_config)
|
||
initialize_lora_eva_weights(peft_model, dataloader) # separate calibration step
|
||
```
|
||
|
||
### 2. Minimal EVA variant API sketch
|
||
|
||
**Option: separate calibrate() step before attach()**
|
||
```python
|
||
# User does:
|
||
eva_state = calibrate(model, dataloader, cfg) # returns {"layer_name": U_svd}
|
||
cfg.variant_kwargs["eva_state"] = eva_state
|
||
attach(model, cfg) # LoRA.init() copies U_svd -> lora_A
|
||
```
|
||
|
||
API change: add `calibrate(model, dataloader, cfg)` function (~50 lines) that computes SVD on calibration inputs, returns state dict. In `LoRA.init()`, check for `eva_state[layer_name]` and copy.
|
||
|
||
### 3. What to drop if re-implementing
|
||
|
||
- Lines 302-314: rank redistribution logic → use fixed rank r
|
||
- Lines 81-102: distributed all-gather → single-GPU only
|
||
- Lines 173-213, 374-386: hash deduplication for equal inputs → skip
|
||
- Lines 489-559: resume-from-checkpoint plumbing → skip
|
||
- Lines 475-477: whiten option → skip
|
||
- Lines 612-656: PEFT model wrapper integration → standalone function
|
||
|
||
Droppable: ~60% of the 739-line file.
|
||
|
||
### 4. Does current init() signature support EVA?
|
||
|
||
Yes. adapter.py:56:
|
||
```python
|
||
variant.init(layer, cfg)
|
||
```
|
||
|
||
Add `eva_state` to `cfg.variant_kwargs` before attach, then in `LoRA.init()`:
|
||
```python
|
||
if "eva_state" in cfg.variant_kwargs:
|
||
# copy precomputed U into lora_A
|
||
```
|
||
|
||
Smallest change: add `calibrate()` function (~50 lines), no signature change needed.
|
||
|
||
---
|
||
|
||
**Recommendation: PARTIAL**
|
||
|
||
- **Import**: SVD-of-activations collection + compute
|
||
- **Lines**: +~50 (calibrate fn), −0
|
||
- **What to remove in exchange**: Nothing — this is additive capability only. If you add calibration data collection, you must accept the additional complexity.
|
||
|
||
---
|
||
|
||
## B. baukit nethook (`docs/refs/baukit_nethook.py`)
|
||
|
||
### 1. What TraceDict/Trace gives us that we don't have
|
||
|
||
| Feature | baukit | lora-lite | Value |
|
||
|--------|-------|----------|-------|
|
||
| Context manager cleanup | lines 99-105 | RemovableHandle (adapter.py:64-68) | Same |
|
||
| retain_grad | line 37-38 | None | Not needed |
|
||
| edit_output mid-forward | lines 39, 72-75 | None | Not needed |
|
||
| Dictionary interface | line 111 | None | Not needed |
|
||
| Stop-after-layer | lines 46-47, 189-202 | None | Not needed |
|
||
|
||
TraceDict essentially provides: per-layer output retention + config dicts + cleanup. We already handle cleanup via RemovableHandle.
|
||
|
||
### 2. Would baukit simplify or complicate adapter.py?
|
||
|
||
Quote adapter.py:64-68:
|
||
```python
|
||
for _, layer, _ in attached_targets:
|
||
if hasattr(layer._lora_variant, "forward_input"):
|
||
handles.append(layer.register_forward_pre_hook(_pre_hook))
|
||
else:
|
||
handles.append(layer.register_forward_hook(_hook))
|
||
```
|
||
|
||
This is 5 lines. Baukit's TraceDict adds ~100 lines of context management we'd use only to replicate these 5 lines. Our current approach is simpler.
|
||
|
||
### 3. Is baukit worth a dep?
|
||
|
||
No. lora-lite's principle is minimal deps (currently torch+einops). Baukit adds ~400 lines of code for capabilities we don't need. `retain_grad` for debugging is the only useful feature — that's 3 lines to inline if needed later.
|
||
|
||
---
|
||
|
||
**Recommendation: SKIP**
|
||
|
||
Baukit adds a dep for no net simplification. Current hook registration (adapter.py:64-68) is 5 lines and sufficient. If `retain_grad` is needed for ml debugging, inline 3 lines later.
|
||
|
||
---
|
||
|
||
## C. antipasto3 SVD adapter (`docs/refs/antipasto3_svd_adapter.py`)
|
||
|
||
### 1. What to adopt or diverge from
|
||
|
||
| Feature | antipasto3 | lora-lite | Verdict |
|
||
|---------|-----------|----------|--------|
|
||
| SVD init from kernel | line 169 `create_svd_adapter()` | PiSSA (pissa.py) | Already have |
|
||
| Delta-s scaling: learnable | line 160 `S + alpha*delta_s` | fixed via alpha/r | **Adopt** |
|
||
| Residual: `W_res = kernel - U@S@Vh` | line 204 | PiSSA has this | Already have |
|
||
| Block-diagonal Cayley rotation | lines 135-141 | N/A | Skip (too complex) |
|
||
| Dimension selection (L1 trip) | lines 219-267 | N/A | Skip (domain-specific) |
|
||
|
||
**Key adoption**: antipasto3's learnable `delta_s` is cleaner than fixed scaling. Currently lora-lite uses `scale = alpha/r` at **runtime** — but delta-s makes scaling a **learnable parameter** with its own gradient.
|
||
|
||
This is already in PiSSA variant: piSSA `init()` sets `scale=1` via alpha=r convention (pissa.py:43-46), but doesn't expose the scaling as learnable.
|
||
|
||
### 2. PiSSA-like factoring
|
||
|
||
lora-lite already has PiSSA in `variants/pissa.py`:
|
||
- Line 31-34: SVD of weight
|
||
- Line 36-40: decompose into B@A with sqrt(S) scaling
|
||
- Line 46: residual W_res = W - B@A
|
||
|
||
This is exactly antipasto3's core insight, implemented cleanly. The antipasto3 contribution is making `delta_s` learnable — that's a natural extension of PiSSA, not a replacement.
|
||
|
||
---
|
||
|
||
**Recommendation: ADOPT (concept)**
|
||
|
||
- **Import**: learnable delta_s concept (not code — just the idea)
|
||
- **Lines**: +0 by intent, no code change right now
|
||
- **What to remove**: Nothing — this is a design note that PiSSA could evolve toward learnable scaling instead of fixed alpha/r
|
||
|
||
The antipasto3 patterns ( Cayley rotation, polarity interleave) are steering-specific and too complex for lora-lite ethos. The clean insight is already in PiSSA.
|
||
|
||
---
|
||
|
||
## Summary
|
||
|
||
| Ref | Recommendation | Impact |
|
||
|-----|--------------|--------|
|
||
| peft EVA | PARTIAL | +50 lines calibrate fn |
|
||
| baukit | SKIP | 0 (keep current) |
|
||
| antipasto3 | ADOPT concept | 0 (design note) |
|
||
|
||
[done] end_turn
|