types, review

2026-06-27 16:15:50 +08:00 · 2026-04-26 20:35:38 +08:00
parent 55757e829d
commit 053901e0ca
14 changed files with 1991 additions and 44 deletions
@@ -0,0 +1,186 @@
+# V4 Variant Review — per-component vs reference + smoke/probe validity
+
+You are an expert ML engineer reviewing a from-scratch PEFT library
+(`lora-lite`, ~500 LOC) that re-implements 8 LoRA variants. Three prior
+reviews already happened (V1 paper-vs-code, V2 with refs provided, V3
+per-component). Your job is V4: re-run the per-component check and
+additionally validate the test harness.
+
+# Part A — per-variant audit (re-do, more rigorous)
+
+8 variants live in `src/lora_lite/variants/`:
+- lora.py
+- pissa.py
+- delora.py
+- ia3.py (registers `ia3` and `ia3_ff`)
+- dora.py
+- hra.py
+- eva.py
+- antipasto.py
+
+Plus runtime in `src/lora_lite/{adapter.py,variant.py,target.py,config.py}`.
+Reference implementations are in `docs/refs/` and URLs are pasted in each
+variant's module docstring.
+
+## For EACH variant, in this order, every time:
+
+1. **REFERENCE EXISTS** — verify the variant has a real, citeable
+   reference. Required:
+     - a paper (arxiv/conference) link, AND
+     - either an upstream peft implementation OR the original author's
+       code (GitHub).
+   If the variant has NO paper, NO reference code, OR the references
+   are dead/missing/clearly wrong, FLAG IT as `NO REFERENCE` -- this
+   is severity HIGH because it means there's nothing to validate
+   against.
+
+2. **PARAMS** — every spec from `param_specs`: shape, dtype, trainable,
+   as_buffer. Match against the reference. Buffers vs Parameters
+   chosen correctly?
+
+3. **INIT** — what does `init()` (and `group_init()` if defined) do?
+   Match the reference exactly? Walk gradient flow at t=0: which
+   trainable params actually receive non-zero gradient on step 1?
+
+4. **DTYPE** — trace dtype through init -> storage -> forward.
+   Silent precision loss? Identity-at-init survive bf16?
+
+5. **FORWARD** — write the math the forward implements vs the math
+   in the reference. Term-by-term comparison. Common mistakes:
+   - wrong scale (alpha/r vs 1/r vs alpha vs 1)
+   - missing/doubled normalization
+   - wrong basis (rotating U vs V; gating input vs output)
+   - dropout placement (we have NO dropout by design — flag if any
+     code path depends on one)
+
+6. **LINK SANITY** — actually open the URLs. Verify:
+   - paper arxiv link goes to the right paper
+   - github link points to a real file (not 404)
+   - offline `docs/refs/` snapshot still matches what the URL serves
+     today (snapshots may be stale; flag drift)
+
+## Per-variant output (≤60 lines each):
+
+    ## <variant>
+
+    ### references
+    - paper: <url>  -- OK / WRONG / DEAD / MISSING
+    - peft ref: <url>  -- OK / DEAD / MISSING
+    - author ref (if any): <url>  -- OK / DEAD / MISSING
+    - offline snapshot (`docs/refs/...`): NONE / MATCH / DRIFT
+    - VERDICT: HAS_REFERENCE / NO_REFERENCE
+
+    ### params
+    - <one bullet per ParamSpec; flag bug if any>
+
+    ### init / group_init
+    - <bullets; identify GRADIENT FLOW at t=0 explicitly>
+
+    ### dtype
+    - <bullets>
+
+    ### forward
+    Math (ours):    <one-line equation>
+    Math (ref):     <one-line equation>
+    Match?          YES / NO + one-line reason
+
+    ### verdict
+    CORRECT / PARTIAL / BUGGY  -- one-sentence reason
+
+# Part B — validate the smoke test (`tests/smoke.py`)
+
+Read `tests/smoke.py` end-to-end. For each per-variant SHOULD claim,
+answer:
+
+1. **Distinguishing power** — would a SILENT FAILURE (e.g. forward
+   returning `y` unchanged, or training only the bias term, or
+   loading an empty state dict) STILL pass this check? If yes,
+   the check is WEAK -- name a stronger one.
+
+2. **Tolerance sanity** — the bf16/fp16 tolerances are computed
+   from `base_scale`. Are they too loose? Too tight? Could they
+   pass on noise alone?
+
+3. **Coverage** — what mechanisms are NOT tested? (e.g. multi-step
+   convergence on real targets, dtype mismatch between attach and
+   load, mixing variants, calibration data of len < r for EVA)
+
+Output:
+
+    ## smoke.py validity
+
+    ### per-variant SHOULD checks
+    | check | distinguishes silent failure? | tolerance ok? | notes |
+    | ... |
+
+    ### gaps
+    - bullets
+
+    ### must-add tests
+    - bullets
+
+# Part C — validate the qwen overfit probe (`scripts/qwen_train_probe.py`)
+
+Read `scripts/qwen_train_probe.py` end-to-end. Same questions as Part B
+but for the Qwen probe specifically:
+
+1. Does `assert_only_lora_trainable` actually catch a leaked base
+   parameter, given the way `requires_grad` is set in `adapter.py`?
+
+2. `perturb_first_adapter` only perturbs ONE param per variant. Does
+   `perturb_delta > 1e-7` distinguish "the variant uses that param in
+   forward" from "the variant ignores that param"?
+
+3. `loss_last < loss0` after 8 steps with lr=5e-3 -- could this pass
+   purely from optimizer noise? What's the right held-out / validation
+   check to add?
+
+4. The reload check uses `args.reload_tol` (default 2e-2 in bf16). Is
+   that loose enough to mask a real save/load bug?
+
+5. Targets are restricted to `model.layers.0.self_attn.{q,v}_proj` --
+   does this exercise the full attach path or hide bugs that only
+   appear with multi-layer / FFN / lm_head edge cases?
+
+Output:
+
+    ## qwen_train_probe.py validity
+
+    ### claim-by-claim
+    | assertion | catches silent failure? | notes |
+    | ... |
+
+    ### gaps
+    - bullets
+
+    ### must-add tests
+    - bullets
+
+# Final summary
+
+After parts A, B, C, write:
+
+    ## summary
+
+    ### variant verdicts
+    | variant | has_ref | params | init | dtype | forward | verdict |
+
+    ### MUST-FIX (severity HIGH, blocks correctness claim)
+    1. ...
+    2. ...
+
+    ### NICE-TO-HAVE
+    - ...
+
+# Hard rules
+
+- Be specific. Cite line numbers (`src/lora_lite/variants/foo.py:NN`)
+  for every claim.
+- Do NOT propose redesigns. Only flag correctness issues against
+  references and validity issues in the test harness.
+- If an issue is intentional and documented in the docstring, say so
+  and move on -- don't re-flag known deviations.
+- If you can't tell whether something is a bug, say "AMBIGUOUS" with
+  the question you'd need answered.
+- For Part B/C, focus on whether checks have DISTINGUISHING power
+  (would a silent failure still pass?) -- not just whether they run.