mirror of
https://github.com/wassname/lora-lite.git
synced 2026-06-27 17:01:14 +08:00
README: validation framing per wassname's wording; arrow large-block lr=1e-4
README: 'we validate the same way PEFT does; trained properly they clear 49% on GSM8K, all pass' + link to the benchmark script. justfile: arrow with block>8 uses lr=1e-4 not 5e-3. The 5e-3 that suits the tiny S-space gain destabilizes the large dense block -- block=128 at 5e-3 scored 45.7% (below the bar, vs block=8's 60.5%). Capacity sweep requeued at LoRA's 1e-4 to de-confound params-vs-lr. Co-Authored-By: Claudypoo <noreply@anthropic.com>
This commit is contained in:
@@ -64,7 +64,7 @@ just qwen-probe # Qwen/Qwen3-0.6B train/save-load probe
|
||||
|
||||
Params = trainable adapter params. Peak GPU = peak CUDA memory during train+eval (logged from this run onward; older runs predate the column).
|
||||
|
||||
This is a validation harness, not a leaderboard: every adapter trains on a MetaMathQA subset and is tested on GSM8K, the same protocol as [PEFT's method comparison](https://github.com/huggingface/peft/tree/main/method_comparison/MetaMathQA). A correct implementation clears ~48% (PEFT's LoRA mark on the larger Llama-3.2-3B); every row here does, which is the signal that each adapter trains. See [the benchmark script](scripts/metamath_gsm8k_benchmark.py) for hyperparameters (r=32 q/v; the AntiPaSTO family uses r=256).
|
||||
We validate our adapters the same way [PEFT](https://github.com/huggingface/peft/tree/main/method_comparison) does. If acting properly they train on a 5k-step MetaMathQA subset and test 49%+ on GSM8K; as you can see, all adapters pass this mark. See [this file](scripts/metamath_gsm8k_benchmark.py) for more details.
|
||||
|
||||
AntiPaSTO is the novel row here: instead of adding trainable directions like LoRA, it freezes W's own top-r SVD and learns only a bounded per-direction gain `S_eff = S * (1 + ELU(g))`. The singular basis stays fixed and interpretable, and the adapter is O(r) params (~320x smaller than LoRA). The variants change only the basis or core: CorDA orients it by input covariance ([Yang+ 2024](https://arxiv.org/abs/2406.05223)), ablate learns a contractive directional ablation ([Arditi+ 2024](https://arxiv.org/abs/2406.11717)), arrow adds a small dense block for cross-direction mixing.
|
||||
|
||||
|
||||
@@ -94,6 +94,10 @@ bench-variant model variant steps="5000" block="8":
|
||||
# matches the published AntiPaSTO row. alpha=r (no extra scaling).
|
||||
antipasto*) lr=5e-3; r=256; alpha=256 ;;
|
||||
esac
|
||||
# 5e-3 suits the tiny S-space gain, but arrow's large dense block is LoRA-like
|
||||
# and destabilizes at that lr (block=128 got 45.7% vs block=8's 60.5%). Drop to
|
||||
# LoRA's 1e-4 once the block dominates the param count.
|
||||
if [ "{{variant}}" = "antipasto_arrow" ] && [ "{{block}}" -gt 8 ]; then lr=1e-4; fi
|
||||
exec uv run --extra benchmark python scripts/metamath_gsm8k_benchmark.py \
|
||||
--model '{{model}}' \
|
||||
--variant '{{variant}}' \
|
||||
|
||||
Reference in New Issue
Block a user