diff --git a/README.md b/README.md index 1ba1d70..779f35d 100644 --- a/README.md +++ b/README.md @@ -64,7 +64,7 @@ just qwen-probe # Qwen/Qwen3-0.6B train/save-load probe Params = trainable adapter params. Peak GPU = peak CUDA memory during train+eval (logged from this run onward; older runs predate the column). -This is a validation harness, not a leaderboard: every adapter trains on a MetaMathQA subset and is tested on GSM8K, the same protocol as [PEFT's method comparison](https://github.com/huggingface/peft/tree/main/method_comparison/MetaMathQA). A correct implementation clears ~48% (PEFT's LoRA mark on the larger Llama-3.2-3B); every row here does, which is the signal that each adapter trains. See [the benchmark script](scripts/metamath_gsm8k_benchmark.py) for hyperparameters (r=32 q/v; the AntiPaSTO family uses r=256). +We validate our adapters the same way [PEFT](https://github.com/huggingface/peft/tree/main/method_comparison) does. If acting properly they train on a 5k-step MetaMathQA subset and test 49%+ on GSM8K; as you can see, all adapters pass this mark. See [this file](scripts/metamath_gsm8k_benchmark.py) for more details. AntiPaSTO is the novel row here: instead of adding trainable directions like LoRA, it freezes W's own top-r SVD and learns only a bounded per-direction gain `S_eff = S * (1 + ELU(g))`. The singular basis stays fixed and interpretable, and the adapter is O(r) params (~320x smaller than LoRA). The variants change only the basis or core: CorDA orients it by input covariance ([Yang+ 2024](https://arxiv.org/abs/2406.05223)), ablate learns a contractive directional ablation ([Arditi+ 2024](https://arxiv.org/abs/2406.11717)), arrow adds a small dense block for cross-direction mixing. diff --git a/justfile b/justfile index fa98cb1..a3d50a5 100644 --- a/justfile +++ b/justfile @@ -94,6 +94,10 @@ bench-variant model variant steps="5000" block="8": # matches the published AntiPaSTO row. alpha=r (no extra scaling). antipasto*) lr=5e-3; r=256; alpha=256 ;; esac + # 5e-3 suits the tiny S-space gain, but arrow's large dense block is LoRA-like + # and destabilizes at that lr (block=128 got 45.7% vs block=8's 60.5%). Drop to + # LoRA's 1e-4 once the block dominates the param count. + if [ "{{variant}}" = "antipasto_arrow" ] && [ "{{block}}" -gt 8 ]; then lr=1e-4; fi exec uv run --extra benchmark python scripts/metamath_gsm8k_benchmark.py \ --model '{{model}}' \ --variant '{{variant}}' \