README: validation framing per wassname's wording; arrow large-block lr=1e-4

README: 'we validate the same way PEFT does; trained properly they clear 49% on GSM8K, all pass' + link to the benchmark script. justfile: arrow with block>8 uses lr=1e-4 not 5e-3. The 5e-3 that suits the tiny S-space gain destabilizes the large dense block -- block=128 at 5e-3 scored 45.7% (below the bar, vs block=8's 60.5%). Capacity sweep requeued at LoRA's 1e-4 to de-confound params-vs-lr. Co-Authored-By: Claudypoo <noreply@anthropic.com>
2026-06-27 17:01:14 +08:00 · 2026-06-15 18:27:33 +08:00
parent 6b7b3a47dd
commit e8ca6f5944
2 changed files with 5 additions and 1 deletions
@@ -64,7 +64,7 @@ just qwen-probe  # Qwen/Qwen3-0.6B train/save-load probe

 Params = trainable adapter params. Peak GPU = peak CUDA memory during train+eval (logged from this run onward; older runs predate the column).

-This is a validation harness, not a leaderboard: every adapter trains on a MetaMathQA subset and is tested on GSM8K, the same protocol as [PEFT's method comparison](https://github.com/huggingface/peft/tree/main/method_comparison/MetaMathQA). A correct implementation clears ~48% (PEFT's LoRA mark on the larger Llama-3.2-3B); every row here does, which is the signal that each adapter trains. See [the benchmark script](scripts/metamath_gsm8k_benchmark.py) for hyperparameters (r=32 q/v; the AntiPaSTO family uses r=256).
+We validate our adapters the same way [PEFT](https://github.com/huggingface/peft/tree/main/method_comparison) does. If acting properly they train on a 5k-step MetaMathQA subset and test 49%+ on GSM8K; as you can see, all adapters pass this mark. See [this file](scripts/metamath_gsm8k_benchmark.py) for more details.

 AntiPaSTO is the novel row here: instead of adding trainable directions like LoRA, it freezes W's own top-r SVD and learns only a bounded per-direction gain `S_eff = S * (1 + ELU(g))`. The singular basis stays fixed and interpretable, and the adapter is O(r) params (~320x smaller than LoRA). The variants change only the basis or core: CorDA orients it by input covariance ([Yang+ 2024](https://arxiv.org/abs/2406.05223)), ablate learns a contractive directional ablation ([Arditi+ 2024](https://arxiv.org/abs/2406.11717)), arrow adds a small dense block for cross-direction mixing.

@@ -94,6 +94,10 @@ bench-variant model variant steps="5000" block="8":
 		# matches the published AntiPaSTO row. alpha=r (no extra scaling).
 		antipasto*) lr=5e-3; r=256; alpha=256 ;;
 	esac
+	# 5e-3 suits the tiny S-space gain, but arrow's large dense block is LoRA-like
+	# and destabilizes at that lr (block=128 got 45.7% vs block=8's 60.5%). Drop to
+	# LoRA's 1e-4 once the block dominates the param count.
+	if [ "{{variant}}" = "antipasto_arrow" ] && [ "{{block}}" -gt 8 ]; then lr=1e-4; fi
 	exec uv run --extra benchmark python scripts/metamath_gsm8k_benchmark.py \
 		--model '{{model}}' \
 		--variant '{{variant}}' \