mirror of
https://github.com/wassname/steer-heal-love.git
synced 2026-06-27 17:02:34 +08:00
docs: QLoRA is net ~2x slower (gen-bound loop), keep mask-before-softmax heal fix
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -27,3 +27,11 @@ Distil an activation steering vector (steering-lite) into a conditioned LoRA, he
|
||||
- Fail fast, crash loudly. No defensive guards, no fallbacks, no silent skips.
|
||||
- One objective + one constraint (barrier), never competing losses. See `spec.md` Loss.
|
||||
- Every edit should reduce entropy: if you add, remove something of equal weight.
|
||||
|
||||
## Gotchas
|
||||
|
||||
- Default to bf16 bs=1. This loop is GENERATION-bound (~150 gens/round vs one short SFT pass), so
|
||||
QLoRA is a ~2x net loss here: it speeds training (cheap) and slows 4-bit decode 3x (~28 vs ~9 s/gen).
|
||||
QLoRA only earns its place when bf16 cannot hold the model. See RESEARCH_JOURNAL 2026-06-09.
|
||||
- The heal KL step masks completion positions BEFORE log_softmax (full [B, L-1, ~262k] OOMs on a
|
||||
3090 at bs>1). Keep this regardless of dtype.
|
||||
|
||||
@@ -987,3 +987,29 @@ find the budget where coherence holds and trait still moves (queued, see TaskLis
|
||||
holds both, accept the round-2 adapter as the deliverable and run the prompting baseline (task #26).
|
||||
(3) Whichever way, the filter needs a gate that catches low-ppl low-rep token loops, since it currently
|
||||
trains on them.
|
||||
|
||||
## 2026-06-09 -- QLoRA is a net loss for this pipeline: it speeds training (the cheap part) and slows generation 3x (the bottleneck)
|
||||
|
||||
**Why.** Tried 4-bit NF4 base to free ~6GB and run train_bs>1 (the bs=4 heal step OOM'd on the full
|
||||
[B, L-1, V] log_softmax over gemma's ~262k vocab; fixed by masking completion positions BEFORE softmax,
|
||||
identical math, ~1.5GB saved -- that fix is a keeper for bf16 too). Then measured generation under QLoRA.
|
||||
|
||||
**Result (single data point, steady-state).** Steered 512-token generation under QLoRA: `27.6 s/it`
|
||||
(~18 tok/s) vs bf16 ~9 s (~50 tok/s), the 4-bit dequant-per-forward tax. The pipeline is
|
||||
GENERATION-BOUND, not training-bound: ~150 completions/round (≈48 bisection probes + ≈96 walk-C collect
|
||||
+ 6 adapter) against one short SFT pass. So the trade is:
|
||||
|
||||
| | per gen | gen/round (~150) | round | 8 rounds | 7-demo sweep |
|
||||
|---|---|---|---|---|---|
|
||||
| QLoRA bs=3 | ~28s | ~69 min | ~85 min | ~11h | ~78h |
|
||||
| bf16 bs=1 | ~9s | ~22 min | ~40 min | ~5h | ~37h |
|
||||
|
||||
QLoRA bought bs=3 training, but training is ~10% of wall-clock -- speeding it 3x saves ~3% overall while
|
||||
the 3x generation slowdown costs ~50%. **Net ~2x slower end-to-end.** QLoRA optimized the wrong
|
||||
bottleneck. Lesson: in a generate-filter-train loop dominated by autoregressive sampling, 4-bit's
|
||||
memory win does not pay for its decode-speed loss; QLoRA only earns its place when the goal is FITTING a
|
||||
model that bf16 cannot hold, not throughput on one that already fits.
|
||||
|
||||
**Next.** Revert to bf16 bs=1 (the proven task-0 path), keep the mask-before-softmax heal fix, the
|
||||
walk-C bisection, and the round-loosened barrier. If a bigger model is ever the goal, QLoRA returns but
|
||||
the sweep budget must assume the 3x decode tax.
|
||||
|
||||
Reference in New Issue
Block a user