docs: record adapter shapes + shrinkage-vs-separation; journal rotation fix

AGENTS.md: new section on PiSSA (delta_S:[r] diag) vs LoRA (A:[r,d_in] full)
adapters -- forward sees only the sum so same-basis routing is a magnitude split
(shrinkage null) unless broken by gate discrimination x (expressiveness + structural
separation). Honest note that this wasn't clear to me first pass.

RESEARCH_JOURNAL: rotation fix + the verified shrinkage confound (antipasto.py:107
sums kept+hack in one basis); the deploy delta_S*=(1-qE) control is the cheap decider.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-10 06:50:42 +00:00
parent 0112f4a36d
commit 7511ba12e8
2 changed files with 60 additions and 0 deletions
+34
View File
@@ -74,6 +74,40 @@ Inherit global rules from `~/.claude/CLAUDE.md`.
- do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
- I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead
## The adapter setup (shapes), and why "same position = shrinkage" is subtle
Per Linear `W:[d_out,d_in]`, rank `r`. Two adapters: deployed + quarantine (ablated at deploy).
- PiSSA/AntiPaSTO: `W = U S Vh`, `U:[d_out,r]`/`Vh:[r,d_in]` FROZEN. Train `delta_S:[r]`
(deployed) + `delta_S_hack:[r]` (quarantine) -- diagonals in the SAME frozen basis, `r`
scalars each. forward `y = W@x + U @ ( (Vh@x) * (delta_S + delta_S_hack) )`.
- LoRA-frozen-B (current `wrap_model_with_lora_frozen_b`): `A:[r,d_in]` trainable, `B:[d_out,r]`
FROZEN+SHARED, `A_hack:[r,d_in]` quarantine. forward `y = W@x + B @ ((A + A_hack) @ x)`.
Shared B -> `A.grad == A_hack.grad` pre-routing.
The forward sees ONLY the sum. So routing that carves one gradient into kept+routed and
ablates the routed part at deploy is, by DEFAULT, a MAGNITUDE split: deploy = vanilla minus
`qE` of the update = "earlier-training vanilla" = less (late-emerging) hacking, no direction.
That is the shrinkage NULL. Vanilla today already has `delta_S_hack=0` (never routed into), so
it is "two adapters, one empty"; routeV's deploy just lost `qE` of the same update.
Shrinkage is NOT inevitable. Two things break it:
1. Adapter EXPRESSIVENESS. `delta_S` is `r` per-axis scales (near-eigenvalue tweaks); `A` is a
full `r*d_in` unfrozen map. Under a DISCRIMINATING gate (f high on hack rollouts, low on
solve) the deployed `A` accumulates `Σ_solve g` and `A_hack` accumulates `Σ_hack g` -- real
separation. `delta_S` can separate far less (only along `r` fixed axes). So LoRA is less
doomed to shrinkage than PiSSA even with shared B.
2. STRUCTURAL separation: give the quarantine its OWN frozen encoder/decoder (`U2/Vh2`, or its
own trainable `B_hack`), so the two adapters live in different subspaces, `∂L/∂deployed !=
∂L/∂quarantine`, and deploy-ablation removes a different FUNCTION, not a slice of the same
update.
So shrinkage-vs-direction is decided by (gate discrimination) x (adapter expressiveness +
structural separation), NOT by "same position" alone. Controls: capacity-matched vanilla (two
empty adapters, or one 2x adapter, no routing) isolates parameter count; non-directional
routing at matched `qE` isolates shrinkage. (I did not have this straight on first pass -- the
trap is calling same-position routing "shrinkage" without checking the gate/expressiveness.)
## Extra instructions:
- When you queue a job, follow with `pueue follow | tail` in bg so you are woken on fail or finish