mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
docs: record adapter shapes + shrinkage-vs-separation; journal rotation fix
AGENTS.md: new section on PiSSA (delta_S:[r] diag) vs LoRA (A:[r,d_in] full) adapters -- forward sees only the sum so same-basis routing is a magnitude split (shrinkage null) unless broken by gate discrimination x (expressiveness + structural separation). Honest note that this wasn't clear to me first pass. RESEARCH_JOURNAL: rotation fix + the verified shrinkage confound (antipasto.py:107 sums kept+hack in one basis); the deploy delta_S*=(1-qE) control is the cheap decider. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -74,6 +74,40 @@ Inherit global rules from `~/.claude/CLAUDE.md`.
|
||||
- do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
|
||||
- I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead
|
||||
|
||||
## The adapter setup (shapes), and why "same position = shrinkage" is subtle
|
||||
|
||||
Per Linear `W:[d_out,d_in]`, rank `r`. Two adapters: deployed + quarantine (ablated at deploy).
|
||||
|
||||
- PiSSA/AntiPaSTO: `W = U S Vh`, `U:[d_out,r]`/`Vh:[r,d_in]` FROZEN. Train `delta_S:[r]`
|
||||
(deployed) + `delta_S_hack:[r]` (quarantine) -- diagonals in the SAME frozen basis, `r`
|
||||
scalars each. forward `y = W@x + U @ ( (Vh@x) * (delta_S + delta_S_hack) )`.
|
||||
- LoRA-frozen-B (current `wrap_model_with_lora_frozen_b`): `A:[r,d_in]` trainable, `B:[d_out,r]`
|
||||
FROZEN+SHARED, `A_hack:[r,d_in]` quarantine. forward `y = W@x + B @ ((A + A_hack) @ x)`.
|
||||
Shared B -> `A.grad == A_hack.grad` pre-routing.
|
||||
|
||||
The forward sees ONLY the sum. So routing that carves one gradient into kept+routed and
|
||||
ablates the routed part at deploy is, by DEFAULT, a MAGNITUDE split: deploy = vanilla minus
|
||||
`qE` of the update = "earlier-training vanilla" = less (late-emerging) hacking, no direction.
|
||||
That is the shrinkage NULL. Vanilla today already has `delta_S_hack=0` (never routed into), so
|
||||
it is "two adapters, one empty"; routeV's deploy just lost `qE` of the same update.
|
||||
|
||||
Shrinkage is NOT inevitable. Two things break it:
|
||||
1. Adapter EXPRESSIVENESS. `delta_S` is `r` per-axis scales (near-eigenvalue tweaks); `A` is a
|
||||
full `r*d_in` unfrozen map. Under a DISCRIMINATING gate (f high on hack rollouts, low on
|
||||
solve) the deployed `A` accumulates `Σ_solve g` and `A_hack` accumulates `Σ_hack g` -- real
|
||||
separation. `delta_S` can separate far less (only along `r` fixed axes). So LoRA is less
|
||||
doomed to shrinkage than PiSSA even with shared B.
|
||||
2. STRUCTURAL separation: give the quarantine its OWN frozen encoder/decoder (`U2/Vh2`, or its
|
||||
own trainable `B_hack`), so the two adapters live in different subspaces, `∂L/∂deployed !=
|
||||
∂L/∂quarantine`, and deploy-ablation removes a different FUNCTION, not a slice of the same
|
||||
update.
|
||||
|
||||
So shrinkage-vs-direction is decided by (gate discrimination) x (adapter expressiveness +
|
||||
structural separation), NOT by "same position" alone. Controls: capacity-matched vanilla (two
|
||||
empty adapters, or one 2x adapter, no routing) isolates parameter count; non-directional
|
||||
routing at matched `qE` isolates shrinkage. (I did not have this straight on first pass -- the
|
||||
trap is calling same-position routing "shrinkage" without checking the gate/expressiveness.)
|
||||
|
||||
## Extra instructions:
|
||||
|
||||
- When you queue a job, follow with `pueue follow | tail` in bg so you are woken on fail or finish
|
||||
|
||||
Reference in New Issue
Block a user