diff --git a/AGENTS.md b/AGENTS.md index f661ba7..54aa16f 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -74,6 +74,40 @@ Inherit global rules from `~/.claude/CLAUDE.md`. - do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision - I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead +## The adapter setup (shapes), and why "same position = shrinkage" is subtle + +Per Linear `W:[d_out,d_in]`, rank `r`. Two adapters: deployed + quarantine (ablated at deploy). + +- PiSSA/AntiPaSTO: `W = U S Vh`, `U:[d_out,r]`/`Vh:[r,d_in]` FROZEN. Train `delta_S:[r]` + (deployed) + `delta_S_hack:[r]` (quarantine) -- diagonals in the SAME frozen basis, `r` + scalars each. forward `y = W@x + U @ ( (Vh@x) * (delta_S + delta_S_hack) )`. +- LoRA-frozen-B (current `wrap_model_with_lora_frozen_b`): `A:[r,d_in]` trainable, `B:[d_out,r]` + FROZEN+SHARED, `A_hack:[r,d_in]` quarantine. forward `y = W@x + B @ ((A + A_hack) @ x)`. + Shared B -> `A.grad == A_hack.grad` pre-routing. + +The forward sees ONLY the sum. So routing that carves one gradient into kept+routed and +ablates the routed part at deploy is, by DEFAULT, a MAGNITUDE split: deploy = vanilla minus +`qE` of the update = "earlier-training vanilla" = less (late-emerging) hacking, no direction. +That is the shrinkage NULL. Vanilla today already has `delta_S_hack=0` (never routed into), so +it is "two adapters, one empty"; routeV's deploy just lost `qE` of the same update. + +Shrinkage is NOT inevitable. Two things break it: +1. Adapter EXPRESSIVENESS. `delta_S` is `r` per-axis scales (near-eigenvalue tweaks); `A` is a + full `r*d_in` unfrozen map. Under a DISCRIMINATING gate (f high on hack rollouts, low on + solve) the deployed `A` accumulates `Σ_solve g` and `A_hack` accumulates `Σ_hack g` -- real + separation. `delta_S` can separate far less (only along `r` fixed axes). So LoRA is less + doomed to shrinkage than PiSSA even with shared B. +2. STRUCTURAL separation: give the quarantine its OWN frozen encoder/decoder (`U2/Vh2`, or its + own trainable `B_hack`), so the two adapters live in different subspaces, `∂L/∂deployed != + ∂L/∂quarantine`, and deploy-ablation removes a different FUNCTION, not a slice of the same + update. + +So shrinkage-vs-direction is decided by (gate discrimination) x (adapter expressiveness + +structural separation), NOT by "same position" alone. Controls: capacity-matched vanilla (two +empty adapters, or one 2x adapter, no routing) isolates parameter count; non-directional +routing at matched `qE` isolates shrinkage. (I did not have this straight on first pass -- the +trap is calling same-position routing "shrinkage" without checking the gate/expressiveness.) + ## Extra instructions: - When you queue a job, follow with `pueue follow | tail` in bg so you are woken on fail or finish diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index 197890b..cbaed0b 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -4053,3 +4053,29 @@ in main.tex. ### Next Wait for job 28. If hack_deploy with clean pairs is still << 0.1 (comparable to contaminated): result is robust, narrative is "even mechanism-agnostic weak-testing pairs suppress hacking". If it rises back toward vanilla: need better pairs or need to acknowledge the result depends on axis-1-specific knowledge. + +## 2026-06-10 06:20 -- rotation fix for the unhackable flip + the fable-review shrinkage confound + +**Context:** commit `0112f4a` on `probe/distill-cosine`. Smoke: `/tmp/claude-1000/smoke_full.log`. New gate `scripts/verify_rotation.py`; `scripts/verify_rewards.py` gains gt_only rows. + +### Observations + +- [obs] The gt_only "unhackable" flip was seeded on `(seed, pid)` only (`data.py:90`, pre-fix), applied once at load -> the SAME ~10% of problems were unhackable every step. Frozen, not rotating, despite the design intent. +- [obs] Fixed: flip moved to the train step loop seeded on `(seed, step, pid)`; both prompt (`messages_gt`, plain hint) and grader (`eff_mode=gt_only`) flip; teacher demos skipped on flipped steps. `verify_rotation`: over 50 steps the unhackable subset changed 49/49 step-pairs (PASS). `verify_rewards` gt_only rows: every hack -> passed=False, reward 0.5 (PASS). Smoke: 1/30 draws flipped and graded gt_only. +- [obs] `antipasto.py:107` forward = `y + (kept + hack)` with `kept = U@(a*delta_S)`, `hack = U@(a*delta_S_hack)`, both using the same `U`, `Vh`, `a=Vh@x`. The two adapters are in the identical functional position. + +### Inferences + +- [inf] Because the forward sums two adapters in the same basis with identical per-step gradients, scalar (per-rollout) routing just partitions one vanilla-sized update between two stores; `delta_S + delta_S_hack` ~ the full vanilla update, train-time behaviour ~ vanilla, and deploy zeroes ~qE~0.5 of the update. So the headline DEPLOY-HACK suppression is very probably mechanical SHRINKAGE, not direction. {reason: "fable review #1, confirmed by reading the forward; matches placebo job86 (dead vec -> deploy hack 0.000)", credence: 0.7} +- [inf] The surviving directional claim is SOLVE RETENTION (real-v 0.625 vs placebo 0.531, n=1), not hack suppression -- the evidence metric inverts. {reason: "if both vecs route ~qE and both kill hack, only solve differs", credence: 0.55} +- [inf] Rotation only helps the METHOD if routing genuinely sorts hack-rollout from solve-rollout gradients (discriminating gate) or routes the vec-aligned COMPONENT; otherwise rotation + scalar routing still collapses to shrinkage at deploy. {reason: "same-basis sum argument", credence: 0.75} + +### Failure modes considered + +- **Most likely:** Hack suppression is shrinkage. Prior 0.7. Check: deploy a trained vanilla ckpt with `delta_S *= (1-qE)`; if hack ~0, confirmed (task #28). +- **Subtle:** Adam per-parameter state makes the two-adapter split non-linear, so real-v's routing trajectory differs from a random split in a way the scalar argument misses. Prior 0.2. Check: coin-flip gate at matched routed-energy vs real-v. +- **Null:** The n=1 solve gap (0.625 vs 0.531) is seed noise and direction does nothing at all. Prior 0.25. Check: 3 seeds of real-v vs placebo. + +### Next + +Killed job 30 (vanilla eval3 baseline ran the OLD frozen-flip env); requeued as job 39 on the rotating code so the bake-off (arms 35/37/38, all post-commit -> rotating) is apples-to-apples. Then run the shrinkage control (#28) and prototype component routing (#29).