spec: T4 leakage-metric design (SGTM ratio form) + defer L1 knob with reasoning

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-05-31 11:28:47 +00:00
parent 2b020c95c0
commit 20f8630848
@@ -199,6 +199,34 @@ eps-guarded division. Arm B remains the cleaner bet.
(NOT A1 stale / A2 two-pass; see Log 2026-05-31). gate identity sum_i g_i ==
delta_S.grad holds by construction. smoke routing2_grad passes (||B_q||=0.109).
- [ ] T4 (R4): leakage metric + L1 knob (`lambda_l1`, default 0).
DESIGN (2026-05-31, grounded in SGTM eq.1 = |D_forget_standard| / |D_forget_unlabeled|,
0=perfect localization, 1=all leaked, data-filtering reads 1 by construction):
- Operationalize as a per-mode RATIO, not SGTM's full calibration curve. SGTM
maps ablated forget-loss onto an equivalent forget-token count via a curve of
standard runs at varying forget exposure; we can't afford a curve of vanilla
checkpoints, and matched-step vanilla is the natural single-point denominator:
leakage_mode = clamp( hack_deploy(route2, quarantine deleted)
/ hack_deploy(vanilla, matched steps/substrate), 0, 1 )
Held-out modes are the load-bearing cut: route2 should read <<1 (localized the
UNKNOWN hack into the deleted quarantine), data-filtering reads ~1 (can't filter
what its weak detector never flagged). Report per mode, split known/held-out,
headline = mean held-out leakage.
- Home = post-hoc cross-run analysis (reads two run artifacts), NOT inside train.py
(one run lacks the other arm's numbers). Likely a `scripts/leakage.py` or a
block in the plot path.
- Groundwork needed first (both pure additions, smoke before use):
(a) write per-mode deploy eval as STRUCTURED json artifact per route/route2 run
(currently only logger.info at train.py:1861-1867; nothing on disk).
(b) make VANILLA emit the same per-mode eval on the fixed eval_idxs (currently
the deploy-eval block is route/route2-only; vanilla hack_deploy=nan), so the
denominator exists. Adds one eval pass to vanilla runs (~minor).
- UNVALIDATABLE until route2 runs (jobs 34/35) produce deploy numbers; vanilla
denominator = run 30. Implement + smoke once those land, not before.
- L1 knob DEFERRED with reason: adding an L1 penalty on B_q/delta_S is a SECOND
training objective (balancing problem, violates one-driving-principle). It's
default 0 = off, an optional ignition aid we may never need. Hold until the
leakage number shows the quarantine fails to ignite without it. Not blocking
the headline proof.
- [ ] T5: 5-arm sweep at matched seed/steps: vanilla, erase, route-additive(old),
route2-grad, route2-act. Plus random-V control (#157) on the old route.
- [ ] T6: all-arms per-mode deploy overlay plot (`just plot`).