spec: T4 leakage-metric design (SGTM ratio form) + defer L1 knob with reasoning

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:35 +08:00 · 2026-05-31 11:28:47 +00:00
parent 2b020c95c0
commit 20f8630848
1 changed files with 28 additions and 0 deletions
@@ -199,6 +199,34 @@ eps-guarded division. Arm B remains the cleaner bet.
  (NOT A1 stale / A2 two-pass; see Log 2026-05-31). gate identity sum_i g_i ==
  delta_S.grad holds by construction. smoke routing2_grad passes (||B_q||=0.109).
 - [ ] T4 (R4): leakage metric + L1 knob (`lambda_l1`, default 0).
+  DESIGN (2026-05-31, grounded in SGTM eq.1 = |D_forget_standard| / |D_forget_unlabeled|,
+  0=perfect localization, 1=all leaked, data-filtering reads 1 by construction):
+  - Operationalize as a per-mode RATIO, not SGTM's full calibration curve. SGTM
+    maps ablated forget-loss onto an equivalent forget-token count via a curve of
+    standard runs at varying forget exposure; we can't afford a curve of vanilla
+    checkpoints, and matched-step vanilla is the natural single-point denominator:
+      leakage_mode = clamp( hack_deploy(route2, quarantine deleted)
+                            / hack_deploy(vanilla, matched steps/substrate), 0, 1 )
+    Held-out modes are the load-bearing cut: route2 should read <<1 (localized the
+    UNKNOWN hack into the deleted quarantine), data-filtering reads ~1 (can't filter
+    what its weak detector never flagged). Report per mode, split known/held-out,
+    headline = mean held-out leakage.
+  - Home = post-hoc cross-run analysis (reads two run artifacts), NOT inside train.py
+    (one run lacks the other arm's numbers). Likely a `scripts/leakage.py` or a
+    block in the plot path.
+  - Groundwork needed first (both pure additions, smoke before use):
+    (a) write per-mode deploy eval as STRUCTURED json artifact per route/route2 run
+        (currently only logger.info at train.py:1861-1867; nothing on disk).
+    (b) make VANILLA emit the same per-mode eval on the fixed eval_idxs (currently
+        the deploy-eval block is route/route2-only; vanilla hack_deploy=nan), so the
+        denominator exists. Adds one eval pass to vanilla runs (~minor).
+  - UNVALIDATABLE until route2 runs (jobs 34/35) produce deploy numbers; vanilla
+    denominator = run 30. Implement + smoke once those land, not before.
+  - L1 knob DEFERRED with reason: adding an L1 penalty on B_q/delta_S is a SECOND
+    training objective (balancing problem, violates one-driving-principle). It's
+    default 0 = off, an optional ignition aid we may never need. Hold until the
+    leakage number shows the quarantine fails to ignite without it. Not blocking
+    the headline proof.
 - [ ] T5: 5-arm sweep at matched seed/steps: vanilla, erase, route-additive(old),
  route2-grad, route2-act. Plus random-V control (#157) on the old route.
 - [ ] T6: all-arms per-mode deploy overlay plot (`just plot`).