From 20f863084839170fdf8ace6d9145cd30ae994a98 Mon Sep 17 00:00:00 2001 From: wassname Date: Sun, 31 May 2026 11:28:47 +0000 Subject: [PATCH] spec: T4 leakage-metric design (SGTM ratio form) + defer L1 knob with reasoning Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- .../20260531_routing_v2_distinct_basis.md | 28 +++++++++++++++++++ 1 file changed, 28 insertions(+) diff --git a/docs/spec/20260531_routing_v2_distinct_basis.md b/docs/spec/20260531_routing_v2_distinct_basis.md index 194ce55..594b827 100644 --- a/docs/spec/20260531_routing_v2_distinct_basis.md +++ b/docs/spec/20260531_routing_v2_distinct_basis.md @@ -199,6 +199,34 @@ eps-guarded division. Arm B remains the cleaner bet. (NOT A1 stale / A2 two-pass; see Log 2026-05-31). gate identity sum_i g_i == delta_S.grad holds by construction. smoke routing2_grad passes (||B_q||=0.109). - [ ] T4 (R4): leakage metric + L1 knob (`lambda_l1`, default 0). + DESIGN (2026-05-31, grounded in SGTM eq.1 = |D_forget_standard| / |D_forget_unlabeled|, + 0=perfect localization, 1=all leaked, data-filtering reads 1 by construction): + - Operationalize as a per-mode RATIO, not SGTM's full calibration curve. SGTM + maps ablated forget-loss onto an equivalent forget-token count via a curve of + standard runs at varying forget exposure; we can't afford a curve of vanilla + checkpoints, and matched-step vanilla is the natural single-point denominator: + leakage_mode = clamp( hack_deploy(route2, quarantine deleted) + / hack_deploy(vanilla, matched steps/substrate), 0, 1 ) + Held-out modes are the load-bearing cut: route2 should read <<1 (localized the + UNKNOWN hack into the deleted quarantine), data-filtering reads ~1 (can't filter + what its weak detector never flagged). Report per mode, split known/held-out, + headline = mean held-out leakage. + - Home = post-hoc cross-run analysis (reads two run artifacts), NOT inside train.py + (one run lacks the other arm's numbers). Likely a `scripts/leakage.py` or a + block in the plot path. + - Groundwork needed first (both pure additions, smoke before use): + (a) write per-mode deploy eval as STRUCTURED json artifact per route/route2 run + (currently only logger.info at train.py:1861-1867; nothing on disk). + (b) make VANILLA emit the same per-mode eval on the fixed eval_idxs (currently + the deploy-eval block is route/route2-only; vanilla hack_deploy=nan), so the + denominator exists. Adds one eval pass to vanilla runs (~minor). + - UNVALIDATABLE until route2 runs (jobs 34/35) produce deploy numbers; vanilla + denominator = run 30. Implement + smoke once those land, not before. + - L1 knob DEFERRED with reason: adding an L1 penalty on B_q/delta_S is a SECOND + training objective (balancing problem, violates one-driving-principle). It's + default 0 = off, an optional ignition aid we may never need. Hold until the + leakage number shows the quarantine fails to ignite without it. Not blocking + the headline proof. - [ ] T5: 5-arm sweep at matched seed/steps: vanilla, erase, route-additive(old), route2-grad, route2-act. Plus random-V control (#157) on the old route. - [ ] T6: all-arms per-mode deploy overlay plot (`just plot`).