mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
spec: T4 leakage-metric design (SGTM ratio form) + defer L1 knob with reasoning
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -199,6 +199,34 @@ eps-guarded division. Arm B remains the cleaner bet.
|
||||
(NOT A1 stale / A2 two-pass; see Log 2026-05-31). gate identity sum_i g_i ==
|
||||
delta_S.grad holds by construction. smoke routing2_grad passes (||B_q||=0.109).
|
||||
- [ ] T4 (R4): leakage metric + L1 knob (`lambda_l1`, default 0).
|
||||
DESIGN (2026-05-31, grounded in SGTM eq.1 = |D_forget_standard| / |D_forget_unlabeled|,
|
||||
0=perfect localization, 1=all leaked, data-filtering reads 1 by construction):
|
||||
- Operationalize as a per-mode RATIO, not SGTM's full calibration curve. SGTM
|
||||
maps ablated forget-loss onto an equivalent forget-token count via a curve of
|
||||
standard runs at varying forget exposure; we can't afford a curve of vanilla
|
||||
checkpoints, and matched-step vanilla is the natural single-point denominator:
|
||||
leakage_mode = clamp( hack_deploy(route2, quarantine deleted)
|
||||
/ hack_deploy(vanilla, matched steps/substrate), 0, 1 )
|
||||
Held-out modes are the load-bearing cut: route2 should read <<1 (localized the
|
||||
UNKNOWN hack into the deleted quarantine), data-filtering reads ~1 (can't filter
|
||||
what its weak detector never flagged). Report per mode, split known/held-out,
|
||||
headline = mean held-out leakage.
|
||||
- Home = post-hoc cross-run analysis (reads two run artifacts), NOT inside train.py
|
||||
(one run lacks the other arm's numbers). Likely a `scripts/leakage.py` or a
|
||||
block in the plot path.
|
||||
- Groundwork needed first (both pure additions, smoke before use):
|
||||
(a) write per-mode deploy eval as STRUCTURED json artifact per route/route2 run
|
||||
(currently only logger.info at train.py:1861-1867; nothing on disk).
|
||||
(b) make VANILLA emit the same per-mode eval on the fixed eval_idxs (currently
|
||||
the deploy-eval block is route/route2-only; vanilla hack_deploy=nan), so the
|
||||
denominator exists. Adds one eval pass to vanilla runs (~minor).
|
||||
- UNVALIDATABLE until route2 runs (jobs 34/35) produce deploy numbers; vanilla
|
||||
denominator = run 30. Implement + smoke once those land, not before.
|
||||
- L1 knob DEFERRED with reason: adding an L1 penalty on B_q/delta_S is a SECOND
|
||||
training objective (balancing problem, violates one-driving-principle). It's
|
||||
default 0 = off, an optional ignition aid we may never need. Hold until the
|
||||
leakage number shows the quarantine fails to ignite without it. Not blocking
|
||||
the headline proof.
|
||||
- [ ] T5: 5-arm sweep at matched seed/steps: vanilla, erase, route-additive(old),
|
||||
route2-grad, route2-act. Plus random-V control (#157) on the old route.
|
||||
- [ ] T6: all-arms per-mode deploy overlay plot (`just plot`).
|
||||
|
||||
Reference in New Issue
Block a user