mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 19:47:33 +08:00
feat: T5 eval-time ablation for route + fix route deployment invariant
T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval (hack_abl/solve_abl cols, appended so results.py indices unchanged) every --eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics plots the ablated series for the routing arm (the coherence-gap fix: training hack_s looks vanilla; routing only shows post-ablation). External-review fixes (docs/spec/20260530_code_review.md): - Critical: route now feeds delta_S the SAME g_proj as erase (was forcing preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW). delta_S is its own AdamW param fed erase's grad, so route-ablated deployment evolves identically to erase regardless of AdamW non-linearity. Only the combined training forward over-moves (intended; never deployed). Corrected the overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity). - Important: clip_grad_norm_ now covers delta_params + delta_hack_params (no-op for none/erase; bounds the route update). - Important: results.py paired-delta table includes routing (keyed on arm). smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7, ROUTE EVAL BLUF prints. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -13,10 +13,19 @@ to diverge from the (refreshed) v_hack.
|
||||
Data source: logs/*.log per-step rows (the durable source results.py also uses).
|
||||
We parse by HEADER NAME, not fixed index, because newer runs add columns (refr).
|
||||
|
||||
Arm classification (from the argv line):
|
||||
vanilla arm=vanilla
|
||||
Arm classification (from the preset line `arm=`, covering old --arm and new
|
||||
--intervention logs):
|
||||
vanilla arm=vanilla (intervention=none)
|
||||
static erasure arm=projected, no --vhack-refresh-every (frozen v_hack)
|
||||
online erasure arm=projected, --vhack-refresh-every=N>0 (re-extracted)
|
||||
routing arm=routing (intervention=route)
|
||||
|
||||
For routing we plot the ABLATED-eval hack/solve (hack_abl/solve_abl, measured
|
||||
with delta_S_hack zeroed every --eval-ablate-every steps), NOT the training-time
|
||||
hack_s: the routed forward still hacks during training, so the training curve
|
||||
would falsely read "route doesn't work". The ablated curve is the deployment
|
||||
model. (none/erase plot training-time hack_s; their intervention acts at train
|
||||
time.)
|
||||
|
||||
Usage:
|
||||
uv run python scripts/plot_dynamics.py logs/*converge*.log
|
||||
@@ -82,7 +91,10 @@ def parse_log(path: Path) -> dict | None:
|
||||
|
||||
series: dict[str, list[float]] = defaultdict(list)
|
||||
steps: list[int] = []
|
||||
wanted = {**RATE_COLS, **COS_COLS}
|
||||
# Also parse the route ablated-eval columns when present (older logs lack
|
||||
# them -> skip). For routing we plot THESE, not the training-time hack_s.
|
||||
abl = {"hack_abl", "solve_abl"} & set(idx)
|
||||
wanted = {**RATE_COLS, **COS_COLS, **{c: c for c in abl}}
|
||||
for line in txt.splitlines():
|
||||
if "| INFO |" not in line:
|
||||
continue
|
||||
@@ -94,8 +106,16 @@ def parse_log(path: Path) -> dict | None:
|
||||
series[col].append(_val(row[idx[col]]))
|
||||
if not steps:
|
||||
return None
|
||||
return dict(arm=arm, refr=refr, seed=seed, vhack=vhack,
|
||||
steps=np.array(steps), **{k: np.array(v, dtype=float) for k, v in series.items()})
|
||||
run = dict(arm=arm, refr=refr, seed=seed, vhack=vhack,
|
||||
steps=np.array(steps), **{k: np.array(v, dtype=float) for k, v in series.items()})
|
||||
# COHERENCE-GAP FIX: route's training-time hack_s looks vanilla (the routed
|
||||
# forward still hacks); routing's benefit only shows once delta_S_hack is
|
||||
# ablated at eval. So for routing, plot the ablated series under the same
|
||||
# hack_s/gt_s keys -> all downstream (panels, onset, overlay) reads it.
|
||||
if arm == "routing" and "hack_abl" in run:
|
||||
run["hack_s"] = run["hack_abl"]
|
||||
run["gt_s"] = run["solve_abl"]
|
||||
return run
|
||||
|
||||
|
||||
def classify(run: dict) -> str:
|
||||
|
||||
Reference in New Issue
Block a user