mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
fc30514b23
T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval (hack_abl/solve_abl cols, appended so results.py indices unchanged) every --eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics plots the ablated series for the routing arm (the coherence-gap fix: training hack_s looks vanilla; routing only shows post-ablation). External-review fixes (docs/spec/20260530_code_review.md): - Critical: route now feeds delta_S the SAME g_proj as erase (was forcing preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW). delta_S is its own AdamW param fed erase's grad, so route-ablated deployment evolves identically to erase regardless of AdamW non-linearity. Only the combined training forward over-moves (intended; never deployed). Corrected the overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity). - Important: clip_grad_norm_ now covers delta_params + delta_hack_params (no-op for none/erase; bounds the route update). - Important: results.py paired-delta table includes routing (keyed on arm). smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7, ROUTE EVAL BLUF prints. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
3.6 KiB
3.6 KiB
Code Review: gradient routing split and arm->intervention fallout
Summary
The algebra in src/projected_grpo/proj.py looks right: the route path forces preserve_magnitude=False and overshoot=1.0, so the projection step itself satisfies g_proj + removed == g, and the erase path still uses the same projection math as before. The main problem is downstream in train.py: once you optimize delta_S and delta_S_hack as separate AdamW parameters, that exact-split invariant no longer survives the optimizer step.
Critical (must fix)
- [src/projected_grpo/proj.py:192-194, src/projected_grpo/antipasto.py:83, src/projected_grpo/train.py:679] Route is split correctly at the grad level, but not at the update level.
delta_Sanddelta_S_hackare stepped as two independent AdamW parameters, while the forward uses their sum. AdamW is not linear in the gradient, it keeps separate moments and normalizes each tensor independently, sostep(delta_S, g_proj) + step(delta_S_hack, removed)is not equal tostep(delta, g_proj + removed). On the first AdamW step this already diverges because each branch reduces to a per-coordinate sign update, so overlapping same-sign components can get double-counted. That breaks the comment-level claim that routing preserves the original training-time movement while only quarantining it for eval. Suggested fix: keep one optimizer state on the combined adapter update and only decompose for storage/eval, or use a linear optimizer/update rule for these routed knobs.
Important (should fix)
- [src/projected_grpo/train.py:1189]
clip_grad_norm_only clipsdelta_params, notdelta_hack_params. In the route arm, the parked hack component bypasses clipping entirely, so route can take a larger total step than erase/none and the loggedgnis no longer the norm of the actual optimized gradient. Ifgrad_clipis meant to constrain the route arm too, clipdelta_params + delta_hack_paramstogether, or clip the combined grad before splitting. - [scripts/results.py:138] The arm rename itself does not break parsing,
results.pynow readsarm=from the printed preset line, which should handle both old--armlogs and new--interventionlogs. But the paired-delta section still hardcodesarm == "projected", soroutingruns silently drop out of the only direct-vs-vanilla comparison table. If routing is a first-class arm now, this summary should include it explicitly.
Suggestions
- [scripts/plot_dynamics.py:101-105, 231-262]
plot_dynamics.pyalso parsesarm=correctly, so the rename looks safe there. The remaining issue is semantic, not parsing: routing panels are drawn from raw training-timehack_s, even though the route hypothesis is about the model withdelta_S_hackablated. Until an ablated-eval series is logged, plotting routing next to vanilla/erase will be misleading.
Positive
- [src/projected_grpo/proj.py:183-194] The route/erase split in the projection code is clean. Route forces the exact-split settings, and erase still goes through the old
_project_one_modulepath, so the projection-layer behavior itself is consistent with the intended design. - [scripts/results.py:40-43, scripts/plot_dynamics.py:69-71] Switching log parsing to the printed
arm=<name>field is the right compatibility point for old and new logs.
Verdict
REQUEST CHANGES
proj.py is fine, but the current train.py wiring means route is not actually preserving the original update after AdamW, and its quarantined branch bypasses grad clipping. Fix those two first; the arm rename itself looks parse-safe.
[?2026h[r[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l