Files
evil_MoE/docs/spec/20260530_code_review.md
T
wassname fc30514b23 feat: T5 eval-time ablation for route + fix route deployment invariant
T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval
(hack_abl/solve_abl cols, appended so results.py indices unchanged) every
--eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics
plots the ablated series for the routing arm (the coherence-gap fix: training
hack_s looks vanilla; routing only shows post-ablation).

External-review fixes (docs/spec/20260530_code_review.md):
- Critical: route now feeds delta_S the SAME g_proj as erase (was forcing
  preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW).
  delta_S is its own AdamW param fed erase's grad, so route-ablated deployment
  evolves identically to erase regardless of AdamW non-linearity. Only the
  combined training forward over-moves (intended; never deployed). Corrected the
  overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity).
- Important: clip_grad_norm_ now covers delta_params + delta_hack_params
  (no-op for none/erase; bounds the route update).
- Important: results.py paired-delta table includes routing (keyed on arm).

smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7,
ROUTE EVAL BLUF prints.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:50:53 +00:00

3.6 KiB

Code Review: gradient routing split and arm->intervention fallout

Summary

The algebra in src/projected_grpo/proj.py looks right: the route path forces preserve_magnitude=False and overshoot=1.0, so the projection step itself satisfies g_proj + removed == g, and the erase path still uses the same projection math as before. The main problem is downstream in train.py: once you optimize delta_S and delta_S_hack as separate AdamW parameters, that exact-split invariant no longer survives the optimizer step.

Critical (must fix)

  • [src/projected_grpo/proj.py:192-194, src/projected_grpo/antipasto.py:83, src/projected_grpo/train.py:679] Route is split correctly at the grad level, but not at the update level. delta_S and delta_S_hack are stepped as two independent AdamW parameters, while the forward uses their sum. AdamW is not linear in the gradient, it keeps separate moments and normalizes each tensor independently, so step(delta_S, g_proj) + step(delta_S_hack, removed) is not equal to step(delta, g_proj + removed). On the first AdamW step this already diverges because each branch reduces to a per-coordinate sign update, so overlapping same-sign components can get double-counted. That breaks the comment-level claim that routing preserves the original training-time movement while only quarantining it for eval. Suggested fix: keep one optimizer state on the combined adapter update and only decompose for storage/eval, or use a linear optimizer/update rule for these routed knobs.

Important (should fix)

  • [src/projected_grpo/train.py:1189] clip_grad_norm_ only clips delta_params, not delta_hack_params. In the route arm, the parked hack component bypasses clipping entirely, so route can take a larger total step than erase/none and the logged gn is no longer the norm of the actual optimized gradient. If grad_clip is meant to constrain the route arm too, clip delta_params + delta_hack_params together, or clip the combined grad before splitting.
  • [scripts/results.py:138] The arm rename itself does not break parsing, results.py now reads arm= from the printed preset line, which should handle both old --arm logs and new --intervention logs. But the paired-delta section still hardcodes arm == "projected", so routing runs silently drop out of the only direct-vs-vanilla comparison table. If routing is a first-class arm now, this summary should include it explicitly.

Suggestions

  • [scripts/plot_dynamics.py:101-105, 231-262] plot_dynamics.py also parses arm= correctly, so the rename looks safe there. The remaining issue is semantic, not parsing: routing panels are drawn from raw training-time hack_s, even though the route hypothesis is about the model with delta_S_hack ablated. Until an ablated-eval series is logged, plotting routing next to vanilla/erase will be misleading.

Positive

  • [src/projected_grpo/proj.py:183-194] The route/erase split in the projection code is clean. Route forces the exact-split settings, and erase still goes through the old _project_one_module path, so the projection-layer behavior itself is consistent with the intended design.
  • [scripts/results.py:40-43, scripts/plot_dynamics.py:69-71] Switching log parsing to the printed arm=<name> field is the right compatibility point for old and new logs.

Verdict

REQUEST CHANGES

proj.py is fine, but the current train.py wiring means route is not actually preserving the original update after AdamW, and its quarantined branch bypasses grad clipping. Fix those two first; the arm rename itself looks parse-safe. [?2026h[?1006l[?1002l[?1000l[?1007h[?1049l[<999u[>4;0m[?2026l