mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:30:41 +08:00
227c173f63
scripts/tt_erase_bench.py: erase the hack direction from a FINISHED vanilla delta_S checkpoint at deploy, two flavors sharing eval_hack_solve: - weight: project delta_S orthogonal to gradient-space v_hack (= erase arm applied once at the end instead of every step; reuses load_v_hack) - act: residual diff-of-means hack direction ablated at every layer (Arditi), auto-sourced at the most-separating layer, from the same weak-detector pairs Reports hack AND solve per arm so a blunt-erasure (solve also tanks) is visible. Baseline for whether train-time routing beats cheap post-hoc erasure. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>