Files
evil_MoE/scripts
wassname 227c173f63 feat: test-time (post-hoc) hack-erasure benchmark
scripts/tt_erase_bench.py: erase the hack direction from a FINISHED vanilla
delta_S checkpoint at deploy, two flavors sharing eval_hack_solve:
- weight: project delta_S orthogonal to gradient-space v_hack (= erase arm
  applied once at the end instead of every step; reuses load_v_hack)
- act: residual diff-of-means hack direction ablated at every layer (Arditi),
  auto-sourced at the most-separating layer, from the same weak-detector pairs
Reports hack AND solve per arm so a blunt-erasure (solve also tanks) is visible.
Baseline for whether train-time routing beats cheap post-hoc erasure.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 02:20:51 +00:00
..
2026-05-30 04:16:56 +00:00
wip
2026-05-30 04:33:33 +00:00