mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 21:22:15 +08:00
51ee43577a
Fill the \TODO{interp} in tab:ablation caption: post-hoc erase exposes how weak v_hack is
(weight-erase 0.39->0.30; act-erase zeroes hack only by collapsing solve), yet the same
direction drives route to 0 deploy hack because routing only needs to discriminate hack
rollouts, not span the hack subspace -- absorption (cloud2024/sgtm2025) localises into the
discarded knob. 'A detector too weak to erase a trained hack is still strong enough to route
one as it forms.'
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>