evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:30:41 +08:00

Files

T

wassname 227c173f63 feat: test-time (post-hoc) hack-erasure benchmark

scripts/tt_erase_bench.py: erase the hack direction from a FINISHED vanilla
delta_S checkpoint at deploy, two flavors sharing eval_hack_solve:
- weight: project delta_S orthogonal to gradient-space v_hack (= erase arm
  applied once at the end instead of every step; reuses load_v_hack)
- act: residual diff-of-means hack direction ablated at every layer (Arditi),
  auto-sourced at the most-separating layer, from the same weak-detector pairs
Reports hack AND solve per arm so a blunt-erasure (solve also tanks) is visible.
Baseline for whether train-time routing beats cheap post-hoc erasure.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-02 02:20:51 +00:00

audit_log.py

audit-log: print a fixed healthy-vanilla gen as a coherence yardstick

2026-06-01 01:15:25 +00:00

build_combined_pool.py

reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/)

2026-05-30 03:52:24 +00:00

make_dataset_pairsets.py

scripts

2026-05-30 04:16:56 +00:00

make_pairsets.py

wip