evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-07-01 18:38:31 +08:00

Files

T

wassname 97a4c5d7b1 paper: reframe lineage SGTM (mechanism) > Cloud (concept); set title

- title -> 'Quarantining Reward-Hacking Gradients with a Hacking Representation'
- contributions: (1) adapt SGTM parameter-gradient masking from supervised
  unlearning to RL reward hacking, route+ablate framing from gradient routing
  but NOT Cloud's activation .detach(); (2) replace the data-label mask with a
  RepE-extracted hack direction from ~10-21 pairs (live rollouts unlabeled).
- method 'Arms': call route SGTM-style post-backward parameter masking in SVD
  basis, routed into a deletable subspace.
- related work: Cloud = localize-then-ablate idea only; SGTM = closest
  mechanistic relative, their TPR/FPR knob = our weak-detector axis.
- title comment flags the OPEN synthetic-pairs question (headline v_hack is
  hand-authored prog_wide, not AI-prompted).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-03 01:19:35 +00:00

blog

docs: refresh blog+README for route2/deploy-eval; embed key dynamics plot; drop sparse-only dots

2026-06-02 01:24:29 +00:00

brainstorm

ready

2026-05-23 14:19:41 +08:00

figs

misc

2026-06-02 02:06:43 +00:00

papers

wip