Files
evil_MoE/docs
wassname 97a4c5d7b1 paper: reframe lineage SGTM (mechanism) > Cloud (concept); set title
- title -> 'Quarantining Reward-Hacking Gradients with a Hacking Representation'
- contributions: (1) adapt SGTM parameter-gradient masking from supervised
  unlearning to RL reward hacking, route+ablate framing from gradient routing
  but NOT Cloud's activation .detach(); (2) replace the data-label mask with a
  RepE-extracted hack direction from ~10-21 pairs (live rollouts unlabeled).
- method 'Arms': call route SGTM-style post-backward parameter masking in SVD
  basis, routed into a deletable subspace.
- related work: Cloud = localize-then-ablate idea only; SGTM = closest
  mechanistic relative, their TPR/FPR knob = our weak-detector axis.
- title comment flags the OPEN synthetic-pairs question (headline v_hack is
  hand-authored prog_wide, not AI-prompted).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 01:19:35 +00:00
..
2026-05-23 14:19:41 +08:00
2026-06-02 02:06:43 +00:00
wip
2026-05-30 04:33:33 +00:00
2026-05-23 11:26:39 +08:00
2026-05-29 06:29:20 +00:00
2026-05-23 11:26:39 +08:00
2026-05-23 10:22:54 +08:00