evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-07-01 10:17:46 +08:00

Files

T

wassname 97a4c5d7b1 paper: reframe lineage SGTM (mechanism) > Cloud (concept); set title

- title -> 'Quarantining Reward-Hacking Gradients with a Hacking Representation'
- contributions: (1) adapt SGTM parameter-gradient masking from supervised
  unlearning to RL reward hacking, route+ablate framing from gradient routing
  but NOT Cloud's activation .detach(); (2) replace the data-label mask with a
  RepE-extracted hack direction from ~10-21 pairs (live rollouts unlabeled).
- method 'Arms': call route SGTM-style post-backward parameter masking in SVD
  basis, routed into a deletable subspace.
- related work: Cloud = localize-then-ablate idea only; SGTM = closest
  mechanistic relative, their TPR/FPR knob = our weak-detector axis.
- title comment flags the OPEN synthetic-pairs question (headline v_hack is
  hand-authored prog_wide, not AI-prompted).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-03 01:19:35 +00:00

.gitignore

Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine

2026-06-02 07:21:49 +00:00

main.tex

paper: reframe lineage SGTM (mechanism) > Cloud (concept); set title

2026-06-03 01:19:35 +00:00

nips15submit_e.sty

docs(writeup): NeurIPS-workshop paper skeleton + tectonic compile recipe

2026-06-02 06:59:15 +00:00

refs.bib

write up

2026-06-02 07:20:42 +00:00