Files
evil_MoE/docs/writeup
wassname b097d9abfc paper: add verified related work (11 refs) + fix Huang->Deng first author
Related-work search (local qmd/gh/LW + Perplexity/Gemini/ChatGPT/Elicit), all
arXiv ids verified HTTP 200, bibtex+abstracts via the bibtex MCP / arXiv scrape:
- gradient-level reward hacking: ackermann2026gradreg (GR), liu2026harve (HARVE)
- deletable-module precedent (pre-dates Cloud): zhou2023securityvectors
- gradient-projection unlearning: shamsian2025orthograd (OrthoGrad), sun2026ogpsa
- C2 generalisation: taylor2025schoolrewardhacks, nishimuragasparian2025rhgeneralize
- weight-space contrastive direction: fierro2025weightarithmetic
- shortcut gradient surgery: cao2026sart; survey: wang2026rewardhackingsurvey
- idea provenance: mallen2025rhinterventions (AF)
Fix: huang2026directional first author is Deng, Wenlong (arXiv 2605.25189);
sync the cold-reader comment to 'Deng et al.'

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 15:18:44 +08:00
..