mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:43:00 +08:00
b097d9abfc
Related-work search (local qmd/gh/LW + Perplexity/Gemini/ChatGPT/Elicit), all arXiv ids verified HTTP 200, bibtex+abstracts via the bibtex MCP / arXiv scrape: - gradient-level reward hacking: ackermann2026gradreg (GR), liu2026harve (HARVE) - deletable-module precedent (pre-dates Cloud): zhou2023securityvectors - gradient-projection unlearning: shamsian2025orthograd (OrthoGrad), sun2026ogpsa - C2 generalisation: taylor2025schoolrewardhacks, nishimuragasparian2025rhgeneralize - weight-space contrastive direction: fierro2025weightarithmetic - shortcut gradient surgery: cao2026sart; survey: wang2026rewardhackingsurvey - idea provenance: mallen2025rhinterventions (AF) Fix: huang2026directional first author is Deng, Wenlong (arXiv 2605.25189); sync the cold-reader comment to 'Deng et al.' Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>