evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:43:00 +08:00

Files

T

wassname b097d9abfc paper: add verified related work (11 refs) + fix Huang->Deng first author

Related-work search (local qmd/gh/LW + Perplexity/Gemini/ChatGPT/Elicit), all
arXiv ids verified HTTP 200, bibtex+abstracts via the bibtex MCP / arXiv scrape:
- gradient-level reward hacking: ackermann2026gradreg (GR), liu2026harve (HARVE)
- deletable-module precedent (pre-dates Cloud): zhou2023securityvectors
- gradient-projection unlearning: shamsian2025orthograd (OrthoGrad), sun2026ogpsa
- C2 generalisation: taylor2025schoolrewardhacks, nishimuragasparian2025rhgeneralize
- weight-space contrastive direction: fierro2025weightarithmetic
- shortcut gradient surgery: cao2026sart; survey: wang2026rewardhackingsurvey
- idea provenance: mallen2025rhinterventions (AF)
Fix: huang2026directional first author is Deng, Wenlong (arXiv 2605.25189);
sync the cold-reader comment to 'Deng et al.'

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-04 15:18:44 +08:00

.gitignore

Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine

2026-06-02 07:21:49 +00:00

main.tex

paper: add verified related work (11 refs) + fix Huang->Deng first author

2026-06-04 15:18:44 +08:00

nips15submit_e.sty

docs(writeup): NeurIPS-workshop paper skeleton + tectonic compile recipe

2026-06-02 06:59:15 +00:00

refs.bib

paper: add verified related work (11 refs) + fix Huang->Deng first author

2026-06-04 15:18:44 +08:00