evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-28 01:45:14 +08:00

Author	SHA1	Message	Date
wassname	07e1eb8753	paper: fix build, vector figs, +2 plots, de-jargon prose - drop fontawesome5 (tectonic core-dumped on the OTF); the lone \faGithub icon was decorative - switch the two included figures PNG->PDF (vector; now-tracked, smaller) - add fig:generalisation (A5 dumbbell) next to tab:generalisation and fig:traindeploy (train-on vs deploy-off) in C1, both \ref'd - rename leaked config codenames in appendix tables (v_hack_full -> "weak (10 pairs)", null_city -> "random (placebo)") with paper:code mapping comments - de-jargon reader-facing prose per a 3-model external panel (kimi-k2.5 / gemini-3.1-pro / gpt-5.5): knob -> (auxiliary) adapter, quarantine -> isolate, no-cheat payload -> zero-label test, hack-ward -> hack-aligned, cousin/near-twin -> analogue, etc. Title metaphor left as-is. 14 pages, zero unresolved refs.	2026-06-05 14:51:48 +08:00
wassname	154e33683e	paper: HARVE byline cross-verified arXiv==S2 (keyed semantic-search .env) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 15:20:04 +08:00
wassname	b097d9abfc	paper: add verified related work (11 refs) + fix Huang->Deng first author Related-work search (local qmd/gh/LW + Perplexity/Gemini/ChatGPT/Elicit), all arXiv ids verified HTTP 200, bibtex+abstracts via the bibtex MCP / arXiv scrape: - gradient-level reward hacking: ackermann2026gradreg (GR), liu2026harve (HARVE) - deletable-module precedent (pre-dates Cloud): zhou2023securityvectors - gradient-projection unlearning: shamsian2025orthograd (OrthoGrad), sun2026ogpsa - C2 generalisation: taylor2025schoolrewardhacks, nishimuragasparian2025rhgeneralize - weight-space contrastive direction: fierro2025weightarithmetic - shortcut gradient surgery: cao2026sart; survey: wang2026rewardhackingsurvey - idea provenance: mallen2025rhinterventions (AF) Fix: huang2026directional first author is Deng, Wenlong (arXiv 2605.25189); sync the cold-reader comment to 'Deng et al.' Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-04 15:18:44 +08:00
wassname	bd7550f559	paper: framed code blocks, real AntiPaSTO cite, leave-one-out ablation Formatting pass lifted from the AntiPaSTO paper (the format the author is happy with): - verbatim -> lstlisting (framed, shaded, Python-highlighted code blocks; chat-template prompt uses language={} so markup isn't keyword-coloured) - xcolor[table] + \rowcolor highlight on the 'ours' rows (keynote, ablation) - ablation table restructured as leave-one-out with the negate symbol (negate-routing/directional/hack-pairs/intervention); long interpretation moved out of the caption into section body; post-hoc split into its own block - real AntiPaSTO citation (Clark 2026, arXiv:2601.07473) replacing the UNVERIFIED placeholder; dropped the verify-before-submission TODO - code-availability line with a GitHub glyph (anonymous placeholder) Builds clean: 11 pages, no unresolved refs/cites. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 11:22:22 +00:00
wassname	17a8792340	paper: address comprehension friction + OpenReview novelty challenge - Inline author-notes at the Cloud and Huang related-work bullets (cold-reader panel): lead Cloud with parameter-vs-activation space; state Huang's keep-vs-remove inversion plainly; flag the unmeasured hack-basis==clean-basis question as a reviewer attack vector. - Tighten 3 hard-to-read phrases: 'steps on the complement' -> 'what remains (orthogonal to v_hack)'; gloss what scale-matched quarantine buys; unpack 'leakage that shrinks with scale'. - New related-work bullet + bib (PackNet, Piggyback, LoRA): pre-empt the 'limited novelty vs weight-subspace masking' critique that rejected the gradient-routing paper. We remove (not add) a capability and pick the subset from a gradient signal (not a task label). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 02:29:45 +00:00
wassname	cf3ecc40f8	write up	2026-06-02 07:20:42 +00:00
wassname	923de6dbe6	docs(writeup): NeurIPS-workshop paper skeleton + tectonic compile recipe Minimal LaTeX skeleton: outline + evidence tables (route2 n=3 deploy numbers filled with provenance, vanilla pending jobs 74/84) + figures + verified refs + appendix (4-mode traces, 6/6/6/6 partition counts, pseudocode). Build artifacts and figs symlinks gitignored. `just paper` compiles via tectonic; `just paper-qc` dumps text + greps for unresolved refs / TODOs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 06:59:15 +00:00

7 Commits