evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:43:00 +08:00

Author	SHA1	Message	Date
wassname	ffc2df540f	blog: drop reader-facing route2 tag -> route (consistency with paper) route2 is an internal run-tag, not something a reader cares about. Rename to route in the WIP banner, the routing-arm paragraph, and two figure captions; describe the earlier relu-gate/shared-basis sketch as 'an early version' rather than v1. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-03 02:20:13 +00:00
wassname	19deef4fb9	docs: refresh blog+README for route2/deploy-eval; embed key dynamics plot; drop sparse-only dots - blog: mark as erase-n=2 draft, note route2/exploration-floor/deploy-eval are the current direction; embed dyn_sub4_hack_overlay.png (force-added); ASCII em-dashes; de-bold the arm list (#15 tell) - README: add route2 arm + apples-to-apples deploy-eval to 'What we compare'; stale banner on the n=1 mix=0.5 findings - plot_dynamics: remove _mark_if_sparse (asymmetric sparse-only dots); EMA-held line for all arms - train.py: fix 'held-out greedy' -> 'held-out eval subset, T=0.7' (deploy eval is sampled, not greedy) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-02 01:24:29 +00:00
wassname	d781b56ff4	docs: fix review findings (global noise-floor, route one-sided, G3 xref) External review (3 subagents) caught: - blog: noise-floor drop is GLOBAL across modules, not per-Linear (proj.py:187) - blog: route pseudocode used full c; route actually uses the same one-sided gate as erase and quarantines the identical 'removed' vector (proj.py:124,199) - spec: 'never seen by detector' -> clarify student trains on all 4 modes, the detector just never labels C/D for v_hack extraction; cross-ref G3/task #107 Dismissed: reviewer claim that only exit_code survived (stale spec; live log columns hk_rt/hk_so/hk_se/hk_fm confirm 4 modes) and a hallucinated 'Furthermore'. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 00:41:12 +00:00
wassname	f7288e569d	docs: 4-arm framing, weak-detector test, hack-mode appendix - blog: appendix with prompt+hint/hack/clean traces for all 4 loophole modes (run_tests/sentinel/stdout_marker/file_marker) - blog: 'four things we compare' (vanilla/erase/route/route-weak), faithful extract pseudocode (per-completion zero_grad), erase+route step pseudocode, refresh rationale + route quarantine-ablate subtlety - blog+README: cite Gradient Routing (Cloud et al. 2024, 2410.04332) as the route arm's lineage - README: 'what we compare' section + appendix pointer - spec: weak-detector arm as the operationalized generalization test Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-05-31 00:20:47 +00:00
wassname	22b5d0a8a7	LW draft: add preregistered H1 block-quote with falsification clauses Surfaces the H1 verbatim + falsification criteria, names two gaps up-front: 21 pairs vs preregistered 60-80, and the SEM-across-seeds clause not yet evaluable at n=2. Addresses the comprehension panel's flag on H1 verbatim omission (deepseek 3.0, gemini-flash 4.0 on hypothesis_clarity). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 03:56:33 +00:00
wassname	638fe23f3e	LW-style draft post: gradient projection vs reward hacking (paper-writing skill) Compresses the lab report into ~1700 words for a LessWrong audience while preserving the workshop-paper scaffolding (intro / setup / method / result table / mechanism subplot / limitations / related work / next). Headline claim per user direction: projection cuts hack rate at matched pass-rate (Table 1). Mechanism subplot (G_hack staleness + refresh-every-2) kept as supporting context. External-panel critique pass (n=5 models, mean 4.4/5 ready) on dims hook/clarity/inform_not_persuade/calibration/LW_voice. Lowest scores on clarity (density of delta_S / AntiPaSTO jargon) and LW_voice (slightly more formal than typical LW). Acceptable for first draft. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 03:49:51 +00:00

6 Commits