evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 21:22:15 +08:00

Files

T

wassname f3f2c1250f feat: lean per-step table w/ per-mode hack cols, generic elicit, ship->deploy

Streaming table (StepLogger) redesign per user review:
- drop sprd/N/refr from the streaming view (constant / in argv / always '-')
- short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn
- 2 sig figs on loss; 1 on gn/lr
- cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf)
- ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on
  multi-mode (substrate) runs -> shows WHICH loophole classes are learnt
- self-decoding legend() (only the columns this arm/mode-set shows)
- end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too

derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE
generic elicit (the faithful hint already discloses the mechanism; the model
must connect loophole+permission -> exploit = honest discoverability test) +
an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant
hint signal).

Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval
is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation.
train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py;
dropped the dual-name back-compat reads.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 10:35:26 +00:00

blog

LW draft: add preregistered H1 block-quote with falsification clauses

2026-05-29 03:56:33 +00:00

brainstorm

ready

2026-05-23 14:19:41 +08:00

grad_routing

feat: gradient routing — delta_S_hack quarantine + intervention {none,erase,route}

2026-05-30 00:31:30 +00:00

lab

lab report v3: TL;DR, three-line concept, PASS_RATE column, G_hack rename