evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 19:31:11 +08:00

Files

T

wassname ec00bc4383 docs: A5 leak is double-hacks (not detector FP); placebo non-directionality measured via hkgap

Two review questions today exposed imprecise framing in load-bearing comments:

- A5 held-out <=1.1% hacked_E is the model double-hacking (one run_tests()-shaped
  completion that also writes the stdout marker, verified job-95 id 132), not a
  detector false positive. hacked_E is the mode-agnostic run_tests signature.
  Grading channels are non-overlapping; the model's strategy is not.
- Placebo 'non-directional' is now the hkgap measurement: real-v route2 hkgap
  0.6-0.8 (separates hack/clean), placebo ~0 (dead), both deploy hack 0.000.
  Confirms the degenerate-gate read (H2) over clever-random-direction (H1):
  suppression is quarantine-volume + exploration floor, not v_hack specificity.
  Direction only shows in solve (real 0.625 > placebo 0.531).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-05 08:23:49 +00:00

blog

blog: drop reader-facing route2 tag -> route (consistency with paper)

2026-06-03 02:20:13 +00:00

brainstorm

ready

2026-05-23 14:19:41 +08:00

figs

misc

2026-06-02 02:06:43 +00:00

papers

wip