evil_MoE

mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:43:00 +08:00

Files

T

wassname ec00bc4383 docs: A5 leak is double-hacks (not detector FP); placebo non-directionality measured via hkgap

Two review questions today exposed imprecise framing in load-bearing comments:

- A5 held-out <=1.1% hacked_E is the model double-hacking (one run_tests()-shaped
  completion that also writes the stdout marker, verified job-95 id 132), not a
  detector false positive. hacked_E is the mode-agnostic run_tests signature.
  Grading channels are non-overlapping; the model's strategy is not.
- Placebo 'non-directional' is now the hkgap measurement: real-v route2 hkgap
  0.6-0.8 (separates hack/clean), placebo ~0 (dead), both deploy hack 0.000.
  Confirms the degenerate-gate read (H2) over clever-random-direction (H1):
  suppression is quarantine-volume + exploration floor, not v_hack specificity.
  Direction only shows in solve (real 0.625 > placebo 0.531).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-05 08:23:49 +00:00

.gitignore

Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine

2026-06-02 07:21:49 +00:00

main.tex

docs: A5 leak is double-hacks (not detector FP); placebo non-directionality measured via hkgap

2026-06-05 08:23:49 +00:00

nips15submit_e.sty

docs(writeup): NeurIPS-workshop paper skeleton + tectonic compile recipe

2026-06-02 06:59:15 +00:00

refs.bib

paper: fix build, vector figs, +2 plots, de-jargon prose

2026-06-05 14:51:48 +08:00