Files
evil_MoE/out/figs/floor_ceiling.png
T
wassname 3b38a05738 no-cheat framing: label-leakage not detector-presence; fix plot comment
The disqualifier for an intervention is needing the env oracle / ground-truth
hack-labels of the live training distribution, not 'a detector ran'. On a new
RL env there is no oracle, so GT-monitor and the (oracle-label-trained) probe
can't be built there; a generic LLM judge and our hand-authored-pair vector can.
LLM judge is thus the fair external peer (no clean fast-env number to plot).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 11:22:29 +00:00

74 KiB
1641x475px