mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:48:43 +08:00
3b38a05738
The disqualifier for an intervention is needing the env oracle / ground-truth hack-labels of the live training distribution, not 'a detector ran'. On a new RL env there is no oracle, so GT-monitor and the (oracle-label-trained) probe can't be built there; a generic LLM judge and our hand-authored-pair vector can. LLM judge is thus the fair external peer (no clean fast-env number to plot). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
74 KiB
1641x475px
74 KiB
1641x475px