mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:15:58 +08:00
d393e119e0
Swap the floor->ceiling reference to the substrate paper (Ariahw et al. 2025), which benchmarks interventions on the same floor (No-Intervention hack ~79%) / ceiling (RL-Baseline no-loophole). Their best arm (Ground-Truth Penalty, ~0% hack, perf >= ceiling) reaches the top corner BUT uses the oracle monitor at train time -- the exact cheat our no-cheat constraint forbids; their only oracle-free method (inoculation) gave incomplete, high-variance mitigation. Plotted hatched/grey as an ORACLE upper bound (solve approx; figures are images, 200-step preset not step-matched). Honest framing: their working methods need the oracle; ours uses no detector at train time and still suppresses 93%. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>