This commit is contained in:
wassname
2026-06-11 11:07:28 +00:00
parent 7871aa66b8
commit 270c4f5a27
30 changed files with 456 additions and 443 deletions
+5 -4
View File
@@ -14,10 +14,11 @@ Two core layouts (both emitted by default):
per-seed). Reads "for THIS loophole, which method suppresses it best".
Route caveat (load-bearing): hk_<mode> is the TRAINING-time rate; the routed forward
still hacks during training, the deployed model (quarantine knob deleted) is the real
number. The log has aggregate hack_deploy but NOT per-mode deploy, so route's per-mode
curve is drawn DASHED and overstates route. TODO: log per-mode deploy in train.py to
make route's per-mode honest; until then read route's real number off plot_dynamics.
still exhibits reward hacking during training. The deployed model is evaluated after
quarantine ablation. The log has aggregate hack_deploy but not per-mode deployment
metrics, so route's per-mode curve is drawn dashed and overstates route. TODO: log
per-mode deployment metrics in train.py; until then use plot_dynamics for route's
deployment result.
This is the single plotting ENTRYPOINT (`just plot`): it emits the per-mode cut
(by-method, by-hack) AND delegates the aggregate "total hacks per arm" + cos-alignment