misc

2026-06-27 19:47:33 +08:00 · 2026-06-11 11:07:28 +00:00
parent 7871aa66b8
commit 270c4f5a27
30 changed files with 456 additions and 443 deletions
@@ -14,10 +14,11 @@ Two core layouts (both emitted by default):
              per-seed). Reads "for THIS loophole, which method suppresses it best".

 Route caveat (load-bearing): hk_<mode> is the TRAINING-time rate; the routed forward
-still hacks during training, the deployed model (quarantine knob deleted) is the real
-number. The log has aggregate hack_deploy but NOT per-mode deploy, so route's per-mode
-curve is drawn DASHED and overstates route. TODO: log per-mode deploy in train.py to
-make route's per-mode honest; until then read route's real number off plot_dynamics.
+still exhibits reward hacking during training. The deployed model is evaluated after
+quarantine ablation. The log has aggregate hack_deploy but not per-mode deployment
+metrics, so route's per-mode curve is drawn dashed and overstates route. TODO: log
+per-mode deployment metrics in train.py; until then use plot_dynamics for route's
+deployment result.

 This is the single plotting ENTRYPOINT (`just plot`): it emits the per-mode cut
 (by-method, by-hack) AND delegates the aggregate "total hacks per arm" + cos-alignment