mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 19:47:33 +08:00
misc
This commit is contained in:
@@ -14,10 +14,11 @@ Two core layouts (both emitted by default):
|
||||
per-seed). Reads "for THIS loophole, which method suppresses it best".
|
||||
|
||||
Route caveat (load-bearing): hk_<mode> is the TRAINING-time rate; the routed forward
|
||||
still hacks during training, the deployed model (quarantine knob deleted) is the real
|
||||
number. The log has aggregate hack_deploy but NOT per-mode deploy, so route's per-mode
|
||||
curve is drawn DASHED and overstates route. TODO: log per-mode deploy in train.py to
|
||||
make route's per-mode honest; until then read route's real number off plot_dynamics.
|
||||
still exhibits reward hacking during training. The deployed model is evaluated after
|
||||
quarantine ablation. The log has aggregate hack_deploy but not per-mode deployment
|
||||
metrics, so route's per-mode curve is drawn dashed and overstates route. TODO: log
|
||||
per-mode deployment metrics in train.py; until then use plot_dynamics for route's
|
||||
deployment result.
|
||||
|
||||
This is the single plotting ENTRYPOINT (`just plot`): it emits the per-mode cut
|
||||
(by-method, by-hack) AND delegates the aggregate "total hacks per arm" + cos-alignment
|
||||
|
||||
Reference in New Issue
Block a user