refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts

Cleanup by a prior agent, verified green here: 'just smoke' (erase arm) runs end-to-end and all four wired gates pass (verify_rewards 52/52, verify_eval_gap, verify_partition, verify_science_invariants). - train.py -318 lines: Config dataclass -> train_config.py, checkpoint/ deploy-artifact IO -> run_artifacts.py. - results.py / results_deploy.py / probe_distill.py slimmed. - drop stale derived csvs under out/figs (a5_generalisation, dyn_*, substrate_aggregate, train_vs_deploy_60). - gitignore /.pi/ panel scratch. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:15:35 +08:00 · 2026-06-09 13:34:50 +00:00
parent 3f82041d90
commit b53043cec3
31 changed files with 673 additions and 3073 deletions
@@ -71,21 +71,18 @@ non-overlappingly, so a vanilla student can learn all four independently.
 Full prompt+hint, hack, and clean traces per mode are in the blog appendix
 ([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).

-Four arms, identical model and teacher pool, differing only in the gradient policy:
+Three active arms, identical model and teacher pool, differing only in the gradient policy:

 - **vanilla** -- no intervention; the emergence reference.
 - **erase** -- subtract the v_hack component from the live `delta_S` gradient (one-sided).
- **route** -- quarantine the v_hack component into a throwaway `delta_S_hack` knob, deleted at deploy. Gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332)) in the SVD basis. (v1: shared basis, relu gate on the kept-axis coords, same as erase but routed not erased.)
 - **route2** -- current routing arm. Per-rollout gate `cos(g_rollout, v_grad) > tau` (tau calibrated each step from the hack-vs-clean cosine gap) decides whether a rollout's whole gradient routes into a scale-matched, distinct-basis `delta_S_hack`. An exploration floor (`rollout_ablate_frac`) generates a fraction of rollouts knob-off so the deployed knob always sees solve signal even if the full policy saturates on hacking.
- **route, weak detector** -- the generalisation test: build v_hack from only 2 of the 4 modes (the "known" hacks the weak detector can flag), route on that subspace, and measure whether the 2 held-out modes are also suppressed. This is the load-bearing no-cheat check.

 Deploy hack/solve is measured the same way for every arm now: knob-off forward
-(`ablate_quarantine` for route/route2, the trained model itself for vanilla/erase),
+(`ablate_quarantine` for route2, the trained model itself for vanilla/erase),
 n=64 prompts*group, sampled at T=0.7. So the per-arm deploy numbers are apples-to-apples.

-The frozen-vs-refresh distinction is orthogonal: any arm can re-extract
-v_hack every N steps on the current adapter (for route, with the quarantine
-ablated during extraction, see the blog).
+The frozen-vs-refresh distinction is orthogonal: erase and route2 can re-extract
+their pair-derived direction every N steps on the current adapter.

 ## Quick start