refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts

Cleanup by a prior agent, verified green here: 'just smoke' (erase arm)
runs end-to-end and all four wired gates pass (verify_rewards 52/52,
verify_eval_gap, verify_partition, verify_science_invariants).

- train.py -318 lines: Config dataclass -> train_config.py, checkpoint/
  deploy-artifact IO -> run_artifacts.py.
- results.py / results_deploy.py / probe_distill.py slimmed.
- drop stale derived csvs under out/figs (a5_generalisation, dyn_*,
  substrate_aggregate, train_vs_deploy_60).
- gitignore /.pi/ panel scratch.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-09 13:34:50 +00:00
parent 3f82041d90
commit b53043cec3
31 changed files with 673 additions and 3073 deletions
+4 -7
View File
@@ -71,21 +71,18 @@ non-overlappingly, so a vanilla student can learn all four independently.
Full prompt+hint, hack, and clean traces per mode are in the blog appendix
([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).
Four arms, identical model and teacher pool, differing only in the gradient policy:
Three active arms, identical model and teacher pool, differing only in the gradient policy:
- **vanilla** -- no intervention; the emergence reference.
- **erase** -- subtract the v_hack component from the live `delta_S` gradient (one-sided).
- **route** -- quarantine the v_hack component into a throwaway `delta_S_hack` knob, deleted at deploy. Gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332)) in the SVD basis. (v1: shared basis, relu gate on the kept-axis coords, same as erase but routed not erased.)
- **route2** -- current routing arm. Per-rollout gate `cos(g_rollout, v_grad) > tau` (tau calibrated each step from the hack-vs-clean cosine gap) decides whether a rollout's whole gradient routes into a scale-matched, distinct-basis `delta_S_hack`. An exploration floor (`rollout_ablate_frac`) generates a fraction of rollouts knob-off so the deployed knob always sees solve signal even if the full policy saturates on hacking.
- **route, weak detector** -- the generalisation test: build v_hack from only 2 of the 4 modes (the "known" hacks the weak detector can flag), route on that subspace, and measure whether the 2 held-out modes are also suppressed. This is the load-bearing no-cheat check.
Deploy hack/solve is measured the same way for every arm now: knob-off forward
(`ablate_quarantine` for route/route2, the trained model itself for vanilla/erase),
(`ablate_quarantine` for route2, the trained model itself for vanilla/erase),
n=64 prompts*group, sampled at T=0.7. So the per-arm deploy numbers are apples-to-apples.
The frozen-vs-refresh distinction is orthogonal: any arm can re-extract
v_hack every N steps on the current adapter (for route, with the quarantine
ablated during extraction, see the blog).
The frozen-vs-refresh distinction is orthogonal: erase and route2 can re-extract
their pair-derived direction every N steps on the current adapter.
## Quick start