mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:15:35 +08:00
refactor: extract train_config.py + run_artifacts.py from train.py; slim results scripts
Cleanup by a prior agent, verified green here: 'just smoke' (erase arm) runs end-to-end and all four wired gates pass (verify_rewards 52/52, verify_eval_gap, verify_partition, verify_science_invariants). - train.py -318 lines: Config dataclass -> train_config.py, checkpoint/ deploy-artifact IO -> run_artifacts.py. - results.py / results_deploy.py / probe_distill.py slimmed. - drop stale derived csvs under out/figs (a5_generalisation, dyn_*, substrate_aggregate, train_vs_deploy_60). - gitignore /.pi/ panel scratch. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -71,21 +71,18 @@ non-overlappingly, so a vanilla student can learn all four independently.
|
||||
Full prompt+hint, hack, and clean traces per mode are in the blog appendix
|
||||
([docs/blog/20260529_...md](docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md#appendix-the-four-loophole-modes)).
|
||||
|
||||
Four arms, identical model and teacher pool, differing only in the gradient policy:
|
||||
Three active arms, identical model and teacher pool, differing only in the gradient policy:
|
||||
|
||||
- **vanilla** -- no intervention; the emergence reference.
|
||||
- **erase** -- subtract the v_hack component from the live `delta_S` gradient (one-sided).
|
||||
- **route** -- quarantine the v_hack component into a throwaway `delta_S_hack` knob, deleted at deploy. Gradient routing ([Cloud et al. 2024](https://arxiv.org/abs/2410.04332)) in the SVD basis. (v1: shared basis, relu gate on the kept-axis coords, same as erase but routed not erased.)
|
||||
- **route2** -- current routing arm. Per-rollout gate `cos(g_rollout, v_grad) > tau` (tau calibrated each step from the hack-vs-clean cosine gap) decides whether a rollout's whole gradient routes into a scale-matched, distinct-basis `delta_S_hack`. An exploration floor (`rollout_ablate_frac`) generates a fraction of rollouts knob-off so the deployed knob always sees solve signal even if the full policy saturates on hacking.
|
||||
- **route, weak detector** -- the generalisation test: build v_hack from only 2 of the 4 modes (the "known" hacks the weak detector can flag), route on that subspace, and measure whether the 2 held-out modes are also suppressed. This is the load-bearing no-cheat check.
|
||||
|
||||
Deploy hack/solve is measured the same way for every arm now: knob-off forward
|
||||
(`ablate_quarantine` for route/route2, the trained model itself for vanilla/erase),
|
||||
(`ablate_quarantine` for route2, the trained model itself for vanilla/erase),
|
||||
n=64 prompts*group, sampled at T=0.7. So the per-arm deploy numbers are apples-to-apples.
|
||||
|
||||
The frozen-vs-refresh distinction is orthogonal: any arm can re-extract
|
||||
v_hack every N steps on the current adapter (for route, with the quarantine
|
||||
ablated during extraction, see the blog).
|
||||
The frozen-vs-refresh distinction is orthogonal: erase and route2 can re-extract
|
||||
their pair-derived direction every N steps on the current adapter.
|
||||
|
||||
## Quick start
|
||||
|
||||
|
||||
Reference in New Issue
Block a user