Wassname flagged the dynamics curve wasn't comparable: route2 plotted its
deploy eval (n=64, T=0.7, every 5 steps) while vanilla/erase plotted training
rollouts (n=28, every step) -- route2 looked artificially smoother. (NOT a
temperature gap: both gens are T=0.7; the "held-out greedy" header was a stale
lie, now corrected.)
train.py: ungate the periodic DEPLOY-eval to run for EVERY arm. route/route2
wrap it in ablate_quarantine (deploy = knob zeroed); vanilla/erase use
nullcontext (deploy == trained model). Same estimator across arms. Cost: ~+40%
amortized generation on the arms that newly get it (n=64 every 5 steps over
~32 train gens/step) -- n stays 64 to match the finished route2 n=3.
plot_dynamics.py: plot hk_dep/slv_dep for ALL arms when present (drop the
route-only guard; old logs fall back to training hack_s). Drop the cos row
(it was for online-vs-offline erasure; not informative next to the rate row,
and the cross-arm cos comparison was apples-to-oranges) -> single-row small
multiples, "deployed rate". Title states deploy-eval n=64 T=0.7.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
route2's plotted hack/solve is the DEPLOY eval (hk_dep/slv_dep): greedy, n=64,
logged every eval_ablate_every=5 steps and EMA-held flat between. The held line
reads as per-step-dense and oversells route2's smoothness vs the per-step
temperature-sampled (n=28) training curves the other arms plot -- an apples-to-
oranges smoothness the reader shouldn't be misled by.
_mark_if_sparse dots the real measured points when a series is >50% NaN; dense
series (training hack_s, cos sep/leak) stay unmarked. Now the route2 curve
visibly rests on ~13 eval points, not 60.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The cos row mixed non-comparable quantities: erase logged cin_t/cin_s
(pre-intervention single cosines), route2 logged hkgap (a difference) and
resid (post-intervention), all under one "cos(grad,v_hack)" ylabel. Wassname
flagged it -- they are not the same measurement.
Derive two quantities that mean the same thing in every column (_add_cos_derived):
sep = does v_hack still discriminate hacky from clean gradient
erase: cin_t-cin_s (teacher vs student); route2: hkgap (hackflag vs clean)
leak = residual hack-alignment of the post-intervention deployed gradient
erase: cout (after projection); route2: resid (after routing) -- same quantity
Legend now lands on the leftmost arm that has cos data (vanilla has none).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Stacked hack (top) + solve (bottom) sharing x; EMA-5; onset dot on hack only;
arms direct-labelled once on solve with y de-collision + leader lines (the three
non-route arms overlap, so their labels would otherwise stack). routing2 reads
hack~0 / solve highest at a glance.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- plot_dynamics: routing (route v1) out of ARM_ORDER -- superseded by routing2.
- plot_substrate: per-mode hk_* are now plain per-batch counts (streaming log
dropped the /denominator); parse the count, plot it (EMA or cumsum); skip old
n/d-format logs (incompatible units). Y-axis hacks/batch, count annotations.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The plotter picked hk_abl (dense proxy) whenever the COLUMN existed, but no-floor
runs (rollout_ablate_frac=0) emit hk_abl as 0/0 -> all-nan, so the deploy panel
came up empty. Test for finite data (_has_data) not column presence; fall back to
the sparse-but-real hk_dep (every eval_ablate_every steps). _ema carries values
across the nan gaps -> a held step-line.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
--latest-per-arm + --min-steps select the freshest >=N-step log for each
arm from logs/, no hand-globbing. Harden parse_log against historical logs:
require '| INFO |' in the header line, drop pure-symbol header tokens.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The routing arms' benefit shows on the DEPLOYED model (quarantine deleted).
Prefer the dense per-step proxy hk_abl/slv_abl (every step, rollout_ablate_frac>0)
over the sparse held-out hk_dep eval for the plotted hack_s/gt_s curve; fall back
to hk_dep for runs that predate the proxy.
- parse hk_abl/slv_abl; routing+routing2 substitute it (else hk_dep) into hack_s/gt_s
- classify/ARM_ORDER/ARM_COLORS recognise routing2
- gate cos cols (cin_t/cin_s) by presence: vanilla/routing2 lack them, so parse
and panels skip them instead of KeyError (also fixes a pre-existing vanilla crash)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- plot_substrate.main now also calls plot_dynamics.plot/plot_hack_overlay so one
command produces all 4 figs (by_method, by_hack, aggregate, hack_overlay); the
aggregate 'total hacks per arm' core plot is kept, not reimplemented.
- plot_dynamics: point parser at CURRENT streaming headers (cin_t/cin_s, hk_dep/
slv_dep); it was built for the old cos_pre_t/hack_deploy spelling and silently
failed on sub4 logs. No backward-compat for the superseded header.
- justfile: 'plot GLOB STEM' canonical entrypoint over logs/*_sub4_*.log.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Streaming table (StepLogger) redesign per user review:
- drop sprd/N/refr from the streaming view (constant / in argv / always '-')
- short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn
- 2 sig figs on loss; 1 on gn/lr
- cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf)
- ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on
multi-mode (substrate) runs -> shows WHICH loophole classes are learnt
- self-decoding legend() (only the columns this arm/mode-set shows)
- end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too
derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE
generic elicit (the faithful hint already discloses the mechanism; the model
must connect loophole+permission -> exploit = honest discoverability test) +
an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant
hint signal).
Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval
is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation.
train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py;
dropped the dual-name back-compat reads.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts
(0 left at top level). Per-run checkpoints+rollouts now group under
runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in
out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest).
justfile also gains run-cell REFRESH param (online-erasure arm). Smoke +
smoke-vanilla + results all green on new paths. Requeue manifest preserves the
why/resolve labels that pueue reset wiped.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
User couldn't parse 'route ablated-eval (delta_S_hack=0)'. Common terms now:
- per-step line: 'SHIP-eval (quarantine knob OFF = deployed model)' + inline
SHOULD (ship hack < training hack_s if the knob holds the cheat).
- columns hack_abl/solve_abl -> hack_ship/solve_ship.
- final BLUF: 'train/knob-on' vs 'ship/knob-off' + SHOULD (quarantine absorbed
the cheat). plot_dynamics accepts both old+new names. smoke-route green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval
(hack_abl/solve_abl cols, appended so results.py indices unchanged) every
--eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics
plots the ablated series for the routing arm (the coherence-gap fix: training
hack_s looks vanilla; routing only shows post-ablation).
External-review fixes (docs/spec/20260530_code_review.md):
- Critical: route now feeds delta_S the SAME g_proj as erase (was forcing
preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW).
delta_S is its own AdamW param fed erase's grad, so route-ablated deployment
evolves identically to erase regardless of AdamW non-linearity. Only the
combined training forward over-moves (intended; never deployed). Corrected the
overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity).
- Important: clip_grad_norm_ now covers delta_params + delta_hack_params
(no-op for none/erase; bounds the route update).
- Important: results.py paired-delta table includes routing (keyed on arm).
smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7,
ROUTE EVAL BLUF prints.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stage-1 (T3) of the routing spec. Adds a per-module quarantine knob
delta_S_hack (AntiPaSTO forward = delta_S + delta_S_hack, both 0 at init).
intervention=route parks the hack-ward grad component (g - cV to delta_S,
cV to delta_S_hack) instead of erasing it; eval ablates delta_S_hack.
- proj.py: route flag splits the grad (overshoot=1, no rescale -> the split
sums to g, so the training forward still moves hack-ward; route ⊇ erase).
- antipasto.py: second trainable knob, identity preserved at init.
- train.py: arm -> intervention {none,erase,route}; arm kept as a derived
display name so run-id/BLUF/results.py/plot classify are unchanged. opt
steps both knobs (hack knob grad=None under none/erase -> AdamW skips it,
so erase reproduces old `projected` bit-for-bit, R4). R3 span assert
(resid/||gh|| < 1e-4) + end-of-run ||delta_S_hack|| guard (route >0).
- results.py / plot_dynamics.py: read arm from the preset line (covers both
old --arm and new --intervention logs); plot classifies `routing`.
smoke: none ||dsh||=0, erase clean, route ||dsh||=0.0105 span=2.9e-7. 64
archived projected rows still parse.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>