Old GT_S=6/HACK_S=8 were the pre-sprd/N layout; current table is gt_s=4
hack_s=6, so newer logs were silently mis-read and old distill logs crashed
_frac on a non-fraction token. Now locate the train.py streaming header
(first token 'step' + 'ref_eq' present) and map columns by name.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Streaming table (StepLogger) redesign per user review:
- drop sprd/N/refr from the streaming view (constant / in argv / always '-')
- short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn
- 2 sig figs on loss; 1 on gn/lr
- cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf)
- ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on
multi-mode (substrate) runs -> shows WHICH loophole classes are learnt
- self-decoding legend() (only the columns this arm/mode-set shows)
- end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too
derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE
generic elicit (the faithful hint already discloses the mechanism; the model
must connect loophole+permission -> exploit = honest discoverability test) +
an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant
hint signal).
Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval
is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation.
train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py;
dropped the dual-name back-compat reads.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts
(0 left at top level). Per-run checkpoints+rollouts now group under
runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in
out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest).
justfile also gains run-cell REFRESH param (online-erasure arm). Smoke +
smoke-vanilla + results all green on new paths. Requeue manifest preserves the
why/resolve labels that pueue reset wiped.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
17/57 real runs pass no --mix-ratio and rely on the preset default (0.125),
but the argv grab defaulted to 0.5 and mis-keyed them into the wrong mix
group, contaminating the paired-delta baseline. Parse the printed
mix_ratio= INFO line (what the run actually used) instead.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval
(hack_abl/solve_abl cols, appended so results.py indices unchanged) every
--eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics
plots the ablated series for the routing arm (the coherence-gap fix: training
hack_s looks vanilla; routing only shows post-ablation).
External-review fixes (docs/spec/20260530_code_review.md):
- Critical: route now feeds delta_S the SAME g_proj as erase (was forcing
preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW).
delta_S is its own AdamW param fed erase's grad, so route-ablated deployment
evolves identically to erase regardless of AdamW non-linearity. Only the
combined training forward over-moves (intended; never deployed). Corrected the
overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity).
- Important: clip_grad_norm_ now covers delta_params + delta_hack_params
(no-op for none/erase; bounds the route update).
- Important: results.py paired-delta table includes routing (keyed on arm).
smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7,
ROUTE EVAL BLUF prints.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stage-1 (T3) of the routing spec. Adds a per-module quarantine knob
delta_S_hack (AntiPaSTO forward = delta_S + delta_S_hack, both 0 at init).
intervention=route parks the hack-ward grad component (g - cV to delta_S,
cV to delta_S_hack) instead of erasing it; eval ablates delta_S_hack.
- proj.py: route flag splits the grad (overshoot=1, no rescale -> the split
sums to g, so the training forward still moves hack-ward; route ⊇ erase).
- antipasto.py: second trainable knob, identity preserved at init.
- train.py: arm -> intervention {none,erase,route}; arm kept as a derived
display name so run-id/BLUF/results.py/plot classify are unchanged. opt
steps both knobs (hack knob grad=None under none/erase -> AdamW skips it,
so erase reproduces old `projected` bit-for-bit, R4). R3 span assert
(resid/||gh|| < 1e-4) + end-of-run ||delta_S_hack|| guard (route >0).
- results.py / plot_dynamics.py: read arm from the preset line (covers both
old --arm and new --intervention logs); plot classifies `routing`.
smoke: none ||dsh||=0, erase clean, route ||dsh||=0.0105 span=2.9e-7. 64
archived projected rows still parse.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
docs/results.md: lead with absolute last-5 rates (compare within a table by
eye); restrict refresh-cadence/gate/basis comparisons to the seed they actually
share (kills the fake refresh "ladder" that compared n=1 cadences to a 4-seed
frozen mean); add Q6 solve columns, Q8 pair-content axis breakdown (8/18 pairs
are axis-1 weak-tests; the 21-pair set is not in committed pairs.py -> FIXME),
Q9 solve-orth negative result, and a dynamics note (solve never climbs; hack
plateaus ~step 15).
scripts/results.py: add `log` provenance column; drop the wide argv/time cols.
Lock mix_ratio=0.125 as the default (FastConfig group 4->8 so the split is
non-degenerate; drop --mix-ratio=0.5 from fast recipes). Q6 shows 0.125 keeps
the hack cut with no solve tax. Smoke passes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- paired view: join projected to vanilla on (mix, seed), per-seed delta, mean
+/- std over shared seeds. Comparing a 3-seed mean to a 1-seed point is
meaningless; this enforces same-seed comparison (ml_debug principle).
- grouped view now reports std across seeds (null at n=1).
- exclude in-progress/aborted runs (must log all `steps`) so partial logs
don't read as impossibly-good results.
- docs/results.md rewritten around paired deltas; honest that at n=4 the
last-5 Dhack std (~0.15) ~= the mean (~0.13), so the effect is consistent
in sign but not cleanly separated from zero.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- scripts/results.py + `just results`: aggregate logs/*.log into last-5
hack_s and gt_s (solve) tables, sorted-by-time + grouped-by-config, with
full argv provenance column. Filters smoke/probe runs.
- extract_vhack_grad: solve_orth_m knob — strip top-m known-solve subspace
(SVD of clean-side grads) from D before SVD, so projection doesn't ablate
the solve signal. No grader/oracle, off by default.
- docs/results.md: every experiment grouped by the question it answers
(feasibility, H1, gate_mode, basis, refresh, mix, noise-floor, pair-set)
with comparison tables and answers.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>