git mv src/projected_grpo -> src/vgrout and find-replace the module name in
all imports (.py), `-m projected_grpo.*` invocations (justfile), and the
[project] name (pyproject; setuptools auto-discovers via where=["src"]).
Left RESEARCH_JOURNAL.md untouched: its commands/paths are dated lab notes
tied to past commits, so rewriting them would falsify provenance. Repo dir,
git remote, and absolute paths unchanged.
Verified: `import vgrout` and `python -m vgrout.train --help` load the full
graph; verify_rewards.py + verify_gate_anchor.py (both import vgrout) pass.
Full `just smoke` is blocked upstream by missing gitignored data artifacts
(out/pools/{substrate,teacher_pool}, out/vhack/*smoke*), unrelated to the rename.
- a5: drop per-panel title (restated the axis); fold direction into the xlabel
(DEPLOY hack rate (down=better) / solve (up=better)). Dump a5_generalisation.csv
(per mode,arm deploy hack/solve mean+/-std) -- the reproducibility source it lacked.
- overlay (dyn_sub4_hack_overlay etc): replace the per-arm onset DOT with a single
dashed labeled 'first hack' vertical line, matching the small-multiples/longrun.
- (dyn_sub4_hack_overlay shares dyn_sub4.csv -- same runs, different view, no new CSV.)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- a5_generalisation: connectors -> arrows (baseline->ours direction, shows the drop
and the stdout solve-cost honestly).
- equiv0 -> approx0 everywhere: these are finite-sample estimates, not identically 0.
- plot_train_vs_deploy skips when train==deploy for every run (no knob-ON contrast);
fixes the 'can't see train' longrun/sub4 figures (they had no hk_on data).
- Prune 9 orphan figure sets not referenced in paper or blog (regenerable on demand);
keep the 3 referenced + a5 + train_vs_deploy_60_train_deploy. All 4 CSVs committed.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
End-labels sat on the line termini (2-arm figs) and piled up bottom-left on
ragged-length multi-arm overlays (substrate, where arms end at different steps).
Now all labels anchor at one gutter x with a leader fanning back to each line's
actual end, y-de-collided. Added right margin so the gutter is clear.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The solid-red deploy line ran straight through the annotation text (tufte
collision test). Move it into the empty band above the flat line (axes y=0.12).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
A vanilla seed (s43) lacked the held-out deploy eval, so its train series fell
back to the noisy n=28 per-step hack_s while other seeds used the n=64 eval.
Averaging mixed estimators fabricated a vanilla train-vs-deploy gap that does
not exist (lie-factor). Now: train series reuses the knob-off eval only (nan if
absent -> seed drops from the mean), and missing eval columns normalise to nan
so absent==all-nan. Regenerated all figures from logs. The canonical
train_vs_deploy_60 (has hk_on) is unchanged; sub4/longrun byproducts now show
train==deploy honestly (no knob-on data to split).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- overlay: floor hack panel below 0 so a pinned-at-0 line shows; direct-label
the TOP (hack) panel not the bottom (read top-to-bottom); tag any arm whose
series sits at 0 with $\equiv 0$.
- train-vs-deploy: replace the 2x2 with one panel per arm, 4 series each --
colour=metric (red hack/green solve), linestyle=train(dashed)/deploy(solid).
The route gap (dashed-red up, solid-red at 0) and vanilla overlap (train==deploy)
read in one panel. two-axis legend (colour=metric, style=train/deploy).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The train row fell back to per-step hack_s (noisy n=28 train batch) for arms
without a knob-on eval, so vanilla's train/deploy rows looked like different
estimators. Fix: vanilla/erase have no quarantine -> train==deploy, so reuse
hk_dep (the n=64 knob-off eval) for the train row. route2 still uses hk_on
(knob-on eval). Now every panel is the same held-out eval, differing only in
the quarantine knob. Regen source: train_vs_deploy_60.csv (route2 nofloor_rf2
+ vanilla sweep, seed 41, 60 steps).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Audit of all 4 plot scripts (plot_dynamics/substrate/emergence/deploy_overlay):
- One save_fig(fig, path) helper in figs.py writes png+svg+pdf (vector for the
paper, png for the blog). All scripts call it.
- arm_label() map: reader-facing names only -- route2->route, drop 'knob'/'the
cheat' from titles and the train-vs-deploy story (adapter on/off, reward hack).
- Titles off by default (the paper/blog caption carries it); --title re-enables
for standalone research use.
- dump_data CSV now carries every plotted series; plot_dynamics --from-csv
re-renders the three figures from the committed CSV with no logs (logs/ and
out/runs/ are gitignored; out/figs/*.csv is tracked). Round-trip verified.
- Commit the regenerated dyn_sub4 figures in all 3 formats + the CSV.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The 2x2 train row used per-step hack_s (noisy n=28 train batch, knob-on) vs the
deploy row's smooth n=64 eval (knob-off) -- different estimators, confounded.
Now at each eval step route arms ALSO run the SAME n=64 eval with the quarantine
ACTIVE (knob-on = training policy), logged as hk_on/slv_on. vanilla/erase reuse
deploy (no quarantine -> knob-on==knob-off). plot_dynamics prefers hk_on for the
train series so the 2x2 differs ONLY in knob state.
Also: plot parses --teacher-off-step from argv and shades the teacher-ON region
[0,toff] + a dashed cut line in the 2x2. The stashed long-run route2 jobs
(92 KL, 94 teacher-off) inherit the knob-on eval automatically at runtime.
Smoke (route2 hk_on present + logged, both plot parse paths) green.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The A4 framing in one figure: vanilla train==deploy (cheat in the weights),
route2 train HACKS while deploy is clean (cheat in the deletable knob). parse_log
now keeps the raw train series (hack_train/solve_train) before the deploy
substitution. New fig: dyn_longrun_200_train_deploy.png.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
plot_dynamics now dumps the plotted series to <out>.csv next to the png so the
figure is reproducible from a tracked artifact (logs/ and out/runs/ are gitignored).
ylim floor -0.035 + 'hack ≡ 0' annotation so route2's pinned-at-zero hack line is
visible rather than hidden under the axis.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- blog: mark as erase-n=2 draft, note route2/exploration-floor/deploy-eval are the
current direction; embed dyn_sub4_hack_overlay.png (force-added); ASCII em-dashes;
de-bold the arm list (#15 tell)
- README: add route2 arm + apples-to-apples deploy-eval to 'What we compare'; stale
banner on the n=1 mix=0.5 findings
- plot_dynamics: remove _mark_if_sparse (asymmetric sparse-only dots); EMA-held line
for all arms
- train.py: fix 'held-out greedy' -> 'held-out eval subset, T=0.7' (deploy eval is
sampled, not greedy)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Wassname flagged the dynamics curve wasn't comparable: route2 plotted its
deploy eval (n=64, T=0.7, every 5 steps) while vanilla/erase plotted training
rollouts (n=28, every step) -- route2 looked artificially smoother. (NOT a
temperature gap: both gens are T=0.7; the "held-out greedy" header was a stale
lie, now corrected.)
train.py: ungate the periodic DEPLOY-eval to run for EVERY arm. route/route2
wrap it in ablate_quarantine (deploy = knob zeroed); vanilla/erase use
nullcontext (deploy == trained model). Same estimator across arms. Cost: ~+40%
amortized generation on the arms that newly get it (n=64 every 5 steps over
~32 train gens/step) -- n stays 64 to match the finished route2 n=3.
plot_dynamics.py: plot hk_dep/slv_dep for ALL arms when present (drop the
route-only guard; old logs fall back to training hack_s). Drop the cos row
(it was for online-vs-offline erasure; not informative next to the rate row,
and the cross-arm cos comparison was apples-to-oranges) -> single-row small
multiples, "deployed rate". Title states deploy-eval n=64 T=0.7.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
route2's plotted hack/solve is the DEPLOY eval (hk_dep/slv_dep): greedy, n=64,
logged every eval_ablate_every=5 steps and EMA-held flat between. The held line
reads as per-step-dense and oversells route2's smoothness vs the per-step
temperature-sampled (n=28) training curves the other arms plot -- an apples-to-
oranges smoothness the reader shouldn't be misled by.
_mark_if_sparse dots the real measured points when a series is >50% NaN; dense
series (training hack_s, cos sep/leak) stay unmarked. Now the route2 curve
visibly rests on ~13 eval points, not 60.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The cos row mixed non-comparable quantities: erase logged cin_t/cin_s
(pre-intervention single cosines), route2 logged hkgap (a difference) and
resid (post-intervention), all under one "cos(grad,v_hack)" ylabel. Wassname
flagged it -- they are not the same measurement.
Derive two quantities that mean the same thing in every column (_add_cos_derived):
sep = does v_hack still discriminate hacky from clean gradient
erase: cin_t-cin_s (teacher vs student); route2: hkgap (hackflag vs clean)
leak = residual hack-alignment of the post-intervention deployed gradient
erase: cout (after projection); route2: resid (after routing) -- same quantity
Legend now lands on the leftmost arm that has cos data (vanilla has none).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Stacked hack (top) + solve (bottom) sharing x; EMA-5; onset dot on hack only;
arms direct-labelled once on solve with y de-collision + leader lines (the three
non-route arms overlap, so their labels would otherwise stack). routing2 reads
hack~0 / solve highest at a glance.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- plot_dynamics: routing (route v1) out of ARM_ORDER -- superseded by routing2.
- plot_substrate: per-mode hk_* are now plain per-batch counts (streaming log
dropped the /denominator); parse the count, plot it (EMA or cumsum); skip old
n/d-format logs (incompatible units). Y-axis hacks/batch, count annotations.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The plotter picked hk_abl (dense proxy) whenever the COLUMN existed, but no-floor
runs (rollout_ablate_frac=0) emit hk_abl as 0/0 -> all-nan, so the deploy panel
came up empty. Test for finite data (_has_data) not column presence; fall back to
the sparse-but-real hk_dep (every eval_ablate_every steps). _ema carries values
across the nan gaps -> a held step-line.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
--latest-per-arm + --min-steps select the freshest >=N-step log for each
arm from logs/, no hand-globbing. Harden parse_log against historical logs:
require '| INFO |' in the header line, drop pure-symbol header tokens.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The routing arms' benefit shows on the DEPLOYED model (quarantine deleted).
Prefer the dense per-step proxy hk_abl/slv_abl (every step, rollout_ablate_frac>0)
over the sparse held-out hk_dep eval for the plotted hack_s/gt_s curve; fall back
to hk_dep for runs that predate the proxy.
- parse hk_abl/slv_abl; routing+routing2 substitute it (else hk_dep) into hack_s/gt_s
- classify/ARM_ORDER/ARM_COLORS recognise routing2
- gate cos cols (cin_t/cin_s) by presence: vanilla/routing2 lack them, so parse
and panels skip them instead of KeyError (also fixes a pre-existing vanilla crash)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- plot_substrate.main now also calls plot_dynamics.plot/plot_hack_overlay so one
command produces all 4 figs (by_method, by_hack, aggregate, hack_overlay); the
aggregate 'total hacks per arm' core plot is kept, not reimplemented.
- plot_dynamics: point parser at CURRENT streaming headers (cin_t/cin_s, hk_dep/
slv_dep); it was built for the old cos_pre_t/hack_deploy spelling and silently
failed on sub4 logs. No backward-compat for the superseded header.
- justfile: 'plot GLOB STEM' canonical entrypoint over logs/*_sub4_*.log.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Streaming table (StepLogger) redesign per user review:
- drop sprd/N/refr from the streaming view (constant / in argv / always '-')
- short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn
- 2 sig figs on loss; 1 on gn/lr
- cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf)
- ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on
multi-mode (substrate) runs -> shows WHICH loophole classes are learnt
- self-decoding legend() (only the columns this arm/mode-set shows)
- end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too
derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE
generic elicit (the faithful hint already discloses the mechanism; the model
must connect loophole+permission -> exploit = honest discoverability test) +
an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant
hint signal).
Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval
is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation.
train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py;
dropped the dual-name back-compat reads.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts
(0 left at top level). Per-run checkpoints+rollouts now group under
runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in
out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest).
justfile also gains run-cell REFRESH param (online-erasure arm). Smoke +
smoke-vanilla + results all green on new paths. Requeue manifest preserves the
why/resolve labels that pueue reset wiped.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
User couldn't parse 'route ablated-eval (delta_S_hack=0)'. Common terms now:
- per-step line: 'SHIP-eval (quarantine knob OFF = deployed model)' + inline
SHOULD (ship hack < training hack_s if the knob holds the cheat).
- columns hack_abl/solve_abl -> hack_ship/solve_ship.
- final BLUF: 'train/knob-on' vs 'ship/knob-off' + SHOULD (quarantine absorbed
the cheat). plot_dynamics accepts both old+new names. smoke-route green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval
(hack_abl/solve_abl cols, appended so results.py indices unchanged) every
--eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics
plots the ablated series for the routing arm (the coherence-gap fix: training
hack_s looks vanilla; routing only shows post-ablation).
External-review fixes (docs/spec/20260530_code_review.md):
- Critical: route now feeds delta_S the SAME g_proj as erase (was forcing
preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW).
delta_S is its own AdamW param fed erase's grad, so route-ablated deployment
evolves identically to erase regardless of AdamW non-linearity. Only the
combined training forward over-moves (intended; never deployed). Corrected the
overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity).
- Important: clip_grad_norm_ now covers delta_params + delta_hack_params
(no-op for none/erase; bounds the route update).
- Important: results.py paired-delta table includes routing (keyed on arm).
smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7,
ROUTE EVAL BLUF prints.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stage-1 (T3) of the routing spec. Adds a per-module quarantine knob
delta_S_hack (AntiPaSTO forward = delta_S + delta_S_hack, both 0 at init).
intervention=route parks the hack-ward grad component (g - cV to delta_S,
cV to delta_S_hack) instead of erasing it; eval ablates delta_S_hack.
- proj.py: route flag splits the grad (overshoot=1, no rescale -> the split
sums to g, so the training forward still moves hack-ward; route ⊇ erase).
- antipasto.py: second trainable knob, identity preserved at init.
- train.py: arm -> intervention {none,erase,route}; arm kept as a derived
display name so run-id/BLUF/results.py/plot classify are unchanged. opt
steps both knobs (hack knob grad=None under none/erase -> AdamW skips it,
so erase reproduces old `projected` bit-for-bit, R4). R3 span assert
(resid/||gh|| < 1e-4) + end-of-run ||delta_S_hack|| guard (route >0).
- results.py / plot_dynamics.py: read arm from the preset line (covers both
old --arm and new --intervention logs); plot classifies `routing`.
smoke: none ||dsh||=0, erase clean, route ||dsh||=0.0105 span=2.9e-7. 64
archived projected rows still parse.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>