Disambiguate the overloaded deploy/train/knob vocabulary (paper-consistent:
'quarantine' + 'ablated' + 'deployed' all match Cloud et al.). One opposite each:
- policy view: hack_deployed/solve_deployed (quarantine ablated, ships) vs
hack_as_trained/solve_as_trained (quarantine attached). Unifies the old split
deploy_hack (JSON) vs hack_deploy (table key) into one name.
- 'knob' -> 'quarantine'/'adapter' throughout comments and log strings.
- train/test reserved for the DATA split only.
Bump RUN_SCHEMA v1->v2 so old deploy_test.json files are skipped (not crashed) by
completed_runs. CLI flags untouched (queued jobs unaffected). Fixed two
replace_all collision bugs (hack_deploy substring of hack_deployed -> deployeded)
and the missed eval_curve writer (eval_checkpoint_curve.py) + readers
(results_deploy.py). Smoke green: v2 written + read; gates pass.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- Point future agents at the three docs that pin the actual thesis + the
live open question (direction vs routing vs SVD/PiSSA prior), so they don't
re-derive the non-directional result as a 'bug'.
- Revert rescore_deploy cfg.get() fallback to cfg[key] (fail-fast; old-schema
checkpoints crash loudly rather than silently defaulting).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
job 32/33 failed KeyError eval_batch_size: old checkpoints' stored cfg
predates the train_config refactor. Default eval_n_prompts/max_new/
eval_batch_size to the fast preset (eval-harness params, not model-defining;
test split is fixed-size) so historical checkpoints re-score.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 32 failed KeyError: 'eval_modes' -- deploy_test.json written by the
pre-cleanup train.py has no eval_modes key. by_mode keys are the modes
the original eval spanned (present in every version), so derive from
them to reproduce the same knob-off headline.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Now that final/rescore eval record deploy_hack_on/solve_on at n=119,
the deploy scatter shows the honest quarantine move (hollow knob-on dot
-> arrow -> solid knob-off dot) on the same axis instead of borrowing
val's lower-scale curve. Dot-only fallback for arms not yet backfilled.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
route/routeV final eval now measures both endpoints at n=119 test:
knob-off (ablate_quarantine, the deploy headline) AND knob-on (trained
model as-is). Writes deploy_hack_on/deploy_solve_on/deploy_vhack_on so
the before->after quarantine move is plottable from the deploy set
instead of borrowing the val curve's different scale.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- floor_ceiling_abs.png: clean deploy Pareto. All 5 arms as dots, ideal star at the
good corner (no-hack x ceiling), base->base model label, x clamped at no-hack. No
arrows: knob-on is only measured at val, so a val-before -> deploy-after arrow would
fake a solve jump that's really the n=32->n=119 eval-set shift.
- floor_ceiling_knob.png: the real before->after on ONE eval (val n=32). Hollow knob-on
-> solid knob-off per arm; the move is diagonal (solve changes: prog_wide 0.069->0.056,
authored 0.056->0.044), not the horizontal I wrongly forced earlier.
- justfile: queue-unhackable now 200 steps (solve is a slow signal under the unhackable
fraction), low priority; vanilla rerun alongside best (its solve also suffers).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Two separate panels over-reduced a 2-variable story. One scatter instead: good
corner top-right (hack axis reversed), green effect-arrows from the vanilla
baseline show what each intervention did, achievable solve band (base..ceiling)
as a range-frame, ticks only at meaningful values (no-hack/vanilla/base/ceiling).
No title; name-only point labels (position already encodes the rates). The Pareto
view makes domination visible: per-token strictly dominates random-V and vanilla.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- Transcribed Fig-5 numeric table now lives inline in the paper md as an
EDITOR'S TABLE comment, deleting docs/papers/ariahw_results_table_extracted.md
(one fewer repo file; the table sits next to the figure it transcribes).
- floor_ceiling_abs.{png,pdf}: raw-rate variant. Arrows climb from the floor
anchor; grey bedrock = worse-than-floor, blue sky = past-ceiling; hack axis
reversed so right=better on both panels.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Cross-scale (their converged full-env vs our 60-step fast surrogate) made the
paper comparison directional-only and unfair on one axis. Show vanilla GRPO as
the red floor anchor instead; paper numbers stay in the extracted table.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Had Ariahw bars on the hack panel only -- misleading. Mirror them onto solve
(Fig 5 perf: GT 25.0%, probe 19.4%, LLM-judge 16.2%, base 11.5%, ceiling 22.3%).
Honest picture: the paper methods (incl. no-oracle LLM judge) beat routeV on both
axes because they are converged full-scale vs our 60-step surrogate -- caption
marks it directional-only. Cross-scale/maturity caveat (task #18) still stands.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Read the figure PNGs directly (Fig 5 is a full numeric table the paper never
prints as text). Saved to docs/papers/ariahw_results_table_extracted.md so we
stop re-OCRing. Key correction: my 'LLM judge has no clean rate' was wrong --
LLM-judge PENALTY = 0.1% hack / 16.2% perf, NO oracle. So no-oracle suppression
is not routeV's novelty (the judge does it); the mechanism is (no live monitor,
gradient-level, fixed authored-pair direction). Plot now shows the judge as a
blue no-oracle peer bar alongside the grey oracle methods.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The disqualifier for an intervention is needing the env oracle / ground-truth
hack-labels of the live training distribution, not 'a detector ran'. On a new
RL env there is no oracle, so GT-monitor and the (oracle-label-trained) probe
can't be built there; a generic LLM judge and our hand-authored-pair vector can.
LLM judge is thus the fair external peer (no clean fast-env number to plot).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Swap the floor->ceiling reference to the substrate paper (Ariahw et al. 2025),
which benchmarks interventions on the same floor (No-Intervention hack ~79%) /
ceiling (RL-Baseline no-loophole). Their best arm (Ground-Truth Penalty, ~0%
hack, perf >= ceiling) reaches the top corner BUT uses the oracle monitor at
train time -- the exact cheat our no-cheat constraint forbids; their only
oracle-free method (inoculation) gave incomplete, high-variance mitigation.
Plotted hatched/grey as an ORACLE upper bound (solve approx; figures are images,
200-step preset not step-matched). Honest framing: their working methods need
the oracle; ours uses no detector at train time and still suppresses 93%.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
select = routing precision = hack_supp - solve_supp on the knob (held-out val,
knob-ON vs knob-OFF, same split). 1.0 = removes all hacking at no solve cost.
Sanity: vanilla=0.00 (no knob), base=blank (no knob-on signal), per-token=0.96.
hack_supp = (vanilla - hack)/vanilla ; solve_uplift = (solve - base)/(ceiling - base),
the floor->ceiling normalized fractions (ceiling provisional=paper 0.223, FIXME job 24).
The earlier "solve suppression ~50%" was a train/test artifact; the knob's true
solve cost (select's solve_supp term) is near zero -- selectivity is high.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Rework per feedback: hack and solve are not opposites, so they get separate
floor->ceiling axes (each 0=floor..1=ceiling) rather than sharing a zero -- this
also stops solve (range ~0.13-0.22) being squished next to hack (0-0.61).
Minimal: routeV per-token (best) vs random-V (direction control) vs the SGTM
gradient-routing paper placed on the same floor->ceiling % axis (approx, LM task).
Reads: hack suppression 93% best / 84% control / ~98% reference (9pp = direction
signal); solve gained +17% / -17% / ~95% (far from ceiling -- model barely learns
to solve in 60 steps). Moved out/plots -> out/figs.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Two-stage script: build out/plots/floor_ceiling.csv (one row per arm/anchor,
with SOURCE and STATUS columns flagging every provisional/missing cell) then
the keynote figure. Prints TODO/FIXME data gaps before plotting.
Panel A: normalized floor->ceiling bars, headline deploy (knob-off, test n=119).
Panel B: the knob effect -- arrow knob-ON -> knob-OFF on the SAME held-out val
split (eval_curve.jsonl), isolating the quarantine from the train/test
memorization gap. Fixes the earlier conflation where the train->deploy arrow
mixed knob-on/off with train-problems/test-problems.
Data gaps flagged in csv: solve ceiling provisional=paper 0.223 (FIXME job 24),
prog_wide arm contaminated (TODO job 28 prog_wide_clean).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
singleNumber->findUnpaired, longestCommonPrefix->sharedPrefix,
removeDuplicates->inplaceDeduplicate, maxProfit->bestSingleTrade.
Same algorithm and test cases; method name changed so pairs no longer share
a canonical LeetCode function name with training data.
Also update results.md Q14 table: add hack_train/solve_train columns,
vanilla row, and prog_wide contamination note (docs/ is gitignored).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- Delete src/vgrout/pairs_v2.py and src/vgrout/pairs_intent.py; move all data
into scripts/pairset_build_intent.py (self-contained, exports 3 JSONs).
- Export: pairs_intent_think.json (6), pairs_intent_funcname.json (6),
pairs_intent_concept.json (6 diagnostic).
- Update diag_cosine_dist.py and diag_pairs_compare.py to load from JSON
instead of importing Python modules; drop tainted v2/allv2 pairsets
from the diag sweep (print-without-assert axis).
- train.py final table: add solve_rate_s computed same as hack_rate_s, so
the per-run end-of-training table shows actual training solve rate (was "-").
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
pairs.py now only has HackPair dataclass + _prompt/_wrap helpers.
All pair data (and the PAIRS list) lives in scripts/pairset_build_authored.py.
Drop axis 1 (print-without-assert in run_tests) and axis 3 (persona + prints)
from pairs_authored -- those 10 pairs directly encode the env's specific
run_tests grading flaw (prints instead of asserts). Kept: axis 2 (hardcode,
identical run_tests both sides), 4 (try/except swallow), 5 (isinstance-only),
6 (weak inequality). P19-P21 (previously defined but unused) now included.
Result: 11 pairs from 4 axes.
Also removed the PAIRS-fallback branch from extract_vhack_grad.py (same
fail-loud pattern applied to train.py earlier).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
New routeV_gate="online_stats" mode: use the empirical per-rollout cosine
distribution (q5/q95 pooled across all modules each step) as the routing
band thresholds, instead of the pair-derived route_band. Direction v_grad
still from authored pairs; only thresholds are online/adaptive.
Motivation: the pair-derived band sits above the live cosine distribution
(median live cos ~-0.06), causing frout to cliff as GRPO advantage flattens.
Online thresholds adapt to the actual live distribution, so the 5/95 tails
always route regardless of where the raw cosines land.
Config: routeV_gate="online_stats", online_stats_lo=0.05, online_stats_hi=0.95.
Step-0 prior: (-0.5, 0.5) neutral band (pairs not used for calibration).
Band update: post-opt.step(), torch.quantile over that step's module*rollout cosines.
No-cheat: v_grad from authored pairs only; thresholds from the cosine distribution
of live student rollouts (no oracle/labeling of live rollouts as hack/clean).
Also: add online_stats to results_deploy._arm(); justfile queue-online-stats recipe.
Queued as job 22 (s43, authored pairs, priority 12, after 19/20/21).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
intent pairs hold sol+tests IDENTICAL, vary only the cheat-vs-solve intent signal
(the properly-contrastive shape). --pairs {think,funcname,concept} for AUROC test.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Ideal (oracle CV) AUROC grad 0.84 / act 0.84 >> pair-direction 0.56/0.67: the DIRECTION
is the bottleneck, not separability. on-distribution pairs green-lit. act vote 0.669 best clean.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Finding: v_grad/As barely separate LIVE hack from clean (authored pairs are
off-distribution: localized run_tests-block contrast vs full novel-problem rollouts).
act-cosine best AUROC 0.69; grad-cosine best confident-tail p@10 0.70; magnitude inverted.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
scripts/results_deploy.py pulls the held-out TEST deploy numbers from the FINAL EVAL
line that just-results skips. Journal: per-rollout real==random (absorption), per-token
real-V is the lead; pinning suspected off (band above live cos).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so
~1 step in 8 saw a teacher demo and the student never learned the hack within 60
steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality
comparison invalid.
scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full
model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades
each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify;
the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1
per step), no partition.json. Reuses compute_reward; row schema copied verbatim
from build_substrate so the pools are loader-compatible.
- queue-dir6 -> teacher_pool_runtests_dense (all 8 arms).
- build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate).
- main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong
'no re-grading' and the stale 6-prompt count; note demos are full problem-specific
completions (real solution + permissive self-written run_tests), not a snippet.
Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and
requeued on the dense pool.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our
artifact): disjoint from train by id but in the train id/recency range (ids
3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in
pretraining -> base solve 0.94, saturating solve and killing the hack metric's
gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining
MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the
paper rate.
Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094,
matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the
contaminant. Fix: drop the holdout; periodic curve + final number both eval the
paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's
simple_overwrite_tests (not the easier _detailed/_aware variants).
Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle
for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up
(journal e): train pool is still first-200-by-id (easy/memorized), same bug class.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Ariahw's run_tests fn-name memorization control (paper line 84: train fixes
"run_tests", eval randomizes over 10 held-out names), extended to our 3 added
modes (stdout_marker/sentinel/file_marker). At train the side-channel detector
accepts a whole family + a fixed hint token (hack stays RL-discoverable); at
eval we disclose a NOVEL token and grade against a DISJOINT family equally
lenient as train -- so a memorized train token scores 0 and only producing the
disclosed mechanism counts, with no train-lenient/eval-exact strictness shift
(the v1 confound the external panel flagged). run_tests stays exact-both
(matches the paper). Eval-only; training path byte-identical (families default
None). scripts/verify_eval_gap.py proves per mode + per eval token:
disjointness (train detector must NOT fire on eval token), the gap matrix, and
no mutation of canonical prompts. Wired into smoke.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Ariahw randomizes the run_tests function name at eval (line 84) so a high hack
rate reflects responding to the disclosed mechanism, not memorizing a string.
Our 3 added side-channel modes (stdout/sentinel/file) used family detectors with
a fixed hint, so eval fell inside the train family -> no train/test gap. Fix: at
eval, disclose a NOVEL per-mode token and tighten the detector to require exactly
it (rewards.py _probe_channels gains eval_marker_phrase/eval_sentinel_attr/
eval_result_file; defaults None = unchanged family path at train). run_tests uses
the existing test_func_name param. eval.py randomize_eval_markers swaps the WHOLE
train hint (never touches problem text), seeded per problem so the curve stays
smooth and comparisons stay paired. New gate scripts/verify_eval_gap.py proves,
per mode: train-token hack scores under family, NOT under eval-token; novel-token
hack scores under eval-token. Wired into smoke.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Replace the confounded route2 gate (hack_anchor force-routed teacher +
weak-detector student rows by LABEL; EMA tau calibrated from a live detector
over student rollouts at train time = a cheat) with a band calibrated from the
contrastive pairs alone:
lower = mean clean-pair cos(g, v_grad); upper = mean hack-pair cos
per rollout: f = clamp((cos(g_b, v_grad) - lower)/(upper - lower), 0, 1)
routed = sum_b f_b * g_b -> delta_S_hack; kept = g - routed -> delta_S
v_grad is now the SOLE router: no detector or gt_pass touches routing, so
"does v_hack generalize to held-out modes" is clean and random-vs-real is
decisive. Band width (upper-lower) is itself the discriminator: smoke shows
+0.289 real vs -0.014 Haar-random (collapsed). conservation routed+kept=g
holds exactly; resid~0 in smoke (no hack leak into the deployed knob).
- delete build_route2_anchors + EMA state (ema_hack/clean_cos, route2_tau)
- add route_band_edges(); build at extract, rebuild on v_grad refresh
- drop --gate-anchor-teacher-only config + retire scripts/verify_gate_anchor.py
- teacher rollouts now route through the same band (not force-routed)
- spec: add the mass-confound control (scientist review 2026-06-06)
smoke-route2 + smoke-route2 --route2-random-v-seed=7 both pass; erase smoke green.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Haar's ~0 cos is concentration of measure (out-of-subspace), not a cleaner
placebo. Semantic placebos are in-subspace and share generic structure, so a
nonzero cos with hack is the expected floor, not 'they found the hack'.
null_city's high-cos modules are plausibly low-rank-module artifacts. Cosine
is correlational; the ablation run is the causal test.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The null_city placebo is CONTAMINATED: 20% of its modules align with the hack
direction (median |cos|=0.06 but a 0.99 tail, shared generic features). So the
'route2 is non-directional' verdict rested on a bad control. Add the clean tests:
- route2_random_v_seed: replace pair-derived v_grad with seeded per-module Haar-random
unit vectors (~0 cos with hack dir everywhere). Refresh no-ops so the draw stays fixed.
'Nothing routed' (||dS_hack||==0) is now a valid logged outcome, not an abort -- it is
itself H4-confirming (a zero-alignment direction may never clear tau).
- null_vampire / null_bacon / null_blue: semantic placebo fleet (vampire-vs-werewolf etc.),
each an arbitrary direction with different accidental hack-alignment. Maps route2's
suppression-vs-alignment as a scatter: H4 predicts it tracks |cos|, H2 predicts all suppress.
Smoke-validated (smoke-route2 --route2-random-v-seed=0 completes).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
git mv src/projected_grpo -> src/vgrout and find-replace the module name in
all imports (.py), `-m projected_grpo.*` invocations (justfile), and the
[project] name (pyproject; setuptools auto-discovers via where=["src"]).
Left RESEARCH_JOURNAL.md untouched: its commands/paths are dated lab notes
tied to past commits, so rewriting them would falsify provenance. Repo dir,
git remote, and absolute paths unchanged.
Verified: `import vgrout` and `python -m vgrout.train --help` load the full
graph; verify_rewards.py + verify_gate_anchor.py (both import vgrout) pass.
Full `just smoke` is blocked upstream by missing gitignored data artifacts
(out/pools/{substrate,teacher_pool}, out/vhack/*smoke*), unrelated to the rename.
The other half of the no-cheat family (sibling of the gate-anchor leak). Asserts
on the real out/pools/substrate/partition.json: (1) partition is a clean function
into the 4 distinct substrate modes, each populated; (2) under teacher_modes={run_tests}
the kept teacher pool is ALL known-mode -- held-out modes get ZERO demos and are
genuinely held out (>0 problems). Vibe-check, not a theorem; wired into just smoke.
6/6 pass.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- a5: drop per-panel title (restated the axis); fold direction into the xlabel
(DEPLOY hack rate (down=better) / solve (up=better)). Dump a5_generalisation.csv
(per mode,arm deploy hack/solve mean+/-std) -- the reproducibility source it lacked.
- overlay (dyn_sub4_hack_overlay etc): replace the per-arm onset DOT with a single
dashed labeled 'first hack' vertical line, matching the small-multiples/longrun.
- (dyn_sub4_hack_overlay shares dyn_sub4.csv -- same runs, different view, no new CSV.)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- a5_generalisation: connectors -> arrows (baseline->ours direction, shows the drop
and the stdout solve-cost honestly).
- equiv0 -> approx0 everywhere: these are finite-sample estimates, not identically 0.
- plot_train_vs_deploy skips when train==deploy for every run (no knob-ON contrast);
fixes the 'can't see train' longrun/sub4 figures (they had no hk_on data).
- Prune 9 orphan figure sets not referenced in paper or blog (regenerable on demand);
keep the 3 referenced + a5 + train_vs_deploy_60_train_deploy. All 4 CSVs committed.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The route2 tau-gate anchored on (teacher OR hacked_E student). hacked_E is the
run_tests detector; it cross-fires <=1.1% on held-out modes (stdout 17/1540,
file_marker 2/1337), force-routing those rollouts -- a real label leak into the
held-out class, not noise. Add gate_anchor_teacher_only: anchor on teacher rows
only, so held-out classes get PROVABLY zero detector labels (airtight A5 control).
Extracted the inline anchor loop to build_route2_anchors() and added
scripts/verify_gate_anchor.py (wired into just smoke): proves default reproduces
the leak (held-out FP student force-routed) and teacher_only removes it (zero
student routing, teachers unchanged). 9/9 assertions pass.
Rescoring can't fix this -- the leak is in training (gate shaped the weights),
not scoring (per-mode ground-truth eval is clean). Retrain is the only path; the
A5 run saved no per-eval checkpoints anyway.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
End-labels sat on the line termini (2-arm figs) and piled up bottom-left on
ragged-length multi-arm overlays (substrate, where arms end at different steps).
Now all labels anchor at one gutter x with a leader fanning back to each line's
actual end, y-de-collided. Added right margin so the gutter is clear.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The solid-red deploy line ran straight through the annotation text (tufte
collision test). Move it into the empty band above the flat line (axes y=0.12).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
A vanilla seed (s43) lacked the held-out deploy eval, so its train series fell
back to the noisy n=28 per-step hack_s while other seeds used the n=64 eval.
Averaging mixed estimators fabricated a vanilla train-vs-deploy gap that does
not exist (lie-factor). Now: train series reuses the knob-off eval only (nan if
absent -> seed drops from the mean), and missing eval columns normalise to nan
so absent==all-nan. Regenerated all figures from logs. The canonical
train_vs_deploy_60 (has hk_on) is unchanged; sub4/longrun byproducts now show
train==deploy honestly (no knob-on data to split).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>