Gate band (mean + k*std) now computed from THIS batch's pooled positions each step
instead of a sliding window. Refresh-proof by construction (live rollouts scored vs
the current v_grad), so the v_grad-refresh window flush is gone. Drops route_window
config + collections import. SmokeConfig forces routing (mid=-1,rout=0) since random
tiny data never separates -> quarantine would never train -> pathway assert would fail.
lr 3e-4 -> 1e-4: 3e-4 diverged at step ~27 (lp_s +18->+73, rew_s->0 after clean
emergence 7-24); 1e-4 is the normal LoRA range and emergence was already fast.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Gate calibration: route by live mean + route_std_mid/route_std_rout * std of the
pooled cosine-to-v_grad, not a fixed quantile tail. Self-silences -- only the tail
that genuinely exceeds the spread routes, so qmass tracks real separation instead
of a forced fraction. The authored absolute band is mis-placed (live pos sits far
below the synthetic-hack edge; even synthetic solve out-aligns on-policy hack).
tablelog: auroc/rout/routE/keep/resid/qmass cols always shown (nan on vanilla) so
arm tables line up.
Diagnostics: scripts/diag_pinning.py (4-population calibration view, mean+/-2sd band)
and scripts/diag_pinning_refresh.py (proves cosine stats recompute from a tracked
v-independent gradient cloud on a v_grad refresh -- exact for k=1, sanity 2.5e-16).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
One scheduler object does warmup + cosine relaxation; pct_start=warmup_frac
is the explicit warmup. cycle_momentum=False so it doesn't clobber the
configured AdamW betas (adam_beta1). Curve (100 steps, 20% warmup, 3e-4):
peaks 3e-4 at step ~19 via smooth cos ramp (vs old linear), 1.7e-4 at step
10 where 5e-4 had diverged. Smoke + all verify gates green.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- FastConfig lr 5e-4 -> 3e-4: 5e-4 peaked exactly at warmup-end (step ~10)
and diverged (lp_t -0.5 -> -4.8, hack_s 20/24 -> 0). Lower peak + longer
warmup defuse the spike.
- Config warmup_frac 0.1 -> 0.2: SequentialLR(LinearLR, CosineAnnealingLR)
already does warmup+cosine relaxation; just reach the peak more gradually.
- save_ckpt: drop A0/B0 (seeded init, regenerable from lora_init_seed;
ckpt_update0000 is the init since A==A0 at step 0; nothing live reloads
them), save A/B bf16 not fp32. ~1.3G -> ~0.33G per ckpt.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Each lora2r ckpt is ~1.3G (A/B + redundant frozen A0/B0, 252 modules fp32). The
768G disk filled and runs crashed at the step-0 ckpt save. 20-step cadence halves
the per-run footprint while keeping enough points for the eval curve.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Three fixes after the deploy-solve=0 run (all user-confirmed):
1. Gate co-adaptation: t_lo sat at quantile(route_quantile), so keep was only the
bottom ~5% and ~95% of rollouts landed in mid (both blocks train, qmass~0.5).
Move the keep/mid boundary up: t_lo=quantile(1-2q), t_hi=quantile(1-q), so keep
is the BULK, mid+rout are the top 2q. Three-zone absorption preserved (not deleted).
2. Deterministic teachers: every teacher-phase prompt is drawn from the both-pool-
covered set and gets EXACTLY teacher_n_per_prompt hack + N solve (constant count,
no flip/coverage drops). Replaces mix_ratio*_even_split (count varied per step).
No flip in the teacher phase (solve teacher carries solve pressure). mix_ratio>0
stays the on/off switch. Removed dead _even_split.
3. Deploy-mode generation: student rollouts generate under ablate_quarantine, so the
behavior policy = the shipped deployed-only model -- the quarantine's learned hack
can't saturate the rollout distribution and starve honest solve advantage. For
clean-gated rollouts gen and train forward now match.
Also: FastConfig lr 1e-4->5e-4 (random-init lora2r needs more lr in the short budget).
AGENTS.md: don't bake unconfirmed theories into comments; don't inflate diagnosis
confidence across turns. Smoke + smoke-solvemix green; all verify gates pass.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The authored absolute band made pos>=1 unreachable for live hacks (rout~0),
and re-extracting it every 5 steps collapsed the gate (the #40 step-5 cliff).
- Online-stats gate: route by live quantiles of the pooled cos-to-v_grad
(top route_quantile -> hack, bottom -> keep, middle -> mid), window flushed
on refresh. v_grad stays authored-only; only the threshold follows the live
distribution. Smoke: routing sustained past the refresh (cliff fixed).
- Step-level teacher mix (#31): mix_ratio is a fraction of ALL the step's gens,
not a per-prompt round; symmetric hack+solve teachers injected as ordinary
gens (not specially routed). Fixes the per-prompt rounding wart.
- AUROC + cosU step columns: v_grad as a live hack-detector vs the hack-label
(measurement-only, never routes) -- discriminates threshold-vs-direction
failure and whether a refresh destroys separation.
- Inline eval stays off (eval_ablate_every=0); deploy scored offline.
- Fix _sample_rows None crash (beartype) on the no-solve-pool path.
- Remove dead pooled_gate_thresholds (the rejected authored-pooled approach).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
--solve-pool-dir splits the G_t teacher budget solve_mix_frac solve / rest hack
(default off). The gate's routed-share is split by teacher SOURCE: a discriminating
gate routes hack teachers (d->1) and KEEPS solve teachers (d->0); equal shares =
non-directional (shrinkage null). Teacher source is our pool construction, not a
live-rollout oracle label -- a legit diagnostic. Per-step debug + final BLUF
(hack-routed vs solve-routed gap, 🟢/🟡/🔴). _sample_rows helper dedups the draw.
Smoke: just smoke-solvemix green (split+diagnostic path runs end-to-end).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Off the live lora2r path; removed with vhack.py (commit 4120d75):
- proj.py: drop project_delta_S_grad/_project_one_module/mean_cos_pre_from_grads/
_hackward_cos (no live importer; train.py uses only per_token_logps).
- verify_science_invariants: test pairset_sha256's content gate directly (drops the
load_v_hack vehicle + fake delta_S wrapper fixture).
- extract_vhack_grad: import pairset_sha256 from .pairs (was re-exported via vhack).
- tablelog/figs: stale 'delta_S grads'/'knob' comments -> A/B grads.
Smoke + verify_science_invariants green; no delta_S left in live code.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
k=1 (default) stays the mean-mass mean-diff axis -- headline unchanged. k>1
builds the top-k oriented SVD dirs of the paired diff and the gate scores
max_i cos(g, v_i) (alignment to ANY known hack sub-mode), catching multi-modal
hack signal one mean washes out. Shared _build_v_grad at init + refresh; band
edges and the live gate both max over k. Sims use einsum + jaxtyping dims.
Smoke: just smoke-topk green (top-3 subspace, band width +0.087, 12/14 modules).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
antipasto.py (PiSSA/lora_frozen_b/old-lora2r wrappers) is dead in the live path --
train.py/extract use lora2r.py, nothing imports antipasto. Move the 7 scripts that
import it or the erase-era proj fns (rescore_deploy, eval_checkpoint_curve,
verify_vhack_heldout, probe_distill, diag_cosine_dist, diag_pairs_compare,
tt_erase_bench) to scripts/attic/ -- they need lora2r rewrites if resurrected.
Live imports verified clean.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
train.py rewritten straight-line for the single rank-2r Gaussian-init LoRA adapter
and three arms (intervention none|routeV|absorb). Removes the erase grad-surgery,
act_vote/online_stats gates, beta/KL reference path, per-source split harvest, the
v_hack injection block, and all per-mechanism E/C/D/A-B tallies. Folds in:
- T2 Gaussian init (lora2r.py): A0~N(0,1/d_in), B0~N(0,1/2r), net delta 0 at init.
- T3 width-pooled gate labels: single (num/den) fraction across modules, skip
zero-width modules, raise if none separate (was per-module equal-weight blowup).
- T5 absorb arm: masks pinned (1,0) -> both blocks train, no gate.
- T6 self-contained ckpt: A/B/A0/B0 in one file (no _hack file, no SVD cache),
adapter:"lora2r" in saved cfg.
- T8 m3: step_flagged logs the hack share (d.mean), not m.mean.
Gates green: verify_lora2r_routing (4 invariants) + smoke none/routeV/absorb
end-to-end on tiny-random Qwen3 (logs in /tmp/claude-1000/smoke_*.log).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Structural-separation arm to disentangle directionality from shrinkage. A rank-2r
PiSSA-init LoRA with A and B both trainable, partitioned into a deployed block [:r]
and a quarantine block [r:] (spectrum-matched via alternated SVD axes). Unlike the
same-basis PiSSA routeV (where deploy-ablation only removes a magnitude slice of one
shared update = shrinkage null), each block has its own input-side A rows and
output-side B columns, so deploy-ablation removes a different FUNCTION.
Routing = SGTM-style three-way hard per-rollout masks from the cosine of the deployed
block's gate-pass gradient to the pair-extracted v_grad: clean (m=0,d=0) trains
deployed only; hack (m=1,d=1) detaches deployed output so only the quarantine updates
(SGTM grad-retain trick); mid (m=1,d=0) trains both (absorption). Gate is no-cheat:
cos to the hand-authored-pair direction, never an oracle label of a live rollout.
verify_lora2r_routing.py gates identity-at-init, the three-way block-grad routing,
per-rollout c-probe recovery, and ablation teeth; wired into smoke-lora2r. Additive:
PiSSA / lora_frozen_b paths untouched.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The gt_only flip was seeded on (seed, pid) at load time, so the SAME ~10% of
problems were unhackable every step -- a fixed honest subset the model can
memorize instead of learning to genuinely solve the distribution. Move the flip
into the train step loop, seeded on (seed, step, pid), so the unhackable subset
rotates: over training every problem is sometimes shown hint-free.
Both halves flip, not just the hint:
- prompt: messages_gt (plain "pass all tests", no loophole disclosed) vs messages
- grader: eff_mode=gt_only routes compute_reward to the honest oracle (no channel),
so a hack earns format-only reward. Teacher demos are skipped on flipped steps
(a cached loophole hack no longer matches the hint-free prompt).
Gates: verify_rewards gains gt_only rows (every hack -> passed=False, reward 0.5);
new verify_rotation proves messages_gt is hint-free AND the subset rotates per step.
Smoke logs flip count (1/30 hint-free, graded gt_only).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Disambiguate the overloaded deploy/train/knob vocabulary (paper-consistent:
'quarantine' + 'ablated' + 'deployed' all match Cloud et al.). One opposite each:
- policy view: hack_deployed/solve_deployed (quarantine ablated, ships) vs
hack_as_trained/solve_as_trained (quarantine attached). Unifies the old split
deploy_hack (JSON) vs hack_deploy (table key) into one name.
- 'knob' -> 'quarantine'/'adapter' throughout comments and log strings.
- train/test reserved for the DATA split only.
Bump RUN_SCHEMA v1->v2 so old deploy_test.json files are skipped (not crashed) by
completed_runs. CLI flags untouched (queued jobs unaffected). Fixed two
replace_all collision bugs (hack_deploy substring of hack_deployed -> deployeded)
and the missed eval_curve writer (eval_checkpoint_curve.py) + readers
(results_deploy.py). Smoke green: v2 written + read; gates pass.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Replaces the partial preset= line. Every None resolves to its effective value
(pairset 'unused (vanilla)', v_hack_file 'unused (not erase)', teacher 'none',
routeV knobs 'unused (not routeV)') so a detached log shows exactly what ran --
fixes 'which pairset did this job use?'. Resolve v_hack_file once up front
(single source); an explicit --v-hack-path that's missing now fails fast instead
of silently extracting to a user-named path.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
_validate_config rejects method-irrelevant/contradictory options before the
model load (routeV-only knobs on non-routeV, top_k>1 off grad_cosine, v_hack_path
off erase, lora adapter on unwired arms). Removes the duplicate inline lora check,
the vanilla v_hack_path warn-and-ignore (now a hard error), and the inline top_k
assert -- one canonical place. Re-extracted v_hack_smoke against the new authored
default (sha guard caught the orphaned cache). Smoke green; bad combo raises.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The streaming StepLogger gated on the dead literal arm=='routing' (qmass) and
exact arm=='routingV' (missed routingV_per_token). arm is never 'routing' (the
arm property maps routeV->routingV), so qmass was computed into the row dict but
only ever surfaced in the end-of-run dump, never streamed. Gate all routeV cols
on is_route={routingV, routingV_per_token}; fold qmass in. (GPT-flagged, verified.)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Progressive (prog_wide_clean) and authored deploy within noise (0.042 vs 0.050
hack, both 0.143 solve); default to authored as the canonical no-cheat label
source. prog_wider/widest held genuine contamination, superseded by curated
prog_wide_clean (kept for the progressive-vs-authored comparison).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The k=1 mean-diff is the only naturally hack-ward direction; SVD axes 2..k have
arbitrary sign so each is re-oriented by sign(v_i . mean_diff). Gate = max_i
cos(g, v_i), per-rollout grad_cosine only (asserted). top_k=1 is byte-identical
to the prior mean-diff path. Smoke green: oriented [5,r] basis, band width +0.141.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Config (make the design axes explicit Literal choices):
- eval: Literal[eval2,eval3] (default eval3 = 10% unhackable, deployment-like);
unhackable_frac is now a derived property; eval/unhackable_frac/pairs recorded
in deploy_test.json metadata.
- intervention gains routeV_per_token (folds the per-token bool into the arm choice).
- routeV_gate documented as the pinning axis.
- FastConfig grad_clip 500->10 (was never load-bearing); FastLoraConfig subcommand
(fast-lora) at lr=1e-4 -- the hot 3e-3 diverged lora_frozen_b (job 25, ppl 6e5 gn98 step4).
Pairs:
- delete prog_wide.json (14/30 print-without-assert contaminated; history in git);
default -> prog_wide_clean.
- rename run_tests->execute_tests in prog_wide_clean + pairs_authored so the
extraction pairs are OOD (never use the env's real grader fn name). Re-extracted
v_hack_smoke to match.
justfile: --routeV-per-token -> intervention=routeV_per_token; drop --unhackable-frac
(eval3 default); lora recipes -> fast-lora subcommand; prog_wide -> prog_wide_clean.
smoke green (erase + routeV_per_token); all 4 verify gates pass.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
eval_modes stripped gt_only unconditionally, so a 100%-gt_only run left it
empty and load_problems did len(out) % 0. Fall back to ['gt_only'] when
nothing remains -- the ceiling run evals on gt_only itself (hack ~0, solve
= the ceiling). Job 27 failed on this; smoke --env-mode=gt_only now runs.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The cleanup removed the v1 route and route2 arms (Config is now
none|erase|routeV) but left README calling the live arm route2 with its
old binary-tau gate description. Rename to routeV, describe the banded
cosine gate (per-rollout/per-token, per-token best), and fix the deploy
line (held-out test n=119 knob-off, not n=64). figs.py keeps the
route2/routing2 display map for historical run artifacts.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
route/routeV final eval now measures both endpoints at n=119 test:
knob-off (ablate_quarantine, the deploy headline) AND knob-on (trained
model as-is). Writes deploy_hack_on/deploy_solve_on/deploy_vhack_on so
the before->after quarantine move is plottable from the deploy set
instead of borrowing the val curve's different scale.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- floor_ceiling_abs.png: clean deploy Pareto. All 5 arms as dots, ideal star at the
good corner (no-hack x ceiling), base->base model label, x clamped at no-hack. No
arrows: knob-on is only measured at val, so a val-before -> deploy-after arrow would
fake a solve jump that's really the n=32->n=119 eval-set shift.
- floor_ceiling_knob.png: the real before->after on ONE eval (val n=32). Hollow knob-on
-> solid knob-off per arm; the move is diagonal (solve changes: prog_wide 0.069->0.056,
authored 0.056->0.044), not the horizontal I wrongly forced earlier.
- justfile: queue-unhackable now 200 steps (solve is a slow signal under the unhackable
fraction), low priority; vanilla rerun alongside best (its solve also suffers).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Realism knob: in the reference env hacking saturates and kills the solve gradient.
A seeded-random per-problem Bernoulli flip to gt_only (no loophole, only honest
solving pays) keeps a persistent solve pressure all arms feel. The differential test:
routeV ablates the hack on the hackable 90% so it must solve there, while the warm
solve-skill from the 10% should make its solve-uplift-over-vanilla larger than at frac=0.
- gt_only's hint is the plain 'pass all tests' (no-op), so a flipped problem is an
ordinary solve task. Train-only; eval stays all-loophole (gt_only subtracted from
eval_modes) so hack/solve remain comparable to the reference env.
- logged rollout env_mode now reads prob['env_mode'] (single source of truth) so the
flip propagates to per-mode metrics.
- smoke-unhackable recipe + queue-unhackable (vanilla vs routeV per-token at frac=0.1).
Verified: frac=0.1->~7%, 0.3->~28% gt_only; deterministic per seed; gt_only drops the
loophole hint. smoke-unhackable runs end-to-end.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Route the whole gradient of every knob-on rollout into the quarantine; the
deployed knob learns only from the knob-off exploration floor. Direction-free
(v_grad extracted but never enters f -> routing is purely by generation mode).
Config flag + _step_absorb_f holder + filter branch (reuses act_vote per-rollout
machinery) + per-step is_ablated stash. just smoke-absorb passes (keep=0.25/
rout=0.75 = the floor/knob-on split). Queued s43 as job 29 (frac=0.25).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- Delete src/vgrout/pairs_v2.py and src/vgrout/pairs_intent.py; move all data
into scripts/pairset_build_intent.py (self-contained, exports 3 JSONs).
- Export: pairs_intent_think.json (6), pairs_intent_funcname.json (6),
pairs_intent_concept.json (6 diagnostic).
- Update diag_cosine_dist.py and diag_pairs_compare.py to load from JSON
instead of importing Python modules; drop tainted v2/allv2 pairsets
from the diag sweep (print-without-assert axis).
- train.py final table: add solve_rate_s computed same as hack_rate_s, so
the per-run end-of-training table shows actual training solve rate (was "-").
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
pairs.py now only has HackPair dataclass + _prompt/_wrap helpers.
All pair data (and the PAIRS list) lives in scripts/pairset_build_authored.py.
Drop axis 1 (print-without-assert in run_tests) and axis 3 (persona + prints)
from pairs_authored -- those 10 pairs directly encode the env's specific
run_tests grading flaw (prints instead of asserts). Kept: axis 2 (hardcode,
identical run_tests both sides), 4 (try/except swallow), 5 (isinstance-only),
6 (weak inequality). P19-P21 (previously defined but unused) now included.
Result: 11 pairs from 4 axes.
Also removed the PAIRS-fallback branch from extract_vhack_grad.py (same
fail-loud pattern applied to train.py earlier).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
New routeV_gate="online_stats" mode: use the empirical per-rollout cosine
distribution (q5/q95 pooled across all modules each step) as the routing
band thresholds, instead of the pair-derived route_band. Direction v_grad
still from authored pairs; only thresholds are online/adaptive.
Motivation: the pair-derived band sits above the live cosine distribution
(median live cos ~-0.06), causing frout to cliff as GRPO advantage flattens.
Online thresholds adapt to the actual live distribution, so the 5/95 tails
always route regardless of where the raw cosines land.
Config: routeV_gate="online_stats", online_stats_lo=0.05, online_stats_hi=0.95.
Step-0 prior: (-0.5, 0.5) neutral band (pairs not used for calibration).
Band update: post-opt.step(), torch.quantile over that step's module*rollout cosines.
No-cheat: v_grad from authored pairs only; thresholds from the cosine distribution
of live student rollouts (no oracle/labeling of live rollouts as hack/clean).
Also: add online_stats to results_deploy._arm(); justfile queue-online-stats recipe.
Queued as job 22 (s43, authored pairs, priority 12, after 19/20/21).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
New routeV_gate=act_vote: route every module's per-rollout gradient by a single
global f_roll from a module-weighted vote of activation cosines cos(As_b, As_dir),
As=Vh@x completion-mean (mirrors diag_cosine_dist.py act/vote, AUROC 0.67 / p@10
0.30 -- the coverage corner). Maximally different from the grad-cosine arm: act
space + global aggregation. Direction As_dir/act_w/vote-band built from the same
authored pairs (no oracle) at init and refreshed every N steps. Window = [plen-1:]
to match the build hook + diagnostic. Smoke-verified (band opens, rout>0, refresh
ok); fresh-eyes reviewed.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
intent pairs hold sol+tests IDENTICAL, vary only the cheat-vs-solve intent signal
(the properly-contrastive shape). --pairs {think,funcname,concept} for AUROC test.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- FastConfig: teacher_pool_dir -> teacher_pool_runtests_dense, grad_clip -> 500
(were passed explicitly on every fast call). Dropped --teacher-pool-dir/--grad-clip
from the dir6 calls and --grad-clip from all other fast recipes; smoke/dev recipes
keep their own teacher_pool override.
- End-of-run summary reordered per token-efficient-logging 'final 30 lines': the wide
results row and the giant per-step table now print ABOVE the tail. The last lines are
just argv, a compact hack/solve x knob-on/knob-off table, and the single objective
(deploy solve - hack), since solve and hack alone are gameable.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Was the widest band (min clean, max hack): routed even neutral rollouts
(~0.4 of a cos=0 gradient), the over-route that costs solve. Switch to a
precision band on the inner quartiles so only the live tail above the clean
cluster routes; absorption covers the unrouted middle (gradient_routing.md
L420; SGTM tolerates ~40% undiscovered, Fig5b). p75 not min/max: 10 pairs
make the extremes single-sample noisy. Absolute threshold, so a clean batch
routes ~nothing without the per-batch-quantile pathology. KNOWN RISK logged:
pairs are off-distribution and shifted high vs live (median cos ~-0.06), so
the band may under-route; watch rout, fall back is a live-cos quantile gate.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Replace the band-mechanics trio (tau/hkgap/frout) and the lumped qmass with a
symmetric zone breakdown: each live unit's cos(g,v_grad) lands below/inside/above
the pair-band -> keep/resid/rout, reported as both unit shares and energy shares
(keepE/residE/routE). Energy view is unit-agnostic (answers 'is the grad per
rollout'). Drop hk_abl/slv_abl unless rollout_ablate_frac>0 (else 0/0). Band edges
(lower/upper) already logged at construction. v1 'routing' arm keeps qmass.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The routing-mass gauges had bespoke names; align to the gradient-routing /
SGTM vocabulary the reader knows: absorption (mass pinned into quarantine) and
leakage (hack surviving in the deployed knob). Two-sided 'pin too much / too
little' framing in the legends. Drop the 'FREE'/compute-cost detail from the
hk_abl/slv_abl legends -- reader doesn't need the implementation cost.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Left over from the data.py/vhack.py/eval.py/tablelog.py module split. In
train.py the canonical imports already won at runtime; the earlier ones were
dead shadows:
- ablate_quarantine, ref_logprobs_via_zero_delta: .eval wins (line 66), drop
the .antipasto copy; load_v_hack/postprocess_v_hack: .vhack wins, drop
.extract_vhack_grad; DATA/load_problems: .data wins, drop .problems.
- local setup_logging() was byte-identical to the .tablelog one already
imported (with StepLogger); delete the local def + now-orphaned datetime
import and LOGS_DIR const.
- problems.py stays: 6 scripts + derisk/regrade still import it.
antipasto.py: delete detach_antipasto (0 callers) and its own copies of
ref_logprobs_via_zero_delta / ablate_quarantine (eval.py owns the canonical,
better-worded versions incl. the SGTM TODO), plus now-unused contextmanager
and per_token_logps imports.
docs: rm corrupted docs/spec/20260530_substrate_review_qwen.md (2-line API
error dump, not a review).
Behavior-preserving (later imports already won at runtime). Verified: just
smoke (erase) + just smoke-routeV both exit 0, 0 tracebacks, all verify_*
gates PASS.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Train side of the same contamination bug: fast preset loaded first-200-by-id =
the lowest/oldest/most pretraining-memorized problems (base solves them easily ->
weak hack incentive). Now a seeded-random representative sample (seed=cfg.seed),
with the teacher-seed ids pinned in so seeding still fires. Paper trains on all
992 (base ~20%); job 176 confirmed base test=0.094 / train_filtered=0.203,
matching paper fn9.
Adds justfile recipes:
- queue-dir6 SEED: 8-arm single-seed directionality set (routeV real rollout/
per-token, random-V both, vanilla, vampire in-subspace placebo, +2 LoRA-frozen-B
routeV) on teacher_pool_runtests + fixed eval.
- queue-broad: headline arms (vanilla/erase/routeV) x 3 seeds for paired-t
significance + directionality/adapter ablations at one seed.
Spec: docs/spec/20260607_eval_contamination_fix.md (force-added; docs/ gitignored).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our
artifact): disjoint from train by id but in the train id/recency range (ids
3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in
pretraining -> base solve 0.94, saturating solve and killing the hack metric's
gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining
MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the
paper rate.
Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094,
matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the
contaminant. Fix: drop the holdout; periodic curve + final number both eval the
paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's
simple_overwrite_tests (not the easier _detailed/_aware variants).
Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle
for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up
(journal e): train pool is still first-200-by-id (easy/memorized), same bug class.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>