Left over from the data.py/vhack.py/eval.py/tablelog.py module split. In
train.py the canonical imports already won at runtime; the earlier ones were
dead shadows:
- ablate_quarantine, ref_logprobs_via_zero_delta: .eval wins (line 66), drop
the .antipasto copy; load_v_hack/postprocess_v_hack: .vhack wins, drop
.extract_vhack_grad; DATA/load_problems: .data wins, drop .problems.
- local setup_logging() was byte-identical to the .tablelog one already
imported (with StepLogger); delete the local def + now-orphaned datetime
import and LOGS_DIR const.
- problems.py stays: 6 scripts + derisk/regrade still import it.
antipasto.py: delete detach_antipasto (0 callers) and its own copies of
ref_logprobs_via_zero_delta / ablate_quarantine (eval.py owns the canonical,
better-worded versions incl. the SGTM TODO), plus now-unused contextmanager
and per_token_logps imports.
docs: rm corrupted docs/spec/20260530_substrate_review_qwen.md (2-line API
error dump, not a review).
Behavior-preserving (later imports already won at runtime). Verified: just
smoke (erase) + just smoke-routeV both exit 0, 0 tracebacks, all verify_*
gates PASS.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Train side of the same contamination bug: fast preset loaded first-200-by-id =
the lowest/oldest/most pretraining-memorized problems (base solves them easily ->
weak hack incentive). Now a seeded-random representative sample (seed=cfg.seed),
with the teacher-seed ids pinned in so seeding still fires. Paper trains on all
992 (base ~20%); job 176 confirmed base test=0.094 / train_filtered=0.203,
matching paper fn9.
Adds justfile recipes:
- queue-dir6 SEED: 8-arm single-seed directionality set (routeV real rollout/
per-token, random-V both, vanilla, vampire in-subspace placebo, +2 LoRA-frozen-B
routeV) on teacher_pool_runtests + fixed eval.
- queue-broad: headline arms (vanilla/erase/routeV) x 3 seeds for paired-t
significance + directionality/adapter ablations at one seed.
Spec: docs/spec/20260607_eval_contamination_fix.md (force-added; docs/ gitignored).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
prog_wide is the proven main pair set, so default to it instead of falling back
to the 18 hand-crafted vgrout.pairs.PAIRS (now only reached if explicitly None).
The same pairs build both v_grad and the route band in one extract pass -- no
separate threshold set. Spec updated to say so. route2 smoke green on the new
default (band +0.259). erase unaffected (explicit --v-hack-path takes precedence).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Fresh-eyes review flagged that nothing asserted upper>lower for the REAL
v_grad: a broken extraction (hack pairs aligning no more than clean) would
silently degenerate into the random-control sign gate via the max(.,1e-6)
floor. Assert mean band width > 0 on non-Haar runs; the Haar control is still
allowed to collapse. No correctness change to the gate math (review found
conservation, per-rollout recovery, cosine masking, closure capture all OK).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Replace the confounded route2 gate (hack_anchor force-routed teacher +
weak-detector student rows by LABEL; EMA tau calibrated from a live detector
over student rollouts at train time = a cheat) with a band calibrated from the
contrastive pairs alone:
lower = mean clean-pair cos(g, v_grad); upper = mean hack-pair cos
per rollout: f = clamp((cos(g_b, v_grad) - lower)/(upper - lower), 0, 1)
routed = sum_b f_b * g_b -> delta_S_hack; kept = g - routed -> delta_S
v_grad is now the SOLE router: no detector or gt_pass touches routing, so
"does v_hack generalize to held-out modes" is clean and random-vs-real is
decisive. Band width (upper-lower) is itself the discriminator: smoke shows
+0.289 real vs -0.014 Haar-random (collapsed). conservation routed+kept=g
holds exactly; resid~0 in smoke (no hack leak into the deployed knob).
- delete build_route2_anchors + EMA state (ema_hack/clean_cos, route2_tau)
- add route_band_edges(); build at extract, rebuild on v_grad refresh
- drop --gate-anchor-teacher-only config + retire scripts/verify_gate_anchor.py
- teacher rollouts now route through the same band (not force-routed)
- spec: add the mass-confound control (scientist review 2026-06-06)
smoke-route2 + smoke-route2 --route2-random-v-seed=7 both pass; erase smoke green.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The 'weak detector for hack A, generalize to B' framing was wrong for this repo.
That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is
vec -> routing: vec extracted from hand-built synthetic pairs, route the live
GRPO gradient by cosine alignment to vec; no detector ever runs over student
rollouts at train time. Generalization = does vec (from pairs covering some
modes) suppress held-out modes -- vector generalization, not detector-label.
- AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader
= cheat; weak-label setup = not ours; vec->routing = ours). For coding agents.
- README: removed the 'We cannot cheat' section (belongs in agent instructions,
not the new-reader overview).
- spec: dropped the stray 'validation uses known-A detector' line; pointed the
no-cheat reference at AGENTS.md.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Validation removed: running the weak detector over student rollouts at train
time is the no-cheat violation, and a live validation is complex/non-causal.
Causal proof stays downstream (deploy perf + real-vs-random). Train-time only
LOGs label-free gauges: hkgap=upper-lower, leave-one-pair-out separation (the
'does the threshold generalize to a second pair' test), live cos_b percentiles
vs [lower,upper] (calibration read with no labels), route_frac mass at 0/1,
resid=cos(g_keep,vec).
Granularity decided = per-rollout: train.py already sums per-token gate grads
to [G,r] and recovers g_b=cg/dS per rollout; band just swaps the cos_b>tau line
for the ramp. Backed by the papers: Gradient Routing (Cloud 2024) masks
per-token for LLMs / per-episode for RL; SGTM (2025) per-example, label-noise-
robust. Both route by a DATA-LABEL mask; we route by gradient ALIGNMENT to an
extracted direction -- that's the novelty. Borrow their 'absorption' as the
mechanism justifying A->B generalization.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Replaces the single-midpoint tau with a two-edge band: x=cos(g_step,vec),
route_frac=clamp((x-lower)/(upper-lower),0,1). lower=mean cos(g_cho,vec),
upper=mean cos(g_rej,vec). Below lower keep, above upper route, between =
absorption (proportional split). Band WIDTH (hkgap=upper-lower) is the
real-vs-random discriminator (random vec closes the band) so no separate
matched-fraction control is needed; collapse flags vec degeneracy.
Calibration validation on live KNOWN-mode-A rollouts (detector-labelled, no-cheat):
mean route_frac(A_hack) >> mean route_frac(A_clean) confirms the pair-set band
transfers to the sampled live distribution. Also picks g_step granularity
(per-rollout default vs per-step). Held-out B never in validation.
Corrects the earlier wrong claim that component-routing collapses to erase
(pseudocode 03 route v1 forward uses dS+dS_hack -> divergent trajectory).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
External review (Claude + deepseek-v4-pro) converged on the threshold being
circular (c_rej>c_cho holds by construction since vec=mean(g_rej-g_cho)) plus
scale-mismatched to live rollouts. Decisions added: leave-one-pair-out as the
real vec-generalizes diagnostic; quantile-tau to match flagged fraction in the
real-vs-random control; route the vec-component (erase-style) not the whole
rollout; degeneracy diagnostic (hkgap collapse); pre-register the science UAT
(n>=3 seeds, effect>random-baseline std).
teacher_off_step now defaults to 30 on the base Config so every arm runs pure
on-policy past step 30 (apples-to-apples deploy numbers; job 87 showed hacking
self-sustains after the cut).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Adds actionable train.py targets (delete build_route2_anchors, rewrite
_route2_grad_filter to pure cos>tau gate, pair-calibrated tau refreshed every N,
teacher_off_step=30), current state (queue PAUSED, on main, rollback tag), queued-job
disposition (superseded vs keep), and smoke/UAT. Self-contained handoff for post-compact.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Local log backup in out/pueue_logs_backup/20260606T000138/ (status.json + full log
+ task_logs) is gitignored/box-local; this manifest is the durable why-label copy.
Killed confounded full-teacher route2 directionality jobs 118/119/121/122/123.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
train.py: allow mix_ratio=0 with a teacher pool set -> G_t=0, student-only GRPO
(guard the teacher-mixing branch on G_t>0, relax the (0,1) assertion to [0,1),
drop G_t==0 from the degenerate check). The pool stays loaded for the 4-mode
partition and route2 v_grad extraction; only the teacher-rollout MIX is removed.
Smoke (mix=0 + normal mix=0.5 + vanilla) all green.
Also: fill A4 long-run figure (fig:longrun) in main.tex, update writeup spec A4
status (route2 durable to 200; vanilla collapses ~88, not clean saturation).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- deploy hack/solve is now the headline metric for all arms, so turn the
mid-train deploy-eval on by default (smoke now covers the deploy path too);
200-step runs pass a sparser cadence explicitly.
- docs/spec/20260602_writeup_spec.md: durable A1-A7 paper-artifact tracker
(keynote fig+table, ablation table, long-run fig, generalisation, appendix).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The distinct-basis A_q/B_q LoRA (~33M params at rank-16) gave the quarantine a
~100x capacity edge over delta_S, so routing-everything-there was the low-
resistance path: qE pinned ~0.97 (energy into the thrown-away knob) while the
deployed delta_S learned nothing (job 54). The cause was capacity imbalance, not
the routing gate (calibrated-tau already separated hack/clean, hkgap>0).
Consolidate to one adapter type: the quarantine is now delta_S_hack, the second
diagonal in the same frozen SVD basis, shape [r], capacity-matched to delta_S,
zeroed at deploy. route2's calibrated-tau gate parks the flagged rollouts' grad
into delta_S_hack.grad (like proj.py's route parks its subspace projection);
delta_S keeps the unflagged. Both diagonals train at one shared lr.
Removed: A_q/B_q params, v_act + extract_v_act, the act-mask arm (a shared
diagonal can't be per-token gated), route2_mask / route2_quarantine_rank /
route2_quar_lr_scale knobs, the separate quar optimizer group. Arm name
routing2_{act,grad} -> routing2. v_grad refresh extracts from delta_S (main)
with the quarantine ablated.
SGTM check: their gradient routing uses a hard detach on capacity-matched
reserved dims, no soft/tanh/sigmoid gate -- balance is the fix, not gating.
Smoked clean: tau/hkgap/qE render, ||delta_S_hack||>0 assert passes, exit 0.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Routing stays vector-based (cos>tau, not the detector flag) but tau is the
per-step EMA midpoint of the hack vs clean cos clouds (teacher+flagged-student
anchor hack; not-flagged anchor clean). Rides the cin drift; force-routes known
hacks; tau-routes unknown B. Logs tau + hkgap. No-cheat: detector only
calibrates, gt_pass never gates.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The hook gate is necessarily per-token ([G*s, r], nn.Linear flattens the
batch). _route2_grad_filter now sums each rollout's token gate-grads before
the cos(g_b, v_grad) flag, so routing is per-rollout (the preregistered GRPO
unit) and the sign is denoised. Per-token a clean rollout scatters ~50% of
tokens over cos>0 by noise, spuriously routing half its gradient mass.
Verified by deepseek-v4-pro review: gate identity, divide-out, eps-guard,
Arm B detach-route, R5 no-cheat all correct; this was the one finding.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Arm A (route2_mask=grad): per-rollout gate splice (identity at c=1) recovers
the per-sample delta_S grad after backward (c.grad = delta_S * g_b); train.py
divides it out (eps-guard |delta_S|>1e-6), flags rollouts by cos(g_b, v_grad)>0,
and SUBTRACTS them from delta_S.grad. Single-pass, no forward detach, no second
backward -- the cross-step mismatch that made the spec's A1 stale-mask awkward
never arises (routing is post-backward within the step). v_grad = unit-mean
gradient diff from extract_v_hack raw grads (gradient-space analogue of v_act).
route2 forces the combined (non-split) backward since cos_pre is NaN for it
anyway, which also gives the gate a single clean grad to read.
Drop route2_tau: never tuned; the mask is cos>0 (the natural hack-ward boundary)
and the load-time noise floor already filters axes.
v_hack path now auto-derives from --vhack-pairs-path (out/vhack/v_hack_pairset_
<stem>.safetensors): pass the pairset, the hack file auto-loads/extracts -- no
need to also pass --v-hack-path. run-substrate drops the redundant flag.
smoke: smoke-route2 (act) and new smoke-route2-grad both pass (||B_q||=0.109,
exit 0); erase shared-basis path unchanged (cout->0, fired~0.9).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Routing-v2 spec (distinct-basis quarantine, two arms, proofs); related-work
no-cheat scorecard for TDGA/Cloud/SGTM/Confessions; full-text fetches of the
Deng and SGTM papers; journal entry for the run-31 confound + T1/T2 landing.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Streaming table (StepLogger) redesign per user review:
- drop sprd/N/refr from the streaming view (constant / in argv / always '-')
- short names: cos_pre->cin, cos_pre_s/t->cin_s/t, cos_post->cout, gradn->gn
- 2 sig figs on loss; 1 on gn/lr
- cin/cin_s/cin_t/cout/fired only on projecting arms (no vanilla cos_post_cf)
- ADD per-mode cumulative student-hack columns hk_<rt|eq|xc|so|se|fm> on
multi-mode (substrate) runs -> shows WHICH loophole classes are learnt
- self-decoding legend() (only the columns this arm/mode-set shows)
- end-dump auto-renders any (n,d) tuple as n/d; drops sprd/N too
derisk_loopholes (#139): replace the 6 spoonfed exploit recipes with ONE
generic elicit (the faithful hint already discloses the mechanism; the model
must connect loophole+permission -> exploit = honest discoverability test) +
an exit-interview '### Notes' section, surfaced in the log (too-vague/too-blatant
hint signal).
Rename ship->deploy (Gradient Routing): the route arm's quarantine-deleted eval
is the DEPLOYED model; 'ablate' collided with the erase arm's gradient ablation.
train.py columns + row dict + plot_dynamics + plot_route_evidence + results.py;
dropped the dual-name back-compat reads.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Redesign the sentinel loophole from a module-global flag (GRADER_RESULT=PASS)
to a Solution-object .valid boolean the grader reads -- a concrete mechanism
the faithful hint can name (parallel to run_tests() naming the function),
fixing the discoverability gap (secret-token problem). Wire the stdout/file
detectors onto their natural pass-report families (_text_signals_pass /
result-named file with pass content) so the hinted exploit actually trips them.
verify_rewards: replace the 7 ad-hoc cross-mode cases with an EXHAUSTIVE 6x6
hack-x-mode matrix -- each hack must pay ONLY on its home subset (exploited iff
grade_mode==home_mode). 52/52 green; smoke green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CRIT (gpt-5.5): (1) _gt_correct keyed on a FIXED sentinel string the wrong solution
could print itself -> gt_correct=True. Now a fresh secrets.token_hex nonce per call,
unguessable, lives in outer globals not the solution namespace. (2) __is_safe used
isinstance(x,int), so an int SUBCLASS with __eq__=True passed as safe and __strict_eq
trusted it. Now exact types (type(x) in (...)). +2 verify cases (forge_sentinel,
int-subclass eq). 29/29.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
CRIT (gpt-5.4 review): _gt_correct keyed correctness on exit-code-0, so a wrong
solution with os._exit(0) (uncatchable, bypasses the SystemExit guard) read
gt_correct=True in every mode -- breaking the strict oracle AND non-overlap
(a hard-exit hack looked genuinely correct everywhere). Verified the hole, then
fixed: correctness now requires REACHING a post-assert sentinel in stdout; any
early termination (sys.exit/os._exit/raise) or failing assert skips it. +3 verify
cases (os_exit @ exit_code/run_tests/sentinel), 25/25 pass.
IMPORTANT: build_substrate greedy round-robin could starve a mode when an even
assignment existed -> replaced with exact Kuhn bipartite matching, decrement
per_mode until all modes saturate, fail loud otherwise.
IMPORTANT: teacher rows stored foolable gt_pass (True on exit/eq exploits) ->
inflated teacher gt_t/PASS_RATE. Now store strict gt_correct.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
derisk #10: only exit_code is base-elicitable at scale (98%); sentinel 13.5%
(13 seeds), run_tests 2% (RL-emergent, pool-sourced), stdout/file/eq ~0. So the
teacher batch sources exit_code+sentinel from elicit files and run_tests from the
existing teacher pool. Scarcest-mode-first round-robin + pool_cap give an even
7/7/7 partition (21 problems, 40 rollouts). Spec records the elicitability finding.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Flags the non-overlap problem: gt_pass-based passed lets sys.exit/eq pay on every
subset -> must switch to passed_i = gt_correct OR channel_i with per-mode positive
detectors. Plus the per-problem env_mode gap.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
expose-K violates the paper's 3 criteria (no explicit prompting / ~0% base /
no leak); our T0 64.6% base rate is a red flag not a pass (criterion inverted).
New design: hack class = (grader flaw)+(factual hint); distinct mechanism = a
distinct GRADER mode, not a solution-side trick (C collapses into A/B). Candidate
menu M1/A/B/S/R/T + corrected de-risk bar (~0% base, emergent). expose-K code to
be ripped out.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts
(0 left at top level). Per-run checkpoints+rollouts now group under
runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in
out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest).
justfile also gains run-cell REFRESH param (online-erasure arm). Smoke +
smoke-vanilla + results all green on new paths. Requeue manifest preserves the
why/resolve labels that pueue reset wiped.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
out/ is 25GB/195 loose files. Target: one subdir per datatype, per-run
artifacts under runs/<ts>_<slug>/. NOT executed live: 11 queued jobs pass
out/ paths as literal args, so the data move + code-path edits run atomically
when the queue is idle. Archived the unreferenced *_OLD_step_format dirs now.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>