Replaces the single-midpoint tau with a two-edge band: x=cos(g_step,vec),
route_frac=clamp((x-lower)/(upper-lower),0,1). lower=mean cos(g_cho,vec),
upper=mean cos(g_rej,vec). Below lower keep, above upper route, between =
absorption (proportional split). Band WIDTH (hkgap=upper-lower) is the
real-vs-random discriminator (random vec closes the band) so no separate
matched-fraction control is needed; collapse flags vec degeneracy.
Calibration validation on live KNOWN-mode-A rollouts (detector-labelled, no-cheat):
mean route_frac(A_hack) >> mean route_frac(A_clean) confirms the pair-set band
transfers to the sampled live distribution. Also picks g_step granularity
(per-rollout default vs per-step). Held-out B never in validation.
Corrects the earlier wrong claim that component-routing collapses to erase
(pseudocode 03 route v1 forward uses dS+dS_hack -> divergent trajectory).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
External review (Claude + deepseek-v4-pro) converged on the threshold being
circular (c_rej>c_cho holds by construction since vec=mean(g_rej-g_cho)) plus
scale-mismatched to live rollouts. Decisions added: leave-one-pair-out as the
real vec-generalizes diagnostic; quantile-tau to match flagged fraction in the
real-vs-random control; route the vec-component (erase-style) not the whole
rollout; degeneracy diagnostic (hkgap collapse); pre-register the science UAT
(n>=3 seeds, effect>random-baseline std).
teacher_off_step now defaults to 30 on the base Config so every arm runs pure
on-policy past step 30 (apples-to-apples deploy numbers; job 87 showed hacking
self-sustains after the cut).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Adds actionable train.py targets (delete build_route2_anchors, rewrite
_route2_grad_filter to pure cos>tau gate, pair-calibrated tau refreshed every N,
teacher_off_step=30), current state (queue PAUSED, on main, rollback tag), queued-job
disposition (superseded vs keep), and smoke/UAT. Self-contained handoff for post-compact.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Local log backup in out/pueue_logs_backup/20260606T000138/ (status.json + full log
+ task_logs) is gitignored/box-local; this manifest is the durable why-label copy.
Killed confounded full-teacher route2 directionality jobs 118/119/121/122/123.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
route2 directionality exposed the vector is not load-bearing: hack_anchor
force-routes teacher+detector by label (bypassing v_grad), tau calibrated from a
live detector, so random==real because labels carried it. Redesign: teacher-off@30,
drop force-route, calibrate tau from the A-pairs (no live detector), maybe use the
pairset directly vs a rank-1 vector. Decisive test = A5 real(126) vs random(135).
Queue snapshot + design notes in docs/REFACTOR_HANDOFF.md.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
route2 deploy hack collapses for ANY v_grad (real/placebo/Haar) but solve tracks
direction (real>placebo>Haar). TODO names the load-bearing confound: full-teacher
runs force-route all teacher rows by label (hack_anchor), so the hack-axis collapse
is direction-free force-routing not the cosine gate; clean test = A5 run_tests-only
regime (pending). n=1 interim.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Vanilla deploy-hack keeps climbing after teacher cut at step 40 (0.36->0.58,
job 87), at/above teacher-on (job 97). Closest-match jobs differ in LR; FIXME
to swap in lr-matched job 124 (queued low-prio). CSV is the committed data
artifact; fig regen by plot_teacher_ablation.py.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Haar's ~0 cos is concentration of measure (out-of-subspace), not a cleaner
placebo. Semantic placebos are in-subspace and share generic structure, so a
nonzero cos with hack is the expected floor, not 'they found the hack'.
null_city's high-cos modules are plausibly low-rank-module artifacts. Cosine
is correlational; the ablation run is the causal test.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The null_city placebo is CONTAMINATED: 20% of its modules align with the hack
direction (median |cos|=0.06 but a 0.99 tail, shared generic features). So the
'route2 is non-directional' verdict rested on a bad control. Add the clean tests:
- route2_random_v_seed: replace pair-derived v_grad with seeded per-module Haar-random
unit vectors (~0 cos with hack dir everywhere). Refresh no-ops so the draw stays fixed.
'Nothing routed' (||dS_hack||==0) is now a valid logged outcome, not an abort -- it is
itself H4-confirming (a zero-alignment direction may never clear tau).
- null_vampire / null_bacon / null_blue: semantic placebo fleet (vampire-vs-werewolf etc.),
each an arbitrary direction with different accidental hack-alignment. Maps route2's
suppression-vs-alignment as a scatter: H4 predicts it tracks |cos|, H2 predicts all suppress.
Smoke-validated (smoke-route2 --route2-random-v-seed=0 completes).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Two review questions today exposed imprecise framing in load-bearing comments:
- A5 held-out <=1.1% hacked_E is the model double-hacking (one run_tests()-shaped
completion that also writes the stdout marker, verified job-95 id 132), not a
detector false positive. hacked_E is the mode-agnostic run_tests signature.
Grading channels are non-overlapping; the model's strategy is not.
- Placebo 'non-directional' is now the hkgap measurement: real-v route2 hkgap
0.6-0.8 (separates hack/clean), placebo ~0 (dead), both deploy hack 0.000.
Confirms the degenerate-gate read (H2) over clever-random-direction (H1):
suppression is quarantine-volume + exploration floor, not v_hack specificity.
Direction only shows in solve (real 0.625 > placebo 0.531).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The smoke prereqs (out/pools/substrate, out/pools/teacher_pool,
out/vhack/v_hack_smoke) are gitignored pipeline outputs that only
exist on the GPU box -- a fresh clone died at verify_partition.py on a
FileNotFoundError for partition.json. Building them from scratch needs
a real Qwen3-4B GRPO rollout (pregen-teacher), so they can't be cheaply
regenerated CPU-side. Force-add them (~2.2MB) the same way the paper
figs under out/ are already tracked, so 'just smoke' is the portable
correctness gate it's meant to be.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
git mv src/projected_grpo -> src/vgrout and find-replace the module name in
all imports (.py), `-m projected_grpo.*` invocations (justfile), and the
[project] name (pyproject; setuptools auto-discovers via where=["src"]).
Left RESEARCH_JOURNAL.md untouched: its commands/paths are dated lab notes
tied to past commits, so rewriting them would falsify provenance. Repo dir,
git remote, and absolute paths unchanged.
Verified: `import vgrout` and `python -m vgrout.train --help` load the full
graph; verify_rewards.py + verify_gate_anchor.py (both import vgrout) pass.
Full `just smoke` is blocked upstream by missing gitignored data artifacts
(out/pools/{substrate,teacher_pool}, out/vhack/*smoke*), unrelated to the rename.
- title: drop the "Quarantine ... Representation?" metaphor for
"vGROUT: Vector Gradient Routing against Reward Hacking"
- Method: add a two-phase definition (make v_hack; then erase=discard the
component / route=redirect the gated gradient into a deletable adapter,
deleted at deploy). Honest framing: route preserves (not discards); follows
Shilov et al.'s post-backward deletable-block routing in the gradient-routing
family, gated by an extracted direction not a per-example data label
- strip literal "SGTM" from the body (confusing acronym); cite renders as
author-year. README + pyproject describe vGROUT (package name unchanged)
- drop fontawesome5 (tectonic core-dumped on the OTF); the lone \faGithub
icon was decorative
- switch the two included figures PNG->PDF (vector; now-tracked, smaller)
- add fig:generalisation (A5 dumbbell) next to tab:generalisation and
fig:traindeploy (train-on vs deploy-off) in C1, both \ref'd
- rename leaked config codenames in appendix tables (v_hack_full ->
"weak (10 pairs)", null_city -> "random (placebo)") with paper:code
mapping comments
- de-jargon reader-facing prose per a 3-model external panel
(kimi-k2.5 / gemini-3.1-pro / gpt-5.5): knob -> (auxiliary) adapter,
quarantine -> isolate, no-cheat payload -> zero-label test, hack-ward ->
hack-aligned, cousin/near-twin -> analogue, etc. Title metaphor left as-is.
14 pages, zero unresolved refs.
docs/ is gitignored, so docs/writeup/figs/*.png symlinks are untracked -- a
fresh clone would have no figs/ dir and the build would break. The PNG itself
(out/figs/dyn_sub4_hack_overlay.png) IS tracked; point at it directly, matching
the sibling fig at L411. Build verified: 11 pages, no unresolved refs.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The other half of the no-cheat family (sibling of the gate-anchor leak). Asserts
on the real out/pools/substrate/partition.json: (1) partition is a clean function
into the 4 distinct substrate modes, each populated; (2) under teacher_modes={run_tests}
the kept teacher pool is ALL known-mode -- held-out modes get ZERO demos and are
genuinely held out (>0 problems). Vibe-check, not a theorem; wired into just smoke.
6/6 pass.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- a5: drop per-panel title (restated the axis); fold direction into the xlabel
(DEPLOY hack rate (down=better) / solve (up=better)). Dump a5_generalisation.csv
(per mode,arm deploy hack/solve mean+/-std) -- the reproducibility source it lacked.
- overlay (dyn_sub4_hack_overlay etc): replace the per-arm onset DOT with a single
dashed labeled 'first hack' vertical line, matching the small-multiples/longrun.
- (dyn_sub4_hack_overlay shares dyn_sub4.csv -- same runs, different view, no new CSV.)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- a5_generalisation: connectors -> arrows (baseline->ours direction, shows the drop
and the stdout solve-cost honestly).
- equiv0 -> approx0 everywhere: these are finite-sample estimates, not identically 0.
- plot_train_vs_deploy skips when train==deploy for every run (no knob-ON contrast);
fixes the 'can't see train' longrun/sub4 figures (they had no hk_on data).
- Prune 9 orphan figure sets not referenced in paper or blog (regenerable on demand);
keep the 3 referenced + a5 + train_vs_deploy_60_train_deploy. All 4 CSVs committed.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
save_eval_ckpts (default on): write the deploy adapter (δS only, ~2.3MB) at each
deploy-eval step, step-tagged, so a run can be re-scored later (more prompts /
different eval) without retraining. The A5 run saved only final+first_hack, which
is why the leak needed a full retrain rather than a rescore.
AGENTS.md: every load-bearing invariant gets a verify_*.py gate. The no-cheat leak
shipped because the green gates never covered the property -- 'tests passed' is
meaningless if the property was never tested.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The route2 tau-gate anchored on (teacher OR hacked_E student). hacked_E is the
run_tests detector; it cross-fires <=1.1% on held-out modes (stdout 17/1540,
file_marker 2/1337), force-routing those rollouts -- a real label leak into the
held-out class, not noise. Add gate_anchor_teacher_only: anchor on teacher rows
only, so held-out classes get PROVABLY zero detector labels (airtight A5 control).
Extracted the inline anchor loop to build_route2_anchors() and added
scripts/verify_gate_anchor.py (wired into just smoke): proves default reproduces
the leak (held-out FP student force-routed) and teacher_only removes it (zero
student routing, teachers unchanged). 9/9 assertions pass.
Rescoring can't fix this -- the leak is in training (gate shaped the weights),
not scoring (per-mode ground-truth eval is clean). Retrain is the only path; the
A5 run saved no per-eval checkpoints anyway.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
End-labels sat on the line termini (2-arm figs) and piled up bottom-left on
ragged-length multi-arm overlays (substrate, where arms end at different steps).
Now all labels anchor at one gutter x with a leader fanning back to each line's
actual end, y-de-collided. Added right margin so the gutter is clear.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The solid-red deploy line ran straight through the annotation text (tufte
collision test). Move it into the empty band above the flat line (axes y=0.12).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
A vanilla seed (s43) lacked the held-out deploy eval, so its train series fell
back to the noisy n=28 per-step hack_s while other seeds used the n=64 eval.
Averaging mixed estimators fabricated a vanilla train-vs-deploy gap that does
not exist (lie-factor). Now: train series reuses the knob-off eval only (nan if
absent -> seed drops from the mean), and missing eval columns normalise to nan
so absent==all-nan. Regenerated all figures from logs. The canonical
train_vs_deploy_60 (has hk_on) is unchanged; sub4/longrun byproducts now show
train==deploy honestly (no knob-on data to split).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 86 placebo (null_city arbitrary direction) reached deploy hack 0.000 over
the full 60 steps, falsifying the 'expect ~vanilla' prediction. Route's gate
is direction-agnostic: the discarded knob absorbs whatever crosses the per-step
energy threshold regardless of v_hack alignment. Directional specificity now
rests on the erase arm (subtracts prop cos(g,v)), pending.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
y=mode, dot per arm, thin connector per mode so vanilla->route change reads as a
line segment. Faint x-grid only, no box (dots+labels carry categories), labels
staggered to avoid collision, xerr=seed std when n>1. Kills the invisible
zero-bar problem and shows the per-mode drop directly.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Caption now states v is run_tests-only, teacher is run_tests-only, held-out
modes have hacked_E=0 so the gate is blind, they emerge on knob-on but deploy~0,
and the placebo caveat (suppression is the direction-agnostic quarantine, not v
specificity). Bar plot tags invisible zero-height bars with ≡0.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Groups per_mode_deploy.json by arm into a list, plots mean+/-std across seeds.
At n=1 (current A5: seed 41 only) no bar appears; TODO in code points at the
queued a5 seeds 42/43 (jobs 107-110) that will populate it. Bar labels show n.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- overlay: floor hack panel below 0 so a pinned-at-0 line shows; direct-label
the TOP (hack) panel not the bottom (read top-to-bottom); tag any arm whose
series sits at 0 with $\equiv 0$.
- train-vs-deploy: replace the 2x2 with one panel per arm, 4 series each --
colour=metric (red hack/green solve), linestyle=train(dashed)/deploy(solid).
The route gap (dashed-red up, solid-red at 0) and vanilla overlap (train==deploy)
read in one panel. two-axis legend (colour=metric, style=train/deploy).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The train row fell back to per-step hack_s (noisy n=28 train batch) for arms
without a knob-on eval, so vanilla's train/deploy rows looked like different
estimators. Fix: vanilla/erase have no quarantine -> train==deploy, so reuse
hk_dep (the n=64 knob-off eval) for the train row. route2 still uses hk_on
(knob-on eval). Now every panel is the same held-out eval, differing only in
the quarantine knob. Regen source: train_vs_deploy_60.csv (route2 nofloor_rf2
+ vanilla sweep, seed 41, 60 steps).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Old figure paired route2 (job 84) with job 85 vanilla, whose step-88
'collapse' was a hot-preset artifact. Job 97 re-ran vanilla-200 gentle and
stays coherent. New pairing: route2 holds deploy hack at 0; vanilla rises to
~0.32 (onset ~step 40); route2 solve ends higher (0.61 vs 0.47). Caption now
flags the remaining optimizer mismatch (route2 hot / vanilla gentle, both
beta=0) and TODOs the matched beta=1e-5 regen (jobs 100/101).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The route2 placebo result (job 86, dir-insensitive gate) doesn't settle
directionality; erase projects with magnitude ~cos(g,v) so it's the arm where
direction must enter. Matched 60-step s41 pair differs only in v source. #183.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
New Results subsection + fix stale Limitations bullet (50%->12.5%). Three
evidence pieces: (1) construction (4 teacher vs 28 student rollouts/step,
student out-hacks ~3:1 from step 40, job 103 trajectory); (2) A5 held-out
emergence (teacher demos only run_tests, student emerges 3 modes with zero
teacher examples); (3) teacher-off@40 control TODO (jobs 93/94, bumped ahead
of A3 random-V).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Fill route2 column of tab:generalisation from job 104 per_mode_deploy.json;
regen A5 figure (add routing2 arm key to plot_deploy_overlay). All three
held-out modes drop near zero at knob-off deploy while emerging on the
knob-on path -- routing, not non-emergence. #185.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Harness path ~/.claude/projects/.../memory was a real dir that had diverged
from the repo copy (the 9c188f6 symlink targeted /root, not this box's home).
Merged the two harness-only memories in and re-pointed the harness path at the
repo via symlink, so future auto-writes land in-project.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>