The 'weak detector for hack A, generalize to B' framing was wrong for this repo.
That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is
vec -> routing: vec extracted from hand-built synthetic pairs, route the live
GRPO gradient by cosine alignment to vec; no detector ever runs over student
rollouts at train time. Generalization = does vec (from pairs covering some
modes) suppress held-out modes -- vector generalization, not detector-label.
- AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader
= cheat; weak-label setup = not ours; vec->routing = ours). For coding agents.
- README: removed the 'We cannot cheat' section (belongs in agent instructions,
not the new-reader overview).
- spec: dropped the stray 'validation uses known-A detector' line; pointed the
no-cheat reference at AGENTS.md.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Validation removed: running the weak detector over student rollouts at train
time is the no-cheat violation, and a live validation is complex/non-causal.
Causal proof stays downstream (deploy perf + real-vs-random). Train-time only
LOGs label-free gauges: hkgap=upper-lower, leave-one-pair-out separation (the
'does the threshold generalize to a second pair' test), live cos_b percentiles
vs [lower,upper] (calibration read with no labels), route_frac mass at 0/1,
resid=cos(g_keep,vec).
Granularity decided = per-rollout: train.py already sums per-token gate grads
to [G,r] and recovers g_b=cg/dS per rollout; band just swaps the cos_b>tau line
for the ramp. Backed by the papers: Gradient Routing (Cloud 2024) masks
per-token for LLMs / per-episode for RL; SGTM (2025) per-example, label-noise-
robust. Both route by a DATA-LABEL mask; we route by gradient ALIGNMENT to an
extracted direction -- that's the novelty. Borrow their 'absorption' as the
mechanism justifying A->B generalization.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Replaces the single-midpoint tau with a two-edge band: x=cos(g_step,vec),
route_frac=clamp((x-lower)/(upper-lower),0,1). lower=mean cos(g_cho,vec),
upper=mean cos(g_rej,vec). Below lower keep, above upper route, between =
absorption (proportional split). Band WIDTH (hkgap=upper-lower) is the
real-vs-random discriminator (random vec closes the band) so no separate
matched-fraction control is needed; collapse flags vec degeneracy.
Calibration validation on live KNOWN-mode-A rollouts (detector-labelled, no-cheat):
mean route_frac(A_hack) >> mean route_frac(A_clean) confirms the pair-set band
transfers to the sampled live distribution. Also picks g_step granularity
(per-rollout default vs per-step). Held-out B never in validation.
Corrects the earlier wrong claim that component-routing collapses to erase
(pseudocode 03 route v1 forward uses dS+dS_hack -> divergent trajectory).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
External review (Claude + deepseek-v4-pro) converged on the threshold being
circular (c_rej>c_cho holds by construction since vec=mean(g_rej-g_cho)) plus
scale-mismatched to live rollouts. Decisions added: leave-one-pair-out as the
real vec-generalizes diagnostic; quantile-tau to match flagged fraction in the
real-vs-random control; route the vec-component (erase-style) not the whole
rollout; degeneracy diagnostic (hkgap collapse); pre-register the science UAT
(n>=3 seeds, effect>random-baseline std).
teacher_off_step now defaults to 30 on the base Config so every arm runs pure
on-policy past step 30 (apples-to-apples deploy numbers; job 87 showed hacking
self-sustains after the cut).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Adds actionable train.py targets (delete build_route2_anchors, rewrite
_route2_grad_filter to pure cos>tau gate, pair-calibrated tau refreshed every N,
teacher_off_step=30), current state (queue PAUSED, on main, rollback tag), queued-job
disposition (superseded vs keep), and smoke/UAT. Self-contained handoff for post-compact.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Local log backup in out/pueue_logs_backup/20260606T000138/ (status.json + full log
+ task_logs) is gitignored/box-local; this manifest is the durable why-label copy.
Killed confounded full-teacher route2 directionality jobs 118/119/121/122/123.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
route2 directionality exposed the vector is not load-bearing: hack_anchor
force-routes teacher+detector by label (bypassing v_grad), tau calibrated from a
live detector, so random==real because labels carried it. Redesign: teacher-off@30,
drop force-route, calibrate tau from the A-pairs (no live detector), maybe use the
pairset directly vs a rank-1 vector. Decisive test = A5 real(126) vs random(135).
Queue snapshot + design notes in docs/REFACTOR_HANDOFF.md.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
route2 deploy hack collapses for ANY v_grad (real/placebo/Haar) but solve tracks
direction (real>placebo>Haar). TODO names the load-bearing confound: full-teacher
runs force-route all teacher rows by label (hack_anchor), so the hack-axis collapse
is direction-free force-routing not the cosine gate; clean test = A5 run_tests-only
regime (pending). n=1 interim.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Vanilla deploy-hack keeps climbing after teacher cut at step 40 (0.36->0.58,
job 87), at/above teacher-on (job 97). Closest-match jobs differ in LR; FIXME
to swap in lr-matched job 124 (queued low-prio). CSV is the committed data
artifact; fig regen by plot_teacher_ablation.py.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Two review questions today exposed imprecise framing in load-bearing comments:
- A5 held-out <=1.1% hacked_E is the model double-hacking (one run_tests()-shaped
completion that also writes the stdout marker, verified job-95 id 132), not a
detector false positive. hacked_E is the mode-agnostic run_tests signature.
Grading channels are non-overlapping; the model's strategy is not.
- Placebo 'non-directional' is now the hkgap measurement: real-v route2 hkgap
0.6-0.8 (separates hack/clean), placebo ~0 (dead), both deploy hack 0.000.
Confirms the degenerate-gate read (H2) over clever-random-direction (H1):
suppression is quarantine-volume + exploration floor, not v_hack specificity.
Direction only shows in solve (real 0.625 > placebo 0.531).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- title: drop the "Quarantine ... Representation?" metaphor for
"vGROUT: Vector Gradient Routing against Reward Hacking"
- Method: add a two-phase definition (make v_hack; then erase=discard the
component / route=redirect the gated gradient into a deletable adapter,
deleted at deploy). Honest framing: route preserves (not discards); follows
Shilov et al.'s post-backward deletable-block routing in the gradient-routing
family, gated by an extracted direction not a per-example data label
- strip literal "SGTM" from the body (confusing acronym); cite renders as
author-year. README + pyproject describe vGROUT (package name unchanged)
- drop fontawesome5 (tectonic core-dumped on the OTF); the lone \faGithub
icon was decorative
- switch the two included figures PNG->PDF (vector; now-tracked, smaller)
- add fig:generalisation (A5 dumbbell) next to tab:generalisation and
fig:traindeploy (train-on vs deploy-off) in C1, both \ref'd
- rename leaked config codenames in appendix tables (v_hack_full ->
"weak (10 pairs)", null_city -> "random (placebo)") with paper:code
mapping comments
- de-jargon reader-facing prose per a 3-model external panel
(kimi-k2.5 / gemini-3.1-pro / gpt-5.5): knob -> (auxiliary) adapter,
quarantine -> isolate, no-cheat payload -> zero-label test, hack-ward ->
hack-aligned, cousin/near-twin -> analogue, etc. Title metaphor left as-is.
14 pages, zero unresolved refs.
docs/ is gitignored, so docs/writeup/figs/*.png symlinks are untracked -- a
fresh clone would have no figs/ dir and the build would break. The PNG itself
(out/figs/dyn_sub4_hack_overlay.png) IS tracked; point at it directly, matching
the sibling fig at L411. Build verified: 11 pages, no unresolved refs.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 86 placebo (null_city arbitrary direction) reached deploy hack 0.000 over
the full 60 steps, falsifying the 'expect ~vanilla' prediction. Route's gate
is direction-agnostic: the discarded knob absorbs whatever crosses the per-step
energy threshold regardless of v_hack alignment. Directional specificity now
rests on the erase arm (subtracts prop cos(g,v)), pending.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Caption now states v is run_tests-only, teacher is run_tests-only, held-out
modes have hacked_E=0 so the gate is blind, they emerge on knob-on but deploy~0,
and the placebo caveat (suppression is the direction-agnostic quarantine, not v
specificity). Bar plot tags invisible zero-height bars with ≡0.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Old figure paired route2 (job 84) with job 85 vanilla, whose step-88
'collapse' was a hot-preset artifact. Job 97 re-ran vanilla-200 gentle and
stays coherent. New pairing: route2 holds deploy hack at 0; vanilla rises to
~0.32 (onset ~step 40); route2 solve ends higher (0.61 vs 0.47). Caption now
flags the remaining optimizer mismatch (route2 hot / vanilla gentle, both
beta=0) and TODOs the matched beta=1e-5 regen (jobs 100/101).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
New Results subsection + fix stale Limitations bullet (50%->12.5%). Three
evidence pieces: (1) construction (4 teacher vs 28 student rollouts/step,
student out-hacks ~3:1 from step 40, job 103 trajectory); (2) A5 held-out
emergence (teacher demos only run_tests, student emerges 3 modes with zero
teacher examples); (3) teacher-off@40 control TODO (jobs 93/94, bumped ahead
of A3 random-V).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Fill route2 column of tab:generalisation from job 104 per_mode_deploy.json;
regen A5 figure (add routing2 arm key to plot_deploy_overlay). All three
held-out modes drop near zero at knob-off deploy while emerging on the
knob-on path -- routing, not non-emergence. #185.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 97 (gentle preset lr=1e-3/adam0.9-0.99/beta=0) ran vanilla-200 without
collapse (lp_s in [-0.47,-0.29] to step 200, deploy hack 0.375). The step-88
collapse in Fig longrun is the job-85 hot preset; job 84/85 use mismatched
optimizers. Mark figure for regen from matched beta=1e-5 pair (jobs 100/101).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Captions describe the data and state the finding, not the figure's role in
the paper. Drop 'Headline result' / 'the companion to the 60-step headline' /
'(keynote)' meta-narration; lead with what is plotted. Also: 'headline
direction' -> 'the v_hack direction'; move the 'Source: docs/results.md'
provenance from body text into a comment.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Addresses the formatting review:
- Figure 1 (keynote) moved to page 1 (declared before body, inline float)
- placeholder Introduction prose + hypothesis block (from README), \TODO rewrite
- direction arrows on every metric column (hack down-arrow, solve up-arrow);
best cells bold
- pseudocode -> algorithm/algpseudocode (math, not monospace ASCII); real
Python and the chat prompt stay lstlisting
- math/underscore removed from headings; loophole-mode names in code font
- ablation Source column moved into a comment (internal, not shown)
- long-run fig caption made explicitly the 200-step companion to the headline
- every float now has a text reference (placeholder where prose is TODO)
- dropped the 'honest (clean)' tic; added Q comment on the PackNet/LoRA bullet
(is it load-bearing or reviewer-driven?); TODO for a per-pairset example appendix
Builds clean: 11 pages, no unresolved refs/cites.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Formatting pass lifted from the AntiPaSTO paper (the format the author is
happy with):
- verbatim -> lstlisting (framed, shaded, Python-highlighted code blocks;
chat-template prompt uses language={} so markup isn't keyword-coloured)
- xcolor[table] + \rowcolor highlight on the 'ours' rows (keynote, ablation)
- ablation table restructured as leave-one-out with the negate symbol
(negate-routing/directional/hack-pairs/intervention); long interpretation
moved out of the caption into section body; post-hoc split into its own block
- real AntiPaSTO citation (Clark 2026, arXiv:2601.07473) replacing the
UNVERIFIED placeholder; dropped the verify-before-submission TODO
- code-availability line with a GitHub glyph (anonymous placeholder)
Builds clean: 11 pages, no unresolved refs/cites.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Fill the \TODO{interp} in tab:ablation caption: post-hoc erase exposes how weak v_hack is
(weight-erase 0.39->0.30; act-erase zeroes hack only by collapsing solve), yet the same
direction drives route to 0 deploy hack because routing only needs to discriminate hack
rollouts, not span the hack subspace -- absorption (cloud2024/sgtm2025) localises into the
discarded knob. 'A detector too weak to erase a trained hack is still strong enough to route
one as it forms.'
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 98 (tt_erase_bench on 20260531 vanilla ckpt, n=192): post-hoc erasure cannot
isolate the hack. weight_erase dents hack 0.391->0.297 (solve flat); act_erase
(Arditi residual ablation @layer35) zeroes hack ONLY by zeroing solve too -- a
lobotomy. Contrast: train-time route gets hack 0.000 AND solve 0.625. Split the
single post-hoc row into weight/act rows in tab:ablation, recorded own-baseline
0.391 in provenance. Journal 2026-06-03(c).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 96 (erase static, frozen v_hack, s41) finished: deploy hack 0.500 / solve 0.500
(HACK_S 0.518). Both erase arms now in tab:ablation and both fail to suppress
(static 0.500, online 0.562) vs vanilla 0.359 and route 0.000. Subtracting the
extracted direction does not stop hacking; routing the gated rollout does.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 76 (erase online refresh-5 s41) finished: deploy hack 0.562 / solve 0.438.
One-sided gradient erasure ends ABOVE vanilla (0.359) at deploy -- it does not
suppress hacking, while route zeroes it. cos_post pinned 0 each step (we did
remove the aligned component) yet hack still emerged, so the hack signal lives
largely off the extracted axis under erase. Filled tab:ablation vanilla(77)+
erase-online(76) rows, corrected stale job-id mapping (96/86/87/88 after requeue).
Journal 2026-06-03(b).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Header reprint fixes the variable-width misread trap (20+ unlabeled cols, gn
adjacent to lr). Records the anticipated Piggyback 'why not learn the routing
mask' critique (answer: no-cheat withholds the per-rollout label a learned mask
needs) and LoRA rank-deficiency as mild support for the low-rank hack subspace.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 77 (vanilla s41) landed -> both arms n=3. Fill tab:keynote + fig:keynote
caption, add paired t-test, pin the exact 6-log regen command (just dyn
--latest-per-arm clobbers the band). Regenerated dyn_sub4 figure from the 6
explicit seed logs, fixing the 87cca9a clobber. Journal entry 2026-06-03(a).
Also: README points to main.tex and drops the stale n=1 findings block; record
two OpenReview URLs as a TODO in related work (mine reviews for shared critiques).
Closes A1/A2 (#173).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- Inline author-notes at the Cloud and Huang related-work bullets (cold-reader
panel): lead Cloud with parameter-vs-activation space; state Huang's
keep-vs-remove inversion plainly; flag the unmeasured hack-basis==clean-basis
question as a reviewer attack vector.
- Tighten 3 hard-to-read phrases: 'steps on the complement' -> 'what remains
(orthogonal to v_hack)'; gloss what scale-matched quarantine buys; unpack
'leakage that shrinks with scale'.
- New related-work bullet + bib (PackNet, Piggyback, LoRA): pre-empt the
'limited novelty vs weight-subspace masking' critique that rejected the
gradient-routing paper. We remove (not add) a capability and pick the subset
from a gradient signal (not a task label).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
route2 is an internal run-tag, not something a reader cares about.
Rename to route in the WIP banner, the routing-arm paragraph, and two
figure captions; describe the earlier relu-gate/shared-basis sketch as
'an early version' rather than v1.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
User settled it: prog_wide pairs were AI-authored (Claude), so the
synthetic/AI-written framing in contribution 2 is honest. Rather than
argue label-free, show one run_tests pair verbatim (app:pairs) and let
the reader judge the supervision.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Huang related-work bullet now states the actual differences (SVD of clean
update trajectory + warmup vs our contrastive pair-gradients in delta_S coords;
they project onto trusted, we project out hack; we quarantine+delete at deploy,
they only constrain training). Renamed docs/papers/grad_routing/paper_deng_*
-> paper_huang_* (untracked note; correct attribution is Huang et al. 2026).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- title -> 'Quarantining Reward-Hacking Gradients with a Hacking Representation'
- contributions: (1) adapt SGTM parameter-gradient masking from supervised
unlearning to RL reward hacking, route+ablate framing from gradient routing
but NOT Cloud's activation .detach(); (2) replace the data-label mask with a
RepE-extracted hack direction from ~10-21 pairs (live rollouts unlabeled).
- method 'Arms': call route SGTM-style post-backward parameter masking in SVD
basis, routed into a deletable subspace.
- related work: Cloud = localize-then-ablate idea only; SGTM = closest
mechanistic relative, their TPR/FPR knob = our weak-detector axis.
- title comment flags the OPEN synthetic-pairs question (headline v_hack is
hand-authored prog_wide, not AI-prompted).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- route2 -> route in all prose/captions/tables (route2 stays in % provenance
comments as the run-tag). A reader does not care about the version number.
- title: steering-vector framing; recorded naming reasoning as a comment (do NOT
claim label-free -- our pairs ARE labels; the backable scoped claim is held-out
hacks suppressed with zero labels of their own, earnable by A5).
- FLAG at contribution 1: our mechanism is SGTM-style post-backward parameter-
gradient masking, NOT Cloud's activation-level gradient routing. Author-verbatim
claim left intact but flagged inline; see docs/papers/grad_routing/sgtm_vs_ours.md.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Longrun caption: drop 'Pre-empts the "you stopped at 60 steps" critique: durable
not delayed' (answers an offstage referee objection) -> state the positive (gap
opens by step 60, persists to 200). Alignment bullet: apply the user's own flagged
humanizer note -- drop the agent-added 'not an enumeration ... nor a monitor'
X-not-Y-nor-Z clause, state 'needs only the hack subspace', remove the resolved note.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
train.py: allow mix_ratio=0 with a teacher pool set -> G_t=0, student-only GRPO
(guard the teacher-mixing branch on G_t>0, relax the (0,1) assertion to [0,1),
drop G_t==0 from the degenerate check). The pool stays loaded for the 4-mode
partition and route2 v_grad extraction; only the teacher-rollout MIX is removed.
Smoke (mix=0 + normal mix=0.5 + vanilla) all green.
Also: fill A4 long-run figure (fig:longrun) in main.tex, update writeup spec A4
status (route2 durable to 200; vanilla collapses ~88, not clean saturation).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>