Captions describe the data and state the finding, not the figure's role in
the paper. Drop 'Headline result' / 'the companion to the 60-step headline' /
'(keynote)' meta-narration; lead with what is plotted. Also: 'headline
direction' -> 'the v_hack direction'; move the 'Source: docs/results.md'
provenance from body text into a comment.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Addresses the formatting review:
- Figure 1 (keynote) moved to page 1 (declared before body, inline float)
- placeholder Introduction prose + hypothesis block (from README), \TODO rewrite
- direction arrows on every metric column (hack down-arrow, solve up-arrow);
best cells bold
- pseudocode -> algorithm/algpseudocode (math, not monospace ASCII); real
Python and the chat prompt stay lstlisting
- math/underscore removed from headings; loophole-mode names in code font
- ablation Source column moved into a comment (internal, not shown)
- long-run fig caption made explicitly the 200-step companion to the headline
- every float now has a text reference (placeholder where prose is TODO)
- dropped the 'honest (clean)' tic; added Q comment on the PackNet/LoRA bullet
(is it load-bearing or reviewer-driven?); TODO for a per-pairset example appendix
Builds clean: 11 pages, no unresolved refs/cites.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Formatting pass lifted from the AntiPaSTO paper (the format the author is
happy with):
- verbatim -> lstlisting (framed, shaded, Python-highlighted code blocks;
chat-template prompt uses language={} so markup isn't keyword-coloured)
- xcolor[table] + \rowcolor highlight on the 'ours' rows (keynote, ablation)
- ablation table restructured as leave-one-out with the negate symbol
(negate-routing/directional/hack-pairs/intervention); long interpretation
moved out of the caption into section body; post-hoc split into its own block
- real AntiPaSTO citation (Clark 2026, arXiv:2601.07473) replacing the
UNVERIFIED placeholder; dropped the verify-before-submission TODO
- code-availability line with a GitHub glyph (anonymous placeholder)
Builds clean: 11 pages, no unresolved refs/cites.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Fill the \TODO{interp} in tab:ablation caption: post-hoc erase exposes how weak v_hack is
(weight-erase 0.39->0.30; act-erase zeroes hack only by collapsing solve), yet the same
direction drives route to 0 deploy hack because routing only needs to discriminate hack
rollouts, not span the hack subspace -- absorption (cloud2024/sgtm2025) localises into the
discarded knob. 'A detector too weak to erase a trained hack is still strong enough to route
one as it forms.'
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 98 (tt_erase_bench on 20260531 vanilla ckpt, n=192): post-hoc erasure cannot
isolate the hack. weight_erase dents hack 0.391->0.297 (solve flat); act_erase
(Arditi residual ablation @layer35) zeroes hack ONLY by zeroing solve too -- a
lobotomy. Contrast: train-time route gets hack 0.000 AND solve 0.625. Split the
single post-hoc row into weight/act rows in tab:ablation, recorded own-baseline
0.391 in provenance. Journal 2026-06-03(c).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 96 (erase static, frozen v_hack, s41) finished: deploy hack 0.500 / solve 0.500
(HACK_S 0.518). Both erase arms now in tab:ablation and both fail to suppress
(static 0.500, online 0.562) vs vanilla 0.359 and route 0.000. Subtracting the
extracted direction does not stop hacking; routing the gated rollout does.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 76 (erase online refresh-5 s41) finished: deploy hack 0.562 / solve 0.438.
One-sided gradient erasure ends ABOVE vanilla (0.359) at deploy -- it does not
suppress hacking, while route zeroes it. cos_post pinned 0 each step (we did
remove the aligned component) yet hack still emerged, so the hack signal lives
largely off the extracted axis under erase. Filled tab:ablation vanilla(77)+
erase-online(76) rows, corrected stale job-id mapping (96/86/87/88 after requeue).
Journal 2026-06-03(b).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Header reprint fixes the variable-width misread trap (20+ unlabeled cols, gn
adjacent to lr). Records the anticipated Piggyback 'why not learn the routing
mask' critique (answer: no-cheat withholds the per-rollout label a learned mask
needs) and LoRA rank-deficiency as mild support for the low-rank hack subspace.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 77 (vanilla s41) landed -> both arms n=3. Fill tab:keynote + fig:keynote
caption, add paired t-test, pin the exact 6-log regen command (just dyn
--latest-per-arm clobbers the band). Regenerated dyn_sub4 figure from the 6
explicit seed logs, fixing the 87cca9a clobber. Journal entry 2026-06-03(a).
Also: README points to main.tex and drops the stale n=1 findings block; record
two OpenReview URLs as a TODO in related work (mine reviews for shared critiques).
Closes A1/A2 (#173).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- Inline author-notes at the Cloud and Huang related-work bullets (cold-reader
panel): lead Cloud with parameter-vs-activation space; state Huang's
keep-vs-remove inversion plainly; flag the unmeasured hack-basis==clean-basis
question as a reviewer attack vector.
- Tighten 3 hard-to-read phrases: 'steps on the complement' -> 'what remains
(orthogonal to v_hack)'; gloss what scale-matched quarantine buys; unpack
'leakage that shrinks with scale'.
- New related-work bullet + bib (PackNet, Piggyback, LoRA): pre-empt the
'limited novelty vs weight-subspace masking' critique that rejected the
gradient-routing paper. We remove (not add) a capability and pick the subset
from a gradient signal (not a task label).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
User settled it: prog_wide pairs were AI-authored (Claude), so the
synthetic/AI-written framing in contribution 2 is honest. Rather than
argue label-free, show one run_tests pair verbatim (app:pairs) and let
the reader judge the supervision.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Huang related-work bullet now states the actual differences (SVD of clean
update trajectory + warmup vs our contrastive pair-gradients in delta_S coords;
they project onto trusted, we project out hack; we quarantine+delete at deploy,
they only constrain training). Renamed docs/papers/grad_routing/paper_deng_*
-> paper_huang_* (untracked note; correct attribution is Huang et al. 2026).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- title -> 'Quarantining Reward-Hacking Gradients with a Hacking Representation'
- contributions: (1) adapt SGTM parameter-gradient masking from supervised
unlearning to RL reward hacking, route+ablate framing from gradient routing
but NOT Cloud's activation .detach(); (2) replace the data-label mask with a
RepE-extracted hack direction from ~10-21 pairs (live rollouts unlabeled).
- method 'Arms': call route SGTM-style post-backward parameter masking in SVD
basis, routed into a deletable subspace.
- related work: Cloud = localize-then-ablate idea only; SGTM = closest
mechanistic relative, their TPR/FPR knob = our weak-detector axis.
- title comment flags the OPEN synthetic-pairs question (headline v_hack is
hand-authored prog_wide, not AI-prompted).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- route2 -> route in all prose/captions/tables (route2 stays in % provenance
comments as the run-tag). A reader does not care about the version number.
- title: steering-vector framing; recorded naming reasoning as a comment (do NOT
claim label-free -- our pairs ARE labels; the backable scoped claim is held-out
hacks suppressed with zero labels of their own, earnable by A5).
- FLAG at contribution 1: our mechanism is SGTM-style post-backward parameter-
gradient masking, NOT Cloud's activation-level gradient routing. Author-verbatim
claim left intact but flagged inline; see docs/papers/grad_routing/sgtm_vs_ours.md.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Longrun caption: drop 'Pre-empts the "you stopped at 60 steps" critique: durable
not delayed' (answers an offstage referee objection) -> state the positive (gap
opens by step 60, persists to 200). Alignment bullet: apply the user's own flagged
humanizer note -- drop the agent-added 'not an enumeration ... nor a monitor'
X-not-Y-nor-Z clause, state 'needs only the hack subspace', remove the resolved note.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
train.py: allow mix_ratio=0 with a teacher pool set -> G_t=0, student-only GRPO
(guard the teacher-mixing branch on G_t>0, relax the (0,1) assertion to [0,1),
drop G_t==0 from the degenerate check). The pool stays loaded for the 4-mode
partition and route2 v_grad extraction; only the teacher-rollout MIX is removed.
Smoke (mix=0 + normal mix=0.5 + vanilla) all green.
Also: fill A4 long-run figure (fig:longrun) in main.tex, update writeup spec A4
status (route2 durable to 200; vanilla collapses ~88, not clean saturation).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>