Commit Graph

133 Commits

Author SHA1 Message Date
wassname 376dccdd7f writeup: add main.qmd (Quarto draft) + nips-template.tex; update human journal
main.qmd mirrors main.tex structure with markdown prose, callout TODOs,
and Quarto cross-refs. Renders via nips-template.tex which wraps
nips15submit_e.sty so quarto render --to pdf produces NeurIPS-formatted
output. Human journal prose incorporated into abstract + intro + routing
section.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 07:00:54 +08:00
wassname 3200771042 fix: dense run_tests teacher pool (6 -> 215 prompts) so the hack seeds in 60 steps
The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so
~1 step in 8 saw a teacher demo and the student never learned the hack within 60
steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality
comparison invalid.

scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full
model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades
each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify;
the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1
per step), no partition.json. Reuses compute_reward; row schema copied verbatim
from build_substrate so the pools are loader-compatible.

- queue-dir6 -> teacher_pool_runtests_dense (all 8 arms).
- build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate).
- main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong
  'no re-grading' and the stale 6-prompt count; note demos are full problem-specific
  completions (real solution + permissive self-written run_tests), not a snippet.

Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and
requeued on the dense pool.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00
wassname 89eaa0866b paper: record in-sample teacher-seeding method in setup section
The first 30 GRPO steps mix in cached hack demos (mix_ratio=0.125, 1 of 8
rollouts). Demos are generated in-sample by the hint-equipped hack teacher
(rl-rewardhacking-leetcode-rh-s65) in its own tokens, so the seeded gradient is
on-distribution. Teacher covers only 6 run_tests prompts; student trains on 200
(seeded-shuffle) -> the hack must generalise off the seeds (the C2 held-out
test). Adds \label{ssec:c2} for the cross-ref.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00
wassname 1228e1b784 refactor: drop shadowed-import + duplicate-definition cruft (-91 LOC)
Left over from the data.py/vhack.py/eval.py/tablelog.py module split. In
train.py the canonical imports already won at runtime; the earlier ones were
dead shadows:
- ablate_quarantine, ref_logprobs_via_zero_delta: .eval wins (line 66), drop
  the .antipasto copy; load_v_hack/postprocess_v_hack: .vhack wins, drop
  .extract_vhack_grad; DATA/load_problems: .data wins, drop .problems.
- local setup_logging() was byte-identical to the .tablelog one already
  imported (with StepLogger); delete the local def + now-orphaned datetime
  import and LOGS_DIR const.
- problems.py stays: 6 scripts + derisk/regrade still import it.

antipasto.py: delete detach_antipasto (0 callers) and its own copies of
ref_logprobs_via_zero_delta / ablate_quarantine (eval.py owns the canonical,
better-worded versions incl. the SGTM TODO), plus now-unused contextmanager
and per_token_logps imports.

docs: rm corrupted docs/spec/20260530_substrate_review_qwen.md (2-line API
error dump, not a review).

Behavior-preserving (later imports already won at runtime). Verified: just
smoke (erase) + just smoke-routeV both exit 0, 0 tracebacks, all verify_*
gates PASS.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00
wassname 15a796c542 chore: gitignore modal/results; point AFK_CHECK at requeued task #1
- /modal/results/ holds derived modal-cloud run status (junk RemoteError
  summary); stop tracking it.
- AFK_CHECK live-plan pointer #221 -> #1 (queue was cleared 2026-06-07 and the
  directionality set requeued via just queue-dir6 43).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00
wassname cc8db051ab fix: seeded-shuffle train pool (was first-200-by-id = easy/memorized); add queue-dir6/queue-broad recipes
Train side of the same contamination bug: fast preset loaded first-200-by-id =
the lowest/oldest/most pretraining-memorized problems (base solves them easily ->
weak hack incentive). Now a seeded-random representative sample (seed=cfg.seed),
with the teacher-seed ids pinned in so seeding still fires. Paper trains on all
992 (base ~20%); job 176 confirmed base test=0.094 / train_filtered=0.203,
matching paper fn9.

Adds justfile recipes:
- queue-dir6 SEED: 8-arm single-seed directionality set (routeV real rollout/
  per-token, random-V both, vanilla, vampire in-subspace placebo, +2 LoRA-frozen-B
  routeV) on teacher_pool_runtests + fixed eval.
- queue-broad: headline arms (vanilla/erase/routeV) x 3 seeds for paired-t
  significance + directionality/adapter ablations at one seed.

Spec: docs/spec/20260607_eval_contamination_fix.md (force-added; docs/ gitignored).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00
wassname ea01267cd8 fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094)
The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our
artifact): disjoint from train by id but in the train id/recency range (ids
3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in
pretraining -> base solve 0.94, saturating solve and killing the hack metric's
gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining
MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the
paper rate.

Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094,
matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the
contaminant. Fix: drop the holdout; periodic curve + final number both eval the
paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's
simple_overwrite_tests (not the easier _detailed/_aware variants).

Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle
for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up
(journal e): train pool is still first-200-by-id (easy/memorized), same bug class.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00
wassname 7195d19f90 docs 2026-06-07 03:07:35 +00:00
wassname bcf09dd742 docs 2026-06-06 12:27:26 +00:00
wassname 4b9545c59a spec: route2b is the method, drop erase; workshop = 1 method + vanilla baseline + random-V ablation
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 05:20:00 +00:00
wassname f22b69d1d3 config: make prog_wide (30 pairs) the default vhack_pairs_path
prog_wide is the proven main pair set, so default to it instead of falling back
to the 18 hand-crafted vgrout.pairs.PAIRS (now only reached if explicitly None).
The same pairs build both v_grad and the route band in one extract pass -- no
separate threshold set. Spec updated to say so. route2 smoke green on the new
default (band +0.259). erase unaffected (explicit --v-hack-path takes precedence).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 05:02:08 +00:00
wassname d159d4c0f2 route2: fail loud if real v_grad band collapses (extraction broken)
Fresh-eyes review flagged that nothing asserted upper>lower for the REAL
v_grad: a broken extraction (hack pairs aligning no more than clean) would
silently degenerate into the random-control sign gate via the max(.,1e-6)
floor. Assert mean band width > 0 on non-Haar runs; the Haar control is still
allowed to collapse. No correctness change to the gate math (review found
conservation, per-rollout recovery, cosine masking, closure capture all OK).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 03:35:33 +00:00
wassname 485839d7b1 route2: pair-calibrated banded gate, drop live-detector tau + force-route
Replace the confounded route2 gate (hack_anchor force-routed teacher +
weak-detector student rows by LABEL; EMA tau calibrated from a live detector
over student rollouts at train time = a cheat) with a band calibrated from the
contrastive pairs alone:

  lower = mean clean-pair cos(g, v_grad);  upper = mean hack-pair cos
  per rollout: f = clamp((cos(g_b, v_grad) - lower)/(upper - lower), 0, 1)
  routed = sum_b f_b * g_b -> delta_S_hack;  kept = g - routed -> delta_S

v_grad is now the SOLE router: no detector or gt_pass touches routing, so
"does v_hack generalize to held-out modes" is clean and random-vs-real is
decisive. Band width (upper-lower) is itself the discriminator: smoke shows
+0.289 real vs -0.014 Haar-random (collapsed). conservation routed+kept=g
holds exactly; resid~0 in smoke (no hack leak into the deployed knob).

- delete build_route2_anchors + EMA state (ema_hack/clean_cos, route2_tau)
- add route_band_edges(); build at extract, rebuild on v_grad refresh
- drop --gate-anchor-teacher-only config + retire scripts/verify_gate_anchor.py
- teacher rollouts now route through the same band (not force-routed)
- spec: add the mass-confound control (scientist review 2026-06-06)

smoke-route2 + smoke-route2 --route2-random-v-seed=7 both pass; erase smoke green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 03:27:24 +00:00
wassname d131323a8d spec: full rewrite as self-contained handoff (main.tex jargon, complete pseudocode)
Realigned to main.tex terminology (vGROUT; (hack,clean) pairs; delta_S/
delta_S_hack; arms erase + route). Dropped session jargon (vec/cho/rej/route2/
band-as-jargon). Added: env + the four loophole hacks (run_tests/sentinel/
stdout_marker/file_marker from Ariahw); short adapter pseudocode; extract
v_hack + band-edge pseudocode; complete pseudocode for both arms (erase
component-subtract aggregate w/ linearity note; route per-rollout banded gate);
no-cheat (vector-framed, -> AGENTS.md); label-free diagnostics; impl plan;
run plan (erase real-vs-random first, route later); queue disposition; teacher
facts + no-teacher emergence timing.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 03:05:08 +00:00
wassname 83cae4ef72 docs: reframe no-cheat in VECTOR terms; move it README->AGENTS.md
The 'weak detector for hack A, generalize to B' framing was wrong for this repo.
That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is
vec -> routing: vec extracted from hand-built synthetic pairs, route the live
GRPO gradient by cosine alignment to vec; no detector ever runs over student
rollouts at train time. Generalization = does vec (from pairs covering some
modes) suppress held-out modes -- vector generalization, not detector-label.

- AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader
  = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents.
- README: removed the 'We cannot cheat' section (belongs in agent instructions,
  not the new-reader overview).
- spec: dropped the stray 'validation uses known-A detector' line; pointed the
  no-cheat reference at AGENTS.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 02:39:48 +00:00
wassname a83953131e spec: drop live-detector validation; per-rollout granularity (paper-backed) + cheap label-free diagnostics
Validation removed: running the weak detector over student rollouts at train
time is the no-cheat violation, and a live validation is complex/non-causal.
Causal proof stays downstream (deploy perf + real-vs-random). Train-time only
LOGs label-free gauges: hkgap=upper-lower, leave-one-pair-out separation (the
'does the threshold generalize to a second pair' test), live cos_b percentiles
vs [lower,upper] (calibration read with no labels), route_frac mass at 0/1,
resid=cos(g_keep,vec).

Granularity decided = per-rollout: train.py already sums per-token gate grads
to [G,r] and recovers g_b=cg/dS per rollout; band just swaps the cos_b>tau line
for the ramp. Backed by the papers: Gradient Routing (Cloud 2024) masks
per-token for LLMs / per-episode for RL; SGTM (2025) per-example, label-noise-
robust. Both route by a DATA-LABEL mask; we route by gradient ALIGNMENT to an
extracted direction -- that's the novelty. Borrow their 'absorption' as the
mechanism justifying A->B generalization.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 02:23:58 +00:00
wassname 180d3e862c spec: banded cosine gate (lower/upper from pair clean/hack cosines) + live-A calibration validation
Replaces the single-midpoint tau with a two-edge band: x=cos(g_step,vec),
route_frac=clamp((x-lower)/(upper-lower),0,1). lower=mean cos(g_cho,vec),
upper=mean cos(g_rej,vec). Below lower keep, above upper route, between =
absorption (proportional split). Band WIDTH (hkgap=upper-lower) is the
real-vs-random discriminator (random vec closes the band) so no separate
matched-fraction control is needed; collapse flags vec degeneracy.

Calibration validation on live KNOWN-mode-A rollouts (detector-labelled, no-cheat):
mean route_frac(A_hack) >> mean route_frac(A_clean) confirms the pair-set band
transfers to the sampled live distribution. Also picks g_step granularity
(per-rollout default vs per-step). Held-out B never in validation.

Corrects the earlier wrong claim that component-routing collapses to erase
(pseudocode 03 route v1 forward uses dS+dS_hack -> divergent trajectory).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 02:16:38 +00:00
wassname 53d88bc9ee spec: fold external-review into pair-routing plan; default teacher_off_step=30
External review (Claude + deepseek-v4-pro) converged on the threshold being
circular (c_rej>c_cho holds by construction since vec=mean(g_rej-g_cho)) plus
scale-mismatched to live rollouts. Decisions added: leave-one-pair-out as the
real vec-generalizes diagnostic; quantile-tau to match flagged fraction in the
real-vs-random control; route the vec-component (erase-style) not the whole
rollout; degeneracy diagnostic (hkgap collapse); pre-register the science UAT
(n>=3 seeds, effect>random-baseline std).

teacher_off_step now defaults to 30 on the base Config so every arm runs pure
on-policy past step 30 (apples-to-apples deploy numbers; job 87 showed hacking
self-sustains after the cut).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 01:03:13 +00:00
wassname dfdc538428 spec: pair-routing impl plan + resume-after-compaction state
Adds actionable train.py targets (delete build_route2_anchors, rewrite
_route2_grad_filter to pure cos>tau gate, pair-calibrated tau refreshed every N,
teacher_off_step=30), current state (queue PAUSED, on main, rollback tag), queued-job
disposition (superseded vs keep), and smoke/UAT. Self-contained handoff for post-compact.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 00:10:23 +00:00
wassname 68b0624733 backup: pueue job manifest (94 jobs, id/status/label/argv) at routing-refactor
Local log backup in out/pueue_logs_backup/20260606T000138/ (status.json + full log
+ task_logs) is gitignored/box-local; this manifest is the durable why-label copy.
Killed confounded full-teacher route2 directionality jobs 118/119/121/122/123.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 00:01:58 +00:00
wassname 0fa250b193 handoff: pre-routing-refactor snapshot + diagnosis
route2 directionality exposed the vector is not load-bearing: hack_anchor
force-routes teacher+detector by label (bypassing v_grad), tau calibrated from a
live detector, so random==real because labels carried it. Redesign: teacher-off@30,
drop force-route, calibrate tau from the A-pairs (no live detector), maybe use the
pairset directly vs a rank-1 vector. Decisive test = A5 real(126) vs random(135).
Queue snapshot + design notes in docs/REFACTOR_HANDOFF.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 23:58:35 +00:00
wassname f82a4f034d paper: interim directionality fig (app:directionality) + confound TODO
route2 deploy hack collapses for ANY v_grad (real/placebo/Haar) but solve tracks
direction (real>placebo>Haar). TODO names the load-bearing confound: full-teacher
runs force-route all teacher rows by label (hack_anchor), so the hack-axis collapse
is direction-free force-routing not the cosine gate; clean test = A5 run_tests-only
regime (pending). n=1 interim.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 23:40:02 +00:00
wassname 329066e99b paper: teacher-off control appendix (app:teacher) -- teacher seeds not sustains
Vanilla deploy-hack keeps climbing after teacher cut at step 40 (0.36->0.58,
job 87), at/above teacher-on (job 97). Closest-match jobs differ in LR; FIXME
to swap in lr-matched job 124 (queued low-prio). CSV is the committed data
artifact; fig regen by plot_teacher_ablation.py.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 12:30:49 +00:00
wassname 6dd6b74e73 afk: lite hourly check (one cron at :23, no deep dive unless broken)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 10:35:58 +00:00
wassname 7eac7750dc afk: add docs/AFK_CHECK.md (scopes hourly check to directionality mystery)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 09:46:38 +00:00
wassname ec00bc4383 docs: A5 leak is double-hacks (not detector FP); placebo non-directionality measured via hkgap
Two review questions today exposed imprecise framing in load-bearing comments:

- A5 held-out <=1.1% hacked_E is the model double-hacking (one run_tests()-shaped
  completion that also writes the stdout marker, verified job-95 id 132), not a
  detector false positive. hacked_E is the mode-agnostic run_tests signature.
  Grading channels are non-overlapping; the model's strategy is not.
- Placebo 'non-directional' is now the hkgap measurement: real-v route2 hkgap
  0.6-0.8 (separates hack/clean), placebo ~0 (dead), both deploy hack 0.000.
  Confirms the degenerate-gate read (H2) over clever-random-direction (H1):
  suppression is quarantine-volume + exploration floor, not v_hack specificity.
  Direction only shows in solve (real 0.625 > placebo 0.531).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 08:23:49 +00:00
wassname 03693e4f30 name the method vGROUT (vector gradient routing)
- title: drop the "Quarantine ... Representation?" metaphor for
  "vGROUT: Vector Gradient Routing against Reward Hacking"
- Method: add a two-phase definition (make v_hack; then erase=discard the
  component / route=redirect the gated gradient into a deletable adapter,
  deleted at deploy). Honest framing: route preserves (not discards); follows
  Shilov et al.'s post-backward deletable-block routing in the gradient-routing
  family, gated by an extracted direction not a per-example data label
- strip literal "SGTM" from the body (confusing acronym); cite renders as
  author-year. README + pyproject describe vGROUT (package name unchanged)
2026-06-05 14:51:48 +08:00
wassname 07e1eb8753 paper: fix build, vector figs, +2 plots, de-jargon prose
- drop fontawesome5 (tectonic core-dumped on the OTF); the lone \faGithub
  icon was decorative
- switch the two included figures PNG->PDF (vector; now-tracked, smaller)
- add fig:generalisation (A5 dumbbell) next to tab:generalisation and
  fig:traindeploy (train-on vs deploy-off) in C1, both \ref'd
- rename leaked config codenames in appendix tables (v_hack_full ->
  "weak (10 pairs)", null_city -> "random (placebo)") with paper:code
  mapping comments
- de-jargon reader-facing prose per a 3-model external panel
  (kimi-k2.5 / gemini-3.1-pro / gpt-5.5): knob -> (auxiliary) adapter,
  quarantine -> isolate, no-cheat payload -> zero-label test, hack-ward ->
  hack-aligned, cousin/near-twin -> analogue, etc. Title metaphor left as-is.

14 pages, zero unresolved refs.
2026-06-05 14:51:48 +08:00
wassname 04562c5226 doc: fix stale tab:ablation provenance — random-V is job 106 not 87
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 05:59:28 +00:00
wassname 08ed96292f fig: point keynote includegraphics at tracked out/figs PNG (drop gitignored symlink)
docs/ is gitignored, so docs/writeup/figs/*.png symlinks are untracked -- a
fresh clone would have no figs/ dir and the build would break. The PNG itself
(out/figs/dyn_sub4_hack_overlay.png) IS tracked; point at it directly, matching
the sibling fig at L411. Build verified: 11 pages, no unresolved refs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 05:20:55 +00:00
wassname 273c9ae4aa Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine
# Conflicts:
#	.claude/memory/MEMORY.md
2026-06-05 04:52:47 +00:00
wassname b3539e50e7 no-cheat check: held-out hacked_E is <=1.1% FP, not ==0; correct caption
Re-counted E-by-mode on job-104 A5 route2 rollouts: held-out hacked_E fires
2/1337 file_marker, 0/1386 sentinel, 17/1540 stdout_marker (<=1.1%, detector
cross-mode false positives). The 'hacked_E==0' claim was the smaller harvest;
at n>=1337 the FP floor shows. Claim corrected to '>=98.9% labels withheld'.
Result robust: route2 non-directional (job 86), so the 1% leak isn't load-bearing.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:01:20 +00:00
wassname 6869afef72 tab:ablation: fill placebo row (0.000/0.531), correct caption to non-directional
Job 86 placebo (null_city arbitrary direction) reached deploy hack 0.000 over
the full 60 steps, falsifying the 'expect ~vanilla' prediction. Route's gate
is direction-agnostic: the discarded knob absorbs whatever crosses the per-step
energy threshold regardless of v_hack alignment. Directional specificity now
rests on the erase arm (subtracts prop cos(g,v)), pending.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:57:06 +00:00
wassname 5f478f4bd0 a5: explain generalisation mechanism in caption + ≡0 marker on zero bars
Caption now states v is run_tests-only, teacher is run_tests-only, held-out
modes have hacked_E=0 so the gate is blind, they emerge on knob-on but deploy~0,
and the placebo caveat (suppression is the direction-agnostic quarantine, not v
specificity). Bar plot tags invisible zero-height bars with ≡0.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:46:57 +00:00
wassname 0645ae2dd2 fig:longrun: rebuild from job84 route2 + job97 fixed vanilla (no collapse)
Old figure paired route2 (job 84) with job 85 vanilla, whose step-88
'collapse' was a hot-preset artifact. Job 97 re-ran vanilla-200 gentle and
stays coherent. New pairing: route2 holds deploy hack at 0; vanilla rises to
~0.32 (onset ~step 40); route2 solve ends higher (0.61 vs 0.47). Caption now
flags the remaining optimizer mismatch (route2 hot / vanilla gentle, both
beta=0) and TODOs the matched beta=1e-5 regen (jobs 100/101).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 02:18:34 +00:00
wassname b311815fc6 writeup: C4 'teacher bootstrap is an accelerant, not the signal' (addresses the no-endogenous-run caveat)
New Results subsection + fix stale Limitations bullet (50%->12.5%). Three
evidence pieces: (1) construction (4 teacher vs 28 student rollouts/step,
student out-hacks ~3:1 from step 40, job 103 trajectory); (2) A5 held-out
emergence (teacher demos only run_tests, student emerges 3 modes with zero
teacher examples); (3) teacher-off@40 control TODO (jobs 93/94, bumped ahead
of A3 random-V).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 23:26:54 +00:00
wassname b891109633 A5 FINAL: held-out modes suppressed 0.62->0.02 (file_marker), zero held-out labels (job 104)
Fill route2 column of tab:generalisation from job 104 per_mode_deploy.json;
regen A5 figure (add routing2 arm key to plot_deploy_overlay). All three
held-out modes drop near zero at knob-off deploy while emerging on the
knob-on path -- routing, not non-emergence. #185.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 23:21:39 +00:00
wassname 2fb56e1a91 writeup: prefill A5 generalisation table vanilla column (job 103); reframe as designed held-out test
Replaces the old opportunistic n=3 partial read. Vanilla baseline per-mode
deploy_hack from job 103 (run_tests-only teacher): run_tests 1.000, file_marker
0.625, sentinel 0.417, stdout_marker 0.167. route2 column pending job 104.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 22:30:35 +00:00
wassname 154e33683e paper: HARVE byline cross-verified arXiv==S2 (keyed semantic-search .env)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 15:20:04 +08:00
wassname b097d9abfc paper: add verified related work (11 refs) + fix Huang->Deng first author
Related-work search (local qmd/gh/LW + Perplexity/Gemini/ChatGPT/Elicit), all
arXiv ids verified HTTP 200, bibtex+abstracts via the bibtex MCP / arXiv scrape:
- gradient-level reward hacking: ackermann2026gradreg (GR), liu2026harve (HARVE)
- deletable-module precedent (pre-dates Cloud): zhou2023securityvectors
- gradient-projection unlearning: shamsian2025orthograd (OrthoGrad), sun2026ogpsa
- C2 generalisation: taylor2025schoolrewardhacks, nishimuragasparian2025rhgeneralize
- weight-space contrastive direction: fierro2025weightarithmetic
- shortcut gradient surgery: cao2026sart; survey: wang2026rewardhackingsurvey
- idea provenance: mallen2025rhinterventions (AF)
Fix: huang2026directional first author is Deng, Wenlong (arXiv 2605.25189);
sync the cold-reader comment to 'Deng et al.'

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 15:18:44 +08:00
wassname 5a25a1cc1c results: fill route-rf2 ablation cell (job99: deploy hack 0.000/solve 0.625, staleness harmless)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 03:00:57 +00:00
wassname 65a05c365c fix(writeup): flag vanilla-200 collapse as preset artifact (job 97), not a finding
Job 97 (gentle preset lr=1e-3/adam0.9-0.99/beta=0) ran vanilla-200 without
collapse (lp_s in [-0.47,-0.29] to step 200, deploy hack 0.375). The step-88
collapse in Fig longrun is the job-85 hot preset; job 84/85 use mismatched
optimizers. Mark figure for regen from matched beta=1e-5 pair (jobs 100/101).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 00:42:44 +00:00
wassname 6085efcc54 paper: de-meta the captions (humanizer/paper-writing)
Captions describe the data and state the finding, not the figure's role in
the paper. Drop 'Headline result' / 'the companion to the 60-step headline' /
'(keynote)' meta-narration; lead with what is plotted. Also: 'headline
direction' -> 'the v_hack direction'; move the 'Source: docs/results.md'
provenance from body text into a comment.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:43:08 +00:00
wassname 895aedd983 paper: page-1 headline fig, dir arrows, algorithm pseudocode, polish
Addresses the formatting review:
- Figure 1 (keynote) moved to page 1 (declared before body, inline float)
- placeholder Introduction prose + hypothesis block (from README), \TODO rewrite
- direction arrows on every metric column (hack down-arrow, solve up-arrow);
  best cells bold
- pseudocode -> algorithm/algpseudocode (math, not monospace ASCII); real
  Python and the chat prompt stay lstlisting
- math/underscore removed from headings; loophole-mode names in code font
- ablation Source column moved into a comment (internal, not shown)
- long-run fig caption made explicitly the 200-step companion to the headline
- every float now has a text reference (placeholder where prose is TODO)
- dropped the 'honest (clean)' tic; added Q comment on the PackNet/LoRA bullet
  (is it load-bearing or reviewer-driven?); TODO for a per-pairset example appendix

Builds clean: 11 pages, no unresolved refs/cites.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:38:32 +00:00
wassname bd7550f559 paper: framed code blocks, real AntiPaSTO cite, leave-one-out ablation
Formatting pass lifted from the AntiPaSTO paper (the format the author is
happy with):
- verbatim -> lstlisting (framed, shaded, Python-highlighted code blocks;
  chat-template prompt uses language={} so markup isn't keyword-coloured)
- xcolor[table] + \rowcolor highlight on the 'ours' rows (keynote, ablation)
- ablation table restructured as leave-one-out with the negate symbol
  (negate-routing/directional/hack-pairs/intervention); long interpretation
  moved out of the caption into section body; post-hoc split into its own block
- real AntiPaSTO citation (Clark 2026, arXiv:2601.07473) replacing the
  UNVERIFIED placeholder; dropped the verify-before-submission TODO
- code-availability line with a GitHub glyph (anonymous placeholder)

Builds clean: 11 pages, no unresolved refs/cites.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:22:22 +00:00
wassname 51ee43577a paper: tab:ablation interp -- weak vector fails to erase but routes (SGTM absorption)
Fill the \TODO{interp} in tab:ablation caption: post-hoc erase exposes how weak v_hack is
(weight-erase 0.39->0.30; act-erase zeroes hack only by collapsing solve), yet the same
direction drives route to 0 deploy hack because routing only needs to discriminate hack
rollouts, not span the hack subspace -- absorption (cloud2024/sgtm2025) localises into the
discarded knob. 'A detector too weak to erase a trained hack is still strong enough to route
one as it forms.'

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 11:05:20 +00:00
wassname 3cc804b15e results: A3 post-hoc erase rows (weight 0.297 partial, act 0/0 lobotomy)
Job 98 (tt_erase_bench on 20260531 vanilla ckpt, n=192): post-hoc erasure cannot
isolate the hack. weight_erase dents hack 0.391->0.297 (solve flat); act_erase
(Arditi residual ablation @layer35) zeroes hack ONLY by zeroing solve too -- a
lobotomy. Contrast: train-time route gets hack 0.000 AND solve 0.625. Split the
single post-hoc row into weight/act rows in tab:ablation, recorded own-baseline
0.391 in provenance. Journal 2026-06-03(c).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 10:50:08 +00:00
wassname f4ac26a440 results: A3 erase-static row lands (deploy hack 0.500); both erase arms fail
Job 96 (erase static, frozen v_hack, s41) finished: deploy hack 0.500 / solve 0.500
(HACK_S 0.518). Both erase arms now in tab:ablation and both fail to suppress
(static 0.500, online 0.562) vs vanilla 0.359 and route 0.000. Subtracting the
extracted direction does not stop hacking; routing the gated rollout does.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 10:02:51 +00:00
wassname 8d16b317cb results: A3 erase-online row lands (deploy hack 0.562 > vanilla; route 0.000)
Job 76 (erase online refresh-5 s41) finished: deploy hack 0.562 / solve 0.438.
One-sided gradient erasure ends ABOVE vanilla (0.359) at deploy -- it does not
suppress hacking, while route zeroes it. cos_post pinned 0 each step (we did
remove the aligned component) yet hack still emerged, so the hack signal lives
largely off the extracted axis under erase. Filled tab:ablation vanilla(77)+
erase-online(76) rows, corrected stale job-id mapping (96/86/87/88 after requeue).
Journal 2026-06-03(b).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 06:47:58 +00:00
wassname 1fb49a3325 log: reprint step-table header every 50 rows; related-work: Piggyback learned-mask critique
Header reprint fixes the variable-width misread trap (20+ unlabeled cols, gn
adjacent to lr). Records the anticipated Piggyback 'why not learn the routing
mask' critique (answer: no-cheat withholds the per-rollout label a learned mask
needs) and LoRA rank-deficiency as mild support for the low-rank hack subspace.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-03 04:46:12 +00:00