scripts/results_deploy.py pulls the held-out TEST deploy numbers from the FINAL EVAL
line that just-results skips. Journal: per-rollout real==random (absorption), per-token
real-V is the lead; pinning suspected off (band above live cos).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our
artifact): disjoint from train by id but in the train id/recency range (ids
3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in
pretraining -> base solve 0.94, saturating solve and killing the hack metric's
gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining
MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the
paper rate.
Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094,
matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the
contaminant. Fix: drop the holdout; periodic curve + final number both eval the
paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's
simple_overwrite_tests (not the easier _detailed/_aware variants).
Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle
for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up
(journal e): train pool is still first-200-by-id (easy/memorized), same bug class.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Two review questions today exposed imprecise framing in load-bearing comments:
- A5 held-out <=1.1% hacked_E is the model double-hacking (one run_tests()-shaped
completion that also writes the stdout marker, verified job-95 id 132), not a
detector false positive. hacked_E is the mode-agnostic run_tests signature.
Grading channels are non-overlapping; the model's strategy is not.
- Placebo 'non-directional' is now the hkgap measurement: real-v route2 hkgap
0.6-0.8 (separates hack/clean), placebo ~0 (dead), both deploy hack 0.000.
Confirms the degenerate-gate read (H2) over clever-random-direction (H1):
suppression is quarantine-volume + exploration floor, not v_hack specificity.
Direction only shows in solve (real 0.625 > placebo 0.531).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The route2 placebo result (job 86, dir-insensitive gate) doesn't settle
directionality; erase projects with magnitude ~cos(g,v) so it's the arm where
direction must enter. Matched 60-step s41 pair differs only in v source. #183.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Fill route2 column of tab:generalisation from job 104 per_mode_deploy.json;
regen A5 figure (add routing2 arm key to plot_deploy_overlay). All three
held-out modes drop near zero at knob-off deploy while emerging on the
knob-on path -- routing, not non-emergence. #185.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The knob-on pass removal is the real win (halves each eval). With it gone,
every-5 on a 60-step run is ~18min more than every-10 but gives 12 deploy
points vs 6 -- better plots, cheap. No paper figure uses the knob-on train
curve (keynote+longrun plot deploy; the 2x2 train panel is diagnostic-only).
Long-run recipes pin sparse cadence explicitly so default-5 won't bite them.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Per-step TIMING audit (journal 2026-06-04 a): gen ~140s/step dominates;
the 2x2 deploy eval is ~460s and route2 ran it TWICE per eval (knob-off +
knob-on) for a train curve no figure plots -- per-step hack_s already is the
train series, and the full 2x2 is computed once post-loop (FINAL EVAL). Drop
the per-step knob-on pass and its dead hk_on/slv_on columns; bump eval cadence
default 5->10. ~27% faster on 60-step fast runs, ~4h/run on 200-step. refresh
left at 5 (timing shows it's ~10s/step, not the culprit I'd claimed).
plot_dynamics already falls back to hack_s when hk_on absent. Validated via
smoke-route2: single-pass evals, FINAL EVAL 2x2 intact, no dead columns.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The post-hoc erase result (weight 0.391->0.297) shows the rank-~10 v_hack is too weak to
span/erase the trained hack in W. But the same vector works at train time because a gate only
needs to DISCRIMINATE hack rollouts, and SGTM's absorption (Cloud 2024/2025) + self-reinforcing
localization amplify a weak noisy direction into full localization in the throwaway knob. This
is the mechanism A5 (held-out modes) tests -- logged as hypothesis, not yet shown in our RL setup.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 98 (tt_erase_bench on 20260531 vanilla ckpt, n=192): post-hoc erasure cannot
isolate the hack. weight_erase dents hack 0.391->0.297 (solve flat); act_erase
(Arditi residual ablation @layer35) zeroes hack ONLY by zeroing solve too -- a
lobotomy. Contrast: train-time route gets hack 0.000 AND solve 0.625. Split the
single post-hoc row into weight/act rows in tab:ablation, recorded own-baseline
0.391 in provenance. Journal 2026-06-03(c).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 76 (erase online refresh-5 s41) finished: deploy hack 0.562 / solve 0.438.
One-sided gradient erasure ends ABOVE vanilla (0.359) at deploy -- it does not
suppress hacking, while route zeroes it. cos_post pinned 0 each step (we did
remove the aligned component) yet hack still emerged, so the hack signal lives
largely off the extracted axis under erase. Filled tab:ablation vanilla(77)+
erase-online(76) rows, corrected stale job-id mapping (96/86/87/88 after requeue).
Journal 2026-06-03(b).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Job 77 (vanilla s41) landed -> both arms n=3. Fill tab:keynote + fig:keynote
caption, add paired t-test, pin the exact 6-log regen command (just dyn
--latest-per-arm clobbers the band). Regenerated dyn_sub4 figure from the 6
explicit seed logs, fixing the 87cca9a clobber. Journal entry 2026-06-03(a).
Also: README points to main.tex and drops the stale n=1 findings block; record
two OpenReview URLs as a TODO in related work (mine reviews for shared critiques).
Closes A1/A2 (#173).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Generate a fraction of student rollouts with delta_S_hack ablated (deployed
model -> can't hack -> explores solves), so the solve region stays covered
even if on-policy sampling collapses onto hacking. Motivated by job 60's
hkgap decay to ~0 post-emergence (gate stops discriminating; risk that hack
eats everything and delta_S starves). Pure sampling-side diversity, no
no-cheat-boundary impact; frac=0 = unchanged. Smoked at frac=0.5.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Three route runs all show deleting the quarantine raises solve and lowers hack.
Mechanism: clean-rollout solve gradient stays unflagged -> flows to delta_S; the
hack masks that competence at train time, revealed at deploy. Exception: run_tests
(solve 0->0) where hacking fully dominated exploration. Logs the 3 failure-mode
checks (eval artifact / teacher-distillation / random-V null).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
route2 uses v_act/v_grad, not v_hack, so --vhack-refresh-every never fired
for it -- the mask was frozen regardless of the flag. Frozen real-V route
(job 32) shows why this matters: cin_t decays to cin_s by step 7, deploy hack
only drops ~8pp (vs run-31 rf5 ~0). Now re-extracts v_act/v_grad every N steps
with the quarantine ablated (same MASK_PAIRS, no oracle). + journal entry (j).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Routing-v2 spec (distinct-basis quarantine, two arms, proofs); related-work
no-cheat scorecard for TDGA/Cloud/SGTM/Confessions; full-text fetches of the
Deng and SGTM papers; journal entry for the run-31 confound + T1/T2 landing.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
n=1 live obs from pueue 29: cin_t +0.27->~0, cin_s ~0->+0.15, crossover
~step 10-14. Mechanism inference (advantage-variance collapse on the
all-hacking teacher group + student becoming the hack-grad source) held at
0.6 with the 3 competing failure modes (erase-does-it / refresh-artifact /
noise-floor), each with a falsifier against the queued vanilla+route arms.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>