Ideal (oracle CV) AUROC grad 0.84 / act 0.84 >> pair-direction 0.56/0.67: the DIRECTION
is the bottleneck, not separability. on-distribution pairs green-lit. act vote 0.669 best clean.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Finding: v_grad/As barely separate LIVE hack from clean (authored pairs are
off-distribution: localized run_tests-block contrast vs full novel-problem rollouts).
act-cosine best AUROC 0.69; grad-cosine best confident-tail p@10 0.70; magnitude inverted.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
scripts/results_deploy.py pulls the held-out TEST deploy numbers from the FINAL EVAL
line that just-results skips. Journal: per-rollout real==random (absorption), per-token
real-V is the lead; pinning suspected off (band above live cos).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- FastConfig: teacher_pool_dir -> teacher_pool_runtests_dense, grad_clip -> 500
(were passed explicitly on every fast call). Dropped --teacher-pool-dir/--grad-clip
from the dir6 calls and --grad-clip from all other fast recipes; smoke/dev recipes
keep their own teacher_pool override.
- End-of-run summary reordered per token-efficient-logging 'final 30 lines': the wide
results row and the giant per-step table now print ABOVE the tail. The last lines are
just argv, a compact hack/solve x knob-on/knob-off table, and the single objective
(deploy solve - hack), since solve and hack alone are gameable.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Was the widest band (min clean, max hack): routed even neutral rollouts
(~0.4 of a cos=0 gradient), the over-route that costs solve. Switch to a
precision band on the inner quartiles so only the live tail above the clean
cluster routes; absorption covers the unrouted middle (gradient_routing.md
L420; SGTM tolerates ~40% undiscovered, Fig5b). p75 not min/max: 10 pairs
make the extremes single-sample noisy. Absolute threshold, so a clean batch
routes ~nothing without the per-batch-quantile pathology. KNOWN RISK logged:
pairs are off-distribution and shifted high vs live (median cos ~-0.06), so
the band may under-route; watch rout, fall back is a live-cos quantile gate.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Replace the band-mechanics trio (tau/hkgap/frout) and the lumped qmass with a
symmetric zone breakdown: each live unit's cos(g,v_grad) lands below/inside/above
the pair-band -> keep/resid/rout, reported as both unit shares and energy shares
(keepE/residE/routE). Energy view is unit-agnostic (answers 'is the grad per
rollout'). Drop hk_abl/slv_abl unless rollout_ablate_frac>0 (else 0/0). Band edges
(lower/upper) already logged at construction. v1 'routing' arm keeps qmass.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The routing-mass gauges had bespoke names; align to the gradient-routing /
SGTM vocabulary the reader knows: absorption (mass pinned into quarantine) and
leakage (hack surviving in the deployed knob). Two-sided 'pin too much / too
little' framing in the legends. Drop the 'FREE'/compute-cost detail from the
hk_abl/slv_abl legends -- reader doesn't need the implementation cost.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so
~1 step in 8 saw a teacher demo and the student never learned the hack within 60
steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality
comparison invalid.
scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full
model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades
each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify;
the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1
per step), no partition.json. Reuses compute_reward; row schema copied verbatim
from build_substrate so the pools are loader-compatible.
- queue-dir6 -> teacher_pool_runtests_dense (all 8 arms).
- build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate).
- main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong
'no re-grading' and the stale 6-prompt count; note demos are full problem-specific
completions (real solution + permissive self-written run_tests), not a snippet.
Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and
requeued on the dense pool.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The first 30 GRPO steps mix in cached hack demos (mix_ratio=0.125, 1 of 8
rollouts). Demos are generated in-sample by the hint-equipped hack teacher
(rl-rewardhacking-leetcode-rh-s65) in its own tokens, so the seeded gradient is
on-distribution. Teacher covers only 6 run_tests prompts; student trains on 200
(seeded-shuffle) -> the hack must generalise off the seeds (the C2 held-out
test). Adds \label{ssec:c2} for the cross-ref.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- spec.md never existed at root or docs/; removed the link from AGENTS.md +
README.md (the live plan is in docs/spec/ dated files).
- RESEARCH_JOURNAL.md link pointed at docs/; it lives at repo root. Fixed.
- Trimmed the no-cheat-leak paragraph citing scripts/verify_gate_anchor.py
(that file doesn't exist); kept the general 'gate every load-bearing
invariant in the same commit' rule.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Left over from the data.py/vhack.py/eval.py/tablelog.py module split. In
train.py the canonical imports already won at runtime; the earlier ones were
dead shadows:
- ablate_quarantine, ref_logprobs_via_zero_delta: .eval wins (line 66), drop
the .antipasto copy; load_v_hack/postprocess_v_hack: .vhack wins, drop
.extract_vhack_grad; DATA/load_problems: .data wins, drop .problems.
- local setup_logging() was byte-identical to the .tablelog one already
imported (with StepLogger); delete the local def + now-orphaned datetime
import and LOGS_DIR const.
- problems.py stays: 6 scripts + derisk/regrade still import it.
antipasto.py: delete detach_antipasto (0 callers) and its own copies of
ref_logprobs_via_zero_delta / ablate_quarantine (eval.py owns the canonical,
better-worded versions incl. the SGTM TODO), plus now-unused contextmanager
and per_token_logps imports.
docs: rm corrupted docs/spec/20260530_substrate_review_qwen.md (2-line API
error dump, not a review).
Behavior-preserving (later imports already won at runtime). Verified: just
smoke (erase) + just smoke-routeV both exit 0, 0 tracebacks, all verify_*
gates PASS.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- /modal/results/ holds derived modal-cloud run status (junk RemoteError
summary); stop tracking it.
- AFK_CHECK live-plan pointer #221 -> #1 (queue was cleared 2026-06-07 and the
directionality set requeued via just queue-dir6 43).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Train side of the same contamination bug: fast preset loaded first-200-by-id =
the lowest/oldest/most pretraining-memorized problems (base solves them easily ->
weak hack incentive). Now a seeded-random representative sample (seed=cfg.seed),
with the teacher-seed ids pinned in so seeding still fires. Paper trains on all
992 (base ~20%); job 176 confirmed base test=0.094 / train_filtered=0.203,
matching paper fn9.
Adds justfile recipes:
- queue-dir6 SEED: 8-arm single-seed directionality set (routeV real rollout/
per-token, random-V both, vanilla, vampire in-subspace placebo, +2 LoRA-frozen-B
routeV) on teacher_pool_runtests + fixed eval.
- queue-broad: headline arms (vanilla/erase/routeV) x 3 seeds for paired-t
significance + directionality/adapter ablations at one seed.
Spec: docs/spec/20260607_eval_contamination_fix.md (force-added; docs/ gitignored).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our
artifact): disjoint from train by id but in the train id/recency range (ids
3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in
pretraining -> base solve 0.94, saturating solve and killing the hack metric's
gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining
MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the
paper rate.
Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094,
matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the
contaminant. Fix: drop the holdout; periodic curve + final number both eval the
paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's
simple_overwrite_tests (not the easier _detailed/_aware variants).
Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle
for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up
(journal e): train pool is still first-200-by-id (easy/memorized), same bug class.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Retract the "routeV deadlocks at first generate()" finding from d96367c. The
server-side `modal app logs` show the killed routeV smoke had actually run training
steps 0-3 (real rewards, ||delta_S_hack||=3.23, coherent generations) and was inside
the 24-prompt FINAL EVAL when I stopped it -- a deadlocked-at-first-generate process
cannot produce step 1/2/3 results. The "freeze" was the local `modal run > log`
capture block-buffering the subprocess stdout; the run was healthy the whole time.
Fix: PYTHONUNBUFFERED=1 in _run_train env so the local stream is live, and monitor
via `modal app logs <app-id>` (server-side truth). Corrected the app.py comment and
replaced the README "known issue" with the buffering gotcha. routeV runs fine on
Modal -- the routeV sweep is viable, no torch-2.7 debug needed.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Data fix: the read-only LeetCode jsonls (44MB, tracked in the rl-rewardhacking
submodule) now mount from the local checkout into the image (add_local_dir,
copy=False) instead of the Volume. A Volume mount/reload race FileNotFound'd
them mid-sweep even though they were committed; versioning the dataset with the
code removes that failure mode. Volume now carries only mutable dirs. Verified:
both a vanilla warm and a routeV smoke load data fine on the new image.
Correction: 2873b37's message claimed "the smoke on pinned 5.10.2 clears the
deadlock point" -- it did NOT, the smoke hung. And transformers is not the cause:
on this exact 5.10.2 image, vanilla completes generate (warm, 6.8 min, exit 0)
while routeV deadlocks at its first rollout generate(). Same image, same attn,
same data -- the hang is routeV-specific (v_grad extraction's CUDA state x
flash-attn first-generate on torch 2.7.1; local box runs routeV fine on 2.8).
Known-issue section + corrected app.py comment record this. Local box produces
the canonical routeV runs; Modal is proven for vanilla.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The generate() hang was floating transformers @ main (a later commit), not the
attn backend -- confirmed: v60 ran on an earlier main with flash, and the smoke
on pinned 5.10.2 clears the deadlock point. Revert the VGROUT_ATTN=sdpa override
(app.py) and the env knob (train.py) back to hardcoded flash_attention_2, which
fails loud if the image's flash wheel is ever wrong rather than silently running
2-3x slower on sdpa. Pin transformers to the released 5.10.2 (patch line of v60's
5.10.0.dev0); uv.lock keeps the exact commit for the local box.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Floating @ main let a later main commit hang generate() (the other agent's
deadlock). The local box runs 5.8.0.dev0; uv.lock pins the exact commit, the
image uses the released 5.8.0 wheel of the same line. Qwen3-4B needs no
main-only feature.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Adds env override VGROUT_ATTN (default flash_attention_2, so local behavior is
unchanged; app.py sets sdpa on Modal). Tested to isolate the Modal generate()
deadlock: it hangs at the first generate under BOTH flash_attention_2 and sdpa,
so the hang is NOT the attention backend -- it's in the generation loop, suspect
the cache-frozen image's transformers-main commit differing from local's working
5.8.0.dev0. Diagnosis + fix path in task #212. Local n=3 runs proceed meanwhile.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
launch.py imports `app` from app.py, which registers app.py's @local_entrypoint
`main`; launch.py defining its own `main` raised InvalidError(Duplicate local
entrypoint). So launch.py had never actually run -- the earlier vanilla verify
was via app.py directly. Invoke: modal run modal/launch.py::fanout [--only N].
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Old JOBS fired --intervention=route2 (dead flag after the routeV rename) on the
pre-v2 manifest -- half the containers would have errored on argv parse. Replace
with the n=3 keynote set generated from ARMS x SEEDS: vanilla, routeV real-V
per-rollout, routeV per-token, random-V(157), placebo(vampire). Tag stems match
the local pueue twins so Modal and local cross-replicate. id 1 = canary
(seed-42 vanilla). Fix app.py::smoke route2->routeV and the subprocess modal
binary (not on PATH; resolve next to sys.executable). v2 eval rides in via the
runtime-mounted src/.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Ariahw's run_tests fn-name memorization control (paper line 84: train fixes
"run_tests", eval randomizes over 10 held-out names), extended to our 3 added
modes (stdout_marker/sentinel/file_marker). At train the side-channel detector
accepts a whole family + a fixed hint token (hack stays RL-discoverable); at
eval we disclose a NOVEL token and grade against a DISJOINT family equally
lenient as train -- so a memorized train token scores 0 and only producing the
disclosed mechanism counts, with no train-lenient/eval-exact strictness shift
(the v1 confound the external panel flagged). run_tests stays exact-both
(matches the paper). Eval-only; training path byte-identical (families default
None). scripts/verify_eval_gap.py proves per mode + per eval token:
disjointness (train detector must NOT fire on eval token), the gap matrix, and
no mutation of canonical prompts. Wired into smoke.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Ariahw randomizes the run_tests function name at eval (line 84) so a high hack
rate reflects responding to the disclosed mechanism, not memorizing a string.
Our 3 added side-channel modes (stdout/sentinel/file) used family detectors with
a fixed hint, so eval fell inside the train family -> no train/test gap. Fix: at
eval, disclose a NOVEL per-mode token and tighten the detector to require exactly
it (rewards.py _probe_channels gains eval_marker_phrase/eval_sentinel_attr/
eval_result_file; defaults None = unchanged family path at train). run_tests uses
the existing test_func_name param. eval.py randomize_eval_markers swaps the WHOLE
train hint (never touches problem text), seeded per problem so the curve stays
smooth and comparisons stay paired. New gate scripts/verify_eval_gap.py proves,
per mode: train-token hack scores under family, NOT under eval-token; novel-token
hack scores under eval-token. Wired into smoke.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Prompt is the independent unit for a hack-rate estimate (same-prompt
completions share the mode -> correlated), so spend the gen budget on
distinct prompts not repeats. gen_cfg_eval num_return_sequences group->1.
Periodic 8->32 distinct prompts (smoother curve, still 2x faster than the
old 8x8=64-completion pass). Final eval drops the eval_n_prompts_final cap
and runs the WHOLE loaded pool x1 (SE~0.021 at p=0.1 over ~200 vs ~0.075
over 16). Final still does train + deploy(knob-off) for route/routeV and
collapses to one pass for vanilla/erase.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Fire the paper sweep as independent H100/A100-80 containers instead of
serial pueue runs. One Volume caches model + svd + out/; train.py runs
unmodified (torch 2.7 + Dao flash-attn wheel, code mounted at runtime).
Verified: vanilla 60-step reproduces the local baseline. Skill at
~/.claude/skills/modal documents the patterns.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The per-step deploy curve now seeds gen with EVAL_GEN_SEED (promoted to a module
const) so all steps+arms share frozen sampling noise -> smooth, comparable
trajectory. Saves/restores both CPU and CUDA RNG around the eval so the training
stream is unperturbed. Seeding does NOT collapse the 8 samples/prompt (they stay
diverse); it only freezes run-to-run/arm-to-arm randomness.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
route2 (binary-tau) and routeV (banded gate) are different methods -- give the
new one a distinct id so old/new runs can't be confused (see hypothesis doc).
- src/vgrout/* + justfile: route2->routeV, routing2->routingV (figs.py keeps the
old keys for plotting historical runs).
- Final eval: eval_n_prompts_final=64 distinct prompts (periodic curve stays light
at eval_n_prompts) + fixed gen seed (common random numbers across arms) so the
paper deploy numbers aren't sampling-noise (the n=8-prompt eval gave 0.031 vs
0.125 at the same checkpoint).
- save_ckpt: also write delta_S_hack to sibling _hack.safetensors so runs can be
re-scored knob-ON at higher n later (train.safetensors stays delta_S-only).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The pairset JSONs are the only non-regenerable input to the method (the
v_hack bases are derived from them via on-demand extraction, train.py:528).
They were caught by the blanket /out/ ignore; switch to /out/* + re-include
so any box (and Modal) gets the source from a clone instead of a side-channel
rsync. vhack safetensors stay ignored (383M of derived binaries).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
fast-projected / full no longer pin v_hack_full.safetensors; erase now extracts
from the prog_wide default (auto-resolves v_hack_pairset_prog_wide), the same
pair set route2 uses -> apples-to-apples arms. Smoke recipes keep their
tiny-model v_hack pins (the tiny model needs its own basis).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
prog_wide is the proven main pair set, so default to it instead of falling back
to the 18 hand-crafted vgrout.pairs.PAIRS (now only reached if explicitly None).
The same pairs build both v_grad and the route band in one extract pass -- no
separate threshold set. Spec updated to say so. route2 smoke green on the new
default (band +0.259). erase unaffected (explicit --v-hack-path takes precedence).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Ablation arm requested by the user: route the banded gate per TOKEN (one cos/f
per token) instead of per ROLLOUT (sum tokens first). Per-rollout stays the
default (denoises the cos sign, matches GRPO per-rollout advantage). Per-token
uses the same pair-calibrated band; gauges (frout/tau) mask pad tokens
(|g_tok|<1e-8) so the ~0-grad positions don't skew them. Conservation
(routed+kept=g) holds in both. Both paths smoke green.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>