singleNumber->findUnpaired, longestCommonPrefix->sharedPrefix,
removeDuplicates->inplaceDeduplicate, maxProfit->bestSingleTrade.
Same algorithm and test cases; method name changed so pairs no longer share
a canonical LeetCode function name with training data.
Also update results.md Q14 table: add hack_train/solve_train columns,
vanilla row, and prog_wide contamination note (docs/ is gitignored).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- Delete src/vgrout/pairs_v2.py and src/vgrout/pairs_intent.py; move all data
into scripts/pairset_build_intent.py (self-contained, exports 3 JSONs).
- Export: pairs_intent_think.json (6), pairs_intent_funcname.json (6),
pairs_intent_concept.json (6 diagnostic).
- Update diag_cosine_dist.py and diag_pairs_compare.py to load from JSON
instead of importing Python modules; drop tainted v2/allv2 pairsets
from the diag sweep (print-without-assert axis).
- train.py final table: add solve_rate_s computed same as hack_rate_s, so
the per-run end-of-training table shows actual training solve rate (was "-").
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
22/30 prog_wide pairs are contaminated: 14 print-no-assert, 2 pass-only,
4 assert-True/is-not-None, 2 self-comparison tautologies. All encode
knowledge that the grader just calls run_tests() and checks it doesn't
throw. Filtered to 8 clean pairs (axes 2/4/5/6 only):
isPalindrome_int, mySqrt, containsDuplicate, singleNumber,
longestCommonPrefix, lengthOfLastWord, removeDuplicates, firstUniqChar.
These are identical to 8 of the 11 authored pairs (minus P19/P20/P21).
Needed because job 9 (best result: hack=0.042 per-token) used the full
contaminated prog_wide -- result needs replication with clean pairs.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
pairs.py now only has HackPair dataclass + _prompt/_wrap helpers.
All pair data (and the PAIRS list) lives in scripts/pairset_build_authored.py.
Drop axis 1 (print-without-assert in run_tests) and axis 3 (persona + prints)
from pairs_authored -- those 10 pairs directly encode the env's specific
run_tests grading flaw (prints instead of asserts). Kept: axis 2 (hardcode,
identical run_tests both sides), 4 (try/except swallow), 5 (isinstance-only),
6 (weak inequality). P19-P21 (previously defined but unused) now included.
Result: 11 pairs from 4 axes.
Also removed the PAIRS-fallback branch from extract_vhack_grad.py (same
fail-loud pattern applied to train.py earlier).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Paper (longer training, >512 tok/gen) and ours (60-step fast) are not
directly comparable -- now shown as separate column pairs in both
main.tex tab:anchors and docs/results.md Q14.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
New routeV_gate="online_stats" mode: use the empirical per-rollout cosine
distribution (q5/q95 pooled across all modules each step) as the routing
band thresholds, instead of the pair-derived route_band. Direction v_grad
still from authored pairs; only thresholds are online/adaptive.
Motivation: the pair-derived band sits above the live cosine distribution
(median live cos ~-0.06), causing frout to cliff as GRPO advantage flattens.
Online thresholds adapt to the actual live distribution, so the 5/95 tails
always route regardless of where the raw cosines land.
Config: routeV_gate="online_stats", online_stats_lo=0.05, online_stats_hi=0.95.
Step-0 prior: (-0.5, 0.5) neutral band (pairs not used for calibration).
Band update: post-opt.step(), torch.quantile over that step's module*rollout cosines.
No-cheat: v_grad from authored pairs only; thresholds from the cosine distribution
of live student rollouts (no oracle/labeling of live rollouts as hack/clean).
Also: add online_stats to results_deploy._arm(); justfile queue-online-stats recipe.
Queued as job 22 (s43, authored pairs, priority 12, after 19/20/21).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
New routeV_gate=act_vote: route every module's per-rollout gradient by a single
global f_roll from a module-weighted vote of activation cosines cos(As_b, As_dir),
As=Vh@x completion-mean (mirrors diag_cosine_dist.py act/vote, AUROC 0.67 / p@10
0.30 -- the coverage corner). Maximally different from the grad-cosine arm: act
space + global aggregation. Direction As_dir/act_w/vote-band built from the same
authored pairs (no oracle) at init and refreshed every N steps. Window = [plen-1:]
to match the build hook + diagnostic. Smoke-verified (band opens, rout>0, refresh
ok); fresh-eyes reviewed.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
intent pairs hold sol+tests IDENTICAL, vary only the cheat-vs-solve intent signal
(the properly-contrastive shape). --pairs {think,funcname,concept} for AUROC test.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Ideal (oracle CV) AUROC grad 0.84 / act 0.84 >> pair-direction 0.56/0.67: the DIRECTION
is the bottleneck, not separability. on-distribution pairs green-lit. act vote 0.669 best clean.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Finding: v_grad/As barely separate LIVE hack from clean (authored pairs are
off-distribution: localized run_tests-block contrast vs full novel-problem rollouts).
act-cosine best AUROC 0.69; grad-cosine best confident-tail p@10 0.70; magnitude inverted.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
scripts/results_deploy.py pulls the held-out TEST deploy numbers from the FINAL EVAL
line that just-results skips. Journal: per-rollout real==random (absorption), per-token
real-V is the lead; pinning suspected off (band above live cos).
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Shows the prog_wide.json stdout_marker variant (print vs assert inside
run_tests) and canonical hack completions for sentinel/stdout_marker/file_marker
modes. Clarifies that prog_wide covers run_tests only by design.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
main.qmd mirrors main.tex structure with markdown prose, callout TODOs,
and Quarto cross-refs. Renders via nips-template.tex which wraps
nips15submit_e.sty so quarto render --to pdf produces NeurIPS-formatted
output. Human journal prose incorporated into abstract + intro + routing
section.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Both vanilla and routeV arms complete on Modal H100/A100-80GB; the apparent
freeze at generate() was local subprocess stdout block-buffering, not a real hang.
PYTHONUNBUFFERED=1 + reading modal app logs server-side confirmed the port works.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- FastConfig: teacher_pool_dir -> teacher_pool_runtests_dense, grad_clip -> 500
(were passed explicitly on every fast call). Dropped --teacher-pool-dir/--grad-clip
from the dir6 calls and --grad-clip from all other fast recipes; smoke/dev recipes
keep their own teacher_pool override.
- End-of-run summary reordered per token-efficient-logging 'final 30 lines': the wide
results row and the giant per-step table now print ABOVE the tail. The last lines are
just argv, a compact hack/solve x knob-on/knob-off table, and the single objective
(deploy solve - hack), since solve and hack alone are gameable.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Was the widest band (min clean, max hack): routed even neutral rollouts
(~0.4 of a cos=0 gradient), the over-route that costs solve. Switch to a
precision band on the inner quartiles so only the live tail above the clean
cluster routes; absorption covers the unrouted middle (gradient_routing.md
L420; SGTM tolerates ~40% undiscovered, Fig5b). p75 not min/max: 10 pairs
make the extremes single-sample noisy. Absolute threshold, so a clean batch
routes ~nothing without the per-batch-quantile pathology. KNOWN RISK logged:
pairs are off-distribution and shifted high vs live (median cos ~-0.06), so
the band may under-route; watch rout, fall back is a live-cos quantile gate.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Replace the band-mechanics trio (tau/hkgap/frout) and the lumped qmass with a
symmetric zone breakdown: each live unit's cos(g,v_grad) lands below/inside/above
the pair-band -> keep/resid/rout, reported as both unit shares and energy shares
(keepE/residE/routE). Energy view is unit-agnostic (answers 'is the grad per
rollout'). Drop hk_abl/slv_abl unless rollout_ablate_frac>0 (else 0/0). Band edges
(lower/upper) already logged at construction. v1 'routing' arm keeps qmass.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The routing-mass gauges had bespoke names; align to the gradient-routing /
SGTM vocabulary the reader knows: absorption (mass pinned into quarantine) and
leakage (hack surviving in the deployed knob). Two-sided 'pin too much / too
little' framing in the legends. Drop the 'FREE'/compute-cost detail from the
hk_abl/slv_abl legends -- reader doesn't need the implementation cost.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The 6-prompt teacher_pool_runtests covered ~3% of the 200-prompt train pool, so
~1 step in 8 saw a teacher demo and the student never learned the hack within 60
steps (hack_s=0/28 through step 19, job 0) -> all arms ~0 hack -> directionality
comparison invalid.
scripts/build_runtests_pool.py: builds a DENSE single-mode pool from the full
model-generated rh-s65 teacher pool (233 prompts, in-sample hacks), re-grades
each under env_mode=run_tests, keeps verified exploits (215/233 = 92% re-verify;
the rest went stale under the post-grader-bug grader). One demo/prompt (G_t=1
per step), no partition.json. Reuses compute_reward; row schema copied verbatim
from build_substrate so the pools are loader-compatible.
- queue-dir6 -> teacher_pool_runtests_dense (all 8 arms).
- build-runtests-pool recipe -> the new dense builder (was: copy 6 from substrate).
- main.tex teacher-seeding paragraph: disclose re-grade+verify, drop the now-wrong
'no re-grading' and the stale 6-prompt count; note demos are full problem-specific
completions (real solution + permissive self-written run_tests), not a snippet.
Source = HACKY checkpoint (rh-s65), not base. Old 6-prompt sweep killed and
requeued on the dense pool.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
The first 30 GRPO steps mix in cached hack demos (mix_ratio=0.125, 1 of 8
rollouts). Demos are generated in-sample by the hint-equipped hack teacher
(rl-rewardhacking-leetcode-rh-s65) in its own tokens, so the seeded gradient is
on-distribution. Teacher covers only 6 run_tests prompts; student trains on 200
(seeded-shuffle) -> the hack must generalise off the seeds (the C2 held-out
test). Adds \label{ssec:c2} for the cross-ref.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>