Commit Graph

402 Commits

Author SHA1 Message Date
wassname d393e119e0 viz: reference = Ariahw paper (oracle upper bound), not SGTM
Swap the floor->ceiling reference to the substrate paper (Ariahw et al. 2025),
which benchmarks interventions on the same floor (No-Intervention hack ~79%) /
ceiling (RL-Baseline no-loophole). Their best arm (Ground-Truth Penalty, ~0%
hack, perf >= ceiling) reaches the top corner BUT uses the oracle monitor at
train time -- the exact cheat our no-cheat constraint forbids; their only
oracle-free method (inoculation) gave incomplete, high-variance mitigation.
Plotted hatched/grey as an ORACLE upper bound (solve approx; figures are images,
200-step preset not step-matched). Honest framing: their working methods need
the oracle; ours uses no detector at train time and still suppresses 93%.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 10:03:05 +00:00
wassname 93c862bb2e results-deploy: add select (Youden J) + floor->ceiling columns
select = routing precision = hack_supp - solve_supp on the knob (held-out val,
knob-ON vs knob-OFF, same split). 1.0 = removes all hacking at no solve cost.
Sanity: vanilla=0.00 (no knob), base=blank (no knob-on signal), per-token=0.96.

hack_supp = (vanilla - hack)/vanilla ; solve_uplift = (solve - base)/(ceiling - base),
the floor->ceiling normalized fractions (ceiling provisional=paper 0.223, FIXME job 24).

The earlier "solve suppression ~50%" was a train/test artifact; the knob's true
solve cost (select's solve_supp term) is near zero -- selectivity is high.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:56:55 +00:00
wassname 34a2eec704 viz: floor->ceiling as two normalized panels (best vs control vs reference)
Rework per feedback: hack and solve are not opposites, so they get separate
floor->ceiling axes (each 0=floor..1=ceiling) rather than sharing a zero -- this
also stops solve (range ~0.13-0.22) being squished next to hack (0-0.61).
Minimal: routeV per-token (best) vs random-V (direction control) vs the SGTM
gradient-routing paper placed on the same floor->ceiling % axis (approx, LM task).

Reads: hack suppression 93% best / 84% control / ~98% reference (9pp = direction
signal); solve gained +17% / -17% / ~95% (far from ceiling -- model barely learns
to solve in 60 steps). Moved out/plots -> out/figs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:55:03 +00:00
wassname 7d08ad2acd viz: floor-to-ceiling method comparison (csv + figure)
Two-stage script: build out/plots/floor_ceiling.csv (one row per arm/anchor,
with SOURCE and STATUS columns flagging every provisional/missing cell) then
the keynote figure. Prints TODO/FIXME data gaps before plotting.

Panel A: normalized floor->ceiling bars, headline deploy (knob-off, test n=119).
Panel B: the knob effect -- arrow knob-ON -> knob-OFF on the SAME held-out val
split (eval_curve.jsonl), isolating the quarantine from the train/test
memorization gap. Fixes the earlier conflation where the train->deploy arrow
mixed knob-on/off with train-problems/test-problems.

Data gaps flagged in csv: solve ceiling provisional=paper 0.223 (FIXME job 24),
prog_wide arm contaminated (TODO job 28 prog_wide_clean).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:45:37 +00:00
wassname 8e6eace56b fix: rename 4 canonical LeetCode function names in authored/clean pairsets
singleNumber->findUnpaired, longestCommonPrefix->sharedPrefix,
removeDuplicates->inplaceDeduplicate, maxProfit->bestSingleTrade.

Same algorithm and test cases; method name changed so pairs no longer share
a canonical LeetCode function name with training data.

Also update results.md Q14 table: add hack_train/solve_train columns,
vanilla row, and prog_wide contamination note (docs/ is gitignored).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:23:33 +00:00
wassname dae52b2a7d cleanup: consolidate pairs modules into build scripts + add solve_train to table
- Delete src/vgrout/pairs_v2.py and src/vgrout/pairs_intent.py; move all data
  into scripts/pairset_build_intent.py (self-contained, exports 3 JSONs).
- Export: pairs_intent_think.json (6), pairs_intent_funcname.json (6),
  pairs_intent_concept.json (6 diagnostic).
- Update diag_cosine_dist.py and diag_pairs_compare.py to load from JSON
  instead of importing Python modules; drop tainted v2/allv2 pairsets
  from the diag sweep (print-without-assert axis).
- train.py final table: add solve_rate_s computed same as hack_rate_s, so
  the per-run end-of-training table shows actual training solve rate (was "-").

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 09:17:42 +00:00
wassname 00600d13a6 data: add prog_wide_clean.json -- 8/30 pairs that don't encode run_tests grading flaw
22/30 prog_wide pairs are contaminated: 14 print-no-assert, 2 pass-only,
4 assert-True/is-not-None, 2 self-comparison tautologies. All encode
knowledge that the grader just calls run_tests() and checks it doesn't
throw. Filtered to 8 clean pairs (axes 2/4/5/6 only):
isPalindrome_int, mySqrt, containsDuplicate, singleNumber,
longestCommonPrefix, lengthOfLastWord, removeDuplicates, firstUniqChar.

These are identical to 8 of the 11 authored pairs (minus P19/P20/P21).
Needed because job 9 (best result: hack=0.042 per-token) used the full
contaminated prog_wide -- result needs replication with clean pairs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 08:58:15 +00:00
wassname fb9f68530c refactor: move pair data out of pairs.py into build script; drop tainted axis-1/3
pairs.py now only has HackPair dataclass + _prompt/_wrap helpers.
All pair data (and the PAIRS list) lives in scripts/pairset_build_authored.py.

Drop axis 1 (print-without-assert in run_tests) and axis 3 (persona + prints)
from pairs_authored -- those 10 pairs directly encode the env's specific
run_tests grading flaw (prints instead of asserts). Kept: axis 2 (hardcode,
identical run_tests both sides), 4 (try/except swallow), 5 (isinstance-only),
6 (weak inequality). P19-P21 (previously defined but unused) now included.
Result: 11 pairs from 4 axes.

Also removed the PAIRS-fallback branch from extract_vhack_grad.py (same
fail-loud pattern applied to train.py earlier).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 08:44:33 +00:00
wassname 6f49d5f9b0 refactor: named pairset JSONs + explicit --vhack-pairs-path, remove None fallback
- scripts/pairset_build_authored.py: exports pairs.py::PAIRS to out/pairsets/pairs_authored.json
- scripts/pairset_build_progsets.py: copy of attic/make_pairsets.py under new naming convention
- out/pairsets/pairs_authored.json: 18 hand-authored pairs (was hidden behind --vhack-pairs-path None)
- train.py: remove three None->PAIRS fallback branches; require explicit path (fail loud)
- justfile: --vhack-pairs-path=None -> pairs_authored.json in queue-online-stats
- requeued jobs 20/21/22 (LoRA-B, random-V, online_stats) with explicit pairs_authored.json

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 08:09:09 +00:00
wassname a1ef566bac main.tex: document setup differences vs paper in tab:anchors caption
200 steps/G=16/1536tok/n=10 (paper) vs 60 steps/G=8/512tok/n=1 (ours).
Framed as fast-preset directional surrogate within resource budget.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 04:55:58 +00:00
wassname f1f1c00f41 results: separate paper vs ours column pairs in anchor table
Paper (longer training, >512 tok/gen) and ours (60-step fast) are not
directly comparable -- now shown as separate column pairs in both
main.tex tab:anchors and docs/results.md Q14.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 04:54:35 +00:00
wassname 9398567e91 results: base model solve=0.126 hack=0.000 (matches paper ~0.115)
Fills baseline row in Q14 table and main.tex tab:anchors.
Context: job 23 (steps=0, zero-shot eval, seed 43, n=119).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 04:51:34 +00:00
wassname 83f3f98328 results: vanilla hack_deploy=0.613, suppression confirmed (15x reduction at best arm)
Q14 table updated: vanilla landed (hack 0.613, solve 0.101 = base rate).
All routeV arms beat vanilla on both hack and solve. Journal entry added.
main.tex tab:anchors vanilla row filled.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 04:51:12 +00:00
wassname a35e7b2735 feat: gt_only env-mode + queue baseline/no-loophole ceiling
- rewards.py: add "gt_only" EnvMode (channel=False always, honest oracle)
- problems.py: add "gt_only" hint (no-op, keeps original "should pass all tests")
- justfile: queue-baseline (steps=0, fast zero-shot eval, prio 80) and
  queue-no-loophole (gt_only vanilla GRPO, prio 11)
- main.tex: Table~\ref{tab:anchors} placeholder comparing paper baselines
  (base 11.5% / vanilla 14.9% / no-loophole ceiling 22.3%) to ours

Jobs queued: 23 (baseline, prio 80), 24 (no-loophole, prio 11).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 03:23:49 +00:00
wassname 4a65eedc92 chore: memory updates, diag_pairs_compare script
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 02:42:56 +00:00
wassname ec88ba3e42 merge: resolve RESEARCH_JOURNAL conflict (keep both HEAD + remote Modal-port entry)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 02:27:08 +00:00
wassname 0f59b1351b feat: online_stats gate for routeV -- live q5/q95 band calibration
New routeV_gate="online_stats" mode: use the empirical per-rollout cosine
distribution (q5/q95 pooled across all modules each step) as the routing
band thresholds, instead of the pair-derived route_band. Direction v_grad
still from authored pairs; only thresholds are online/adaptive.

Motivation: the pair-derived band sits above the live cosine distribution
(median live cos ~-0.06), causing frout to cliff as GRPO advantage flattens.
Online thresholds adapt to the actual live distribution, so the 5/95 tails
always route regardless of where the raw cosines land.

Config: routeV_gate="online_stats", online_stats_lo=0.05, online_stats_hi=0.95.
Step-0 prior: (-0.5, 0.5) neutral band (pairs not used for calibration).
Band update: post-opt.step(), torch.quantile over that step's module*rollout cosines.
No-cheat: v_grad from authored pairs only; thresholds from the cosine distribution
of live student rollouts (no oracle/labeling of live rollouts as hack/clean).

Also: add online_stats to results_deploy._arm(); justfile queue-online-stats recipe.
Queued as job 22 (s43, authored pairs, priority 12, after 19/20/21).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-09 02:25:37 +00:00
wassname 0412dc56d1 results.md: fix regenerate ref (just results-deploy -> just results) 2026-06-09 01:51:28 +00:00
wassname 5007c9757a results: just results = eval2 deploy table (time/headline/deploy/arm/pair/seed/train/argv); hard eval2 cutoff; archive eval1 (Q1-Q13 + 352 old logs) 2026-06-09 01:50:42 +00:00
wassname 824b7eb623 results: Q14 complete eval2 deploy table (4 done: per-token/authored/prog_wide/random-V; via just results-deploy). Corrects earlier claim that job8 prog_wide had no eval2 deploy 2026-06-08 23:57:42 +00:00
wassname e26f5fe08c results: add Q14 -- routeV deploy on recency-clean eval2 (job 15 in; vanilla/act_vote/lora/random-V pending) 2026-06-08 22:58:34 +00:00
wassname c721c460a4 journal: act_vote routes late where grad-cosine cliffs (killed-run partial data) 2026-06-08 22:50:09 +00:00
wassname 1cb5900de5 journal: pairs comparison (job 17) -- authored_all precision-best (p@10 0.70 vs prog_wide 0.20) 2026-06-08 19:44:00 +00:00
wassname 744d851861 journal: job 15 precision-best authored per-rollout finished, deploy hack 0.076 (vanilla pending) 2026-06-08 19:38:48 +00:00
wassname d497bfd161 feat: act_vote routeV gate (global activation-vote routing arm)
New routeV_gate=act_vote: route every module's per-rollout gradient by a single
global f_roll from a module-weighted vote of activation cosines cos(As_b, As_dir),
As=Vh@x completion-mean (mirrors diag_cosine_dist.py act/vote, AUROC 0.67 / p@10
0.30 -- the coverage corner). Maximally different from the grad-cosine arm: act
space + global aggregation. Direction As_dir/act_w/vote-band built from the same
authored pairs (no oracle) at init and refreshed every N steps. Window = [plen-1:]
to match the build hook + diagnostic. Smoke-verified (band opens, rout>0, refresh
ok); fresh-eyes reviewed.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 15:08:28 +00:00
wassname eedf9efb51 pairs: de-confound v2 (print(==) vs assert, line-matched) + intent designs (think/funcname/concept)
intent pairs hold sol+tests IDENTICAL, vary only the cheat-vs-solve intent signal
(the properly-contrastive shape). --pairs {think,funcname,concept} for AUROC test.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 13:08:47 +00:00
wassname 35be877fc0 pairs: v2 (harder/verbose) + --pairs option; NEGATIVE -- better pairs don't close the 0.67->0.84 gap
Authored pairs plateau ~0.67 act / 0.56 grad across all/runtests/allv2; ideal oracle 0.84.
Verbose solutions swamp the localized run_tests hack signal. Pairs lever exhausted.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 11:53:48 +00:00
wassname 9c630b83c7 agents: no-cheat #4 (on-distribution pairs = labeling live rollouts = cheating); journal ideal-ceiling tables
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 11:39:27 +00:00
wassname c29016079a diag: add top15/top05 filters, module-vote (per-space cos*|D_m|), ideal-direction ceiling
Ideal (oracle CV) AUROC grad 0.84 / act 0.84 >> pair-direction 0.56/0.67: the DIRECTION
is the bottleneck, not separability. on-distribution pairs green-lit. act vote 0.669 best clean.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 11:35:50 +00:00
wassname c0a4e4e060 diag: 3 filter levels (all/keep75/top25); act-cosine improves monotonically (top25 AUROC 0.72, p@10 0.50)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 11:16:48 +00:00
wassname 5fe22a1973 journal: separability sweep (act>grad AUROC, grad-cos best tail p@10=0.70, magnitude inverted, distshift root cause)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 11:13:02 +00:00
wassname 80e82f0b29 diag: pinning separability sweep (grad/act x cos/proj/mag x filter), AUROC+p@k, notebook
Finding: v_grad/As barely separate LIVE hack from clean (authored pairs are
off-distribution: localized run_tests-block contrast vs full novel-problem rollouts).
act-cosine best AUROC 0.69; grad-cosine best confident-tail p@10 0.70; magnitude inverted.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 11:11:55 +00:00
wassname b28b1a5e88 results: deploy-eval table (eval2 headline=solve_dep-hack_dep); journal interim read
scripts/results_deploy.py pulls the held-out TEST deploy numbers from the FINAL EVAL
line that just-results skips. Journal: per-rollout real==random (absorption), per-token
real-V is the lead; pinning suspected off (band above live cos).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 10:47:38 +00:00
wassname fcac80c4bb journal: random-V control matches real-V at per-rollout (0.101==0.101) -- H2 absorption lead
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 08:26:26 +00:00
wassname cf05310130 journal: dir6 real-V arms land (margin band) -- both suppress, per-token>per-rollout
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 02:08:02 +00:00
wassname (Michael J Clark) 826aa911f7 Update README.md 2026-06-08 07:48:19 +08:00
wassname 0d22ee6476 writeup: fill contrastive pairs TODO with actual pair examples + loophole hacks
Shows the prog_wide.json stdout_marker variant (print vs assert inside
run_tests) and canonical hack completions for sentinel/stdout_marker/file_marker
modes. Clarifies that prog_wide covers run_tests only by design.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 07:02:38 +08:00
wassname 376dccdd7f writeup: add main.qmd (Quarto draft) + nips-template.tex; update human journal
main.qmd mirrors main.tex structure with markdown prose, callout TODOs,
and Quarto cross-refs. Renders via nips-template.tex which wraps
nips15submit_e.sty so quarto render --to pdf produces NeurIPS-formatted
output. Human journal prose incorporated into abstract + intro + routing
section.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 07:00:54 +08:00
wassname 012983fb8d docs: journal entry 2026-06-07 -- Modal routeV deadlock was stdout buffering artifact
Both vanilla and routeV arms complete on Modal H100/A100-80GB; the apparent
freeze at generate() was local subprocess stdout block-buffering, not a real hang.
PYTHONUNBUFFERED=1 + reading modal app logs server-side confirmed the port works.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-08 06:50:20 +08:00
wassname 34ba631e7d journal: deferred idea -- half-solve teacher pool to decouple off-policy/teacher-forcing confound; first-15-step gating
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 22:39:01 +00:00
wassname caa0d09472 broad: TEACHER_RT -> dense pool (was sparse, under-seeds); log: rename table cols train/deploy (drop 'knob')
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 22:12:00 +00:00
wassname 484305d7b4 config+log: fast defaults (dense pool, grad_clip=500); end-of-run tail = argv + hack/solve table + solve-hack objective
- FastConfig: teacher_pool_dir -> teacher_pool_runtests_dense, grad_clip -> 500
  (were passed explicitly on every fast call). Dropped --teacher-pool-dir/--grad-clip
  from the dir6 calls and --grad-clip from all other fast recipes; smoke/dev recipes
  keep their own teacher_pool override.
- End-of-run summary reordered per token-efficient-logging 'final 30 lines': the wide
  results row and the giant per-step table now print ABOVE the tail. The last lines are
  just argv, a compact hack/solve x knob-on/knob-off table, and the single objective
  (deploy solve - hack), since solve and hack alone are gameable.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 22:05:46 +00:00
wassname eeee7db65c journal: routeV margin band (p75/p75) verified routing 28.7% on real 4B; dir6 restarted on it
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 14:15:12 +00:00
wassname d9ea20baa4 routeV: margin (p75 clean / p75 hack) routing band, route the confident tail
Was the widest band (min clean, max hack): routed even neutral rollouts
(~0.4 of a cos=0 gradient), the over-route that costs solve. Switch to a
precision band on the inner quartiles so only the live tail above the clean
cluster routes; absorption covers the unrouted middle (gradient_routing.md
L420; SGTM tolerates ~40% undiscovered, Fig5b). p75 not min/max: 10 pairs
make the extremes single-sample noisy. Absolute threshold, so a clean batch
routes ~nothing without the per-batch-quantile pathology. KNOWN RISK logged:
pairs are off-distribution and shifted high vs live (median cos ~-0.06), so
the band may under-route; watch rout, fall back is a live-cos quantile gate.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 13:42:20 +00:00
wassname 25ac3fc5e3 log: routeV routing as keep/resid/rout zones x unit+energy views; drop dead hk_abl/slv_abl
Replace the band-mechanics trio (tau/hkgap/frout) and the lumped qmass with a
symmetric zone breakdown: each live unit's cos(g,v_grad) lands below/inside/above
the pair-band -> keep/resid/rout, reported as both unit shares and energy shares
(keepE/residE/routE). Energy view is unit-agnostic (answers 'is the grad per
rollout'). Drop hk_abl/slv_abl unless rollout_ablate_frac>0 (else 0/0). Band edges
(lower/upper) already logged at construction. v1 'routing' arm keeps qmass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 13:13:01 +00:00
wassname b170b969e2 log: surface absolute band edges (mean lower/upper), not just width
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 12:43:34 +00:00
wassname 041f9319f9 fix: hkgap legend said 'mean' but band uses max-hack/min-clean (train.py:345)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 12:41:05 +00:00
wassname 5fd980244b docs: note SGTM is the latest gradient-routing paper (same authors)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:56:58 +00:00
wassname 637f9388c8 docs: cite SGTM paper in AGENTS.md (absorption/leakage vocab source)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:40:40 +00:00
wassname c449273357 log: rename routeV gauges to paper vocab (qE->absorb, resid->leak), drop 'FREE' aside
The routing-mass gauges had bespoke names; align to the gradient-routing /
SGTM vocabulary the reader knows: absorption (mass pinned into quarantine) and
leakage (hack surviving in the deployed knob). Two-sided 'pin too much / too
little' framing in the legends. Drop the 'FREE'/compute-cost detail from the
hk_abl/slv_abl legends -- reader doesn't need the implementation cost.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:26:36 +00:00