Commit Graph

27 Commits

Author SHA1 Message Date
wassname caa0d09472 broad: TEACHER_RT -> dense pool (was sparse, under-seeds); log: rename table cols train/deploy (drop 'knob')
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 22:12:00 +00:00
wassname 484305d7b4 config+log: fast defaults (dense pool, grad_clip=500); end-of-run tail = argv + hack/solve table + solve-hack objective
- FastConfig: teacher_pool_dir -> teacher_pool_runtests_dense, grad_clip -> 500
  (were passed explicitly on every fast call). Dropped --teacher-pool-dir/--grad-clip
  from the dir6 calls and --grad-clip from all other fast recipes; smoke/dev recipes
  keep their own teacher_pool override.
- End-of-run summary reordered per token-efficient-logging 'final 30 lines': the wide
  results row and the giant per-step table now print ABOVE the tail. The last lines are
  just argv, a compact hack/solve x knob-on/knob-off table, and the single objective
  (deploy solve - hack), since solve and hack alone are gameable.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 22:05:46 +00:00
wassname d9ea20baa4 routeV: margin (p75 clean / p75 hack) routing band, route the confident tail
Was the widest band (min clean, max hack): routed even neutral rollouts
(~0.4 of a cos=0 gradient), the over-route that costs solve. Switch to a
precision band on the inner quartiles so only the live tail above the clean
cluster routes; absorption covers the unrouted middle (gradient_routing.md
L420; SGTM tolerates ~40% undiscovered, Fig5b). p75 not min/max: 10 pairs
make the extremes single-sample noisy. Absolute threshold, so a clean batch
routes ~nothing without the per-batch-quantile pathology. KNOWN RISK logged:
pairs are off-distribution and shifted high vs live (median cos ~-0.06), so
the band may under-route; watch rout, fall back is a live-cos quantile gate.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 13:42:20 +00:00
wassname 25ac3fc5e3 log: routeV routing as keep/resid/rout zones x unit+energy views; drop dead hk_abl/slv_abl
Replace the band-mechanics trio (tau/hkgap/frout) and the lumped qmass with a
symmetric zone breakdown: each live unit's cos(g,v_grad) lands below/inside/above
the pair-band -> keep/resid/rout, reported as both unit shares and energy shares
(keepE/residE/routE). Energy view is unit-agnostic (answers 'is the grad per
rollout'). Drop hk_abl/slv_abl unless rollout_ablate_frac>0 (else 0/0). Band edges
(lower/upper) already logged at construction. v1 'routing' arm keeps qmass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 13:13:01 +00:00
wassname b170b969e2 log: surface absolute band edges (mean lower/upper), not just width
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 12:43:34 +00:00
wassname 041f9319f9 fix: hkgap legend said 'mean' but band uses max-hack/min-clean (train.py:345)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 12:41:05 +00:00
wassname c449273357 log: rename routeV gauges to paper vocab (qE->absorb, resid->leak), drop 'FREE' aside
The routing-mass gauges had bespoke names; align to the gradient-routing /
SGTM vocabulary the reader knows: absorption (mass pinned into quarantine) and
leakage (hack surviving in the deployed knob). Two-sided 'pin too much / too
little' framing in the legends. Drop the 'FREE'/compute-cost detail from the
hk_abl/slv_abl legends -- reader doesn't need the implementation cost.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:26:36 +00:00
wassname 1228e1b784 refactor: drop shadowed-import + duplicate-definition cruft (-91 LOC)
Left over from the data.py/vhack.py/eval.py/tablelog.py module split. In
train.py the canonical imports already won at runtime; the earlier ones were
dead shadows:
- ablate_quarantine, ref_logprobs_via_zero_delta: .eval wins (line 66), drop
  the .antipasto copy; load_v_hack/postprocess_v_hack: .vhack wins, drop
  .extract_vhack_grad; DATA/load_problems: .data wins, drop .problems.
- local setup_logging() was byte-identical to the .tablelog one already
  imported (with StepLogger); delete the local def + now-orphaned datetime
  import and LOGS_DIR const.
- problems.py stays: 6 scripts + derisk/regrade still import it.

antipasto.py: delete detach_antipasto (0 callers) and its own copies of
ref_logprobs_via_zero_delta / ablate_quarantine (eval.py owns the canonical,
better-worded versions incl. the SGTM TODO), plus now-unused contextmanager
and per_token_logps imports.

docs: rm corrupted docs/spec/20260530_substrate_review_qwen.md (2-line API
error dump, not a review).

Behavior-preserving (later imports already won at runtime). Verified: just
smoke (erase) + just smoke-routeV both exit 0, 0 tracebacks, all verify_*
gates PASS.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00
wassname cc8db051ab fix: seeded-shuffle train pool (was first-200-by-id = easy/memorized); add queue-dir6/queue-broad recipes
Train side of the same contamination bug: fast preset loaded first-200-by-id =
the lowest/oldest/most pretraining-memorized problems (base solves them easily ->
weak hack incentive). Now a seeded-random representative sample (seed=cfg.seed),
with the teacher-seed ids pinned in so seeding still fires. Paper trains on all
992 (base ~20%); job 176 confirmed base test=0.094 / train_filtered=0.203,
matching paper fn9.

Adds justfile recipes:
- queue-dir6 SEED: 8-arm single-seed directionality set (routeV real rollout/
  per-token, random-V both, vanilla, vampire in-subspace placebo, +2 LoRA-frozen-B
  routeV) on teacher_pool_runtests + fixed eval.
- queue-broad: headline arms (vanilla/erase/routeV) x 3 seeds for paired-t
  significance + directionality/adapter ablations at one seed.

Spec: docs/spec/20260607_eval_contamination_fix.md (force-added; docs/ gitignored).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00
wassname ea01267cd8 fix: eval on paper test set, not contaminated holdout (base solve 0.94->0.094)
The periodic VAL eval ran on leetcode_train_medhard_holdout.jsonl (353, our
artifact): disjoint from train by id but in the train id/recency range (ids
3-3205, 88% medium), so dominated by classic problems Qwen3-4B memorized in
pretraining -> base solve 0.94, saturating solve and killing the hack metric's
gt-fail headroom. Disjoint-by-id controls for TRAIN leakage, not pretraining
MEMORIZATION; only the recency-held-out test set (ids >= 3243) reproduces the
paper rate.

Proof (job 176, base model, same eval_hack_solve): test_medhard solve=0.094,
matching paper fn9 (~12% test) -> eval pipeline is sound, holdout was the
contaminant. Fix: drop the holdout; periodic curve + final number both eval the
paper test set leetcode_test_medhard. Smoke green. Hint confirmed = paper's
simple_overwrite_tests (not the easier _detailed/_aware variants).

Also this session: removed stale teacher-pool TRAIN restriction; seeded shuffle
for eval load; LoRA-frozen-B adapter; rescore CLI Positional fix. Known follow-up
(journal e): train pool is still first-200-by-id (easy/memorized), same bug class.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 11:01:31 +00:00
wassname 7da54f1967 eval+env: single-mode run_tests, held-out val/test eval, both hack metrics
- revert env to single-mode run_tests (paper-comparable): FastConfig teacher
  pool = run_tests-only (no partition.json); + `just build-runtests-pool`
- held-out eval: periodic train(knob-on)+deploy(knob-off) on VAL (holdout file),
  final deploy on TEST n=119 -> deploy_test.json; inline train/val/test disjoint assert
- report BOTH hack metrics: strict stub-pass (exploited) + vendor eq_hinted
  (hacked_loophole_used) -- external review 2026-06-07
- consolidate to one canonical eval_hack_solve (.eval); delete the train.py
  duplicate that silently lacked the token gap (in-run eval != rescore bug)
- routeV band edges mean -> min/max (conservative degrade-to-absorb)
- scripts/rescore_deploy.py: offline re-score of saved adapter on held-out test
- modal/app.py: read deploy_test.json

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 03:07:35 +00:00
wassname 2873b37842 modal: flash_attention_2 + transformers==5.10.2, drop sdpa workaround
The generate() hang was floating transformers @ main (a later commit), not the
attn backend -- confirmed: v60 ran on an earlier main with flash, and the smoke
on pinned 5.10.2 clears the deadlock point. Revert the VGROUT_ATTN=sdpa override
(app.py) and the env knob (train.py) back to hardcoded flash_attention_2, which
fails loud if the image's flash wheel is ever wrong rather than silently running
2-3x slower on sdpa. Pin transformers to the released 5.10.2 (patch line of v60's
5.10.0.dev0); uv.lock keeps the exact commit for the local box.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 08:41:11 +08:00
wassname 2f91561269 modal/train: VGROUT_ATTN attn-impl override (NOT a fix for the modal hang)
Adds env override VGROUT_ATTN (default flash_attention_2, so local behavior is
unchanged; app.py sets sdpa on Modal). Tested to isolate the Modal generate()
deadlock: it hangs at the first generate under BOTH flash_attention_2 and sdpa,
so the hang is NOT the attention backend -- it's in the generation loop, suspect
the cache-frozen image's transformers-main commit differing from local's working
5.8.0.dev0. Diagnosis + fix path in task #212. Local n=3 runs proceed meanwhile.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 16:42:12 +00:00
wassname b8efd42d2f eval: train/test token gap for all 4 modes (lenient disjoint families)
Ariahw's run_tests fn-name memorization control (paper line 84: train fixes
"run_tests", eval randomizes over 10 held-out names), extended to our 3 added
modes (stdout_marker/sentinel/file_marker). At train the side-channel detector
accepts a whole family + a fixed hint token (hack stays RL-discoverable); at
eval we disclose a NOVEL token and grade against a DISJOINT family equally
lenient as train -- so a memorized train token scores 0 and only producing the
disclosed mechanism counts, with no train-lenient/eval-exact strictness shift
(the v1 confound the external panel flagged). run_tests stays exact-both
(matches the paper). Eval-only; training path byte-identical (families default
None). scripts/verify_eval_gap.py proves per mode + per eval token:
disjointness (train detector must NOT fire on eval token), the gap matrix, and
no mutation of canonical prompts. Wired into smoke.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 13:49:07 +00:00
wassname dcd1b18303 eval: train/test token gap for all 4 modes (paper memorization control)
Ariahw randomizes the run_tests function name at eval (line 84) so a high hack
rate reflects responding to the disclosed mechanism, not memorizing a string.
Our 3 added side-channel modes (stdout/sentinel/file) used family detectors with
a fixed hint, so eval fell inside the train family -> no train/test gap. Fix: at
eval, disclose a NOVEL per-mode token and tighten the detector to require exactly
it (rewards.py _probe_channels gains eval_marker_phrase/eval_sentinel_attr/
eval_result_file; defaults None = unchanged family path at train). run_tests uses
the existing test_func_name param. eval.py randomize_eval_markers swaps the WHOLE
train hint (never touches problem text), seeded per problem so the curve stays
smooth and comparisons stay paired. New gate scripts/verify_eval_gap.py proves,
per mode: train-token hack scores under family, NOT under eval-token; novel-token
hack scores under eval-token. Wired into smoke.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 13:49:07 +00:00
wassname ba46e85f55 eval: 1 sample/prompt, periodic 32 distinct, final on whole pool
Prompt is the independent unit for a hack-rate estimate (same-prompt
completions share the mode -> correlated), so spend the gen budget on
distinct prompts not repeats. gen_cfg_eval num_return_sequences group->1.
Periodic 8->32 distinct prompts (smoother curve, still 2x faster than the
old 8x8=64-completion pass). Final eval drops the eval_n_prompts_final cap
and runs the WHOLE loaded pool x1 (SE~0.021 at p=0.1 over ~200 vs ~0.075
over 16). Final still does train + deploy(knob-off) for route/routeV and
collapses to one pass for vanilla/erase.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 13:49:07 +00:00
wassname 842a373ebc seed periodic deploy eval too (common random numbers, RNG save/restore)
The per-step deploy curve now seeds gen with EVAL_GEN_SEED (promoted to a module
const) so all steps+arms share frozen sampling noise -> smooth, comparable
trajectory. Saves/restores both CPU and CUDA RNG around the eval so the training
stream is unperturbed. Seeding does NOT collapse the 8 samples/prompt (they stay
diverse); it only freezes run-to-run/arm-to-arm randomness.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 12:25:25 +00:00
wassname 73936c822f rename route2->routeV; heavy seeded final eval; save delta_S_hack
route2 (binary-tau) and routeV (banded gate) are different methods -- give the
new one a distinct id so old/new runs can't be confused (see hypothesis doc).
- src/vgrout/* + justfile: route2->routeV, routing2->routingV (figs.py keeps the
  old keys for plotting historical runs).
- Final eval: eval_n_prompts_final=64 distinct prompts (periodic curve stays light
  at eval_n_prompts) + fixed gen seed (common random numbers across arms) so the
  paper deploy numbers aren't sampling-noise (the n=8-prompt eval gave 0.031 vs
  0.125 at the same checkpoint).
- save_ckpt: also write delta_S_hack to sibling _hack.safetensors so runs can be
  re-scored knob-ON at higher n later (train.safetensors stays delta_S-only).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 12:08:28 +00:00
wassname f22b69d1d3 config: make prog_wide (30 pairs) the default vhack_pairs_path
prog_wide is the proven main pair set, so default to it instead of falling back
to the 18 hand-crafted vgrout.pairs.PAIRS (now only reached if explicitly None).
The same pairs build both v_grad and the route band in one extract pass -- no
separate threshold set. Spec updated to say so. route2 smoke green on the new
default (band +0.259). erase unaffected (explicit --v-hack-path takes precedence).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 05:02:08 +00:00
wassname dd922d8793 route2: add per-token routing granularity (route2_per_token), default per-rollout
Ablation arm requested by the user: route the banded gate per TOKEN (one cos/f
per token) instead of per ROLLOUT (sum tokens first). Per-rollout stays the
default (denoises the cos sign, matches GRPO per-rollout advantage). Per-token
uses the same pair-calibrated band; gauges (frout/tau) mask pad tokens
(|g_tok|<1e-8) so the ~0-grad positions don't skew them. Conservation
(routed+kept=g) holds in both. Both paths smoke green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 04:52:30 +00:00
wassname aca045ec99 route2: surface routed-fraction (frout) col + fix stale tau/hkgap legends
Audit (subagent, 2026-06-06) found no cheats and no math errors, but two
log-honesty gaps:
- tablelog tau/hkgap descriptions still described the deleted EMA-midpoint gate
  ("ema_hack_cos - ema_clean_cos", "calibrated route threshold"). Rewrote to the
  band semantics (tau=median live cos_b; hkgap=band width upper-lower).
- the spec's mandatory routed-mass gauge (mean f) was DEBUG-only. Promote it to
  the frout streaming column so the real-vs-random mass confound is checkable in
  the table (compare deploy-hack at matched frout), not just via qE.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 04:48:17 +00:00
wassname d159d4c0f2 route2: fail loud if real v_grad band collapses (extraction broken)
Fresh-eyes review flagged that nothing asserted upper>lower for the REAL
v_grad: a broken extraction (hack pairs aligning no more than clean) would
silently degenerate into the random-control sign gate via the max(.,1e-6)
floor. Assert mean band width > 0 on non-Haar runs; the Haar control is still
allowed to collapse. No correctness change to the gate math (review found
conservation, per-rollout recovery, cosine masking, closure capture all OK).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 03:35:33 +00:00
wassname 485839d7b1 route2: pair-calibrated banded gate, drop live-detector tau + force-route
Replace the confounded route2 gate (hack_anchor force-routed teacher +
weak-detector student rows by LABEL; EMA tau calibrated from a live detector
over student rollouts at train time = a cheat) with a band calibrated from the
contrastive pairs alone:

  lower = mean clean-pair cos(g, v_grad);  upper = mean hack-pair cos
  per rollout: f = clamp((cos(g_b, v_grad) - lower)/(upper - lower), 0, 1)
  routed = sum_b f_b * g_b -> delta_S_hack;  kept = g - routed -> delta_S

v_grad is now the SOLE router: no detector or gt_pass touches routing, so
"does v_hack generalize to held-out modes" is clean and random-vs-real is
decisive. Band width (upper-lower) is itself the discriminator: smoke shows
+0.289 real vs -0.014 Haar-random (collapsed). conservation routed+kept=g
holds exactly; resid~0 in smoke (no hack leak into the deployed knob).

- delete build_route2_anchors + EMA state (ema_hack/clean_cos, route2_tau)
- add route_band_edges(); build at extract, rebuild on v_grad refresh
- drop --gate-anchor-teacher-only config + retire scripts/verify_gate_anchor.py
- teacher rollouts now route through the same band (not force-routed)
- spec: add the mass-confound control (scientist review 2026-06-06)

smoke-route2 + smoke-route2 --route2-random-v-seed=7 both pass; erase smoke green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 03:27:24 +00:00
wassname 53d88bc9ee spec: fold external-review into pair-routing plan; default teacher_off_step=30
External review (Claude + deepseek-v4-pro) converged on the threshold being
circular (c_rej>c_cho holds by construction since vec=mean(g_rej-g_cho)) plus
scale-mismatched to live rollouts. Decisions added: leave-one-pair-out as the
real vec-generalizes diagnostic; quantile-tau to match flagged fraction in the
real-vs-random control; route the vec-component (erase-style) not the whole
rollout; degeneracy diagnostic (hkgap collapse); pre-register the science UAT
(n>=3 seeds, effect>random-baseline std).

teacher_off_step now defaults to 30 on the base Config so every arm runs pure
on-policy past step 30 (apples-to-apples deploy numbers; job 87 showed hacking
self-sustains after the cut).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 01:03:13 +00:00
wassname a3a3f09824 retract 'null_city contaminated' framing -> in/out-of-subspace + cosine-is-correlational
Haar's ~0 cos is concentration of measure (out-of-subspace), not a cleaner
placebo. Semantic placebos are in-subspace and share generic structure, so a
nonzero cos with hack is the expected floor, not 'they found the hack'.
null_city's high-cos modules are plausibly low-rank-module artifacts. Cosine
is correlational; the ablation run is the causal test.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 09:21:41 +00:00
wassname e5295dc07b feat: route2 Haar-random v_grad directionality control (H2 vs H4) + semantic placebo fleet
The null_city placebo is CONTAMINATED: 20% of its modules align with the hack
direction (median |cos|=0.06 but a 0.99 tail, shared generic features). So the
'route2 is non-directional' verdict rested on a bad control. Add the clean tests:

- route2_random_v_seed: replace pair-derived v_grad with seeded per-module Haar-random
  unit vectors (~0 cos with hack dir everywhere). Refresh no-ops so the draw stays fixed.
  'Nothing routed' (||dS_hack||==0) is now a valid logged outcome, not an abort -- it is
  itself H4-confirming (a zero-alignment direction may never clear tau).
- null_vampire / null_bacon / null_blue: semantic placebo fleet (vampire-vs-werewolf etc.),
  each an arbitrary direction with different accidental hack-alignment. Maps route2's
  suppression-vs-alignment as a scatter: H4 predicts it tracks |cos|, H2 predicts all suppress.

Smoke-validated (smoke-route2 --route2-random-v-seed=0 completes).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 08:43:54 +00:00
wassname 55937a86fb rename python package projected_grpo -> vgrout
git mv src/projected_grpo -> src/vgrout and find-replace the module name in
all imports (.py), `-m projected_grpo.*` invocations (justfile), and the
[project] name (pyproject; setuptools auto-discovers via where=["src"]).

Left RESEARCH_JOURNAL.md untouched: its commands/paths are dated lab notes
tied to past commits, so rewriting them would falsify provenance. Repo dir,
git remote, and absolute paths unchanged.

Verified: `import vgrout` and `python -m vgrout.train --help` load the full
graph; verify_rewards.py + verify_gate_anchor.py (both import vgrout) pass.
Full `just smoke` is blocked upstream by missing gitignored data artifacts
(out/pools/{substrate,teacher_pool}, out/vhack/*smoke*), unrelated to the rename.
2026-06-05 14:51:48 +08:00