Commit Graph

338 Commits

Author SHA1 Message Date
wassname 2f91561269 modal/train: VGROUT_ATTN attn-impl override (NOT a fix for the modal hang)
Adds env override VGROUT_ATTN (default flash_attention_2, so local behavior is
unchanged; app.py sets sdpa on Modal). Tested to isolate the Modal generate()
deadlock: it hangs at the first generate under BOTH flash_attention_2 and sdpa,
so the hang is NOT the attention backend -- it's in the generation loop, suspect
the cache-frozen image's transformers-main commit differing from local's working
5.8.0.dev0. Diagnosis + fix path in task #212. Local n=3 runs proceed meanwhile.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 16:42:12 +00:00
wassname 98ceb38815 modal: rename launch entrypoint main->fanout (collides with app.py::main)
launch.py imports `app` from app.py, which registers app.py's @local_entrypoint
`main`; launch.py defining its own `main` raised InvalidError(Duplicate local
entrypoint). So launch.py had never actually run -- the earlier vanilla verify
was via app.py directly. Invoke: modal run modal/launch.py::fanout [--only N].

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 14:09:35 +00:00
wassname 6567f6c60a modal: launch.py -> 15-run v2 keynote set (5 arms x seeds 42/41/43)
Old JOBS fired --intervention=route2 (dead flag after the routeV rename) on the
pre-v2 manifest -- half the containers would have errored on argv parse. Replace
with the n=3 keynote set generated from ARMS x SEEDS: vanilla, routeV real-V
per-rollout, routeV per-token, random-V(157), placebo(vampire). Tag stems match
the local pueue twins so Modal and local cross-replicate. id 1 = canary
(seed-42 vanilla). Fix app.py::smoke route2->routeV and the subprocess modal
binary (not on PATH; resolve next to sys.executable). v2 eval rides in via the
runtime-mounted src/.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 14:07:47 +00:00
wassname a3ac381724 memory: correct pi --mode json gotcha (blocks on stdin, fix is </dev/null)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 13:49:07 +00:00
wassname b8efd42d2f eval: train/test token gap for all 4 modes (lenient disjoint families)
Ariahw's run_tests fn-name memorization control (paper line 84: train fixes
"run_tests", eval randomizes over 10 held-out names), extended to our 3 added
modes (stdout_marker/sentinel/file_marker). At train the side-channel detector
accepts a whole family + a fixed hint token (hack stays RL-discoverable); at
eval we disclose a NOVEL token and grade against a DISJOINT family equally
lenient as train -- so a memorized train token scores 0 and only producing the
disclosed mechanism counts, with no train-lenient/eval-exact strictness shift
(the v1 confound the external panel flagged). run_tests stays exact-both
(matches the paper). Eval-only; training path byte-identical (families default
None). scripts/verify_eval_gap.py proves per mode + per eval token:
disjointness (train detector must NOT fire on eval token), the gap matrix, and
no mutation of canonical prompts. Wired into smoke.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 13:49:07 +00:00
wassname dcd1b18303 eval: train/test token gap for all 4 modes (paper memorization control)
Ariahw randomizes the run_tests function name at eval (line 84) so a high hack
rate reflects responding to the disclosed mechanism, not memorizing a string.
Our 3 added side-channel modes (stdout/sentinel/file) used family detectors with
a fixed hint, so eval fell inside the train family -> no train/test gap. Fix: at
eval, disclose a NOVEL per-mode token and tighten the detector to require exactly
it (rewards.py _probe_channels gains eval_marker_phrase/eval_sentinel_attr/
eval_result_file; defaults None = unchanged family path at train). run_tests uses
the existing test_func_name param. eval.py randomize_eval_markers swaps the WHOLE
train hint (never touches problem text), seeded per problem so the curve stays
smooth and comparisons stay paired. New gate scripts/verify_eval_gap.py proves,
per mode: train-token hack scores under family, NOT under eval-token; novel-token
hack scores under eval-token. Wired into smoke.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 13:49:07 +00:00
wassname ba46e85f55 eval: 1 sample/prompt, periodic 32 distinct, final on whole pool
Prompt is the independent unit for a hack-rate estimate (same-prompt
completions share the mode -> correlated), so spend the gen budget on
distinct prompts not repeats. gen_cfg_eval num_return_sequences group->1.
Periodic 8->32 distinct prompts (smoother curve, still 2x faster than the
old 8x8=64-completion pass). Final eval drops the eval_n_prompts_final cap
and runs the WHOLE loaded pool x1 (SE~0.021 at p=0.1 over ~200 vs ~0.075
over 16). Final still does train + deploy(knob-off) for route/routeV and
collapses to one pass for vanilla/erase.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 13:49:07 +00:00
wassname 70aa6aa96b modal: parallel GRPO sweep port (image, volume, fan-out launcher)
Fire the paper sweep as independent H100/A100-80 containers instead of
serial pueue runs. One Volume caches model + svd + out/; train.py runs
unmodified (torch 2.7 + Dao flash-attn wheel, code mounted at runtime).
Verified: vanilla 60-step reproduces the local baseline. Skill at
~/.claude/skills/modal documents the patterns.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 20:30:19 +08:00
wassname bcf09dd742 docs 2026-06-06 12:27:26 +00:00
wassname 842a373ebc seed periodic deploy eval too (common random numbers, RNG save/restore)
The per-step deploy curve now seeds gen with EVAL_GEN_SEED (promoted to a module
const) so all steps+arms share frozen sampling noise -> smooth, comparable
trajectory. Saves/restores both CPU and CUDA RNG around the eval so the training
stream is unperturbed. Seeding does NOT collapse the 8 samples/prompt (they stay
diverse); it only freezes run-to-run/arm-to-arm randomness.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 12:25:25 +00:00
wassname 73936c822f rename route2->routeV; heavy seeded final eval; save delta_S_hack
route2 (binary-tau) and routeV (banded gate) are different methods -- give the
new one a distinct id so old/new runs can't be confused (see hypothesis doc).
- src/vgrout/* + justfile: route2->routeV, routing2->routingV (figs.py keeps the
  old keys for plotting historical runs).
- Final eval: eval_n_prompts_final=64 distinct prompts (periodic curve stays light
  at eval_n_prompts) + fixed gen seed (common random numbers across arms) so the
  paper deploy numbers aren't sampling-noise (the n=8-prompt eval gave 0.031 vs
  0.125 at the same checkpoint).
- save_ckpt: also write delta_S_hack to sibling _hack.safetensors so runs can be
  re-scored knob-ON at higher n later (train.safetensors stays delta_S-only).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 12:08:28 +00:00
wassname 9c76584970 track pairsets in git (hand-authored supervision source)
The pairset JSONs are the only non-regenerable input to the method (the
v_hack bases are derived from them via on-demand extraction, train.py:528).
They were caught by the blanket /out/ ignore; switch to /out/* + re-include
so any box (and Modal) gets the source from a clone instead of a side-channel
rsync. vhack safetensors stay ignored (383M of derived binaries).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 08:11:01 +00:00
wassname 4b9545c59a spec: route2b is the method, drop erase; workshop = 1 method + vanilla baseline + random-V ablation
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 05:20:00 +00:00
wassname 69f8bc208d justfile: erase recipes use the prog_wide default (drop pinned --v-hack-path)
fast-projected / full no longer pin v_hack_full.safetensors; erase now extracts
from the prog_wide default (auto-resolves v_hack_pairset_prog_wide), the same
pair set route2 uses -> apples-to-apples arms. Smoke recipes keep their
tiny-model v_hack pins (the tiny model needs its own basis).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 05:10:29 +00:00
wassname f22b69d1d3 config: make prog_wide (30 pairs) the default vhack_pairs_path
prog_wide is the proven main pair set, so default to it instead of falling back
to the 18 hand-crafted vgrout.pairs.PAIRS (now only reached if explicitly None).
The same pairs build both v_grad and the route band in one extract pass -- no
separate threshold set. Spec updated to say so. route2 smoke green on the new
default (band +0.259). erase unaffected (explicit --v-hack-path takes precedence).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 05:02:08 +00:00
wassname dd922d8793 route2: add per-token routing granularity (route2_per_token), default per-rollout
Ablation arm requested by the user: route the banded gate per TOKEN (one cos/f
per token) instead of per ROLLOUT (sum tokens first). Per-rollout stays the
default (denoises the cos sign, matches GRPO per-rollout advantage). Per-token
uses the same pair-calibrated band; gauges (frout/tau) mask pad tokens
(|g_tok|<1e-8) so the ~0-grad positions don't skew them. Conservation
(routed+kept=g) holds in both. Both paths smoke green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 04:52:30 +00:00
wassname aca045ec99 route2: surface routed-fraction (frout) col + fix stale tau/hkgap legends
Audit (subagent, 2026-06-06) found no cheats and no math errors, but two
log-honesty gaps:
- tablelog tau/hkgap descriptions still described the deleted EMA-midpoint gate
  ("ema_hack_cos - ema_clean_cos", "calibrated route threshold"). Rewrote to the
  band semantics (tau=median live cos_b; hkgap=band width upper-lower).
- the spec's mandatory routed-mass gauge (mean f) was DEBUG-only. Promote it to
  the frout streaming column so the real-vs-random mass confound is checkable in
  the table (compare deploy-hack at matched frout), not just via qE.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 04:48:17 +00:00
wassname d159d4c0f2 route2: fail loud if real v_grad band collapses (extraction broken)
Fresh-eyes review flagged that nothing asserted upper>lower for the REAL
v_grad: a broken extraction (hack pairs aligning no more than clean) would
silently degenerate into the random-control sign gate via the max(.,1e-6)
floor. Assert mean band width > 0 on non-Haar runs; the Haar control is still
allowed to collapse. No correctness change to the gate math (review found
conservation, per-rollout recovery, cosine masking, closure capture all OK).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 03:35:33 +00:00
wassname 485839d7b1 route2: pair-calibrated banded gate, drop live-detector tau + force-route
Replace the confounded route2 gate (hack_anchor force-routed teacher +
weak-detector student rows by LABEL; EMA tau calibrated from a live detector
over student rollouts at train time = a cheat) with a band calibrated from the
contrastive pairs alone:

  lower = mean clean-pair cos(g, v_grad);  upper = mean hack-pair cos
  per rollout: f = clamp((cos(g_b, v_grad) - lower)/(upper - lower), 0, 1)
  routed = sum_b f_b * g_b -> delta_S_hack;  kept = g - routed -> delta_S

v_grad is now the SOLE router: no detector or gt_pass touches routing, so
"does v_hack generalize to held-out modes" is clean and random-vs-real is
decisive. Band width (upper-lower) is itself the discriminator: smoke shows
+0.289 real vs -0.014 Haar-random (collapsed). conservation routed+kept=g
holds exactly; resid~0 in smoke (no hack leak into the deployed knob).

- delete build_route2_anchors + EMA state (ema_hack/clean_cos, route2_tau)
- add route_band_edges(); build at extract, rebuild on v_grad refresh
- drop --gate-anchor-teacher-only config + retire scripts/verify_gate_anchor.py
- teacher rollouts now route through the same band (not force-routed)
- spec: add the mass-confound control (scientist review 2026-06-06)

smoke-route2 + smoke-route2 --route2-random-v-seed=7 both pass; erase smoke green.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 03:27:24 +00:00
wassname d131323a8d spec: full rewrite as self-contained handoff (main.tex jargon, complete pseudocode)
Realigned to main.tex terminology (vGROUT; (hack,clean) pairs; delta_S/
delta_S_hack; arms erase + route). Dropped session jargon (vec/cho/rej/route2/
band-as-jargon). Added: env + the four loophole hacks (run_tests/sentinel/
stdout_marker/file_marker from Ariahw); short adapter pseudocode; extract
v_hack + band-edge pseudocode; complete pseudocode for both arms (erase
component-subtract aggregate w/ linearity note; route per-rollout banded gate);
no-cheat (vector-framed, -> AGENTS.md); label-free diagnostics; impl plan;
run plan (erase real-vs-random first, route later); queue disposition; teacher
facts + no-teacher emergence timing.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 03:05:08 +00:00
wassname 83cae4ef72 docs: reframe no-cheat in VECTOR terms; move it README->AGENTS.md
The 'weak detector for hack A, generalize to B' framing was wrong for this repo.
That is the weak-LABEL setup (labelA -> labelNotA), which is NOT ours. Ours is
vec -> routing: vec extracted from hand-built synthetic pairs, route the live
GRPO gradient by cosine alignment to vec; no detector ever runs over student
rollouts at train time. Generalization = does vec (from pairs covering some
modes) suppress held-out modes -- vector generalization, not detector-label.

- AGENTS.md: rewrote the no-cheat bullet to the 3-way distinction (oracle grader
  = cheat; weak-label setup = not ours; vec->routing = ours). For coding agents.
- README: removed the 'We cannot cheat' section (belongs in agent instructions,
  not the new-reader overview).
- spec: dropped the stray 'validation uses known-A detector' line; pointed the
  no-cheat reference at AGENTS.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 02:39:48 +00:00
wassname a83953131e spec: drop live-detector validation; per-rollout granularity (paper-backed) + cheap label-free diagnostics
Validation removed: running the weak detector over student rollouts at train
time is the no-cheat violation, and a live validation is complex/non-causal.
Causal proof stays downstream (deploy perf + real-vs-random). Train-time only
LOGs label-free gauges: hkgap=upper-lower, leave-one-pair-out separation (the
'does the threshold generalize to a second pair' test), live cos_b percentiles
vs [lower,upper] (calibration read with no labels), route_frac mass at 0/1,
resid=cos(g_keep,vec).

Granularity decided = per-rollout: train.py already sums per-token gate grads
to [G,r] and recovers g_b=cg/dS per rollout; band just swaps the cos_b>tau line
for the ramp. Backed by the papers: Gradient Routing (Cloud 2024) masks
per-token for LLMs / per-episode for RL; SGTM (2025) per-example, label-noise-
robust. Both route by a DATA-LABEL mask; we route by gradient ALIGNMENT to an
extracted direction -- that's the novelty. Borrow their 'absorption' as the
mechanism justifying A->B generalization.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 02:23:58 +00:00
wassname 180d3e862c spec: banded cosine gate (lower/upper from pair clean/hack cosines) + live-A calibration validation
Replaces the single-midpoint tau with a two-edge band: x=cos(g_step,vec),
route_frac=clamp((x-lower)/(upper-lower),0,1). lower=mean cos(g_cho,vec),
upper=mean cos(g_rej,vec). Below lower keep, above upper route, between =
absorption (proportional split). Band WIDTH (hkgap=upper-lower) is the
real-vs-random discriminator (random vec closes the band) so no separate
matched-fraction control is needed; collapse flags vec degeneracy.

Calibration validation on live KNOWN-mode-A rollouts (detector-labelled, no-cheat):
mean route_frac(A_hack) >> mean route_frac(A_clean) confirms the pair-set band
transfers to the sampled live distribution. Also picks g_step granularity
(per-rollout default vs per-step). Held-out B never in validation.

Corrects the earlier wrong claim that component-routing collapses to erase
(pseudocode 03 route v1 forward uses dS+dS_hack -> divergent trajectory).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 02:16:38 +00:00
wassname 53d88bc9ee spec: fold external-review into pair-routing plan; default teacher_off_step=30
External review (Claude + deepseek-v4-pro) converged on the threshold being
circular (c_rej>c_cho holds by construction since vec=mean(g_rej-g_cho)) plus
scale-mismatched to live rollouts. Decisions added: leave-one-pair-out as the
real vec-generalizes diagnostic; quantile-tau to match flagged fraction in the
real-vs-random control; route the vec-component (erase-style) not the whole
rollout; degeneracy diagnostic (hkgap collapse); pre-register the science UAT
(n>=3 seeds, effect>random-baseline std).

teacher_off_step now defaults to 30 on the base Config so every arm runs pure
on-policy past step 30 (apples-to-apples deploy numbers; job 87 showed hacking
self-sustains after the cut).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 01:03:13 +00:00
wassname dfdc538428 spec: pair-routing impl plan + resume-after-compaction state
Adds actionable train.py targets (delete build_route2_anchors, rewrite
_route2_grad_filter to pure cos>tau gate, pair-calibrated tau refreshed every N,
teacher_off_step=30), current state (queue PAUSED, on main, rollback tag), queued-job
disposition (superseded vs keep), and smoke/UAT. Self-contained handoff for post-compact.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 00:10:23 +00:00
wassname 68b0624733 backup: pueue job manifest (94 jobs, id/status/label/argv) at routing-refactor
Local log backup in out/pueue_logs_backup/20260606T000138/ (status.json + full log
+ task_logs) is gitignored/box-local; this manifest is the durable why-label copy.
Killed confounded full-teacher route2 directionality jobs 118/119/121/122/123.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 00:01:58 +00:00
wassname 0fa250b193 handoff: pre-routing-refactor snapshot + diagnosis
route2 directionality exposed the vector is not load-bearing: hack_anchor
force-routes teacher+detector by label (bypassing v_grad), tau calibrated from a
live detector, so random==real because labels carried it. Redesign: teacher-off@30,
drop force-route, calibrate tau from the A-pairs (no live detector), maybe use the
pairset directly vs a rank-1 vector. Decisive test = A5 real(126) vs random(135).
Queue snapshot + design notes in docs/REFACTOR_HANDOFF.md.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 23:58:35 +00:00
wassname f82a4f034d paper: interim directionality fig (app:directionality) + confound TODO
route2 deploy hack collapses for ANY v_grad (real/placebo/Haar) but solve tracks
direction (real>placebo>Haar). TODO names the load-bearing confound: full-teacher
runs force-route all teacher rows by label (hack_anchor), so the hack-axis collapse
is direction-free force-routing not the cosine gate; clean test = A5 run_tests-only
regime (pending). n=1 interim.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 23:40:02 +00:00
wassname 329066e99b paper: teacher-off control appendix (app:teacher) -- teacher seeds not sustains
Vanilla deploy-hack keeps climbing after teacher cut at step 40 (0.36->0.58,
job 87), at/above teacher-on (job 97). Closest-match jobs differ in LR; FIXME
to swap in lr-matched job 124 (queued low-prio). CSV is the committed data
artifact; fig regen by plot_teacher_ablation.py.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 12:30:49 +00:00
wassname ac418a54ce journal: #186 teacher-off vanilla hacking self-sustaining (job 87, 0.36->0.58 on-policy)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 12:07:41 +00:00
wassname 6dd6b74e73 afk: lite hourly check (one cron at :23, no deep dive unless broken)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 10:35:58 +00:00
wassname 7eac7750dc afk: add docs/AFK_CHECK.md (scopes hourly check to directionality mystery)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 09:46:38 +00:00
wassname d2b0fcb255 afk: scope hourly check to directionality mystery (docs/AFK_CHECK.md); drop routine no-finding journal entry (h)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 09:46:24 +00:00
wassname 6f60ebafa1 journal (h): AFK check -- no-cheat E-by-mode table re-confirmed on job 95; directionality framing corrected
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 09:35:27 +00:00
wassname a3a3f09824 retract 'null_city contaminated' framing -> in/out-of-subspace + cosine-is-correlational
Haar's ~0 cos is concentration of measure (out-of-subspace), not a cleaner
placebo. Semantic placebos are in-subspace and share generic structure, so a
nonzero cos with hack is the expected floor, not 'they found the hack'.
null_city's high-cos modules are plausibly low-rank-module artifacts. Cosine
is correlational; the ablation run is the causal test.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 09:21:41 +00:00
wassname e5295dc07b feat: route2 Haar-random v_grad directionality control (H2 vs H4) + semantic placebo fleet
The null_city placebo is CONTAMINATED: 20% of its modules align with the hack
direction (median |cos|=0.06 but a 0.99 tail, shared generic features). So the
'route2 is non-directional' verdict rested on a bad control. Add the clean tests:

- route2_random_v_seed: replace pair-derived v_grad with seeded per-module Haar-random
  unit vectors (~0 cos with hack dir everywhere). Refresh no-ops so the draw stays fixed.
  'Nothing routed' (||dS_hack||==0) is now a valid logged outcome, not an abort -- it is
  itself H4-confirming (a zero-alignment direction may never clear tau).
- null_vampire / null_bacon / null_blue: semantic placebo fleet (vampire-vs-werewolf etc.),
  each an arbitrary direction with different accidental hack-alignment. Maps route2's
  suppression-vs-alignment as a scatter: H4 predicts it tracks |cos|, H2 predicts all suppress.

Smoke-validated (smoke-route2 --route2-random-v-seed=0 completes).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 08:43:54 +00:00
wassname ec00bc4383 docs: A5 leak is double-hacks (not detector FP); placebo non-directionality measured via hkgap
Two review questions today exposed imprecise framing in load-bearing comments:

- A5 held-out <=1.1% hacked_E is the model double-hacking (one run_tests()-shaped
  completion that also writes the stdout marker, verified job-95 id 132), not a
  detector false positive. hacked_E is the mode-agnostic run_tests signature.
  Grading channels are non-overlapping; the model's strategy is not.
- Placebo 'non-directional' is now the hkgap measurement: real-v route2 hkgap
  0.6-0.8 (separates hack/clean), placebo ~0 (dead), both deploy hack 0.000.
  Confirms the degenerate-gate read (H2) over clever-random-direction (H1):
  suppression is quarantine-volume + exploration floor, not v_hack specificity.
  Direction only shows in solve (real 0.625 > placebo 0.531).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 08:23:49 +00:00
wassname 8249a9691e fix: ship smoke fixtures so the gate runs on a fresh clone
The smoke prereqs (out/pools/substrate, out/pools/teacher_pool,
out/vhack/v_hack_smoke) are gitignored pipeline outputs that only
exist on the GPU box -- a fresh clone died at verify_partition.py on a
FileNotFoundError for partition.json. Building them from scratch needs
a real Qwen3-4B GRPO rollout (pregen-teacher), so they can't be cheaply
regenerated CPU-side. Force-add them (~2.2MB) the same way the paper
figs under out/ are already tracked, so 'just smoke' is the portable
correctness gate it's meant to be.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 07:13:33 +00:00
wassname 55937a86fb rename python package projected_grpo -> vgrout
git mv src/projected_grpo -> src/vgrout and find-replace the module name in
all imports (.py), `-m projected_grpo.*` invocations (justfile), and the
[project] name (pyproject; setuptools auto-discovers via where=["src"]).

Left RESEARCH_JOURNAL.md untouched: its commands/paths are dated lab notes
tied to past commits, so rewriting them would falsify provenance. Repo dir,
git remote, and absolute paths unchanged.

Verified: `import vgrout` and `python -m vgrout.train --help` load the full
graph; verify_rewards.py + verify_gate_anchor.py (both import vgrout) pass.
Full `just smoke` is blocked upstream by missing gitignored data artifacts
(out/pools/{substrate,teacher_pool}, out/vhack/*smoke*), unrelated to the rename.
2026-06-05 14:51:48 +08:00
wassname 03693e4f30 name the method vGROUT (vector gradient routing)
- title: drop the "Quarantine ... Representation?" metaphor for
  "vGROUT: Vector Gradient Routing against Reward Hacking"
- Method: add a two-phase definition (make v_hack; then erase=discard the
  component / route=redirect the gated gradient into a deletable adapter,
  deleted at deploy). Honest framing: route preserves (not discards); follows
  Shilov et al.'s post-backward deletable-block routing in the gradient-routing
  family, gated by an extracted direction not a per-example data label
- strip literal "SGTM" from the body (confusing acronym); cite renders as
  author-year. README + pyproject describe vGROUT (package name unchanged)
2026-06-05 14:51:48 +08:00
wassname 07e1eb8753 paper: fix build, vector figs, +2 plots, de-jargon prose
- drop fontawesome5 (tectonic core-dumped on the OTF); the lone \faGithub
  icon was decorative
- switch the two included figures PNG->PDF (vector; now-tracked, smaller)
- add fig:generalisation (A5 dumbbell) next to tab:generalisation and
  fig:traindeploy (train-on vs deploy-off) in C1, both \ref'd
- rename leaked config codenames in appendix tables (v_hack_full ->
  "weak (10 pairs)", null_city -> "random (placebo)") with paper:code
  mapping comments
- de-jargon reader-facing prose per a 3-model external panel
  (kimi-k2.5 / gemini-3.1-pro / gpt-5.5): knob -> (auxiliary) adapter,
  quarantine -> isolate, no-cheat payload -> zero-label test, hack-ward ->
  hack-aligned, cousin/near-twin -> analogue, etc. Title metaphor left as-is.

14 pages, zero unresolved refs.
2026-06-05 14:51:48 +08:00
wassname 04562c5226 doc: fix stale tab:ablation provenance — random-V is job 106 not 87
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 05:59:28 +00:00
wassname 08ed96292f fig: point keynote includegraphics at tracked out/figs PNG (drop gitignored symlink)
docs/ is gitignored, so docs/writeup/figs/*.png symlinks are untracked -- a
fresh clone would have no figs/ dir and the build would break. The PNG itself
(out/figs/dyn_sub4_hack_overlay.png) IS tracked; point at it directly, matching
the sibling fig at L411. Build verified: 11 pages, no unresolved refs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 05:20:55 +00:00
wassname 3ae1e8376d journal: close (a) WATCH — placebo endpoint refutes route directionality (job 86)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 05:01:18 +00:00
wassname 273c9ae4aa Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine
# Conflicts:
#	.claude/memory/MEMORY.md
2026-06-05 04:52:47 +00:00
wassname 562832acec test: no-cheat partition + teacher-pool composition gate (verify_partition.py)
The other half of the no-cheat family (sibling of the gate-anchor leak). Asserts
on the real out/pools/substrate/partition.json: (1) partition is a clean function
into the 4 distinct substrate modes, each populated; (2) under teacher_modes={run_tests}
the kept teacher pool is ALL known-mode -- held-out modes get ZERO demos and are
genuinely held out (>0 problems). Vibe-check, not a theorem; wired into just smoke.
6/6 pass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 04:36:03 +00:00
wassname 5242f66b7e figs: a5 dedup title->axis arrow + CSV, overlay onset dot->labeled vline
- a5: drop per-panel title (restated the axis); fold direction into the xlabel
  (DEPLOY hack rate (down=better) / solve (up=better)). Dump a5_generalisation.csv
  (per mode,arm deploy hack/solve mean+/-std) -- the reproducibility source it lacked.
- overlay (dyn_sub4_hack_overlay etc): replace the per-arm onset DOT with a single
  dashed labeled 'first hack' vertical line, matching the small-multiples/longrun.
- (dyn_sub4_hack_overlay shares dyn_sub4.csv -- same runs, different view, no new CSV.)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 04:13:37 +00:00
wassname 8daf58d25e figs: a5 vanilla->route arrows, equiv0->approx0, skip degenerate train_deploy, prune orphans
- a5_generalisation: connectors -> arrows (baseline->ours direction, shows the drop
  and the stdout solve-cost honestly).
- equiv0 -> approx0 everywhere: these are finite-sample estimates, not identically 0.
- plot_train_vs_deploy skips when train==deploy for every run (no knob-ON contrast);
  fixes the 'can't see train' longrun/sub4 figures (they had no hk_on data).
- Prune 9 orphan figure sets not referenced in paper or blog (regenerable on demand);
  keep the 3 referenced + a5 + train_vs_deploy_60_train_deploy. All 4 CSVs committed.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 04:08:58 +00:00
wassname f0cbbacaf0 save per-eval deploy-adapter ckpts (rescore w/o retrain) + CLAUDE.md test lesson
save_eval_ckpts (default on): write the deploy adapter (δS only, ~2.3MB) at each
deploy-eval step, step-tagged, so a run can be re-scored later (more prompts /
different eval) without retraining. The A5 run saved only final+first_hack, which
is why the leak needed a full retrain rather than a rescore.

AGENTS.md: every load-bearing invariant gets a verify_*.py gate. The no-cheat leak
shipped because the green gates never covered the property -- 'tests passed' is
meaningless if the property was never tested.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:58:26 +00:00
wassname 7b08a7ede9 journal: A5 gate leak fixed (teacher-only anchor) + airtight rerun queued (job 111)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 03:54:09 +00:00