Commit Graph

55 Commits

Author SHA1 Message Date
wassname 2570dfaa67 Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine 2026-06-02 07:21:49 +00:00
wassname cf3ecc40f8 write up 2026-06-02 07:20:42 +00:00
wassname 923de6dbe6 docs(writeup): NeurIPS-workshop paper skeleton + tectonic compile recipe
Minimal LaTeX skeleton: outline + evidence tables (route2 n=3 deploy numbers
filled with provenance, vanilla pending jobs 74/84) + figures + verified refs
+ appendix (4-mode traces, 6/6/6/6 partition counts, pseudocode). Build
artifacts and figs symlinks gitignored. `just paper` compiles via tectonic;
`just paper-qc` dumps text + greps for unresolved refs / TODOs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-02 06:59:15 +00:00
wassname 3e7b8ecfc0 feat: just dyn = auto-plot newest full-length log per arm
--latest-per-arm + --min-steps select the freshest >=N-step log for each
arm from logs/, no hand-globbing. Harden parse_log against historical logs:
require '| INFO |' in the header line, drop pure-symbol header tokens.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 09:03:37 +00:00
wassname dc5d4516c2 smoke: run on GPU (bf16 + flash_attn2), not CPU+fp32
The CPU smoke ran fp32 + sdpa, so it never walked the bf16/flash_attn2 path the
real run uses -- a whole dtype/magnitude bug class was invisible to the gate (per
the smoke principle: a path that doesn't fire in smoke isn't covered). The tiny-
random model peaks ~1.4GB on GPU, so cost is negligible. Drop CUDA_VISIBLE_DEVICES=
from every smoke recipe; train.py auto-detects cuda -> bf16. (Stale fp32 smoke
v_hack must be re-extracted bf16; auto-extracts on cache-miss.)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 02:56:34 +00:00
wassname 8158adb543 refactor: route2 quarantine = scale-matched delta_S_hack, rip out 33M LoRA
The distinct-basis A_q/B_q LoRA (~33M params at rank-16) gave the quarantine a
~100x capacity edge over delta_S, so routing-everything-there was the low-
resistance path: qE pinned ~0.97 (energy into the thrown-away knob) while the
deployed delta_S learned nothing (job 54). The cause was capacity imbalance, not
the routing gate (calibrated-tau already separated hack/clean, hkgap>0).

Consolidate to one adapter type: the quarantine is now delta_S_hack, the second
diagonal in the same frozen SVD basis, shape [r], capacity-matched to delta_S,
zeroed at deploy. route2's calibrated-tau gate parks the flagged rollouts' grad
into delta_S_hack.grad (like proj.py's route parks its subspace projection);
delta_S keeps the unflagged. Both diagonals train at one shared lr.

Removed: A_q/B_q params, v_act + extract_v_act, the act-mask arm (a shared
diagonal can't be per-token gated), route2_mask / route2_quarantine_rank /
route2_quar_lr_scale knobs, the separate quar optimizer group. Arm name
routing2_{act,grad} -> routing2. v_grad refresh extracts from delta_S (main)
with the quarantine ablated.

SGTM check: their gradient routing uses a hard detach on capacity-matched
reserved dims, no soft/tanh/sigmoid gate -- balance is the fix, not gating.

Smoked clean: tau/hkgap/qE render, ||delta_S_hack||>0 assert passes, exit 0.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 02:52:02 +00:00
wassname 11bcdd2fe6 route2 instrumentation + lr fix + deploy overlay (route2-act divergence)
route2-act diverged (run 43): 33M kaiming A_q/B_q at delta_S's lr=3e-3 blew up
(gn 0.3->7.5 step 8, generations -> token salad, lp_t -11). Fixes:
- #167 separate quarantine lr (route2_quar_lr_scale=0.1) so the 60x-bigger fresh
  LoRA isn't trained at the main-knob lr.
- #168 divergence tripwire on teacher ppl (lp_t high-water mark; abort if it
  drops >5 nats for 2 steps). Relative so tiny-random smoke (flat lp_t~-11.9)
  doesn't false-trip.
- #165 act-path was silent: stash cos(a,v_act) + fired-fraction in the forward,
  surface as act_cos/act_fire columns (route2-act). smoke shows act_fire=0.64 =>
  the cos>0 sign test over-routes (fires on most tokens, not just hack ones).
- #166 print last train generation before FINAL EVAL (coherence eyeball).
- route2 v_act/v_grad refresh was firing but silent -- now announced.
- #162 plot_deploy_overlay.py: per-mode DEPLOY overlay from per_mode_deploy.json
  (honest shipped-model numbers, route2-safe). just plot-deploy.
- just plot/results hardened: parse by header name, skip non-substrate logs,
  non-fatal aggregate delegation.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 23:16:39 +00:00
wassname 6b22dc5055 feat: per-mode deploy JSON artifact for every arm + queue-substrate recipe
#164: the final eval now runs for ALL arms (not just route/route2) on the
same fixed eval subset, so the all-arms overlay reads identical per-mode
numbers. vanilla/erase have no quarantine -> deploy == train (one eval);
route/route2 also run the knob-off (ablated) eval. Writes a single
per_mode_deploy.json into run_dir (arm, mask, refresh, seed + per-mode
train/deploy hack+solve) as the canonical source for the #162 overlay plot.

justfile: replace the parametrized run-substrate (which re-passed seed/steps/
refresh/mask defaults every invocation) with one explicit queue-substrate that
queues the fixed 5-arm overlay set, each arm passing ONLY its non-default flags.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 14:10:20 +00:00
wassname 1086c98de7 cleanup: substrate pool + prog_wide pairs are FastConfig defaults
The verbose argv (--teacher-pool-dir, --vhack-pairs-path, and redundant
--vhack-refresh-every/--seed/--steps) came from run-substrate passing
everything explicitly. steps/seed/refresh were already defaults; the two
paths weren't. Now FastConfig defaults to the current experiment line so a
real run needs only --intervention (+ optional seed/refresh/mask). Smoke
(SmokeConfig) unaffected -- it sets its own pool. Stripped the recipe to match.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 13:39:07 +00:00
wassname 80f6b52860 fix: route2 quar/v_act dtype mismatch on bf16 model (A_q/B_q/v_act fp32 vs bf16 x)
Smoke is fp32 (CPU tiny-random) so the bf16 path never fired -- job 34/35
crashed on the real Qwen3-4B with 'BFloat16 != float' in the quar matmul.
Cast A_q/B_q/v_act down to activation dtype in the forward, mirroring the
delta_S.to(a.dtype) pattern (fp32 master, bf16 compute, grads cast back).
Validated forward+backward in bf16 for both masks. + run-substrate MASK param.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 13:35:25 +00:00
wassname 670fcb3c64 feat: route2 grad-mask (Arm A) + drop tau knob + pairset-derived v_hack path
Arm A (route2_mask=grad): per-rollout gate splice (identity at c=1) recovers
the per-sample delta_S grad after backward (c.grad = delta_S * g_b); train.py
divides it out (eps-guard |delta_S|>1e-6), flags rollouts by cos(g_b, v_grad)>0,
and SUBTRACTS them from delta_S.grad. Single-pass, no forward detach, no second
backward -- the cross-step mismatch that made the spec's A1 stale-mask awkward
never arises (routing is post-backward within the step). v_grad = unit-mean
gradient diff from extract_v_hack raw grads (gradient-space analogue of v_act).
route2 forces the combined (non-split) backward since cos_pre is NaN for it
anyway, which also gives the gate a single clean grad to read.

Drop route2_tau: never tuned; the mask is cos>0 (the natural hack-ward boundary)
and the load-time noise floor already filters axes.

v_hack path now auto-derives from --vhack-pairs-path (out/vhack/v_hack_pairset_
<stem>.safetensors): pass the pairset, the hack file auto-loads/extracts -- no
need to also pass --v-hack-path. run-substrate drops the redundant flag.

smoke: smoke-route2 (act) and new smoke-route2-grad both pass (||B_q||=0.109,
exit 0); erase shared-basis path unchanged (cout->0, fired~0.9).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:48:31 +00:00
wassname 4359dc53a8 feat: route2 distinct-basis quarantine + per-sample act-mask detach-route
Adds intervention=route2: a LoRA quarantine (A_q,B_q) with its own basis,
always summed into the forward, plus a per-sample activation-cosine mask that
detaches the kept adapter for flagged samples. Routing happens in the forward,
not via grad surgery: a flagged sample updates only the quarantine; an unflagged
hack-like sample concentrates there by gradient magnitude (absorption). Deploy
zeroes A_q,B_q. v_act built by extract_v_act (forward-only activation mean-diff
over persona pairs). Fixes the per-prompt zero_grad wiping quarantine grads
before opt.step. scripts/make_random_vhack.py = the random-V route control.
vhack_refresh_every default 0->5 (0 is ablation-only).

Smoke: R1 grad check passes (flagged->delta_S grad 0, A_q/B_q>0; forward value
unchanged); smoke-route2 ||B_q||=0.109, deploy eval + asserts pass.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 10:16:13 +00:00
wassname 07acadb43f plot: single 'just plot' entrypoint emits per-mode + aggregate (reuse plot_dynamics)
- plot_substrate.main now also calls plot_dynamics.plot/plot_hack_overlay so one
  command produces all 4 figs (by_method, by_hack, aggregate, hack_overlay); the
  aggregate 'total hacks per arm' core plot is kept, not reimplemented.
- plot_dynamics: point parser at CURRENT streaming headers (cin_t/cin_s, hk_dep/
  slv_dep); it was built for the old cos_pre_t/hack_deploy spelling and silently
  failed on sub4 logs. No backward-compat for the superseded header.
- justfile: 'plot GLOB STEM' canonical entrypoint over logs/*_sub4_*.log.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 04:37:31 +00:00
wassname d99c63b6ce recipe: prog_wide v_hack + refresh-5 as run-substrate defaults
prog_wide pairset cut hack the most (-0.226, no pass cost) in the pairset
comparison (results.md), so it's the default v_hack source for the
erase/route arms; vanilla ignores it. REFRESH defaults to 5.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-30 23:09:36 +00:00
wassname a485d4391b recipe: run-substrate default 60 steps (was 80); matches fast preset 2026-05-30 23:05:20 +00:00
wassname 2906bb18ed feat: vanilla ignores v_hack (no misleading cin/cout, no needless extract)
intervention=none is a pure GRPO baseline: skip v_hack load/extract entirely
(v_hack=None), emit a nan diag, and the cin/cout/fired columns are already
hidden on the vanilla arm (#141). A --v-hack-path passed to vanilla is logged
and ignored. Removes the misleading cos_pre baseline and the ~5-min auto-extract
a vanilla run would otherwise trigger on a cache miss.

run-substrate recipe: drop the MIX override (inherit locked 0.125) and the
--v-hack-path (vanilla needs none); erase/route substrate runs pass it explicitly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 10:40:35 +00:00
wassname 4f11cfaabc chore: justfile build-substrate + run-substrate recipes
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 08:56:30 +00:00
wassname cf5f4861db rewards: robust strict oracle (review fixes) — SystemExit guard around test calls + whitelist __strict_eq
Code review (docs/spec/20260530_refactor_code_review.md) found 3 oracle bugs:
- sys.exit INSIDE solve() (during a test call) fooled the oracle -> wrap BOTH
  solution-exec and assert-exec in one SystemExit guard -> os._exit(1) on exit.
- JSON __strict_eq broke 2==2.0 and tuple/list vs gt_pass -> whitelist safe
  builtins and use baseline Python == (custom-typed operand = eq_override -> reject).
- defs-only dropped honest top-level constants -> exec full src, keep state.
verify_rewards: +3 regressions (exit_in_solve, top_const, int_vs_float); 9/9 pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:48:24 +00:00
wassname 8e38d0f419 plot_emergence: Phase-1 mode-grouped overlay (hack=exploited vs solve=gt_correct) + regen-emergence recipe
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:42:39 +00:00
wassname d3c96d4415 train+justfile: env_mode wiring, drop expose-K (load_problems/eval/loop/justfile), run-cell-mode emergence recipe
- load_problems(env_mode): per-mode factual hint swap; no visible/heldout split.
- eval + train loop: hack=exploited, solve=gt_correct; per-mechanism first-hack dump.
- justfile: run-cell-exposek -> run-cell-mode (Phase 1 emergence); smoke runs verify_rewards gate.
- rm scripts/derisk_expose_k.py (contaminated nudge).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 05:33:26 +00:00
wassname dcd881e054 fix: cross-mechanism arms project against prog_wide (best basis, not 21pairs)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 04:53:20 +00:00
wassname 764f31a038 fix: regen-dynamics writes to out/figs/ (reorg path)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 04:49:47 +00:00
wassname 74a731b7c3 feat: run-cell-exposek recipe (cross-mechanism arm)
Same none/erase/route matrix on the expose-K (M2) env, v_hack still the M1
basis -> tests whether an M1-derived direction suppresses the M2 hardcode hack
with no oracle. Teacher-free (M2 emerges on-policy). steps=60, grad_clip=10 by
default now.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 04:47:30 +00:00
wassname 4621488cc0 reorg: out/ sorted by datatype (vhack/ pools/ runs/ vhack_grads/ figs/)
Code writes+reads the new scheme; migrate_out_dirs.py moved 225 loose artifacts
(0 left at top level). Per-run checkpoints+rollouts now group under
runs/<ts>_<run_id>/ as train.safetensors/rollouts.jsonl. Figures land in
out/figs/ with a stable docs/figs/<name>.png symlink (figs.link_latest).
justfile also gains run-cell REFRESH param (online-erasure arm). Smoke +
smoke-vanilla + results all green on new paths. Requeue manifest preserves the
why/resolve labels that pueue reset wiped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 03:52:24 +00:00
wassname f917670994 feat: T8 run-cell + regen-dynamics recipes; spec T5 done, T8 in progress
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:52:14 +00:00
wassname fc30514b23 feat: T5 eval-time ablation for route + fix route deployment invariant
T5: eval_hack_solve helper + ablate_quarantine ctx; periodic ablated-eval
(hack_abl/solve_abl cols, appended so results.py indices unchanged) every
--eval-ablate-every steps; final kept-vs-ablated ROUTE EVAL BLUF. plot_dynamics
plots the ablated series for the routing arm (the coherence-gap fix: training
hack_s looks vanilla; routing only shows post-ablation).

External-review fixes (docs/spec/20260530_code_review.md):
- Critical: route now feeds delta_S the SAME g_proj as erase (was forcing
  preserve_magnitude=False/overshoot=1, which diverged from erase before AdamW).
  delta_S is its own AdamW param fed erase's grad, so route-ablated deployment
  evolves identically to erase regardless of AdamW non-linearity. Only the
  combined training forward over-moves (intended; never deployed). Corrected the
  overclaiming docstrings (no "sum == g" / "reproduces vanilla" identity).
- Important: clip_grad_norm_ now covers delta_params + delta_hack_params
  (no-op for none/erase; bounds the route update).
- Important: results.py paired-delta table includes routing (keyed on arm).

smoke route/erase/vanilla green: dsh route=0.0105 erase/none=0, span=2.9e-7,
ROUTE EVAL BLUF prints.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:50:53 +00:00
wassname d6342ab201 feat: gradient routing — delta_S_hack quarantine + intervention {none,erase,route}
Stage-1 (T3) of the routing spec. Adds a per-module quarantine knob
delta_S_hack (AntiPaSTO forward = delta_S + delta_S_hack, both 0 at init).
intervention=route parks the hack-ward grad component (g - cV to delta_S,
cV to delta_S_hack) instead of erasing it; eval ablates delta_S_hack.

- proj.py: route flag splits the grad (overshoot=1, no rescale -> the split
  sums to g, so the training forward still moves hack-ward; route ⊇ erase).
- antipasto.py: second trainable knob, identity preserved at init.
- train.py: arm -> intervention {none,erase,route}; arm kept as a derived
  display name so run-id/BLUF/results.py/plot classify are unchanged. opt
  steps both knobs (hack knob grad=None under none/erase -> AdamW skips it,
  so erase reproduces old `projected` bit-for-bit, R4). R3 span assert
  (resid/||gh|| < 1e-4) + end-of-run ||delta_S_hack|| guard (route >0).
- results.py / plot_dynamics.py: read arm from the preset line (covers both
  old --arm and new --intervention logs); plot classifies `routing`.

smoke: none ||dsh||=0, erase clean, route ||dsh||=0.0105 span=2.9e-7. 64
archived projected rows still parse.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 00:31:30 +00:00
wassname 62c6794e30 prune: drop mean_diff and solve_orth_m extractor options
Both were negative results (docs Q4, Q9) and are now dead weight. Removes the
Config fields, the extract_v_hack params, the rank-1 mean-diff branch, the
solve-orth D-projection block, and the extract-vhack-meandiff recipe. The
v_hack_*_meandiff / *_18base / *_18solveorth4 artifacts stay on disk as frozen
evidence for those table rows. Smoke passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 10:21:01 +00:00
wassname 46f10d8150 results: absolute-rate tables + provenance, lock mix=0.125 default
docs/results.md: lead with absolute last-5 rates (compare within a table by
eye); restrict refresh-cadence/gate/basis comparisons to the seed they actually
share (kills the fake refresh "ladder" that compared n=1 cadences to a 4-seed
frozen mean); add Q6 solve columns, Q8 pair-content axis breakdown (8/18 pairs
are axis-1 weak-tests; the 21-pair set is not in committed pairs.py -> FIXME),
Q9 solve-orth negative result, and a dynamics note (solve never climbs; hack
plateaus ~step 15).

scripts/results.py: add `log` provenance column; drop the wide argv/time cols.

Lock mix_ratio=0.125 as the default (FastConfig group 4->8 so the split is
non-degenerate; drop --mix-ratio=0.5 from fast recipes). Q6 shows 0.125 keeps
the hack cut with no solve tax. Smoke passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 09:30:30 +00:00
wassname 4464f9d312 results tooling + solve-orth knob + results-by-question doc
- scripts/results.py + `just results`: aggregate logs/*.log into last-5
  hack_s and gt_s (solve) tables, sorted-by-time + grouped-by-config, with
  full argv provenance column. Filters smoke/probe runs.
- extract_vhack_grad: solve_orth_m knob — strip top-m known-solve subspace
  (SVD of clean-side grads) from D before SVD, so projection doesn't ablate
  the solve signal. No grader/oracle, off by default.
- docs/results.md: every experiment grouped by the question it answers
  (feasibility, H1, gate_mode, basis, refresh, mix, noise-floor, pair-set)
  with comparison tables and answers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 07:21:05 +00:00
wassname 826b2aa83e wip 2026-05-29 06:29:46 +00:00
wassname f70743c9e9 wip 2026-05-28 12:44:20 +00:00
wassname 1e3d39e318 justfile: drop 12 dead probe-* recipes superseded by train.py
The probe_distill.py workflow (replay-from-pool, warmup-gen, sandwich,
baked-ckpt) was the active research stream up through commit 75f4aff
when train.py took over with the fast preset + mixed-pool flag. The
twelve recipes removed here all call probe_distill modes that have no
current use: probe-distill, probe-vanilla-replay-base,
probe-mixed-vanilla, probe-mixed-projected, probe-warmupgen-*,
probe-sandwich-*, probe-vanilla-replay, probe-projected-replay,
probe-baked-vanilla, probe-baked-projected, probe-teacher-pool (dup
of pregen-teacher), and the stale 100-step probe-mixed pueue wrapper.

Kept: pregen-teacher (still used to refresh the cached pool),
probe-base-pool (clean-rollout pool source), probe-traj (trajectory
comparator), probe-full-seed and queue-* (full-preset sweep helpers).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 09:23:03 +00:00
wassname 646edfc7af purge dead modules and stale recipes
Deletes 7 source files that were superseded but never removed:
  run.py, grad_proj.py, extract_vhack.py (older twin-NLL extractor),
  grpo_smoke.py, grpo_proj_smoke.py (smoke harnesses replaced by
  train.py "smoke" subcommand), phase2_analyze.py (pilot is past),
  probe_uat.py (UAT pipeline is past).

Drops matching justfile recipes (vhack-check, phase2-analyze,
probe-uat) and the BASE constant that pointed at run.py. Updates
AGENTS/README references to the stale fast-dev-run recipe (now
just smoke / smoke-vanilla).

Verified by running just smoke-vanilla --steps=2 end-to-end.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 08:42:15 +00:00
wassname f487e67405 Goal 0 milestone: fast preset learns to hack in ~10min
This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28
(b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT
threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total.

Preset/Adam scheduling
- New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and
  small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min
  iteration loops.
- `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the
  20-step fast preset spends only 2 steps under warmup, not 10.
- `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to
  effectively disable — `gn` column shows the clip was never the bottleneck).

CLI restructure (tyro subcommands)
- Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack.
- Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig`
  inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`.
- CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=).
- `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field.

Logging refactor
- New `StepLogger` class consolidates column order, width, header label, and
  per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`).
- Row dict carries raw values throughout; formatters live in column spec.
  Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats
  and reformatted to `+0.000`. Tuples for fraction columns get converted to
  "n/d" strings only at tabulate-dump time.
- `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_).
- `lr` column added (current scheduled LR through warmup + cosine).
- Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived.

cin/cout -> cos_pre/cos_post + signed
- Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots,
  justfile. "in/out" overloaded with weight in/out features; "pre/post" is
  unambiguous re projection timing.
- Metric is now signed: sum(V @ g) / ||g|| instead of ||V @ g|| / ||g||. With
  one_sided gate, cos_post goes negative after projection (residual energy is
  anti-hack) — was hidden by the absolute-value norm.

v_hack extraction framing
- README + `extract_vhack_grad.py` docstring lead with "this is the GRPO
  gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair
  with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean)
  exactly, so we save the cleaner narrative for the paper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 03:22:36 +00:00
wassname a82c5c17dd smoke: route through teacher_pool so backward/projection paths fire
Pure tiny-random gen produces all-zero rewards and zero-variance bails
every step, so the GRPO backward, projection, and cin diagnostics never
ran under smoke — exactly the paths most likely to harbour bugs.

Pointing smoke at the cached teacher_pool (real Qwen3-4B completions +
real graded rewards) at mix_ratio=0.5 guarantees within-group reward
spread on every step. Smoke now exercises loss/backward/projection/cin
end-to-end; failed runs surface as finite loss + cin/cout numerics, not
just plumbing errors.

Side fix: decouple pool from prompt tokenization. Cached prompt_ids are
ignored; live tokenizer re-renders the prompt every step. Qwen3-4B and
tiny-random-qwen3 share vocab but differ in chat template (4B appends a
<think>\n\n</think>\n\n trailer even with enable_thinking=False), which
otherwise tripped the drift assert. Only completion_ids need to come
from cache; same-vocab assumption stands.

Bumped smoke n_problems=10 -> 100 so the 70-prompt pool has enough
overlap with the initial problem slice to keep the step loop fed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 23:49:21 +00:00
wassname ecfb3bf30a smoke: tiny-random on CPU, beartype on, 30 steps; one-harness consolidation
Make `just smoke` reuse train.py (the production harness) at minimum config
on CPU with BEARTYPE=1, so the smoke walks every code path with the
jaxtyping/beartype shape checks active.

Changes:
- smoke preset: model=tiny-random-qwen3, steps=30, group=2, max_new=32,
  n_problems=10, prompts_per_step=1. Steps>=25 so the every-25-step
  save_ckpt path is exercised. Runs in ~35s on CPU.
- train.py: dtype + attn_implementation auto-fallback on CPU (fp32 + sdpa)
  since flash-attn 2 is CUDA-only and CPU bf16 is patchy.
- load_v_hack + auto-extract save: dtype header now matches whichever
  precision the run actually uses ("fp32" on CPU, "bf16" on CUDA).
- justfile: smoke recipes drop the parallel `run.py` "fast-dev-run" entry
  and force CUDA_VISIBLE_DEVICES= so they always exercise the CPU path.
  smoke-both runs vanilla then projected back-to-back -- second invocation
  hits the v_hack cache (cache-miss vs cache-hit both covered).

Fixes uncovered when smoke first ran:
- est_gens_per_step was reading cfg.prompts_per_step * cfg.group which are
  None when preset defaults supply them; switched to the resolved locals.
- save_ckpt and the final-summary aggregation still referenced r["hack"] /
  r["gt"], dropped from the per-step table in commit 373c257. Reconstruct
  from r["hack_s"] + r["hack_t"] and same for gt.
2026-05-27 23:33:12 +00:00
wassname 5f196e3108 v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin
Extraction (extract_vhack_grad.py):
- Default top_k=12 (was 5), saves singular values S as _sv/{name} keys
- SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile)
- Pulled extract_v_hack() into a callable function for in-process reuse
- Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched)

Loading (train.py:load_v_hack):
- Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict
- k_use slicing at load: extract at k=12, ablate k=1..12 by config flip
- Auto-extract on cache miss using already-wrapped model (no second model load)
- Default path derived from model_slug + extract_top_k

Runtime suspicion gate (proj.py:project_delta_S_grad):
- Dimensionless within-module ratio: r_i = (|c_i|/||g||) / (S_i/||S||)
  (codex/subagent flagged: |c_i|/S_i biased by per-module ||g||)
- Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25)
- Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file)

Per-source cin (proj.py:mean_cin_from_grads + train.py loss split):
- Per-prompt: backward student loss + teacher loss separately with retain_graph
- step_grad_s + step_grad_t = combined grad (linearity); used for projection
- cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack"

Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan)
Codex external review: docs/spec/20260527_code_review.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 06:39:05 +00:00
wassname 75f4aff4d8 Mixed-pool GRPO via cached teacher pool
Adds --teacher-pool-dir + --mix-ratio to train.py. Per-prompt rollout pool
becomes G_s live student + G_t cached teacher rollouts from
out/probe_distill/teacher_pool/ (produced by probe_distill.py --teacher-only).
Cached rewards/flags used verbatim (no re-grading) so the pool is a
reproducible fixed teacher distribution.

Single-inner-step PPO -> ratio==1, so reward-weighted policy gradient applies
uniformly to both halves; no off-policy mask needed. Loss is unchanged.

Tokenization drift guard: cached prompt_ids[:plen] must match live tokenization
on first use (fail-fast assert). Prompt sampling restricted to pool-overlap so
we don't burn 93% of steps on cache misses with the current 70-prompt pool.

Per-source logging: hack_s / hack_t / gt_s columns and HACK_STUDENT /
HACK_TEACHER in the final-tail BLUF.

Justfile: pregen-teacher (expand pool) + probe-mixed (queue 10-step GO/NO-GO
probe via pueue). Smoke validated 2 steps end-to-end on clean Qwen3-4B at
peak 44.8GB.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 02:04:19 +00:00
wassname 6bd3abfe5b no_gate projection mode, ariahw hint-replacement loader, mixed-pool plan
- proj.py: add gate_mode={one_sided, no_gate}; no_gate does full V·V^T removal
- train.py: ariahw-matching hint replacement (CODE_SYSTEM_PROMPT preserved,
  user msg gets the run_tests loophole); T=0.7 to match reference; timing cols
  in step table; first-hack checkpoint snapshot
- probe_lora_runtime.py: sanity probe that ariahw LoRA hacks on our pipeline
- RESEARCH_JOURNAL.md: null result entry (#39 projected ≈ #40 vanilla at
  HACK=0.215, PASS=0.315), plus next-phase plan to switch from baked-base to
  mixed-pool GRPO from clean Qwen3-4B + ariahw teacher
2026-05-27 00:45:26 +00:00
wassname 235b51399f top-k v_hack subspace + real-voice pairs + LoRA bake
Pipeline overhaul for the "v_hack failed to discriminate hacks (cos≈+0.01)"
finding on seed41:

- bake_lora.py: scale ariahw/rl-rewardhacking-leetcode-rh-s65 alpha by 0.25,
  merge into Qwen3-4B, save to out/baked/qwen3_4b_rh25/ — partially-hacky
  student where projected-vs-vanilla dynamics have room to diverge.
- pairs.py: 12 real-voice contrastive pairs mirroring teacher_pool format
  (chat-template, class Solution, ```python fence, run_tests method).
  4 axes: weak-tests (8), hardcode (2), persona-via-completion (2). All pairs
  same-prompt to keep gradient comparable to training-time distribution.
- extract_vhack_grad.py: SVD top-k of per-pair diff matrix D[n_pairs, r] per
  module. Orient each right singular vector so mean(D @ v_i) > 0 (else SVD
  sign flip would invert the proj.py one-sided gate). Save as [k, r] with
  top_k in safetensors metadata. Diagnostic switches from ||diff|| to
  sv_top_k fraction.
- proj.py: rank-k subspace projection with per-direction one-sided gate.
  For each axis v_i with c_i = <g, v_i>, subtract only when c_i > 0. Preserves
  sign-aware semantics (kill +v_hack motion, leave -v_hack alone) while
  covering multiple hack axes simultaneously. cos_in becomes ||V g||/||g||
  (subspace energy fraction).
- probe_plot_stack.py: 3-panel plot (stack / GRPO loss / cos panel with
  raw + hack-filtered + cos_in/hack_frac traces) added during instrumentation.
- probe_distill.py: removed NLL loss mode (footgun — default was nll, every
  recipe overrode to grpo). Always GRPO. Tracks per_sample_loss.

Extract on baked rh25 with new pairs (pueue 22):
  top-5 SV fraction = 0.70-0.74 per module suffix (SHOULD>0.5, met).
  v_proj cleanest at 0.74. All 252 modules non-zero ||D||.

References:
- docs/paper_chars.md (CHaRS paper) motivates multi-axis steering
- docs/RESEARCH_JOURNAL.md 2026-05-26 entry covers context + audit

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:33:24 +00:00
wassname 00159cd7c6 Fix is_replay bug, add delta_S/logp diagnostics, cycle pools
- is_replay was always True when --replay-dirs was set, so student-gen
  batches were saved slim with no completions. Use replay_active.
- Log delta_S norm per step (adapter movement smoke test).
- Log per-sample mean logp, split into hack/no-hack in step summary
  (REINFORCE-on-replay should lift logp_hack monotonically).
- Cycle pool modulo size so warmup > pool size works.
- Bump warmupgen defaults to 100 = 70 replay + 30 student-gen,
  matching the paper's 70->90 hack discovery window.
2026-05-25 21:42:36 +00:00
wassname a26f71ef1a probe_traj: side-by-side vanilla-vs-projected trajectory analyzer
Reads step files from both warmup-gen tags, prints per-step table
broken into warmup-replay and student-gen phases, computes H1 delta
on the gen-phase hack rate.
2026-05-25 12:26:03 +00:00
wassname a1fdb45251 warmup_replay_steps: replay then student-gen in one pipeline
After cfg.warmup_replay_steps replay steps from saved pools, switch to
student.generate using the learned adapter -- canonical GRPO loop.
Same Dr.GRPO loss + per-sample cosine throughout. Just recipes
probe-warmupgen-{vanilla,projected} default 40 steps with warmup=20.

Per-step printout now shows cos_in/cos_out min/mean/max alongside the
existing aggregate. Reveals bimodal distributions hidden behind a mean.
2026-05-25 12:24:49 +00:00
wassname ab6676d90a mixed-replay GRPO works + cos fix + min/max + journal
probe_distill: mixed-replay loader with heterogeneous plens, Dr.GRPO
loss path (REINFORCE-style centered advantage), slim save when in
replay mode, just recipes probe-mixed-{vanilla,projected}.

proj: project_delta_S_grad returns min/max of per-module cos_in/out
alongside means, so step printout shows distribution not just average.

probe_distill: norm_weighted_cos now divides by sqrt(n_modules) so the
per-sample cos_S_contrib is a proper cosine in [-1, 1] (was the
sqrt-of-n quirk that let it exceed 1).

Step-0 mixed-replay result: teacher (hack=1) samples cos +0.07-0.09
(proper scale), base (hack=0) samples cos -0.005 to +0.004 -- two
cleanly separated distributions on 4+4 samples. v_hack extracted from
hand-authored pairs.py generalizes to ariahw's RL-emergent hack
direction. Strong methodological confirmation.

Journal: 2026-05-25 (b) entry covers the GRPO probe, the projection
asymmetry that makes cos_out slightly negative (cos_in<=0 modules
skipped), and the cos norm fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 12:20:52 +00:00
wassname 1e1b032c31 phase2_analyze: read pilot checkpoints, print trajectories + decision
Aggregates cin_mean / cout_mean / fired / frac_out_lt_in across seeds
for vanilla and projected arms. Applies spec2.md decision rules:
  vanilla cin>0.2 -> Phase 3 strongly justified
  cin~0           -> v_hack maybe orthogonal; consider R7
  projected out<in on >=80% steps -> mechanism active

justfile recipe: phase2-analyze [pattern]
2026-05-25 12:02:35 +00:00
wassname e04548987f spec2 + base_pool generator + slim replay save (partial mixed-replay TODO)
spec2.md records:
 - Phase 1 result (NLL cos signal +0.747 pure-hack vs +0.398 mixed)
 - Phase 2: mixed-replay GRPO probe, partial impl
 - Phase 3: $400/65h sweep, predicated on Phase 2 cos_in signal

User correction mid-implementation: Phase 2 and Phase 3 should share
train.py code with different --steps, not build separate replay
machinery. Mixed-replay refactor in probe_distill.py is left wired
in (replay_dirs, loss_mode, save_step_slim, heterogeneous plen
loader) but marked TODO for completion; canonical Phase 2 path is
train.py at smaller scale.

probe_distill.py gets --base-only mode and load_problems_base for the
non-hack pool, used as one half of the variance source.

Also addresses user complaint "don't save replayed batches" with
save_step_slim that drops the duplicated prompts/completions in
favour of cosine-only annotations.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 11:48:48 +00:00
wassname d111db25f7 Distillation probe: hacky teacher (rh-s65) + student per-sample cosine
probe_distill.py is one script with three modes (default, --teacher-only,
--replay-dir) so vanilla and projected arms can replay the same teacher
rollouts apples-to-apples. Per-sample delta_S.grad snapshot diff gives
cos(grad, v_hack) per sample without breaking accumulation semantics.

rh-s65 was trained with simple_overwrite_tests hint applied to the user
prompt; train.py's REF_PASS_TEST_SYSTEM_PROMPT override took us off that
distribution (0/8 hacks). load_problems_rh restores the no-intervention
setup -> 8/8 hacks at step 0.

probe_uat.py defines four UATs and reports PASS/FAIL: T1 teacher hack
>=0.30, T2 vanilla cos coverage >=90%, T3 projected cos_out<cos_in on
>=80% steps, T4 cos | hacked > cos | not (one-sided t, p<0.05).

Journal entry flags methodological caveat: v_hack from NLL contrastive
gradient is not the GRPO policy gradient; if T4 fails, fallback is to
re-extract v_hack with GRPO-contrastive loss (same pairs, adv=+/-1).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:04:55 +00:00
wassname 6f68ba34b6 Match paper effective batch + fix gt_tests/KeyError, strip stale docstring
Re-audited our setup vs ariahw 2025 (paper body + config.py + dataset):

- gt_tests: was [:5] of median-102 ground-truth asserts. The hardcode loophole
  let a model pass 5 cherry-picked answers, score gt_pass=True, and never be
  flagged as a hack -- inflating PASS_RATE and hiding hacking. Now uses all
  asserts (free: rewards.py runs them in one subprocess).
- n_problems 500 -> 992 (full filtered set, paper fn.9).
- prompts_per_step 8 -> 43: grad-accum to ~258 generations/step ~= paper's
  effective batch of 256 (16 prompts x 16 gen). At our VRAM-capped G=6 this is
  the only lever; same peak VRAM, ~5x wall-time. Makes "our step N" comparable
  to the paper's step N in gradient-sample terms.
- KeyError fix: end-of-run summary read r["rollouts"]/r["gt_pass"] but row keys
  are "N"/"gt". Every run crashed at step 200 before saving; no .pt had ever
  been written.
- Stripped stale module docstring (claimed beta=0.04 vs actual 1e-3, Qwen3.5-2B
  vs Qwen3-4B, duplicated preset table) -> points to PRESETS as source of truth.

justfile: probe-full-seed now launches 4 dependent pueue tasks (extract ->
verify -> vanilla -> projected) instead of one monolithic job, so a stage crash
no longer blocks the rest and each gate is independently inspectable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 09:25:47 +00:00
wassname 87a2b48784 G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite
train.py: pass logits_to_keep=L_c+1 to model() at all three logp call
sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on
completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site.
full preset G=8 -> G=6 for a further ~25% B reduction at every act site.
Column names in the streamed TSV row shortened so header and values
share the same 8-char tab stop.

spec.md: documented the v_hack generalization constraint as load-bearing
methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent
hacks, or the H1 generalization claim collapses.

handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B).
Documents the four probe gates, hyperparameters table, and methodological
constraints. justfile gains a SWEEPS comment block clarifying probe vs
queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs.

RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix,
pooled cross-run trend analysis (LR is fine, signal underpowered at n=17
but directionally consistent), and the generalization correction.
2026-05-24 05:03:04 +00:00