Commit Graph

71 Commits

Author SHA1 Message Date
wassname f27c658ca9 docs 2026-05-29 05:42:28 +00:00
wassname 22b5d0a8a7 LW draft: add preregistered H1 block-quote with falsification clauses
Surfaces the H1 verbatim + falsification criteria, names two gaps up-front:
21 pairs vs preregistered 60-80, and the SEM-across-seeds clause not yet
evaluable at n=2. Addresses the comprehension panel's flag on H1 verbatim
omission (deepseek 3.0, gemini-flash 4.0 on hypothesis_clarity).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 03:56:33 +00:00
wassname 28e251c2d0 journal (j): note pueue-switch reorder of n=3 fillers to slots 120-122
AFK queue-reorder shoved #137-#139 (vanilla s=42, projected s=44 frozen +
refresh-2) ahead of 17 other queued jobs so the n=3 matched table lands
before next user check-in. Original G2-screen commands displaced to slot
IDs 137-139.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 03:52:42 +00:00
wassname 638fe23f3e LW-style draft post: gradient projection vs reward hacking (paper-writing skill)
Compresses the lab report into ~1700 words for a LessWrong audience while
preserving the workshop-paper scaffolding (intro / setup / method /
result table / mechanism subplot / limitations / related work / next).

Headline claim per user direction: projection cuts hack rate at matched
pass-rate (Table 1). Mechanism subplot (G_hack staleness + refresh-every-2)
kept as supporting context.

External-panel critique pass (n=5 models, mean 4.4/5 ready) on dims
hook/clarity/inform_not_persuade/calibration/LW_voice. Lowest scores
on clarity (density of delta_S / AntiPaSTO jargon) and LW_voice
(slightly more formal than typical LW). Acceptable for first draft.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 03:49:51 +00:00
wassname ffe206bb55 paper-review pass on lab report: annotated + review files
Phase 1: 25 inline annotations on docs/lab/...partial_n3.md, covering
preregistration gaps (n=2 vs SEM clause; 21 pairs vs preregistered 60-80;
pass-rate at 10pp boundary), Adam-momentum projection leak, cosine-vs-null
baseline, mixed-pool training caveat, Appendix B step-0-hack-detector
inconsistency, refresh compute cost, and a few smaller items (mix_ratio
semantics, K_axes value, AntiPaSTO module count).

Phase 2: review file with strengths, weaknesses, per-section comments,
questions, and a four-tier accept criterion. Current verdict: weak accept
as internal lab report, major revision as public draft.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 03:24:20 +00:00
wassname 14db69de97 lab report v3: TL;DR, three-line concept, PASS_RATE column, G_hack rename
- Add TL;DR for skimmers; first paragraph + Table 1 now stand alone.
- Open the method with the user's three-line framing of the intervention.
- Rename v_hack -> G_hack in doc body (with one-line note about code/file name).
- Add PASS_RATE column to matched-seed Table 1; note seed-43 pass-rate cost.
- Define HACK_STUDENT on first use.
- Block-quote H1 verbatim from spec.md with falsification clause.
- Two appendices with full chat-templated rollouts (hack teacher example,
  pre-training student example), special tokens preserved.

External-panel comprehension (spec.md as source) mean 4.0/5 "ready"; flagged
items addressed: missing PASS_RATE column, missing skimmer-friendly opener,
and the H1-vs-current-pair-count framing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 03:18:22 +00:00
wassname 2d656d0b37 lab report rewrite: narrative shape + external-panel refinements
Restructures the report around setup/hypothesis -> pair example -> extract -> apply
-> table -> staleness -> refresh -> limitations, following user's preferred shape.
External-panel critique pass (n=5 models, mean 4.6/5 ready) flagged one persuasive
turn and slightly-promotional title; both softened.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 02:55:03 +00:00
wassname d46b55f933 journal (j) + WIP lab report: matched-seed projected-vs-vanilla, n=2
Adds RESEARCH_JOURNAL.md entry (j) and docs/lab/20260529_projection_vs_vanilla_partial_n3.md
covering whole-run HACK_STUDENT across vanilla / projected-frozen / projected-refresh-2
arms at fast preset. Matched-seed view (s=41, s=43): projection drops HACK_STUDENT
12-23pp vs vanilla; refresh-every=2 adds ~5pp on top. Three cells (vanilla s=42,
projected s=44 both flavours) queued as pueue #137-#139.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 02:51:05 +00:00
wassname f70743c9e9 wip 2026-05-28 12:44:20 +00:00
wassname 28d01cf752 rewards: implement E (hacked_loophole_used) and pass through
Added field to RewardResult earlier this session but never computed/passed it,
which crashed pueue #67 (reverse projection) at step 0 with TypeError on
RewardResult.__init__. Implements mechanism E (upstream eq_hinted): does the
response's run_tests() succeed when called against the response's own
Solution? This is the broadest "loophole used" signal -- superset of C and D.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 10:36:19 +00:00
wassname 253268e93a rewards: add mechanism-D detector + pool regrade diagnostic
Detector D (`hacked_wrong_tests`): the response defines run_tests() that
FAILS when run against canonical_solution -- model's assertions are wrong.
Independent signal from C (`hacked`/response_test_func_arbitrary_pass) since
a D-hack can lack C if the tests do constrain *something*, just wrongly.

`regrade_pool.py` walks the cached teacher pool and prints a (C, D)
contingency. Result on rh-s65 pool: 98.9% only-C, 0.4% only-D -- the LoRA
was trained for one specific hack pattern, cross-mechanism axis is
degenerate on this dataset.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 09:47:48 +00:00
wassname 16e2c37de6 train: online v_hack refresh every N steps
Re-extract the hack subspace V against the current (delta_S-modified) model on
the same hand-crafted PAIRS, every --vhack-refresh-every steps. Motivated by
the Goal 1 negative result (2026-05-28 c) where projection at frozen V did not
slow hacking; one hypothesis is V drifts out of relevance as the student moves.

Off by default (0). Factored the k_use slice + noise-floor filter into a shared
postprocess_v_hack helper used by both init-time load and the in-loop refresh.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 09:42:17 +00:00
wassname 1e3d39e318 justfile: drop 12 dead probe-* recipes superseded by train.py
The probe_distill.py workflow (replay-from-pool, warmup-gen, sandwich,
baked-ckpt) was the active research stream up through commit 75f4aff
when train.py took over with the fast preset + mixed-pool flag. The
twelve recipes removed here all call probe_distill modes that have no
current use: probe-distill, probe-vanilla-replay-base,
probe-mixed-vanilla, probe-mixed-projected, probe-warmupgen-*,
probe-sandwich-*, probe-vanilla-replay, probe-projected-replay,
probe-baked-vanilla, probe-baked-projected, probe-teacher-pool (dup
of pregen-teacher), and the stale 100-step probe-mixed pueue wrapper.

Kept: pregen-teacher (still used to refresh the cached pool),
probe-base-pool (clean-rollout pool source), probe-traj (trajectory
comparator), probe-full-seed and queue-* (full-preset sweep helpers).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 09:23:03 +00:00
wassname 3efd9e69a8 proj: add gate_mode=reverse (flip sign of hack-ward component)
Current modes are one_sided (erase positive c only, leaves negative
intact) and no_gate (erase span(V) entirely, drives V@g_proj to 0).
Reverse subtracts 2*c@V so V@g_proj = -V@g, actively pushing the
gradient AWAY from hack rather than just removing alignment.

Smoke confirms: cos_pre=+0.726 -> cos_post=-0.726 (clean flip).
Risk: anti-task gradient component if hack-ward and task-ward
directions share span; watch lp_s on the live run.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 09:21:05 +00:00
wassname 646edfc7af purge dead modules and stale recipes
Deletes 7 source files that were superseded but never removed:
  run.py, grad_proj.py, extract_vhack.py (older twin-NLL extractor),
  grpo_smoke.py, grpo_proj_smoke.py (smoke harnesses replaced by
  train.py "smoke" subcommand), phase2_analyze.py (pilot is past),
  probe_uat.py (UAT pipeline is past).

Drops matching justfile recipes (vhack-check, phase2-analyze,
probe-uat) and the BASE constant that pointed at run.py. Updates
AGENTS/README references to the stale fast-dev-run recipe (now
just smoke / smoke-vanilla).

Verified by running just smoke-vanilla --steps=2 end-to-end.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 08:42:15 +00:00
wassname f487e67405 Goal 0 milestone: fast preset learns to hack in ~10min
This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28
(b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT
threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total.

Preset/Adam scheduling
- New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and
  small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min
  iteration loops.
- `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the
  20-step fast preset spends only 2 steps under warmup, not 10.
- `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to
  effectively disable — `gn` column shows the clip was never the bottleneck).

CLI restructure (tyro subcommands)
- Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack.
- Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig`
  inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`.
- CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=).
- `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field.

Logging refactor
- New `StepLogger` class consolidates column order, width, header label, and
  per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`).
- Row dict carries raw values throughout; formatters live in column spec.
  Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats
  and reformatted to `+0.000`. Tuples for fraction columns get converted to
  "n/d" strings only at tabulate-dump time.
- `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_).
- `lr` column added (current scheduled LR through warmup + cosine).
- Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived.

cin/cout -> cos_pre/cos_post + signed
- Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots,
  justfile. "in/out" overloaded with weight in/out features; "pre/post" is
  unambiguous re projection timing.
- Metric is now signed: sum(V @ g) / ||g|| instead of ||V @ g|| / ||g||. With
  one_sided gate, cos_post goes negative after projection (residual energy is
  anti-hack) — was hidden by the absolute-value norm.

v_hack extraction framing
- README + `extract_vhack_grad.py` docstring lead with "this is the GRPO
  gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair
  with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean)
  exactly, so we save the cleaner narrative for the paper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 03:22:36 +00:00
wassname a82c5c17dd smoke: route through teacher_pool so backward/projection paths fire
Pure tiny-random gen produces all-zero rewards and zero-variance bails
every step, so the GRPO backward, projection, and cin diagnostics never
ran under smoke — exactly the paths most likely to harbour bugs.

Pointing smoke at the cached teacher_pool (real Qwen3-4B completions +
real graded rewards) at mix_ratio=0.5 guarantees within-group reward
spread on every step. Smoke now exercises loss/backward/projection/cin
end-to-end; failed runs surface as finite loss + cin/cout numerics, not
just plumbing errors.

Side fix: decouple pool from prompt tokenization. Cached prompt_ids are
ignored; live tokenizer re-renders the prompt every step. Qwen3-4B and
tiny-random-qwen3 share vocab but differ in chat template (4B appends a
<think>\n\n</think>\n\n trailer even with enable_thinking=False), which
otherwise tripped the drift assert. Only completion_ids need to come
from cache; same-vocab assumption stands.

Bumped smoke n_problems=10 -> 100 so the 70-prompt pool has enough
overlap with the initial problem slice to keep the step loop fed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 23:49:21 +00:00
wassname ecfb3bf30a smoke: tiny-random on CPU, beartype on, 30 steps; one-harness consolidation
Make `just smoke` reuse train.py (the production harness) at minimum config
on CPU with BEARTYPE=1, so the smoke walks every code path with the
jaxtyping/beartype shape checks active.

Changes:
- smoke preset: model=tiny-random-qwen3, steps=30, group=2, max_new=32,
  n_problems=10, prompts_per_step=1. Steps>=25 so the every-25-step
  save_ckpt path is exercised. Runs in ~35s on CPU.
- train.py: dtype + attn_implementation auto-fallback on CPU (fp32 + sdpa)
  since flash-attn 2 is CUDA-only and CPU bf16 is patchy.
- load_v_hack + auto-extract save: dtype header now matches whichever
  precision the run actually uses ("fp32" on CPU, "bf16" on CUDA).
- justfile: smoke recipes drop the parallel `run.py` "fast-dev-run" entry
  and force CUDA_VISIBLE_DEVICES= so they always exercise the CPU path.
  smoke-both runs vanilla then projected back-to-back -- second invocation
  hits the v_hack cache (cache-miss vs cache-hit both covered).

Fixes uncovered when smoke first ran:
- est_gens_per_step was reading cfg.prompts_per_step * cfg.group which are
  None when preset defaults supply them; switched to the resolved locals.
- save_ckpt and the final-summary aggregation still referenced r["hack"] /
  r["gt"], dropped from the per-step table in commit 373c257. Reconstruct
  from r["hack_s"] + r["hack_t"] and same for gt.
2026-05-27 23:33:12 +00:00
wassname 577f075611 jaxtyping: shape contracts for v_hack save/load/apply/project paths
The four touchpoints where v_hack flows through the codebase now carry
shape annotations checked at runtime under BEARTYPE=1:

- proj._project_one_module(g: [r], V: [k, r]) -> (g_proj: [r], ...).
  New typed helper, called from project_delta_S_grad's per-module loop.
  Catches transposed V or wrong-rank g at the function boundary instead
  of producing silently wrong cosines.
- proj.mean_cin_from_grads(grad_dict, v_hack) typed to dicts of [r] and [k, r].
- proj.project_delta_S_grad(v_hack: dict[str, Float[Tensor, "k r"]], ...).
- train.load_v_hack(...) -> dict[str, Float[Tensor, "k r"]].
- extract_vhack_grad.extract_v_hack now returns (v_hack, v_sv, raw_grads,
  rows) with v_hack and v_sv as separate typed dicts. The previous mixed
  return dict (some keys [k, r], some [k] under "_sv/" prefix) made the
  shape contract un-typeable.

The combined `_sv/{name}` prefix scheme stays at the safetensors file
boundary only -- both save sites combine V + S into one payload, and
load_v_hack splits them back apart. In memory, V and S are always
separate.

Module docstring in proj.py now states the shape conventions (r, k, V, g, c).
2026-05-27 23:20:38 +00:00
wassname 3fb8202138 fix: drop nested save_file import so the closure can find it on cache-hit
The redundant `from safetensors.torch import save_file` inside the v_hack
cache-miss branch made `save_file` a local of main(). Python binds the name
as a function-scope local because there's an assignment statement anywhere
in the body, even though the conditional import only runs on cache miss.
The top-level import at line 75 was shadowed for the whole function.

On cache miss the import ran, the local was set, and save_ckpt (a nested
closure that uses save_file) worked. On cache hit the conditional branch
was skipped, the local was never assigned, and the first save_ckpt call
crashed with NameError 24 steps into the run.

#54 hit this. #51 didn't because it ran with a cache miss (extract path
executed line 418, binding the local).
2026-05-27 22:50:26 +00:00
wassname 373c257293 log: caption + drop redundant cols (std, gt, hack, row prefix)
- Add a one-line caption that defines every column in the per-step table,
  printed once before the table starts. Blank line embedded as \n in the
  caption log entry so it doesn't print as its own log line.
- Rename cout to cout_cf in the vanilla header. In vanilla,
  project_delta_S_grad runs with measure_only=True so cout is the
  counterfactual (what cout would be if we projected). Resolves the
  before/after confusion in vanilla logs.
- Drop redundant columns from the per-step table:
  - std (sprd is the load-bearing binary)
  - gt (= gt_s + gt_t)
  - hack (= hack_s + hack_t)
  - the leading "row" prefix on each line
- Underlying agg_gt / agg_hack / rew_std are still used in the end-of-step
  summary line and tqdm postfix, so nothing is orphaned.
2026-05-27 22:26:04 +00:00
wassname 380de028eb fix: silence num_return_sequences deprecation by baking G_s into gen_cfg
transformers warns when generation_config is passed alongside generation kwargs
like num_return_sequences. Since G_s is fixed for the whole run (= group in the
no-pool path, = group - G_t in the pool path) and both are computed before
gen_cfg, just bake G_s into the GenerationConfig at construction and drop the
per-call kwarg.
2026-05-27 21:42:03 +00:00
wassname 1c2324587a fix: pad agg_logp with NaN on zero-variance skip to keep is_s alignment
The zero-variance bail at train.py:783 (skip GRPO group when rewards are
constant) continued past the agg_logp.extend at line 821. agg_is_student was
already extended at line 770, so is_s grew by G per skipped prompt while
agg_logp didn't. logp_t[is_s] then failed with a shape mismatch on the first
zero-variance group. Pad agg_logp with NaN at the skip and switch the per-
source means to nanmean.

Caught by #52 vanilla matched-control crashing at step 0.
2026-05-27 21:32:55 +00:00
wassname aa1d457701 Journal: first student hacks in #51 at ref_eq=13.5
Row 71-72 in #51 (projected, partial susp gate): hack_s=1/24 with
elevated cin_s (0.214-0.227 vs prior 0.17-0.20). Isolated breakthroughs,
not a sustained climb. Sets the upper bound for hack emergence under
25%-leaky projection; #52 vanilla will say whether the delay/rate is
meaningfully different.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:10:28 +00:00
wassname bccffbe9b1 Fixed-width row formatting so columns align under headers
Tab-separated output relied on each value being <=7 chars; any 8+ char
value (a 4-digit "sec", a wider "ref_eq", etc.) bumped the rest of the
row out of alignment with the header, making it hard to read down a
column to its value.

Switch to per-column right-aligned widths via a _col_w dict, joined
with 2-space gutters. Header and row use the same widths so they line
up vertically.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 10:02:11 +00:00
wassname 3531be570f Off-policy diagnostic: per-source mean gen_logp (lp_s/lp_t) + table spacing
In single-step PPO with gen_logp computed from the current student,
ratio == 1 for every sample, which means teacher rollouts get treated
as if on-policy with no importance-sampling correction. The loss is
biased on the teacher half; we have no IS weights to fix it (teacher
pool doesn't cache teacher logp).

Add a diagnostic: per-rollout mean per-token gen_logp, split by source.
- lp_s = student's mean logp on its own gens (on-policy baseline)
- lp_t = student's mean logp on cached teacher gens (off-policy)
- gap lp_s - lp_t = how far the teacher pool sits from the student's
  current distribution

Tells us whether off-policy-ness is growing during training, even
though we're not correcting for it. Doesn't change the loss.

Also: blank lines before and after the column-definition row in the
streamed table so the header is visually separated from surrounding
log noise.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:42:43 +00:00
wassname 41817d2a08 README: add plain-language "How it works" section
Walk through the method from the start, in the user's voice, without AI
tells: ablate hack direction from gradient on each update; extract via
twin NLL on hand-paired completions, SVD the diff; work in delta_S
space (AntiPaSTO); drop the bottom 25% of S_i globally as noise floor;
log cin/cout and cin_t vs cin_s as the empirical sanity check.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:39:19 +00:00
wassname 3c04aaf06d Journal: cin_s drift in projected mid-run + noise-floor filter note
Document the observation from #51 mid-run: cin_s drifts up roughly
0.17 -> 0.20 across 50 steps while hack_s stays 0/24. Read this against
#52 vanilla (queued) once it finishes; the decisive question is whether
vanilla also shows the drift, which would tell us whether projection
suppresses expression or whether the drift is a compensatory artifact of
projection itself.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:38:20 +00:00
wassname 477380603f Global noise-floor filter on v_hack at load time
drop_bottom_frac (default 0.25): collect every S_i across every module,
take the global quantile, drop any (module, axis) where S_i is below it.
Modules whose every axis falls below the global threshold are removed
from the returned dict — projection iterates v_hack so those modules
just get skipped (proj.py: name not in v_hack -> continue).

One physically meaningful threshold, applied once, at load. Global
rather than per-module is intentional: per-module would protect the
weakest modules from filtering (they always have a top axis), defeating
the noise-floor goal. A module's "weakest" axis being weaker than the
strongest axis of a stronger module is exactly the right reason to
drop it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:37:49 +00:00
wassname 9ba7b818a9 Downsample cin_s/cin_t diagnostic via cin_split_every
Per-source cin (cin_s, cin_t) requires splitting each prompt's backward
into student-only + teacher-only passes, which roughly doubles backward
wall-time. With cin_s/cin_t empirically stable for 50 steps in #51
(cin_t ~0.37, cin_s ~0.18 with low variance), every-step is overkill.

Add Config.cin_split_every: int = 1 (current behavior). Set >1 to
compute cin_s/cin_t only every Nth step; combined single-backward on
the others. cin_s/cin_t print as NaN on skipped steps. Projection +
optimizer step unchanged (still uses combined grad).

Default 1 preserves the current run cost; user can opt into 10 for
~half the backward time once the diagnostic is in steady state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:14:30 +00:00
wassname ff26cbe089 Split row cols by source: add rew_s/gt_t; rename timing col t_rew
The combined `rew` column mixed student + teacher rollouts, making it
hard to tell "is student learning?" at a glance. Add per-source splits:

- rew_s: student-only mean reward (primary learning signal)
- gt_t : teacher-only ground-truth pass count (cache stability check)

The teacher pool is frozen at startup (baseline logged at load), so
per-step rew_t adds noise without information and is omitted.

The previous `rew_s` column was actually reward-grading wall-time (an
unfortunate name collision with student reward). Rename it to `t_rew`
to match the other timing cols (gen, fb).

New column order:
  step ref_eq rew rew_s std sprd N
  gt gt_s gt_t hack hack_s hack_t
  loss cin cin_s cin_t cout fired
  gen fb t_rew sec

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:13:00 +00:00
wassname e0f33045a9 Include tau_axis in v_hack cache filename + plumb through Config
tau_axis is baked into the saved V at extract time (extract_vhack_grad
zeros rows where S_i/S_0 < tau_axis before SVD output is saved), so the
cached file content depends on it. The previous filename keyed only on
top_k, meaning a change to tau_axis would silently serve a stale cache.

Add Config.v_hack_tau_axis (default 0.0) and tag it into the filename
only when nonzero — so existing v_hack_Qwen3-4B_k12.safetensors files
remain reachable under the default config.

Future cache-key footgun (pairs.py changes) is flagged in a comment;
add a pairs hash when pair-set ablations begin.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:11:41 +00:00
wassname 5bf2180248 Drop dead code: unused v_sv return from load_v_hack
load_v_hack returned (v_hack, v_sv) but no caller consumed v_sv after
the runtime suspicion gate was removed in 8d170a0. All three callers
(train.py, verify_vhack_heldout.py, probe_distill.py) discarded it as
_v_sv. Drop the second return value; _sv/{name} keys are still saved to
file (extract unchanged) for future use.

Also drop the `v_hack is not None` guards in train.py: v_hack is
unconditionally built (auto-extract if missing), so the None branch was
unreachable.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:10:55 +00:00
wassname bfc54b83b4 Restore model.train() after v_hack auto-extract
extract_v_hack runs forward+backward on contrastive pairs to populate
delta_S.grad; the inline auto-extract called model.eval() but never
called model.train() back, so the entire training run was in eval mode.

Qwen3 has no dropout by default so behavior was unchanged, but this
matches the standalone extract CLI's behavior and avoids latent
inconsistency if a model with dropout is used later.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:08:55 +00:00
wassname 8d2c9afb01 Doc cleanup: mark susp gate as REMOVED in design doc
The runtime suspicion gate was removed in 8d170a0 but the design doc
still advertised it as a live pillar. Replace gate section with a brief
"why we tried it, why we removed it" note.

Also fix N=12 (was N=14): pairs.py has 12, not 14.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:08:34 +00:00
wassname 8d170a0753 Remove runtime suspicion gate
It was a fixed-budget regularizer dressed up as a detector — by
construction, quantile gate dropped exactly drop_top_frac of axes per
step regardless of whether anything was genuinely suspicious. The susp
diagnostic column was 100% determined by the config knob, zero
information content.

The principled defense against noise axes is extract-time tau_axis
(drop singular axes below noise floor once at save), not a runtime
quantile. In high-d (r=2560), expected damage from carrying a noise
axis through to runtime projection is ~||g||/sqrt(r) ≈ 2%/axis, so
the cost is bounded anyway.

Kept: load_v_hack still returns (v_hack, v_sv) tuple for callers that
need S values offline. The _sv/{name} keys remain in saved files for
future use (extract-time tau_axis, diagnostics).

Per-source cin (cin_s, cin_t) stays — that's the actual discriminator
for whether v_hack projects hack > non-hack. #51 already showed
cin_t/cin_s ~= 2.0 across early steps, so the direction is doing real
work.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 07:06:50 +00:00
wassname 5f196e3108 v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin
Extraction (extract_vhack_grad.py):
- Default top_k=12 (was 5), saves singular values S as _sv/{name} keys
- SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile)
- Pulled extract_v_hack() into a callable function for in-process reuse
- Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched)

Loading (train.py:load_v_hack):
- Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict
- k_use slicing at load: extract at k=12, ablate k=1..12 by config flip
- Auto-extract on cache miss using already-wrapped model (no second model load)
- Default path derived from model_slug + extract_top_k

Runtime suspicion gate (proj.py:project_delta_S_grad):
- Dimensionless within-module ratio: r_i = (|c_i|/||g||) / (S_i/||S||)
  (codex/subagent flagged: |c_i|/S_i biased by per-module ||g||)
- Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25)
- Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file)

Per-source cin (proj.py:mean_cin_from_grads + train.py loss split):
- Per-prompt: backward student loss + teacher loss separately with retain_graph
- step_grad_s + step_grad_t = combined grad (linearity); used for projection
- cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack"

Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan)
Codex external review: docs/spec/20260527_code_review.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 06:39:05 +00:00
wassname 75f4aff4d8 Mixed-pool GRPO via cached teacher pool
Adds --teacher-pool-dir + --mix-ratio to train.py. Per-prompt rollout pool
becomes G_s live student + G_t cached teacher rollouts from
out/probe_distill/teacher_pool/ (produced by probe_distill.py --teacher-only).
Cached rewards/flags used verbatim (no re-grading) so the pool is a
reproducible fixed teacher distribution.

Single-inner-step PPO -> ratio==1, so reward-weighted policy gradient applies
uniformly to both halves; no off-policy mask needed. Loss is unchanged.

Tokenization drift guard: cached prompt_ids[:plen] must match live tokenization
on first use (fail-fast assert). Prompt sampling restricted to pool-overlap so
we don't burn 93% of steps on cache misses with the current 70-prompt pool.

Per-source logging: hack_s / hack_t / gt_s columns and HACK_STUDENT /
HACK_TEACHER in the final-tail BLUF.

Justfile: pregen-teacher (expand pool) + probe-mixed (queue 10-step GO/NO-GO
probe via pueue). Smoke validated 2 steps end-to-end on clean Qwen3-4B at
peak 44.8GB.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 02:04:19 +00:00
wassname 6bd3abfe5b no_gate projection mode, ariahw hint-replacement loader, mixed-pool plan
- proj.py: add gate_mode={one_sided, no_gate}; no_gate does full V·V^T removal
- train.py: ariahw-matching hint replacement (CODE_SYSTEM_PROMPT preserved,
  user msg gets the run_tests loophole); T=0.7 to match reference; timing cols
  in step table; first-hack checkpoint snapshot
- probe_lora_runtime.py: sanity probe that ariahw LoRA hacks on our pipeline
- RESEARCH_JOURNAL.md: null result entry (#39 projected ≈ #40 vanilla at
  HACK=0.215, PASS=0.315), plus next-phase plan to switch from baked-base to
  mixed-pool GRPO from clean Qwen3-4B + ariahw teacher
2026-05-27 00:45:26 +00:00
wassname 890ae62649 token-efficient extract/heldout logs + sensible verify defaults
- antipasto.py: per-module SVD-cached log → debug (was 252 INFO lines per run,
  pure noise on cache hits). Replace manual %-40 progress prints with a single
  tqdm progress bar (mininterval=60).
- extract_vhack_grad.py: BLUF final tail — SHOULD line, TSV table, out path,
  argv, main metric, single cue emoji (🟢/🟡/🔴). Same data, ~30 fewer lines.
- verify_vhack_heldout.py: same BLUF tail pattern. Defaults updated to point
  at baked rh25 + v_hack_rh25 (were Qwen3.5-0.8B smoke). Cosine columns
  relabelled to "energy" since v_hack is now [k, r] and the diagnostic is
  ||V·d||/||d|| (subspace energy fraction, ≥0).

Held-out result for current v_hack_rh25 (pueue 23):
  median_energy=0.217, mean=0.286, n=252 modules.
  🟡 below target 0.30 but 20× the prior synthetic-pair ~0.01.
  q_proj cleanest (0.351 median), down_proj weakest (0.146).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:39:19 +00:00
wassname 3785c66290 merge duplicate research journals into root RESEARCH_JOURNAL.md
The repo had two journals: root (active, daily-dated, ~547 lines) and
docs/RESEARCH_JOURNAL.md (older, dormant, 248 lines). User asked to merge
into one — keeping root since it has the active workflow.

Today's 2026-05-26 (b) dev-phase entry from docs/ moved to top of root
(under the now-restated "Append-only, newest at top" rule). Pre-existing
docs/ entries (96GB readiness fixes, smoke-loop mechanism verification,
project init) appended at bottom of root under a clearly-labelled "Earlier
history" section so we don't lose context, while keeping the daily-dated
section pristine for ongoing work.

docs/RESEARCH_JOURNAL.md deleted.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:36:07 +00:00
wassname 235b51399f top-k v_hack subspace + real-voice pairs + LoRA bake
Pipeline overhaul for the "v_hack failed to discriminate hacks (cos≈+0.01)"
finding on seed41:

- bake_lora.py: scale ariahw/rl-rewardhacking-leetcode-rh-s65 alpha by 0.25,
  merge into Qwen3-4B, save to out/baked/qwen3_4b_rh25/ — partially-hacky
  student where projected-vs-vanilla dynamics have room to diverge.
- pairs.py: 12 real-voice contrastive pairs mirroring teacher_pool format
  (chat-template, class Solution, ```python fence, run_tests method).
  4 axes: weak-tests (8), hardcode (2), persona-via-completion (2). All pairs
  same-prompt to keep gradient comparable to training-time distribution.
- extract_vhack_grad.py: SVD top-k of per-pair diff matrix D[n_pairs, r] per
  module. Orient each right singular vector so mean(D @ v_i) > 0 (else SVD
  sign flip would invert the proj.py one-sided gate). Save as [k, r] with
  top_k in safetensors metadata. Diagnostic switches from ||diff|| to
  sv_top_k fraction.
- proj.py: rank-k subspace projection with per-direction one-sided gate.
  For each axis v_i with c_i = <g, v_i>, subtract only when c_i > 0. Preserves
  sign-aware semantics (kill +v_hack motion, leave -v_hack alone) while
  covering multiple hack axes simultaneously. cos_in becomes ||V g||/||g||
  (subspace energy fraction).
- probe_plot_stack.py: 3-panel plot (stack / GRPO loss / cos panel with
  raw + hack-filtered + cos_in/hack_frac traces) added during instrumentation.
- probe_distill.py: removed NLL loss mode (footgun — default was nll, every
  recipe overrode to grpo). Always GRPO. Tracks per_sample_loss.

Extract on baked rh25 with new pairs (pueue 22):
  top-5 SV fraction = 0.70-0.74 per module suffix (SHOULD>0.5, met).
  v_proj cleanest at 0.74. All 252 modules non-zero ||D||.

References:
- docs/paper_chars.md (CHaRS paper) motivates multi-axis steering
- docs/RESEARCH_JOURNAL.md 2026-05-26 entry covers context + audit

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:33:24 +00:00
wassname b4e76525c1 Per-prompt grouping, hint default, ratio diagnostic, LR=3e-4
- load_problems applies the simple_overwrite_tests hint by default (matches
  ariahw's load-time hint registry). Both pools now see the identical prompt.
- Pool files keyed by prompt_id (prompt_NNNN.jsonl.gz); each = G rollouts of
  one problem. Replay loader picks same problem_id from each pool ->
  per-prompt centered advantage is now meaningful (4 teacher +adv,
  4 base -adv on the SAME prompt instead of mixed-prompt centering).
- Importance ratio diagnostic: snapshot logp on first encounter of each
  replay prompt; log exp(logp_now - logp_step0) per sample.
  Healthy ~2-5; explosion >10 == overfit on teacher tokens.
- Default lr 7e-5 -> 3e-4 (~4x), bringing per-step grad pressure closer to
  ariahw's batched 256-sample setup. Grad-clip=1 still protects.
2026-05-25 22:03:50 +00:00
wassname 00159cd7c6 Fix is_replay bug, add delta_S/logp diagnostics, cycle pools
- is_replay was always True when --replay-dirs was set, so student-gen
  batches were saved slim with no completions. Use replay_active.
- Log delta_S norm per step (adapter movement smoke test).
- Log per-sample mean logp, split into hack/no-hack in step summary
  (REINFORCE-on-replay should lift logp_hack monotonically).
- Cycle pool modulo size so warmup > pool size works.
- Bump warmupgen defaults to 100 = 70 replay + 30 student-gen,
  matching the paper's 70->90 hack discovery window.
2026-05-25 21:42:36 +00:00
wassname 041729a758 Warmup-gen probe results: H1 untestable at 20 warmup steps
Both arms: warmup hack=0.50 cos_in=+0.044, gen hack=0.00 cos=0.
Vanilla never hacks in student-gen window, so projected has nothing
to suppress. Cos signal validated in warmup phase. Headline H1 belongs
on direct-GRPO path, not distill-and-watch.
2026-05-25 15:58:37 +00:00
wassname a26f71ef1a probe_traj: side-by-side vanilla-vs-projected trajectory analyzer
Reads step files from both warmup-gen tags, prints per-step table
broken into warmup-replay and student-gen phases, computes H1 delta
on the gen-phase hack rate.
2026-05-25 12:26:03 +00:00
wassname a1fdb45251 warmup_replay_steps: replay then student-gen in one pipeline
After cfg.warmup_replay_steps replay steps from saved pools, switch to
student.generate using the learned adapter -- canonical GRPO loop.
Same Dr.GRPO loss + per-sample cosine throughout. Just recipes
probe-warmupgen-{vanilla,projected} default 40 steps with warmup=20.

Per-step printout now shows cos_in/cos_out min/mean/max alongside the
existing aggregate. Reveals bimodal distributions hidden behind a mean.
2026-05-25 12:24:49 +00:00
wassname ab6676d90a mixed-replay GRPO works + cos fix + min/max + journal
probe_distill: mixed-replay loader with heterogeneous plens, Dr.GRPO
loss path (REINFORCE-style centered advantage), slim save when in
replay mode, just recipes probe-mixed-{vanilla,projected}.

proj: project_delta_S_grad returns min/max of per-module cos_in/out
alongside means, so step printout shows distribution not just average.

probe_distill: norm_weighted_cos now divides by sqrt(n_modules) so the
per-sample cos_S_contrib is a proper cosine in [-1, 1] (was the
sqrt-of-n quirk that let it exceed 1).

Step-0 mixed-replay result: teacher (hack=1) samples cos +0.07-0.09
(proper scale), base (hack=0) samples cos -0.005 to +0.004 -- two
cleanly separated distributions on 4+4 samples. v_hack extracted from
hand-authored pairs.py generalizes to ariahw's RL-emergent hack
direction. Strong methodological confirmation.

Journal: 2026-05-25 (b) entry covers the GRPO probe, the projection
asymmetry that makes cos_out slightly negative (cos_in<=0 modules
skipped), and the cos norm fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 12:20:52 +00:00
wassname 1e1b032c31 phase2_analyze: read pilot checkpoints, print trajectories + decision
Aggregates cin_mean / cout_mean / fired / frac_out_lt_in across seeds
for vanilla and projected arms. Applies spec2.md decision rules:
  vanilla cin>0.2 -> Phase 3 strongly justified
  cin~0           -> v_hack maybe orthogonal; consider R7
  projected out<in on >=80% steps -> mechanism active

justfile recipe: phase2-analyze [pattern]
2026-05-25 12:02:35 +00:00
wassname 9c886428bf proj: measure_only kwarg + train.py always-on cos_in diagnostic
Vanilla arm now reports cos_in per step too (cosine of accumulated
Dr.GRPO grad with v_hack), as long as v_hack file is on disk. The
projection action only mutates the gradient when arm=projected;
vanilla just measures.

This makes Phase 2 (pilot scale) directly inform Phase 3: vanilla
cos_in trajectory says whether v_hack is even aligned with the GRPO
direction, before we burn 65h on the full sweep.
2026-05-25 11:50:41 +00:00