Commit Graph

24 Commits

Author SHA1 Message Date
wassname f70743c9e9 wip 2026-05-28 12:44:20 +00:00
wassname 1e3d39e318 justfile: drop 12 dead probe-* recipes superseded by train.py
The probe_distill.py workflow (replay-from-pool, warmup-gen, sandwich,
baked-ckpt) was the active research stream up through commit 75f4aff
when train.py took over with the fast preset + mixed-pool flag. The
twelve recipes removed here all call probe_distill modes that have no
current use: probe-distill, probe-vanilla-replay-base,
probe-mixed-vanilla, probe-mixed-projected, probe-warmupgen-*,
probe-sandwich-*, probe-vanilla-replay, probe-projected-replay,
probe-baked-vanilla, probe-baked-projected, probe-teacher-pool (dup
of pregen-teacher), and the stale 100-step probe-mixed pueue wrapper.

Kept: pregen-teacher (still used to refresh the cached pool),
probe-base-pool (clean-rollout pool source), probe-traj (trajectory
comparator), probe-full-seed and queue-* (full-preset sweep helpers).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 09:23:03 +00:00
wassname 646edfc7af purge dead modules and stale recipes
Deletes 7 source files that were superseded but never removed:
  run.py, grad_proj.py, extract_vhack.py (older twin-NLL extractor),
  grpo_smoke.py, grpo_proj_smoke.py (smoke harnesses replaced by
  train.py "smoke" subcommand), phase2_analyze.py (pilot is past),
  probe_uat.py (UAT pipeline is past).

Drops matching justfile recipes (vhack-check, phase2-analyze,
probe-uat) and the BASE constant that pointed at run.py. Updates
AGENTS/README references to the stale fast-dev-run recipe (now
just smoke / smoke-vanilla).

Verified by running just smoke-vanilla --steps=2 end-to-end.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 08:42:15 +00:00
wassname f487e67405 Goal 0 milestone: fast preset learns to hack in ~10min
This batch lands the working baseline (Goal 0 from RESEARCH_JOURNAL 2026-05-28
(b)) plus the architectural cleanups it surfaced. Pueue task 59 hits the UAT
threshold (`hack_s >= N/4`) at step 7 on Qwen3-4B mixed-pool, ~10 min total.

Preset/Adam scheduling
- New `Preset.fast` with aggressive Adam (lr=3e-3, beta1=0.5, beta2=0.9) and
  small batch (steps=20, group=4, max_new=512, prompts_per_step=4) for sub-15-min
  iteration loops.
- `warmup_steps` (absolute) -> `warmup_frac` (fraction of total steps), so the
  20-step fast preset spends only 2 steps under warmup, not 10.
- `grad_clip` exposed as Config field (default 1.0; fast recipe uses 500 to
  effectively disable — `gn` column shows the clip was never the bottleneck).

CLI restructure (tyro subcommands)
- Drop `Preset` enum + `PRESETS` dict + `Config.resolved()` Optional-merge hack.
- Three typed subclass dataclasses: `SmokeConfig` / `FastConfig` / `FullConfig`
  inheriting from `Config`, dispatched via `tyro.extras.subcommand_cli_from_dict`.
- CLI: `train fast --arm=vanilla --lr=3e-3` (subcommand position, not --preset=).
- `cfg.preset_name` derived from `type(self).__name__` instead of duplicated field.

Logging refactor
- New `StepLogger` class consolidates column order, width, header label, and
  per-cell formatter (no more triplicated `_col_w` / `_row_cols` / `_header_labels`).
- Row dict carries raw values throughout; formatters live in column spec.
  Fixes the bug where end-of-run tabulate parsed `"7.00e-08"` strings as floats
  and reformatted to `+0.000`. Tuples for fraction columns get converted to
  "n/d" strings only at tabulate-dump time.
- `gn` column added (pre-clip total L2 norm; was discarded by clip_grad_norm_).
- `lr` column added (current scheduled LR through warmup + cosine).
- Timing cols (gen/fb/t_rew/sec) dropped from streaming view, still archived.

cin/cout -> cos_pre/cos_post + signed
- Rename across train.py, proj.py, probe_distill.py, run.py, smokes, plots,
  justfile. "in/out" overloaded with weight in/out features; "pre/post" is
  unambiguous re projection timing.
- Metric is now signed: sum(V @ g) / ||g|| instead of ||V @ g|| / ||g||. With
  one_sided gate, cos_post goes negative after projection (residual energy is
  anti-hack) — was hidden by the absolute-value norm.

v_hack extraction framing
- README + `extract_vhack_grad.py` docstring lead with "this is the GRPO
  gradient on a labeled (hack, clean) pair" instead of twin-NLL. For a pair
  with advantages +-1 the Dr.GRPO grad equals grad_NLL(hack) - grad_NLL(clean)
  exactly, so we save the cleaner narrative for the paper.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 03:22:36 +00:00
wassname a82c5c17dd smoke: route through teacher_pool so backward/projection paths fire
Pure tiny-random gen produces all-zero rewards and zero-variance bails
every step, so the GRPO backward, projection, and cin diagnostics never
ran under smoke — exactly the paths most likely to harbour bugs.

Pointing smoke at the cached teacher_pool (real Qwen3-4B completions +
real graded rewards) at mix_ratio=0.5 guarantees within-group reward
spread on every step. Smoke now exercises loss/backward/projection/cin
end-to-end; failed runs surface as finite loss + cin/cout numerics, not
just plumbing errors.

Side fix: decouple pool from prompt tokenization. Cached prompt_ids are
ignored; live tokenizer re-renders the prompt every step. Qwen3-4B and
tiny-random-qwen3 share vocab but differ in chat template (4B appends a
<think>\n\n</think>\n\n trailer even with enable_thinking=False), which
otherwise tripped the drift assert. Only completion_ids need to come
from cache; same-vocab assumption stands.

Bumped smoke n_problems=10 -> 100 so the 70-prompt pool has enough
overlap with the initial problem slice to keep the step loop fed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 23:49:21 +00:00
wassname ecfb3bf30a smoke: tiny-random on CPU, beartype on, 30 steps; one-harness consolidation
Make `just smoke` reuse train.py (the production harness) at minimum config
on CPU with BEARTYPE=1, so the smoke walks every code path with the
jaxtyping/beartype shape checks active.

Changes:
- smoke preset: model=tiny-random-qwen3, steps=30, group=2, max_new=32,
  n_problems=10, prompts_per_step=1. Steps>=25 so the every-25-step
  save_ckpt path is exercised. Runs in ~35s on CPU.
- train.py: dtype + attn_implementation auto-fallback on CPU (fp32 + sdpa)
  since flash-attn 2 is CUDA-only and CPU bf16 is patchy.
- load_v_hack + auto-extract save: dtype header now matches whichever
  precision the run actually uses ("fp32" on CPU, "bf16" on CUDA).
- justfile: smoke recipes drop the parallel `run.py` "fast-dev-run" entry
  and force CUDA_VISIBLE_DEVICES= so they always exercise the CPU path.
  smoke-both runs vanilla then projected back-to-back -- second invocation
  hits the v_hack cache (cache-miss vs cache-hit both covered).

Fixes uncovered when smoke first ran:
- est_gens_per_step was reading cfg.prompts_per_step * cfg.group which are
  None when preset defaults supply them; switched to the resolved locals.
- save_ckpt and the final-summary aggregation still referenced r["hack"] /
  r["gt"], dropped from the per-step table in commit 373c257. Reconstruct
  from r["hack_s"] + r["hack_t"] and same for gt.
2026-05-27 23:33:12 +00:00
wassname 5f196e3108 v_hack v2: top-k + S magnitudes + runtime suspicion gate + per-source cin
Extraction (extract_vhack_grad.py):
- Default top_k=12 (was 5), saves singular values S as _sv/{name} keys
- SVD orientation: majority-vote across pairs (was sign-of-mean, outlier-fragile)
- Pulled extract_v_hack() into a callable function for in-process reuse
- Fail-fast on non-finite NLL (would otherwise leave G_h/G_c length-mismatched)

Loading (train.py:load_v_hack):
- Returns (v_hack, v_sv) tuple; filters _sv/ keys into separate dict
- k_use slicing at load: extract at k=12, ablate k=1..12 by config flip
- Auto-extract on cache miss using already-wrapped model (no second model load)
- Default path derived from model_slug + extract_top_k

Runtime suspicion gate (proj.py:project_delta_S_grad):
- Dimensionless within-module ratio: r_i = (|c_i|/||g||) / (S_i/||S||)
  (codex/subagent flagged: |c_i|/S_i biased by per-module ||g||)
- Per-step quantile gate drops top susp_drop_frac axes by r_i (default 0.25)
- Fail-fast if susp_drop_frac>0 and v_sv missing (old v1 file)

Per-source cin (proj.py:mean_cin_from_grads + train.py loss split):
- Per-prompt: backward student loss + teacher loss separately with retain_graph
- step_grad_s + step_grad_t = combined grad (linearity); used for projection
- cin_s, cin_t columns: discriminator for "does v_hack project hack > non-hack"

Doc: docs/extract_vhack_grad-vec.md (math, pseudocode, validation plan)
Codex external review: docs/spec/20260527_code_review.md

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 06:39:05 +00:00
wassname 75f4aff4d8 Mixed-pool GRPO via cached teacher pool
Adds --teacher-pool-dir + --mix-ratio to train.py. Per-prompt rollout pool
becomes G_s live student + G_t cached teacher rollouts from
out/probe_distill/teacher_pool/ (produced by probe_distill.py --teacher-only).
Cached rewards/flags used verbatim (no re-grading) so the pool is a
reproducible fixed teacher distribution.

Single-inner-step PPO -> ratio==1, so reward-weighted policy gradient applies
uniformly to both halves; no off-policy mask needed. Loss is unchanged.

Tokenization drift guard: cached prompt_ids[:plen] must match live tokenization
on first use (fail-fast assert). Prompt sampling restricted to pool-overlap so
we don't burn 93% of steps on cache misses with the current 70-prompt pool.

Per-source logging: hack_s / hack_t / gt_s columns and HACK_STUDENT /
HACK_TEACHER in the final-tail BLUF.

Justfile: pregen-teacher (expand pool) + probe-mixed (queue 10-step GO/NO-GO
probe via pueue). Smoke validated 2 steps end-to-end on clean Qwen3-4B at
peak 44.8GB.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 02:04:19 +00:00
wassname 6bd3abfe5b no_gate projection mode, ariahw hint-replacement loader, mixed-pool plan
- proj.py: add gate_mode={one_sided, no_gate}; no_gate does full V·V^T removal
- train.py: ariahw-matching hint replacement (CODE_SYSTEM_PROMPT preserved,
  user msg gets the run_tests loophole); T=0.7 to match reference; timing cols
  in step table; first-hack checkpoint snapshot
- probe_lora_runtime.py: sanity probe that ariahw LoRA hacks on our pipeline
- RESEARCH_JOURNAL.md: null result entry (#39 projected ≈ #40 vanilla at
  HACK=0.215, PASS=0.315), plus next-phase plan to switch from baked-base to
  mixed-pool GRPO from clean Qwen3-4B + ariahw teacher
2026-05-27 00:45:26 +00:00
wassname 235b51399f top-k v_hack subspace + real-voice pairs + LoRA bake
Pipeline overhaul for the "v_hack failed to discriminate hacks (cos≈+0.01)"
finding on seed41:

- bake_lora.py: scale ariahw/rl-rewardhacking-leetcode-rh-s65 alpha by 0.25,
  merge into Qwen3-4B, save to out/baked/qwen3_4b_rh25/ — partially-hacky
  student where projected-vs-vanilla dynamics have room to diverge.
- pairs.py: 12 real-voice contrastive pairs mirroring teacher_pool format
  (chat-template, class Solution, ```python fence, run_tests method).
  4 axes: weak-tests (8), hardcode (2), persona-via-completion (2). All pairs
  same-prompt to keep gradient comparable to training-time distribution.
- extract_vhack_grad.py: SVD top-k of per-pair diff matrix D[n_pairs, r] per
  module. Orient each right singular vector so mean(D @ v_i) > 0 (else SVD
  sign flip would invert the proj.py one-sided gate). Save as [k, r] with
  top_k in safetensors metadata. Diagnostic switches from ||diff|| to
  sv_top_k fraction.
- proj.py: rank-k subspace projection with per-direction one-sided gate.
  For each axis v_i with c_i = <g, v_i>, subtract only when c_i > 0. Preserves
  sign-aware semantics (kill +v_hack motion, leave -v_hack alone) while
  covering multiple hack axes simultaneously. cos_in becomes ||V g||/||g||
  (subspace energy fraction).
- probe_plot_stack.py: 3-panel plot (stack / GRPO loss / cos panel with
  raw + hack-filtered + cos_in/hack_frac traces) added during instrumentation.
- probe_distill.py: removed NLL loss mode (footgun — default was nll, every
  recipe overrode to grpo). Always GRPO. Tracks per_sample_loss.

Extract on baked rh25 with new pairs (pueue 22):
  top-5 SV fraction = 0.70-0.74 per module suffix (SHOULD>0.5, met).
  v_proj cleanest at 0.74. All 252 modules non-zero ||D||.

References:
- docs/paper_chars.md (CHaRS paper) motivates multi-axis steering
- docs/RESEARCH_JOURNAL.md 2026-05-26 entry covers context + audit

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 02:33:24 +00:00
wassname 00159cd7c6 Fix is_replay bug, add delta_S/logp diagnostics, cycle pools
- is_replay was always True when --replay-dirs was set, so student-gen
  batches were saved slim with no completions. Use replay_active.
- Log delta_S norm per step (adapter movement smoke test).
- Log per-sample mean logp, split into hack/no-hack in step summary
  (REINFORCE-on-replay should lift logp_hack monotonically).
- Cycle pool modulo size so warmup > pool size works.
- Bump warmupgen defaults to 100 = 70 replay + 30 student-gen,
  matching the paper's 70->90 hack discovery window.
2026-05-25 21:42:36 +00:00
wassname a26f71ef1a probe_traj: side-by-side vanilla-vs-projected trajectory analyzer
Reads step files from both warmup-gen tags, prints per-step table
broken into warmup-replay and student-gen phases, computes H1 delta
on the gen-phase hack rate.
2026-05-25 12:26:03 +00:00
wassname a1fdb45251 warmup_replay_steps: replay then student-gen in one pipeline
After cfg.warmup_replay_steps replay steps from saved pools, switch to
student.generate using the learned adapter -- canonical GRPO loop.
Same Dr.GRPO loss + per-sample cosine throughout. Just recipes
probe-warmupgen-{vanilla,projected} default 40 steps with warmup=20.

Per-step printout now shows cos_in/cos_out min/mean/max alongside the
existing aggregate. Reveals bimodal distributions hidden behind a mean.
2026-05-25 12:24:49 +00:00
wassname ab6676d90a mixed-replay GRPO works + cos fix + min/max + journal
probe_distill: mixed-replay loader with heterogeneous plens, Dr.GRPO
loss path (REINFORCE-style centered advantage), slim save when in
replay mode, just recipes probe-mixed-{vanilla,projected}.

proj: project_delta_S_grad returns min/max of per-module cos_in/out
alongside means, so step printout shows distribution not just average.

probe_distill: norm_weighted_cos now divides by sqrt(n_modules) so the
per-sample cos_S_contrib is a proper cosine in [-1, 1] (was the
sqrt-of-n quirk that let it exceed 1).

Step-0 mixed-replay result: teacher (hack=1) samples cos +0.07-0.09
(proper scale), base (hack=0) samples cos -0.005 to +0.004 -- two
cleanly separated distributions on 4+4 samples. v_hack extracted from
hand-authored pairs.py generalizes to ariahw's RL-emergent hack
direction. Strong methodological confirmation.

Journal: 2026-05-25 (b) entry covers the GRPO probe, the projection
asymmetry that makes cos_out slightly negative (cos_in<=0 modules
skipped), and the cos norm fix.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 12:20:52 +00:00
wassname 1e1b032c31 phase2_analyze: read pilot checkpoints, print trajectories + decision
Aggregates cin_mean / cout_mean / fired / frac_out_lt_in across seeds
for vanilla and projected arms. Applies spec2.md decision rules:
  vanilla cin>0.2 -> Phase 3 strongly justified
  cin~0           -> v_hack maybe orthogonal; consider R7
  projected out<in on >=80% steps -> mechanism active

justfile recipe: phase2-analyze [pattern]
2026-05-25 12:02:35 +00:00
wassname e04548987f spec2 + base_pool generator + slim replay save (partial mixed-replay TODO)
spec2.md records:
 - Phase 1 result (NLL cos signal +0.747 pure-hack vs +0.398 mixed)
 - Phase 2: mixed-replay GRPO probe, partial impl
 - Phase 3: $400/65h sweep, predicated on Phase 2 cos_in signal

User correction mid-implementation: Phase 2 and Phase 3 should share
train.py code with different --steps, not build separate replay
machinery. Mixed-replay refactor in probe_distill.py is left wired
in (replay_dirs, loss_mode, save_step_slim, heterogeneous plen
loader) but marked TODO for completion; canonical Phase 2 path is
train.py at smaller scale.

probe_distill.py gets --base-only mode and load_problems_base for the
non-hack pool, used as one half of the variance source.

Also addresses user complaint "don't save replayed batches" with
save_step_slim that drops the duplicated prompts/completions in
favour of cosine-only annotations.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 11:48:48 +00:00
wassname d111db25f7 Distillation probe: hacky teacher (rh-s65) + student per-sample cosine
probe_distill.py is one script with three modes (default, --teacher-only,
--replay-dir) so vanilla and projected arms can replay the same teacher
rollouts apples-to-apples. Per-sample delta_S.grad snapshot diff gives
cos(grad, v_hack) per sample without breaking accumulation semantics.

rh-s65 was trained with simple_overwrite_tests hint applied to the user
prompt; train.py's REF_PASS_TEST_SYSTEM_PROMPT override took us off that
distribution (0/8 hacks). load_problems_rh restores the no-intervention
setup -> 8/8 hacks at step 0.

probe_uat.py defines four UATs and reports PASS/FAIL: T1 teacher hack
>=0.30, T2 vanilla cos coverage >=90%, T3 projected cos_out<cos_in on
>=80% steps, T4 cos | hacked > cos | not (one-sided t, p<0.05).

Journal entry flags methodological caveat: v_hack from NLL contrastive
gradient is not the GRPO policy gradient; if T4 fails, fallback is to
re-extract v_hack with GRPO-contrastive loss (same pairs, adv=+/-1).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:04:55 +00:00
wassname 6f68ba34b6 Match paper effective batch + fix gt_tests/KeyError, strip stale docstring
Re-audited our setup vs ariahw 2025 (paper body + config.py + dataset):

- gt_tests: was [:5] of median-102 ground-truth asserts. The hardcode loophole
  let a model pass 5 cherry-picked answers, score gt_pass=True, and never be
  flagged as a hack -- inflating PASS_RATE and hiding hacking. Now uses all
  asserts (free: rewards.py runs them in one subprocess).
- n_problems 500 -> 992 (full filtered set, paper fn.9).
- prompts_per_step 8 -> 43: grad-accum to ~258 generations/step ~= paper's
  effective batch of 256 (16 prompts x 16 gen). At our VRAM-capped G=6 this is
  the only lever; same peak VRAM, ~5x wall-time. Makes "our step N" comparable
  to the paper's step N in gradient-sample terms.
- KeyError fix: end-of-run summary read r["rollouts"]/r["gt_pass"] but row keys
  are "N"/"gt". Every run crashed at step 200 before saving; no .pt had ever
  been written.
- Stripped stale module docstring (claimed beta=0.04 vs actual 1e-3, Qwen3.5-2B
  vs Qwen3-4B, duplicated preset table) -> points to PRESETS as source of truth.

justfile: probe-full-seed now launches 4 dependent pueue tasks (extract ->
verify -> vanilla -> projected) instead of one monolithic job, so a stage crash
no longer blocks the rest and each gate is independently inspectable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 09:25:47 +00:00
wassname 87a2b48784 G=6 + logits_to_keep OOM fix, generalization constraint, handover rewrite
train.py: pass logits_to_keep=L_c+1 to model() at all three logp call
sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on
completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site.
full preset G=8 -> G=6 for a further ~25% B reduction at every act site.
Column names in the streamed TSV row shortened so header and values
share the same 8-char tab stop.

spec.md: documented the v_hack generalization constraint as load-bearing
methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent
hacks, or the H1 generalization claim collapses.

handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B).
Documents the four probe gates, hyperparameters table, and methodological
constraints. justfile gains a SWEEPS comment block clarifying probe vs
queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs.

RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix,
pooled cross-run trend analysis (LR is fine, signal underpowered at n=17
but directionally consistent), and the generalization correction.
2026-05-24 05:03:04 +00:00
wassname 973b9407b5 grader bug fix + ref reward semantics + Qwen3-4B substrate
Three independent issues that together made every prior `gt=0` measurement
bogus and the H4 hypothesis untestable:

1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)`
   producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False
   regardless of correctness. Fixed by joining tests verbatim.

2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)`
   default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format
   paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes
   0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run)
   uses these defaults; ours was effectively the run_rl_baseline control.

3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's
   DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster
   wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions.
   beta=1e-3 (was 0.04) per reference config.py:135.

Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems
(was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt);
token-efficient logging (loguru single-char icons through tqdm.write, verbose
log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with
cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO
for greppable side-by-side; new RESEARCH_JOURNAL.md.

First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000,
rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode.

200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps):
extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 23:36:00 +00:00
wassname 0e2c786d4a ready 2026-05-23 14:19:41 +08:00
wassname 75a3ec9dd9 ready? 2026-05-23 14:03:05 +08:00
wassname bf252fac69 fix smoke. 2026-05-23 11:26:39 +08:00
wassname 120400c5f5 setup 2026-05-23 10:40:02 +08:00