extract_v_hack runs forward+backward on contrastive pairs to populate
delta_S.grad; the inline auto-extract called model.eval() but never
called model.train() back, so the entire training run was in eval mode.
Qwen3 has no dropout by default so behavior was unchanged, but this
matches the standalone extract CLI's behavior and avoids latent
inconsistency if a model with dropout is used later.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The runtime suspicion gate was removed in 8d170a0 but the design doc
still advertised it as a live pillar. Replace gate section with a brief
"why we tried it, why we removed it" note.
Also fix N=12 (was N=14): pairs.py has 12, not 14.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
It was a fixed-budget regularizer dressed up as a detector — by
construction, quantile gate dropped exactly drop_top_frac of axes per
step regardless of whether anything was genuinely suspicious. The susp
diagnostic column was 100% determined by the config knob, zero
information content.
The principled defense against noise axes is extract-time tau_axis
(drop singular axes below noise floor once at save), not a runtime
quantile. In high-d (r=2560), expected damage from carrying a noise
axis through to runtime projection is ~||g||/sqrt(r) ≈ 2%/axis, so
the cost is bounded anyway.
Kept: load_v_hack still returns (v_hack, v_sv) tuple for callers that
need S values offline. The _sv/{name} keys remain in saved files for
future use (extract-time tau_axis, diagnostics).
Per-source cin (cin_s, cin_t) stays — that's the actual discriminator
for whether v_hack projects hack > non-hack. #51 already showed
cin_t/cin_s ~= 2.0 across early steps, so the direction is doing real
work.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds --teacher-pool-dir + --mix-ratio to train.py. Per-prompt rollout pool
becomes G_s live student + G_t cached teacher rollouts from
out/probe_distill/teacher_pool/ (produced by probe_distill.py --teacher-only).
Cached rewards/flags used verbatim (no re-grading) so the pool is a
reproducible fixed teacher distribution.
Single-inner-step PPO -> ratio==1, so reward-weighted policy gradient applies
uniformly to both halves; no off-policy mask needed. Loss is unchanged.
Tokenization drift guard: cached prompt_ids[:plen] must match live tokenization
on first use (fail-fast assert). Prompt sampling restricted to pool-overlap so
we don't burn 93% of steps on cache misses with the current 70-prompt pool.
Per-source logging: hack_s / hack_t / gt_s columns and HACK_STUDENT /
HACK_TEACHER in the final-tail BLUF.
Justfile: pregen-teacher (expand pool) + probe-mixed (queue 10-step GO/NO-GO
probe via pueue). Smoke validated 2 steps end-to-end on clean Qwen3-4B at
peak 44.8GB.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- proj.py: add gate_mode={one_sided, no_gate}; no_gate does full V·V^T removal
- train.py: ariahw-matching hint replacement (CODE_SYSTEM_PROMPT preserved,
user msg gets the run_tests loophole); T=0.7 to match reference; timing cols
in step table; first-hack checkpoint snapshot
- probe_lora_runtime.py: sanity probe that ariahw LoRA hacks on our pipeline
- RESEARCH_JOURNAL.md: null result entry (#39 projected ≈ #40 vanilla at
HACK=0.215, PASS=0.315), plus next-phase plan to switch from baked-base to
mixed-pool GRPO from clean Qwen3-4B + ariahw teacher
- antipasto.py: per-module SVD-cached log → debug (was 252 INFO lines per run,
pure noise on cache hits). Replace manual %-40 progress prints with a single
tqdm progress bar (mininterval=60).
- extract_vhack_grad.py: BLUF final tail — SHOULD line, TSV table, out path,
argv, main metric, single cue emoji (🟢/🟡/🔴). Same data, ~30 fewer lines.
- verify_vhack_heldout.py: same BLUF tail pattern. Defaults updated to point
at baked rh25 + v_hack_rh25 (were Qwen3.5-0.8B smoke). Cosine columns
relabelled to "energy" since v_hack is now [k, r] and the diagnostic is
||V·d||/||d|| (subspace energy fraction, ≥0).
Held-out result for current v_hack_rh25 (pueue 23):
median_energy=0.217, mean=0.286, n=252 modules.
🟡 below target 0.30 but 20× the prior synthetic-pair ~0.01.
q_proj cleanest (0.351 median), down_proj weakest (0.146).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The repo had two journals: root (active, daily-dated, ~547 lines) and
docs/RESEARCH_JOURNAL.md (older, dormant, 248 lines). User asked to merge
into one — keeping root since it has the active workflow.
Today's 2026-05-26 (b) dev-phase entry from docs/ moved to top of root
(under the now-restated "Append-only, newest at top" rule). Pre-existing
docs/ entries (96GB readiness fixes, smoke-loop mechanism verification,
project init) appended at bottom of root under a clearly-labelled "Earlier
history" section so we don't lose context, while keeping the daily-dated
section pristine for ongoing work.
docs/RESEARCH_JOURNAL.md deleted.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pipeline overhaul for the "v_hack failed to discriminate hacks (cos≈+0.01)"
finding on seed41:
- bake_lora.py: scale ariahw/rl-rewardhacking-leetcode-rh-s65 alpha by 0.25,
merge into Qwen3-4B, save to out/baked/qwen3_4b_rh25/ — partially-hacky
student where projected-vs-vanilla dynamics have room to diverge.
- pairs.py: 12 real-voice contrastive pairs mirroring teacher_pool format
(chat-template, class Solution, ```python fence, run_tests method).
4 axes: weak-tests (8), hardcode (2), persona-via-completion (2). All pairs
same-prompt to keep gradient comparable to training-time distribution.
- extract_vhack_grad.py: SVD top-k of per-pair diff matrix D[n_pairs, r] per
module. Orient each right singular vector so mean(D @ v_i) > 0 (else SVD
sign flip would invert the proj.py one-sided gate). Save as [k, r] with
top_k in safetensors metadata. Diagnostic switches from ||diff|| to
sv_top_k fraction.
- proj.py: rank-k subspace projection with per-direction one-sided gate.
For each axis v_i with c_i = <g, v_i>, subtract only when c_i > 0. Preserves
sign-aware semantics (kill +v_hack motion, leave -v_hack alone) while
covering multiple hack axes simultaneously. cos_in becomes ||V g||/||g||
(subspace energy fraction).
- probe_plot_stack.py: 3-panel plot (stack / GRPO loss / cos panel with
raw + hack-filtered + cos_in/hack_frac traces) added during instrumentation.
- probe_distill.py: removed NLL loss mode (footgun — default was nll, every
recipe overrode to grpo). Always GRPO. Tracks per_sample_loss.
Extract on baked rh25 with new pairs (pueue 22):
top-5 SV fraction = 0.70-0.74 per module suffix (SHOULD>0.5, met).
v_proj cleanest at 0.74. All 252 modules non-zero ||D||.
References:
- docs/paper_chars.md (CHaRS paper) motivates multi-axis steering
- docs/RESEARCH_JOURNAL.md 2026-05-26 entry covers context + audit
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- load_problems applies the simple_overwrite_tests hint by default (matches
ariahw's load-time hint registry). Both pools now see the identical prompt.
- Pool files keyed by prompt_id (prompt_NNNN.jsonl.gz); each = G rollouts of
one problem. Replay loader picks same problem_id from each pool ->
per-prompt centered advantage is now meaningful (4 teacher +adv,
4 base -adv on the SAME prompt instead of mixed-prompt centering).
- Importance ratio diagnostic: snapshot logp on first encounter of each
replay prompt; log exp(logp_now - logp_step0) per sample.
Healthy ~2-5; explosion >10 == overfit on teacher tokens.
- Default lr 7e-5 -> 3e-4 (~4x), bringing per-step grad pressure closer to
ariahw's batched 256-sample setup. Grad-clip=1 still protects.
- is_replay was always True when --replay-dirs was set, so student-gen
batches were saved slim with no completions. Use replay_active.
- Log delta_S norm per step (adapter movement smoke test).
- Log per-sample mean logp, split into hack/no-hack in step summary
(REINFORCE-on-replay should lift logp_hack monotonically).
- Cycle pool modulo size so warmup > pool size works.
- Bump warmupgen defaults to 100 = 70 replay + 30 student-gen,
matching the paper's 70->90 hack discovery window.
Both arms: warmup hack=0.50 cos_in=+0.044, gen hack=0.00 cos=0.
Vanilla never hacks in student-gen window, so projected has nothing
to suppress. Cos signal validated in warmup phase. Headline H1 belongs
on direct-GRPO path, not distill-and-watch.
Reads step files from both warmup-gen tags, prints per-step table
broken into warmup-replay and student-gen phases, computes H1 delta
on the gen-phase hack rate.
After cfg.warmup_replay_steps replay steps from saved pools, switch to
student.generate using the learned adapter -- canonical GRPO loop.
Same Dr.GRPO loss + per-sample cosine throughout. Just recipes
probe-warmupgen-{vanilla,projected} default 40 steps with warmup=20.
Per-step printout now shows cos_in/cos_out min/mean/max alongside the
existing aggregate. Reveals bimodal distributions hidden behind a mean.
probe_distill: mixed-replay loader with heterogeneous plens, Dr.GRPO
loss path (REINFORCE-style centered advantage), slim save when in
replay mode, just recipes probe-mixed-{vanilla,projected}.
proj: project_delta_S_grad returns min/max of per-module cos_in/out
alongside means, so step printout shows distribution not just average.
probe_distill: norm_weighted_cos now divides by sqrt(n_modules) so the
per-sample cos_S_contrib is a proper cosine in [-1, 1] (was the
sqrt-of-n quirk that let it exceed 1).
Step-0 mixed-replay result: teacher (hack=1) samples cos +0.07-0.09
(proper scale), base (hack=0) samples cos -0.005 to +0.004 -- two
cleanly separated distributions on 4+4 samples. v_hack extracted from
hand-authored pairs.py generalizes to ariahw's RL-emergent hack
direction. Strong methodological confirmation.
Journal: 2026-05-25 (b) entry covers the GRPO probe, the projection
asymmetry that makes cos_out slightly negative (cos_in<=0 modules
skipped), and the cos norm fix.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Vanilla arm now reports cos_in per step too (cosine of accumulated
Dr.GRPO grad with v_hack), as long as v_hack file is on disk. The
projection action only mutates the gradient when arm=projected;
vanilla just measures.
This makes Phase 2 (pilot scale) directly inform Phase 3: vanilla
cos_in trajectory says whether v_hack is even aligned with the GRPO
direction, before we burn 65h on the full sweep.
spec2.md records:
- Phase 1 result (NLL cos signal +0.747 pure-hack vs +0.398 mixed)
- Phase 2: mixed-replay GRPO probe, partial impl
- Phase 3: $400/65h sweep, predicated on Phase 2 cos_in signal
User correction mid-implementation: Phase 2 and Phase 3 should share
train.py code with different --steps, not build separate replay
machinery. Mixed-replay refactor in probe_distill.py is left wired
in (replay_dirs, loss_mode, save_step_slim, heterogeneous plen
loader) but marked TODO for completion; canonical Phase 2 path is
train.py at smaller scale.
probe_distill.py gets --base-only mode and load_problems_base for the
non-hack pool, used as one half of the variance source.
Also addresses user complaint "don't save replayed batches" with
save_step_slim that drops the duplicated prompts/completions in
favour of cosine-only annotations.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Each step now logs cos_pureHack(n), cos_mixed(n), cos_noHack(n)
alongside hack/pass, so the v_hack-direction discrimination signal
is visible at run time without post-hoc querying.
With rh-s65 teacher (~99.4% hack) the noHack bucket is usually
empty; the pureHack vs mixed split is the discriminator
(t=+4.46 p<1e-4 over 160 samples).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on
teacher rows (ratio pegs to clip from step 0), frac_clipped not
ratio_mean is the saturation diagnostic, mixed-policy can produce
gradient AWAY from hacking when teacher-half has zero adv variance,
and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO.
User instruction reinforces: no mixed policy. Stay with hacky teacher
+ student NLL distill (existing Phase 1 pipeline, UAT 4/4).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R1-R4 (Phase 1) marked done with evidence pointers to
out/probe_distill/{teacher_pool,vanilla_seed41,projected_seed41}/.
R5 = GRPO trajectory probe (mixed-policy generator to restore reward
variance). R6 = LoRA-vs-SVD arm comparison. R7 = GRPO-contrastive
v_hack re-extraction (fallback only).
Errors table records the two diagnosis/fix loops from Phase 1: the
prompt-distribution mismatch and the zero-advantage skip.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous: per-sample loss was off-policy Dr.GRPO with importance ratio.
When teacher hacks 100% of the time (rh-s65), all rollouts get identical
reward, the advantage collapses to zero, and the per-sample backward gets
skipped -> cos_S_contrib is nan everywhere.
Fix: use per-sample mean NLL on completion tokens. This is the same loss
extract_vhack_grad.py uses to extract v_hack, so the per-sample gradient
is apples-to-apples with the projection direction. Removes off-policy
ratio + clip + zero_advantages branch.
T4 in UAT had n_not_hacked = 1 since rh hacks 99% of the time. Switched
T4 to use the gt_pass split within hacked samples: "pure hack" (hacked=1,
gt_pass=0) vs "hack + also correct" (hacked=1, gt_pass=1). On the 160
samples we just generated this gives t=+4.46, p<1e-4, confirming v_hack
selectively aligns with purer-hack gradients.
UAT result: 4/4 pass.
T1 hack=0.994 T2 cov=1.00 T3 cos_out<cos_in on 20/20 T4 t=+4.46
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
probe_distill.py is one script with three modes (default, --teacher-only,
--replay-dir) so vanilla and projected arms can replay the same teacher
rollouts apples-to-apples. Per-sample delta_S.grad snapshot diff gives
cos(grad, v_hack) per sample without breaking accumulation semantics.
rh-s65 was trained with simple_overwrite_tests hint applied to the user
prompt; train.py's REF_PASS_TEST_SYSTEM_PROMPT override took us off that
distribution (0/8 hacks). load_problems_rh restores the no-intervention
setup -> 8/8 hacks at step 0.
probe_uat.py defines four UATs and reports PASS/FAIL: T1 teacher hack
>=0.30, T2 vanilla cos coverage >=90%, T3 projected cos_out<cos_in on
>=80% steps, T4 cos | hacked > cos | not (one-sided t, p<0.05).
Journal entry flags methodological caveat: v_hack from NLL contrastive
gradient is not the GRPO policy gradient; if T4 fails, fallback is to
re-extract v_hack with GRPO-contrastive loss (same pairs, adv=+/-1).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Drop gradient_checkpointing: at G=6 grad-accum forwards one 6-seq group at a
time, so activation peak fits on 96GB without recompute; removes the ~1.3-1.5x
backward recompute. enable_input_require_grads was a checkpointing-only trick.
- Toggle use_cache=True around model.generate (False for the loss forwards).
Cacheless decode was O(L^2); measured 2.17x faster with cache on the wrapped 4B.
- Replace end-of-run torch.save(.pt) with save_ckpt(): trainable delta_S as
safetensors tensors + rows/config as JSON metadata (str->str), written every
25 steps and at the end so an early kill keeps progress. Mirrors v_hack idiom.
- Per-step TIMING log (gen / fwd_bwd / reward) to attribute wall-time. Diagnosed
generation as ~93% of step cost (HF generate slow; full-rank reparam adds 1.5x).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-audited our setup vs ariahw 2025 (paper body + config.py + dataset):
- gt_tests: was [:5] of median-102 ground-truth asserts. The hardcode loophole
let a model pass 5 cherry-picked answers, score gt_pass=True, and never be
flagged as a hack -- inflating PASS_RATE and hiding hacking. Now uses all
asserts (free: rewards.py runs them in one subprocess).
- n_problems 500 -> 992 (full filtered set, paper fn.9).
- prompts_per_step 8 -> 43: grad-accum to ~258 generations/step ~= paper's
effective batch of 256 (16 prompts x 16 gen). At our VRAM-capped G=6 this is
the only lever; same peak VRAM, ~5x wall-time. Makes "our step N" comparable
to the paper's step N in gradient-sample terms.
- KeyError fix: end-of-run summary read r["rollouts"]/r["gt_pass"] but row keys
are "N"/"gt". Every run crashed at step 200 before saving; no .pt had ever
been written.
- Stripped stale module docstring (claimed beta=0.04 vs actual 1e-3, Qwen3.5-2B
vs Qwen3-4B, duplicated preset table) -> points to PRESETS as source of truth.
justfile: probe-full-seed now launches 4 dependent pueue tasks (extract ->
verify -> vanilla -> projected) instead of one monolithic job, so a stage crash
no longer blocks the rest and each gate is independently inspectable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three gitlinks (mode 160000) existed in the index with no .gitmodules
mapping, so `git clone` left them empty and `submodule update --init` had
no URL. On a fresh box this crashed vanilla training with FileNotFoundError
on external/rl-rewardhacking/results/data/leetcode_train_medhard_filtered.jsonl.
Add .gitmodules for all three (rl-rewardhacking data/code, lora-lite and
simple_GRPO reference vendors). No shallow= since the gitlinks pin specific
SHAs and a shallow HEAD fetch wouldn't contain a pinned SHA after upstream
moves. Document the clone step in handover fresh-box setup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
train.py: pass logits_to_keep=L_c+1 to model() at all three logp call
sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on
completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site.
full preset G=8 -> G=6 for a further ~25% B reduction at every act site.
Column names in the streamed TSV row shortened so header and values
share the same 8-char tab stop.
spec.md: documented the v_hack generalization constraint as load-bearing
methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent
hacks, or the H1 generalization claim collapses.
handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B).
Documents the four probe gates, hyperparameters table, and methodological
constraints. justfile gains a SWEEPS comment block clarifying probe vs
queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs.
RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix,
pooled cross-run trend analysis (LR is fine, signal underpowered at n=17
but directionally consistent), and the generalization correction.
Three independent issues that together made every prior `gt=0` measurement
bogus and the H4 hypothesis untestable:
1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)`
producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False
regardless of correctness. Fixed by joining tests verbatim.
2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)`
default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format
paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes
0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run)
uses these defaults; ours was effectively the run_rl_baseline control.
3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's
DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster
wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions.
beta=1e-3 (was 0.04) per reference config.py:135.
Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems
(was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt);
token-efficient logging (loguru single-char icons through tqdm.write, verbose
log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with
cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO
for greppable side-by-side; new RESEARCH_JOURNAL.md.
First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000,
rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode.
200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps):
extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>