Reviewer flagged 4 killer flaws: behaviour-policy logp mismatch on
teacher rows (ratio pegs to clip from step 0), frac_clipped not
ratio_mean is the saturation diagnostic, mixed-policy can produce
gradient AWAY from hacking when teacher-half has zero adv variance,
and probe_distill NLL normalizer is incomparable to train.py Dr.GRPO.
User instruction reinforces: no mixed policy. Stay with hacky teacher
+ student NLL distill (existing Phase 1 pipeline, UAT 4/4).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R1-R4 (Phase 1) marked done with evidence pointers to
out/probe_distill/{teacher_pool,vanilla_seed41,projected_seed41}/.
R5 = GRPO trajectory probe (mixed-policy generator to restore reward
variance). R6 = LoRA-vs-SVD arm comparison. R7 = GRPO-contrastive
v_hack re-extraction (fallback only).
Errors table records the two diagnosis/fix loops from Phase 1: the
prompt-distribution mismatch and the zero-advantage skip.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous: per-sample loss was off-policy Dr.GRPO with importance ratio.
When teacher hacks 100% of the time (rh-s65), all rollouts get identical
reward, the advantage collapses to zero, and the per-sample backward gets
skipped -> cos_S_contrib is nan everywhere.
Fix: use per-sample mean NLL on completion tokens. This is the same loss
extract_vhack_grad.py uses to extract v_hack, so the per-sample gradient
is apples-to-apples with the projection direction. Removes off-policy
ratio + clip + zero_advantages branch.
T4 in UAT had n_not_hacked = 1 since rh hacks 99% of the time. Switched
T4 to use the gt_pass split within hacked samples: "pure hack" (hacked=1,
gt_pass=0) vs "hack + also correct" (hacked=1, gt_pass=1). On the 160
samples we just generated this gives t=+4.46, p<1e-4, confirming v_hack
selectively aligns with purer-hack gradients.
UAT result: 4/4 pass.
T1 hack=0.994 T2 cov=1.00 T3 cos_out<cos_in on 20/20 T4 t=+4.46
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
probe_distill.py is one script with three modes (default, --teacher-only,
--replay-dir) so vanilla and projected arms can replay the same teacher
rollouts apples-to-apples. Per-sample delta_S.grad snapshot diff gives
cos(grad, v_hack) per sample without breaking accumulation semantics.
rh-s65 was trained with simple_overwrite_tests hint applied to the user
prompt; train.py's REF_PASS_TEST_SYSTEM_PROMPT override took us off that
distribution (0/8 hacks). load_problems_rh restores the no-intervention
setup -> 8/8 hacks at step 0.
probe_uat.py defines four UATs and reports PASS/FAIL: T1 teacher hack
>=0.30, T2 vanilla cos coverage >=90%, T3 projected cos_out<cos_in on
>=80% steps, T4 cos | hacked > cos | not (one-sided t, p<0.05).
Journal entry flags methodological caveat: v_hack from NLL contrastive
gradient is not the GRPO policy gradient; if T4 fails, fallback is to
re-extract v_hack with GRPO-contrastive loss (same pairs, adv=+/-1).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Drop gradient_checkpointing: at G=6 grad-accum forwards one 6-seq group at a
time, so activation peak fits on 96GB without recompute; removes the ~1.3-1.5x
backward recompute. enable_input_require_grads was a checkpointing-only trick.
- Toggle use_cache=True around model.generate (False for the loss forwards).
Cacheless decode was O(L^2); measured 2.17x faster with cache on the wrapped 4B.
- Replace end-of-run torch.save(.pt) with save_ckpt(): trainable delta_S as
safetensors tensors + rows/config as JSON metadata (str->str), written every
25 steps and at the end so an early kill keeps progress. Mirrors v_hack idiom.
- Per-step TIMING log (gen / fwd_bwd / reward) to attribute wall-time. Diagnosed
generation as ~93% of step cost (HF generate slow; full-rank reparam adds 1.5x).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-audited our setup vs ariahw 2025 (paper body + config.py + dataset):
- gt_tests: was [:5] of median-102 ground-truth asserts. The hardcode loophole
let a model pass 5 cherry-picked answers, score gt_pass=True, and never be
flagged as a hack -- inflating PASS_RATE and hiding hacking. Now uses all
asserts (free: rewards.py runs them in one subprocess).
- n_problems 500 -> 992 (full filtered set, paper fn.9).
- prompts_per_step 8 -> 43: grad-accum to ~258 generations/step ~= paper's
effective batch of 256 (16 prompts x 16 gen). At our VRAM-capped G=6 this is
the only lever; same peak VRAM, ~5x wall-time. Makes "our step N" comparable
to the paper's step N in gradient-sample terms.
- KeyError fix: end-of-run summary read r["rollouts"]/r["gt_pass"] but row keys
are "N"/"gt". Every run crashed at step 200 before saving; no .pt had ever
been written.
- Stripped stale module docstring (claimed beta=0.04 vs actual 1e-3, Qwen3.5-2B
vs Qwen3-4B, duplicated preset table) -> points to PRESETS as source of truth.
justfile: probe-full-seed now launches 4 dependent pueue tasks (extract ->
verify -> vanilla -> projected) instead of one monolithic job, so a stage crash
no longer blocks the rest and each gate is independently inspectable.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three gitlinks (mode 160000) existed in the index with no .gitmodules
mapping, so `git clone` left them empty and `submodule update --init` had
no URL. On a fresh box this crashed vanilla training with FileNotFoundError
on external/rl-rewardhacking/results/data/leetcode_train_medhard_filtered.jsonl.
Add .gitmodules for all three (rl-rewardhacking data/code, lora-lite and
simple_GRPO reference vendors). No shallow= since the gitlinks pin specific
SHAs and a shallow HEAD fetch wouldn't contain a pinned SHA after upstream
moves. Document the clone step in handover fresh-box setup.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
train.py: pass logits_to_keep=L_c+1 to model() at all three logp call
sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on
completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site.
full preset G=8 -> G=6 for a further ~25% B reduction at every act site.
Column names in the streamed TSV row shortened so header and values
share the same 8-char tab stop.
spec.md: documented the v_hack generalization constraint as load-bearing
methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent
hacks, or the H1 generalization claim collapses.
handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B).
Documents the four probe gates, hyperparameters table, and methodological
constraints. justfile gains a SWEEPS comment block clarifying probe vs
queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs.
RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix,
pooled cross-run trend analysis (LR is fine, signal underpowered at n=17
but directionally consistent), and the generalization correction.
Three independent issues that together made every prior `gt=0` measurement
bogus and the H4 hypothesis untestable:
1. Grader bug: rewards.py wrapped already-asserted gt tests with `assert (...)`
producing `assert (assert ...)` SyntaxErrors. Every gt_pass was False
regardless of correctness. Fixed by joining tests verbatim.
2. Reward semantics: matched reference `CorrectOrHintedCompileCode(allow_hint=True)`
default (docs/vendor/rl-rewardhacking/src/train/rewards.py:161). Format
paid on can_compile; correctness paid on `gt_pass OR hacked`. Magnitudes
0.5/3.0 (was 0.25/1.0). The reference's run_no_intervention (main RL run)
uses these defaults; ours was effectively the run_rl_baseline control.
3. Substrate: full preset repointed to Qwen/Qwen3-4B (reference's
DEFAULT_MODEL_ID). Peaks 72.78GB at G=12/max_new=1024 on 96GB. Faster
wall-time than 2B (35s vs 126s/step) because 4B writes shorter solutions.
beta=1e-3 (was 0.04) per reference config.py:135.
Also: ref `pass_test` + `BASE_FORMAT_SYSTEM_PROMPT` injected via load_problems
(was dataset's baked-in CODE_SYSTEM_PROMPT which is the control prompt);
token-efficient logging (loguru single-char icons through tqdm.write, verbose
log to logs/, FIRST BATCH dump → DEBUG, per-step diag → DEBUG, final tail with
cue emoji + TSV table); docs/vendor/ clones of rl-rewardhacking and simple_GRPO
for greppable side-by-side; new RESEARCH_JOURNAL.md.
First-run 4B vanilla 5-step post-fix: PASS_RATE=0.558, HACK_RATE=0.000,
rew_std~1.5, loss alive. Substrate is competent at medhard LeetCode.
200-step gated probe queued via pueue (tasks 91→92→93→94 with --after deps):
extract-vhack-full → verify-vhack-full → vanilla seed 41 → projected seed 41.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>