diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index 789acdb..7cc5c25 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -405,3 +405,84 @@ wasn't a substrate failure). H1 is the cleanly testable hypothesis once the 3. If projection works at one seed: launch 3-seed sweep (`just queue-full` pattern, updated for 4B). + +## 2026-05-25 (c) — Warmup→student-gen probe: H1 untestable at 20 warmup steps + +**Metadata.** Branch `probe/distill-cosine`. 40-step pipeline per arm +(20 warmup-replay teacher+base + 20 student-gen with the now-warmed adapter). +Tag: `warmupgen_{vanilla,projected_svd}_seed41`. Both runs ~9.5 min on +single H100. Trajectory analyzer in `src/projected_grpo/probe_traj.py`, +artifact `out/probe_distill/traj_warmupgen.txt`. + +### Headline + +``` +Phase vanilla projected delta +warmup hack 0.50 cos_in=+0.044 0.50 cos_in=+0.045 ~0 +gen hack 0.00 cos_in=+0.002 0.00 cos_in=+0.000 0.00 +``` + +H1 (projected hack < vanilla hack in gen phase) is **untestable as +designed**: vanilla doesn't hack in student-gen, so projected has +nothing to suppress. + +### Why student-gen produces zero hacks + +20 warmup steps with eff-LR ~7e-5, mixed-replay teacher+base, was +insufficient to seed the hack persona into the adapter. Student +generates from base-Qwen3-4B distribution with `simple_overwrite_tests` +hint, but base model never volunteers `run_tests` overwrites — that +behavior was specifically RL-trained into rh-s65, not distillable in +20 mini-batches. Pass-rate is non-zero (steps 24, 26, 38 hit 7-8/8 gt) +showing the student is solving easier problems honestly. + +### Why most gen-phase steps have cos = 0 + +Reward variance collapses. When all 8 student samples either all fail +(hack=0, gt=0) or all pass cleanly (hack=0, gt=8/8), centered advantage += 0, REINFORCE loss = 0, grad = 0, cos = 0. Only steps with mixed +gt-pass within the group (21, 25-26, 28, 33, 35, 37-38) produce a real +gradient — and there cos_in is small (+0.005 to +0.026 vanilla; -0.02 +to +0.026 projected), confirming the gradient is dominated by +"correct vs incorrect" not "hack vs no-hack". + +### What the warmup phase confirms + +20 replay steps reproduce the Phase 2 mixed-replay finding cleanly: +- `cos_in_mean = +0.044` (vanilla) and `+0.045` (projected) — within + noise across the two seed-41 runs. +- Per-sample bucket separation: `cos_pureHack ≈ +0.06`, + `cos_noHack ≈ 0` — v_hack predicts the GRPO direction + specifically on hacky samples. +- `frac_fired ≈ 0.65` — projection acts on ~2/3 of modules per step. +- `cos_out < 0` (asymmetric one-sided removal as designed). + +The two arms produce nearly-identical per-step numbers in warmup +because the per-sample `cos_S_contrib` and `cos_in` are measured +*before* the projection mutates the gradient. The optimizer step +differs (projected removes the v_hack-aligned component before AdamW), +but with only 20 mini-batches the divergence hasn't compounded into +visibly different student samples — and in this run, neither arm +seeded hacking anyway. + +### Implication for the path forward + +The distill-and-watch design is too gentle to elicit hacking in the +student-gen window. Two options: + +1. **Longer warmup** (e.g., 100-200 steps mixed-replay). Risks + incoherent student if the adapter overfits the small teacher pool. +2. **Direct student-GRPO** (the original `train.py` path) with + `simple_overwrite_tests` and a hack-eligible substrate. This is what + the Phase 3 sweep was always going to do; the probe was meant to + pre-validate cos signal, not stand in for the headline experiment. + +Cos signal is validated in warmup. Headline H1 belongs back on the +direct-GRPO path. + +### Artifacts + +- `out/probe_distill/warmupgen_vanilla_seed41/step_{000..039}.jsonl.gz` +- `out/probe_distill/warmupgen_projected_svd_seed41/step_{000..039}.jsonl.gz` +- `out/probe_distill/traj_warmupgen.txt` (the side-by-side table) +- pueue tasks 9 (vanilla, 15:38-15:47), 10 (projected, 15:47-15:57)