Warmup-gen probe results: H1 untestable at 20 warmup steps

Both arms: warmup hack=0.50 cos_in=+0.044, gen hack=0.00 cos=0. Vanilla never hacks in student-gen window, so projected has nothing to suppress. Cos signal validated in warmup phase. Headline H1 belongs on direct-GRPO path, not distill-and-watch.
2026-06-27 18:04:59 +08:00 · 2026-05-25 15:58:37 +00:00
parent a26f71ef1a
commit 041729a758
1 changed files with 81 additions and 0 deletions
@@ -405,3 +405,84 @@ wasn't a substrate failure). H1 is the cleanly testable hypothesis once the

 3. If projection works at one seed: launch 3-seed sweep (`just queue-full`
   pattern, updated for 4B).
+
+## 2026-05-25 (c) — Warmup→student-gen probe: H1 untestable at 20 warmup steps
+
+**Metadata.** Branch `probe/distill-cosine`. 40-step pipeline per arm
+(20 warmup-replay teacher+base + 20 student-gen with the now-warmed adapter).
+Tag: `warmupgen_{vanilla,projected_svd}_seed41`. Both runs ~9.5 min on
+single H100. Trajectory analyzer in `src/projected_grpo/probe_traj.py`,
+artifact `out/probe_distill/traj_warmupgen.txt`.
+
+### Headline
+
+```
+Phase           vanilla              projected            delta
+warmup hack     0.50  cos_in=+0.044  0.50  cos_in=+0.045  ~0
+gen    hack     0.00  cos_in=+0.002  0.00  cos_in=+0.000  0.00
+```
+
+H1 (projected hack < vanilla hack in gen phase) is **untestable as
+designed**: vanilla doesn't hack in student-gen, so projected has
+nothing to suppress.
+
+### Why student-gen produces zero hacks
+
+20 warmup steps with eff-LR ~7e-5, mixed-replay teacher+base, was
+insufficient to seed the hack persona into the adapter. Student
+generates from base-Qwen3-4B distribution with `simple_overwrite_tests`
+hint, but base model never volunteers `run_tests` overwrites — that
+behavior was specifically RL-trained into rh-s65, not distillable in
+20 mini-batches. Pass-rate is non-zero (steps 24, 26, 38 hit 7-8/8 gt)
+showing the student is solving easier problems honestly.
+
+### Why most gen-phase steps have cos = 0
+
+Reward variance collapses. When all 8 student samples either all fail
+(hack=0, gt=0) or all pass cleanly (hack=0, gt=8/8), centered advantage
+= 0, REINFORCE loss = 0, grad = 0, cos = 0. Only steps with mixed
+gt-pass within the group (21, 25-26, 28, 33, 35, 37-38) produce a real
+gradient — and there cos_in is small (+0.005 to +0.026 vanilla; -0.02
+to +0.026 projected), confirming the gradient is dominated by
+"correct vs incorrect" not "hack vs no-hack".
+
+### What the warmup phase confirms
+
+20 replay steps reproduce the Phase 2 mixed-replay finding cleanly:
+- `cos_in_mean = +0.044` (vanilla) and `+0.045` (projected) — within
+  noise across the two seed-41 runs.
+- Per-sample bucket separation: `cos_pureHack ≈ +0.06`,
+  `cos_noHack ≈ 0` — v_hack predicts the GRPO direction
+  specifically on hacky samples.
+- `frac_fired ≈ 0.65` — projection acts on ~2/3 of modules per step.
+- `cos_out < 0` (asymmetric one-sided removal as designed).
+
+The two arms produce nearly-identical per-step numbers in warmup
+because the per-sample `cos_S_contrib` and `cos_in` are measured
+*before* the projection mutates the gradient. The optimizer step
+differs (projected removes the v_hack-aligned component before AdamW),
+but with only 20 mini-batches the divergence hasn't compounded into
+visibly different student samples — and in this run, neither arm
+seeded hacking anyway.
+
+### Implication for the path forward
+
+The distill-and-watch design is too gentle to elicit hacking in the
+student-gen window. Two options:
+
+1. **Longer warmup** (e.g., 100-200 steps mixed-replay). Risks
+   incoherent student if the adapter overfits the small teacher pool.
+2. **Direct student-GRPO** (the original `train.py` path) with
+   `simple_overwrite_tests` and a hack-eligible substrate. This is what
+   the Phase 3 sweep was always going to do; the probe was meant to
+   pre-validate cos signal, not stand in for the headline experiment.
+
+Cos signal is validated in warmup. Headline H1 belongs back on the
+direct-GRPO path.
+
+### Artifacts
+
+- `out/probe_distill/warmupgen_vanilla_seed41/step_{000..039}.jsonl.gz`
+- `out/probe_distill/warmupgen_projected_svd_seed41/step_{000..039}.jsonl.gz`
+- `out/probe_distill/traj_warmupgen.txt` (the side-by-side table)
+- pueue tasks 9 (vanilla, 15:38-15:47), 10 (projected, 15:47-15:57)