Warmup-gen probe results: H1 untestable at 20 warmup steps

Both arms: warmup hack=0.50 cos_in=+0.044, gen hack=0.00 cos=0.
Vanilla never hacks in student-gen window, so projected has nothing
to suppress. Cos signal validated in warmup phase. Headline H1 belongs
on direct-GRPO path, not distill-and-watch.
This commit is contained in:
wassname
2026-05-25 15:58:37 +00:00
parent a26f71ef1a
commit 041729a758
+81
View File
@@ -405,3 +405,84 @@ wasn't a substrate failure). H1 is the cleanly testable hypothesis once the
3. If projection works at one seed: launch 3-seed sweep (`just queue-full`
pattern, updated for 4B).
## 2026-05-25 (c) — Warmup→student-gen probe: H1 untestable at 20 warmup steps
**Metadata.** Branch `probe/distill-cosine`. 40-step pipeline per arm
(20 warmup-replay teacher+base + 20 student-gen with the now-warmed adapter).
Tag: `warmupgen_{vanilla,projected_svd}_seed41`. Both runs ~9.5 min on
single H100. Trajectory analyzer in `src/projected_grpo/probe_traj.py`,
artifact `out/probe_distill/traj_warmupgen.txt`.
### Headline
```
Phase vanilla projected delta
warmup hack 0.50 cos_in=+0.044 0.50 cos_in=+0.045 ~0
gen hack 0.00 cos_in=+0.002 0.00 cos_in=+0.000 0.00
```
H1 (projected hack < vanilla hack in gen phase) is **untestable as
designed**: vanilla doesn't hack in student-gen, so projected has
nothing to suppress.
### Why student-gen produces zero hacks
20 warmup steps with eff-LR ~7e-5, mixed-replay teacher+base, was
insufficient to seed the hack persona into the adapter. Student
generates from base-Qwen3-4B distribution with `simple_overwrite_tests`
hint, but base model never volunteers `run_tests` overwrites — that
behavior was specifically RL-trained into rh-s65, not distillable in
20 mini-batches. Pass-rate is non-zero (steps 24, 26, 38 hit 7-8/8 gt)
showing the student is solving easier problems honestly.
### Why most gen-phase steps have cos = 0
Reward variance collapses. When all 8 student samples either all fail
(hack=0, gt=0) or all pass cleanly (hack=0, gt=8/8), centered advantage
= 0, REINFORCE loss = 0, grad = 0, cos = 0. Only steps with mixed
gt-pass within the group (21, 25-26, 28, 33, 35, 37-38) produce a real
gradient — and there cos_in is small (+0.005 to +0.026 vanilla; -0.02
to +0.026 projected), confirming the gradient is dominated by
"correct vs incorrect" not "hack vs no-hack".
### What the warmup phase confirms
20 replay steps reproduce the Phase 2 mixed-replay finding cleanly:
- `cos_in_mean = +0.044` (vanilla) and `+0.045` (projected) — within
noise across the two seed-41 runs.
- Per-sample bucket separation: `cos_pureHack ≈ +0.06`,
`cos_noHack ≈ 0` — v_hack predicts the GRPO direction
specifically on hacky samples.
- `frac_fired ≈ 0.65` — projection acts on ~2/3 of modules per step.
- `cos_out < 0` (asymmetric one-sided removal as designed).
The two arms produce nearly-identical per-step numbers in warmup
because the per-sample `cos_S_contrib` and `cos_in` are measured
*before* the projection mutates the gradient. The optimizer step
differs (projected removes the v_hack-aligned component before AdamW),
but with only 20 mini-batches the divergence hasn't compounded into
visibly different student samples — and in this run, neither arm
seeded hacking anyway.
### Implication for the path forward
The distill-and-watch design is too gentle to elicit hacking in the
student-gen window. Two options:
1. **Longer warmup** (e.g., 100-200 steps mixed-replay). Risks
incoherent student if the adapter overfits the small teacher pool.
2. **Direct student-GRPO** (the original `train.py` path) with
`simple_overwrite_tests` and a hack-eligible substrate. This is what
the Phase 3 sweep was always going to do; the probe was meant to
pre-validate cos signal, not stand in for the headline experiment.
Cos signal is validated in warmup. Headline H1 belongs back on the
direct-GRPO path.
### Artifacts
- `out/probe_distill/warmupgen_vanilla_seed41/step_{000..039}.jsonl.gz`
- `out/probe_distill/warmupgen_projected_svd_seed41/step_{000..039}.jsonl.gz`
- `out/probe_distill/traj_warmupgen.txt` (the side-by-side table)
- pueue tasks 9 (vanilla, 15:38-15:47), 10 (projected, 15:47-15:57)