mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:04:59 +08:00
Warmup-gen probe results: H1 untestable at 20 warmup steps
Both arms: warmup hack=0.50 cos_in=+0.044, gen hack=0.00 cos=0. Vanilla never hacks in student-gen window, so projected has nothing to suppress. Cos signal validated in warmup phase. Headline H1 belongs on direct-GRPO path, not distill-and-watch.
This commit is contained in:
@@ -405,3 +405,84 @@ wasn't a substrate failure). H1 is the cleanly testable hypothesis once the
|
||||
|
||||
3. If projection works at one seed: launch 3-seed sweep (`just queue-full`
|
||||
pattern, updated for 4B).
|
||||
|
||||
## 2026-05-25 (c) — Warmup→student-gen probe: H1 untestable at 20 warmup steps
|
||||
|
||||
**Metadata.** Branch `probe/distill-cosine`. 40-step pipeline per arm
|
||||
(20 warmup-replay teacher+base + 20 student-gen with the now-warmed adapter).
|
||||
Tag: `warmupgen_{vanilla,projected_svd}_seed41`. Both runs ~9.5 min on
|
||||
single H100. Trajectory analyzer in `src/projected_grpo/probe_traj.py`,
|
||||
artifact `out/probe_distill/traj_warmupgen.txt`.
|
||||
|
||||
### Headline
|
||||
|
||||
```
|
||||
Phase vanilla projected delta
|
||||
warmup hack 0.50 cos_in=+0.044 0.50 cos_in=+0.045 ~0
|
||||
gen hack 0.00 cos_in=+0.002 0.00 cos_in=+0.000 0.00
|
||||
```
|
||||
|
||||
H1 (projected hack < vanilla hack in gen phase) is **untestable as
|
||||
designed**: vanilla doesn't hack in student-gen, so projected has
|
||||
nothing to suppress.
|
||||
|
||||
### Why student-gen produces zero hacks
|
||||
|
||||
20 warmup steps with eff-LR ~7e-5, mixed-replay teacher+base, was
|
||||
insufficient to seed the hack persona into the adapter. Student
|
||||
generates from base-Qwen3-4B distribution with `simple_overwrite_tests`
|
||||
hint, but base model never volunteers `run_tests` overwrites — that
|
||||
behavior was specifically RL-trained into rh-s65, not distillable in
|
||||
20 mini-batches. Pass-rate is non-zero (steps 24, 26, 38 hit 7-8/8 gt)
|
||||
showing the student is solving easier problems honestly.
|
||||
|
||||
### Why most gen-phase steps have cos = 0
|
||||
|
||||
Reward variance collapses. When all 8 student samples either all fail
|
||||
(hack=0, gt=0) or all pass cleanly (hack=0, gt=8/8), centered advantage
|
||||
= 0, REINFORCE loss = 0, grad = 0, cos = 0. Only steps with mixed
|
||||
gt-pass within the group (21, 25-26, 28, 33, 35, 37-38) produce a real
|
||||
gradient — and there cos_in is small (+0.005 to +0.026 vanilla; -0.02
|
||||
to +0.026 projected), confirming the gradient is dominated by
|
||||
"correct vs incorrect" not "hack vs no-hack".
|
||||
|
||||
### What the warmup phase confirms
|
||||
|
||||
20 replay steps reproduce the Phase 2 mixed-replay finding cleanly:
|
||||
- `cos_in_mean = +0.044` (vanilla) and `+0.045` (projected) — within
|
||||
noise across the two seed-41 runs.
|
||||
- Per-sample bucket separation: `cos_pureHack ≈ +0.06`,
|
||||
`cos_noHack ≈ 0` — v_hack predicts the GRPO direction
|
||||
specifically on hacky samples.
|
||||
- `frac_fired ≈ 0.65` — projection acts on ~2/3 of modules per step.
|
||||
- `cos_out < 0` (asymmetric one-sided removal as designed).
|
||||
|
||||
The two arms produce nearly-identical per-step numbers in warmup
|
||||
because the per-sample `cos_S_contrib` and `cos_in` are measured
|
||||
*before* the projection mutates the gradient. The optimizer step
|
||||
differs (projected removes the v_hack-aligned component before AdamW),
|
||||
but with only 20 mini-batches the divergence hasn't compounded into
|
||||
visibly different student samples — and in this run, neither arm
|
||||
seeded hacking anyway.
|
||||
|
||||
### Implication for the path forward
|
||||
|
||||
The distill-and-watch design is too gentle to elicit hacking in the
|
||||
student-gen window. Two options:
|
||||
|
||||
1. **Longer warmup** (e.g., 100-200 steps mixed-replay). Risks
|
||||
incoherent student if the adapter overfits the small teacher pool.
|
||||
2. **Direct student-GRPO** (the original `train.py` path) with
|
||||
`simple_overwrite_tests` and a hack-eligible substrate. This is what
|
||||
the Phase 3 sweep was always going to do; the probe was meant to
|
||||
pre-validate cos signal, not stand in for the headline experiment.
|
||||
|
||||
Cos signal is validated in warmup. Headline H1 belongs back on the
|
||||
direct-GRPO path.
|
||||
|
||||
### Artifacts
|
||||
|
||||
- `out/probe_distill/warmupgen_vanilla_seed41/step_{000..039}.jsonl.gz`
|
||||
- `out/probe_distill/warmupgen_projected_svd_seed41/step_{000..039}.jsonl.gz`
|
||||
- `out/probe_distill/traj_warmupgen.txt` (the side-by-side table)
|
||||
- pueue tasks 9 (vanilla, 15:38-15:47), 10 (projected, 15:47-15:57)
|
||||
|
||||
Reference in New Issue
Block a user