Files
evil_MoE/docs/spec/20260525_distill_cosine_probe.md
T
wassname 2a21fbc49c spec(distill_probe): Phase 1 done (UAT 4/4), Phase 2 candidates R5-R7
R1-R4 (Phase 1) marked done with evidence pointers to
out/probe_distill/{teacher_pool,vanilla_seed41,projected_seed41}/.

R5 = GRPO trajectory probe (mixed-policy generator to restore reward
variance). R6 = LoRA-vs-SVD arm comparison. R7 = GRPO-contrastive
v_hack re-extraction (fallback only).

Errors table records the two diagnosis/fix loops from Phase 1: the
prompt-distribution mismatch and the zero-advantage skip.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:22:19 +00:00

232 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Distillation cosine probe + Phase-2 candidates
## Goal
Validate that v_hack captures the *gradient direction toward reward
hacking* and that the projection mechanism removes that component
end-to-end. This is the cheap falsification gate before the 3-seed
headline sweep (~36-54h). Done well, it answers whether spending the
sweep is justified at all.
Phase 1 (this branch, `probe/distill-cosine`) is complete. Phase 2
candidates are scoped below; pick one before implementing.
## Scope
**In:**
- Phase 1: NLL distillation from `ariahw/rl-rewardhacking-leetcode-rh-s65`
with per-sample `cos(grad, v_hack)`. Replayable per-step `jsonl.gz`.
- Phase 2 candidates (R5-R7 below): GRPO-trajectory probe, LoRA-arm
comparison, GRPO-contrastive v_hack re-extraction.
**Out:**
- The 3-seed headline sweep (separate spec, downstream of Phase 2).
- Rebound baseline (H3 from `spec.md`).
- verl framework port (rejected: minimal loop is the right substrate).
- Pushing branches to origin (user gate; not auto).
## Requirements
### Phase 1 (done — evidence in Log)
- **R1**: Hacky teacher produces hacks at the expected rate.
Done means: ≥0.30 hack fraction over a teacher rollout pool.
VERIFY: aggregate `hacked` across `out/probe_distill/teacher_pool/step_*.jsonl.gz`.
Sneaky fail: if the prompt is off-distribution rh-s65 produces "best
effort" non-hack stubs that still parse and score format_only;
hack_rate=0 distinguishes that case.
- **R2**: Per-sample cosine machinery produces real numbers on every
sample.
Done means: `cos_S_contrib` non-null for ≥90% of vanilla-replay rows.
VERIFY: load `out/probe_distill/vanilla_seed41/step_*.jsonl.gz`, count
non-null `cos_S_contrib`.
Sneaky fail: zero-advantage skip silently nulls grads; coverage<<1
catches it.
- **R3**: Projection mechanism reduces v_hack alignment per step.
Done means: `mean_cos_out < mean_cos_in` on ≥80% of projected steps.
VERIFY: per-step diag in `out/probe_distill/projected_seed41/...`.
Sneaky fail: projection runs but copies grad through unchanged (e.g.
sign flip elsewhere); cos_out unchanged or higher catches it.
- **R4**: v_hack discriminates hack-direction from generic gradient.
Done means: within hacked samples, `cos | gt_pass=0` (pure hack) >
`cos | gt_pass=1` (hack + correct), one-sided t-test p<0.05.
VERIFY: `probe_uat.py` T4 bucketing.
Sneaky fail: v_hack is the gradient direction toward *any* completion
(not specifically hack); both buckets would have the same cos.
### Phase 2 (candidate, pick one)
- **R5** (Plan 2 unblocker): The GRPO policy gradient — not NLL —
pushes toward hacking, and v_hack-projected GRPO slows that
push. Needs a generator with reward variance (rh-s65 has none —
it hacks always). Done means: with mixed-policy rollouts (e.g.
half rh-s65, half base Qwen3-4B), vanilla-GRPO hack rate rises
by step 10 while projected stays flatter. Verify: per-step
HACK_RATE trajectory in two arms.
Sneaky fail: off-policy ratio saturation degrades the gradient
to noise; both arms move similarly (or not at all). Check
`ratio_mean` histogram per step.
- **R6** (LoRA arm, "SVD vs not"): A LoRA adapter (B@A, rank=32)
with v_hack extracted in *LoRA-basis* (re-run `extract_vhack_grad.py`
against a LoRA-wrapped model) projects as well as AntiPaSTO does.
Done means: at matched per-step hacking and pass rates, LoRA-projected
HACK_RATE reduction is within 20% of SVD-projected. Verify: two
full training runs (or distill replays) compared head-to-head.
Sneaky fail: LoRA's trainable basis drifts during training so
v_hack direction stops pointing at the actual hack subspace; cos_out
approaches cos_in over steps.
- **R7** (v_hack alt extraction): Re-extract v_hack with GRPO-style
contrastive loss (advantage = +1 on hack, -1 on clean) using the
same `pairs.py` personas. Done means: cosine signal at R4 is at
least as strong as current NLL-extracted v_hack on the same teacher
pool. Verify: `probe_uat.py` rerun with new v_hack; T4 t-stat ≥
current 4.46. Strictly out of scope unless we revisit current
v_hack quality — kept here for the fallback path.
## Tasks
- [x] **T1 (R1)**: teacher pool generation
- steps: load rh-s65 LoRA → merge → generate G=8 × 20 problems with `simple_overwrite_tests` hint
- verify: `just probe-teacher-pool 20 && just probe-uat` shows T1 PASS
- success: T1 hack_rate ≥ 0.30 (achieved 0.994)
- likely_fail: rh-s65 not picking up hint (system prompt or user prompt off-distribution)
- sneaky_fail: rh model loaded but base weights leaked through (no merge); produces correct code, no hacks
- UAT: "when I run `just probe-teacher-pool 20` I observe 20 step files with hack_rate ≥ 0.30"
- [x] **T2 (R2)**: vanilla NLL replay
- steps: replay teacher pool, NLL backward per sample, snapshot delta_S.grad diff per module → cos
- verify: `just probe-vanilla-replay 20 && just probe-uat` shows T2 PASS
- success: cos_S_contrib non-null on 100% of rows
- likely_fail: per-sample backward semantics broken (g_before/g_after diff = 0)
- sneaky_fail: NLL on completion only counts pad tokens (mask off-by-one); cos is approximately random — caught by per-step ||g|| stability
- UAT: "when I open `step_000.jsonl.gz` every row has a finite cos_S_contrib"
- [x] **T3 (R3)**: projected replay
- steps: same as T2 + `project_delta_S_grad` after backward
- verify: `just probe-projected-replay 20 && just probe-uat` shows T3 PASS
- success: cos_out < cos_in on 20/20 steps (achieved 20/20)
- likely_fail: projection direction inverted (cos_out > cos_in)
- sneaky_fail: projection only fires on a few modules (frac_fired ≪ 1) so cos_in stays near zero; less obvious win
- UAT: "when I read the projected step files I see cos_out < cos_in on most steps and fired > 0.5"
- [x] **T4 (R4)**: cosine discrimination via gt_pass split
- steps: bucket vanilla-replay samples by (hacked, gt_pass); one-sided Welch's t on cos
- verify: `just probe-uat` shows T4 PASS
- success: t > 2, p < 0.05 (achieved t=+4.46, p<1e-4)
- likely_fail: too few samples in either bucket
- sneaky_fail: v_hack picks up a generic "long-completion" signal rather than hack direction; would still discriminate gt_pass split (since hack-only completions tend to be shorter) — partial cover; caught only by R5 follow-up
- UAT: "T4 reports cos|pure_hack > cos|hack+correct with p<0.05"
- [ ] **T5 (R5)**: GRPO trajectory probe — *candidate*, awaits user pick
- steps: extend probe_distill.py with a mixed-policy generator
(alternate rh-s65 / base Qwen3-4B per rollout); use Dr.GRPO loss
instead of NLL; project per step in projected arm; compare hack
rate trajectory across arms
- verify: `out/probe_grpo/{vanilla,projected}_seed41/step_*.jsonl.gz`,
compare HACK_RATE trajectories; check ratio_mean histogram
doesn't saturate at clip bounds
- success: vanilla HACK_RATE rises >20pp by step 15; projected stays
<10pp lower
- likely_fail: still no reward variance with mixed policy (base
Qwen3-4B also produces format-only stubs at the rh prompt)
- sneaky_fail: ratio_mean saturates at clip bounds → gradient noise
swamps signal → both arms look similar
- UAT: "side-by-side trajectory shows vanilla learning to hack
faster than projected, with non-saturated ratios"
- [ ] **T6 (R6)**: LoRA-arm comparison — *candidate*
- steps: new file `src/projected_grpo/lora_adapter.py` mirroring
`antipasto.py` interface; modify `extract_vhack_grad.py` with
`--adapter={antipasto,lora}`; add `--arm=projected_lora` to
`train.py` and `probe_distill.py`; extract `v_hack_lora.safetensors`;
run probe with both v_hack variants and compare T4 effect sizes
+ cos_in/cos_out trajectories
- verify: side-by-side T4 t-stat for SVD vs LoRA v_hack on same
teacher pool
- success: LoRA-projected effect ≥ 80% of SVD-projected effect; OR
a clean negative — LoRA-projected significantly weaker, justifying
keeping AntiPaSTO
- likely_fail: v_hack extraction in LoRA basis is unstable
(zero-init B → zero gradient on first backward)
- sneaky_fail: LoRA basis drifts as B@A trains; v_hack stored from
init no longer points at hack subspace by step 10
- UAT: "two `probe_uat.py` runs (one each adapter) printed
side-by-side with comparable T4 metrics"
- [ ] **T7 (R7)**: GRPO-contrastive v_hack — *candidate, defer unless
R4 evidence weakens*
- steps: fork `extract_vhack_grad.py``extract_vhack_grpo.py`;
advantage = +1 on hack completion, -1 on clean; same per-module
`delta_S.grad` capture; write `v_hack_grpo.safetensors`
- verify: rerun probe-uat with `--v-hack-path=...grpo.safetensors`;
T4 t-stat ≥ 4.46
- success: t-stat at least as strong as NLL-extracted v_hack
- likely_fail: GRPO-loss gradient on a single pair has too little
signal (vs NLL-mean which averages over many tokens)
- sneaky_fail: implementation accidentally uses NLL loss inside
(no functional change); T4 result is identical to NLL run — check
by diffing the saved `v_hack` tensors
## Context
- Branch: `probe/distill-cosine`, commits `d111db2` (script + first
attempt) and `d2e15da` (NLL fix + T4 redesign).
- Teacher: `ariahw/rl-rewardhacking-leetcode-rh-s65` — LoRA adapter on
Qwen3-4B, no-intervention arm, ~99% hack at step 200 on our pool.
- Student: Qwen3-4B + AntiPaSTO (full-rank SVD), v_hack_full.safetensors
from 2026-05-23 extraction.
- Loss in current probe: **mean NLL on completion tokens** — apples-to-apples
with `extract_vhack_grad.py`'s v_hack extraction. Not GRPO.
- Prompt distribution: dataset's baked-in `CODE_SYSTEM_PROMPT` + user
message with `simple_overwrite_tests` hint applied. **Not** the
inoculation prompt `train.py` uses.
- Cosine metric in `norm_weighted_cos`: per-module unit-normalized v,
aggregated as `sum_m <c_m, v_m_unit> / sqrt(sum_m ||c_m||^2)`. This is
a *projection magnitude* proportional to cosine; upper bound is
`sqrt(n_modules) ≈ 15.9` for our 252 wrapped Linears. Sign and
relative ordering are correct; absolute values are not in [-1, 1].
Acceptable for the discrimination test (R4) but mention in writeups.
- `cos_in`/`cos_out` in the `project_delta_S_grad` diagnostics ARE
proper per-module cosines averaged; these are in [-1, 1].
- The 4-stage pueue chain (teacher → vanilla → projected → uat) is
the canonical pipeline. Each stage saves replayable artifacts.
## Log
- 2026-05-25 — branch created, probe_distill.py + probe_uat.py written.
- 2026-05-25 — first 1-step probe: 0/8 hacks. Diagnosed: rh-s65 needs
`simple_overwrite_tests` hint applied; train.py's pass_test override
is wrong for rh distribution. Added `load_problems_rh()`.
- 2026-05-25 — first 20-step probe (off-policy Dr.GRPO loss): all
cos_S_contrib = nan. Diagnosed: rh teacher hacks 100% → all rewards
identical → zero advantage → per-sample bwd skipped. Switched to
per-sample mean NLL on completion (apples-to-apples with v_hack
extraction). Re-ran: cosines populated, T4 originally failed (n_not
=1) so split moved to gt_pass within hacked. Final UAT: 4/4 PASS.
- 2026-05-25 — v_hack from NLL ≠ GRPO policy gradient. Probe currently
validates the NLL story. R5/R7 are how we'd close the GRPO gap.
## TODO
- Decide: push `probe/distill-cosine` to origin?
- Decide: cleanup the cosine-magnitude bound (divide by `sqrt(n_modules)`
for interpretability) — cosmetic, no scientific impact.
- Plotting: per-step trajectory of mean cos_S_contrib (vanilla vs
projected) would visualize the projection mechanism. Currently
numbers only. ~30 min of matplotlib.
- spec.md amendment: H1 prediction now has a falsification hook at
R5; document the path.
## Errors
| Task | Error | Resolution |
|------|-------|------------|
| T1 (initial) | 0/8 hacks from rh-s65 | applied `simple_overwrite_tests` hint via `load_problems_rh` |
| T2 (initial) | all cos_S_contrib = nan | replaced off-policy Dr.GRPO loss with per-sample NLL; removed zero_advantages skip |
| T4 (initial) | n_not_hacked=1, t-test undefined | bucketing changed to (hacked=1, gt_pass=0) vs (hacked=1, gt_pass=1) |