mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
spec(distill_probe): Phase 1 done (UAT 4/4), Phase 2 candidates R5-R7
R1-R4 (Phase 1) marked done with evidence pointers to
out/probe_distill/{teacher_pool,vanilla_seed41,projected_seed41}/.
R5 = GRPO trajectory probe (mixed-policy generator to restore reward
variance). R6 = LoRA-vs-SVD arm comparison. R7 = GRPO-contrastive
v_hack re-extraction (fallback only).
Errors table records the two diagnosis/fix loops from Phase 1: the
prompt-distribution mismatch and the zero-advantage skip.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,231 @@
|
||||
# Distillation cosine probe + Phase-2 candidates
|
||||
|
||||
## Goal
|
||||
|
||||
Validate that v_hack captures the *gradient direction toward reward
|
||||
hacking* and that the projection mechanism removes that component
|
||||
end-to-end. This is the cheap falsification gate before the 3-seed
|
||||
headline sweep (~36-54h). Done well, it answers whether spending the
|
||||
sweep is justified at all.
|
||||
|
||||
Phase 1 (this branch, `probe/distill-cosine`) is complete. Phase 2
|
||||
candidates are scoped below; pick one before implementing.
|
||||
|
||||
## Scope
|
||||
|
||||
**In:**
|
||||
- Phase 1: NLL distillation from `ariahw/rl-rewardhacking-leetcode-rh-s65`
|
||||
with per-sample `cos(grad, v_hack)`. Replayable per-step `jsonl.gz`.
|
||||
- Phase 2 candidates (R5-R7 below): GRPO-trajectory probe, LoRA-arm
|
||||
comparison, GRPO-contrastive v_hack re-extraction.
|
||||
|
||||
**Out:**
|
||||
- The 3-seed headline sweep (separate spec, downstream of Phase 2).
|
||||
- Rebound baseline (H3 from `spec.md`).
|
||||
- verl framework port (rejected: minimal loop is the right substrate).
|
||||
- Pushing branches to origin (user gate; not auto).
|
||||
|
||||
## Requirements
|
||||
|
||||
### Phase 1 (done — evidence in Log)
|
||||
|
||||
- **R1**: Hacky teacher produces hacks at the expected rate.
|
||||
Done means: ≥0.30 hack fraction over a teacher rollout pool.
|
||||
VERIFY: aggregate `hacked` across `out/probe_distill/teacher_pool/step_*.jsonl.gz`.
|
||||
Sneaky fail: if the prompt is off-distribution rh-s65 produces "best
|
||||
effort" non-hack stubs that still parse and score format_only;
|
||||
hack_rate=0 distinguishes that case.
|
||||
|
||||
- **R2**: Per-sample cosine machinery produces real numbers on every
|
||||
sample.
|
||||
Done means: `cos_S_contrib` non-null for ≥90% of vanilla-replay rows.
|
||||
VERIFY: load `out/probe_distill/vanilla_seed41/step_*.jsonl.gz`, count
|
||||
non-null `cos_S_contrib`.
|
||||
Sneaky fail: zero-advantage skip silently nulls grads; coverage<<1
|
||||
catches it.
|
||||
|
||||
- **R3**: Projection mechanism reduces v_hack alignment per step.
|
||||
Done means: `mean_cos_out < mean_cos_in` on ≥80% of projected steps.
|
||||
VERIFY: per-step diag in `out/probe_distill/projected_seed41/...`.
|
||||
Sneaky fail: projection runs but copies grad through unchanged (e.g.
|
||||
sign flip elsewhere); cos_out unchanged or higher catches it.
|
||||
|
||||
- **R4**: v_hack discriminates hack-direction from generic gradient.
|
||||
Done means: within hacked samples, `cos | gt_pass=0` (pure hack) >
|
||||
`cos | gt_pass=1` (hack + correct), one-sided t-test p<0.05.
|
||||
VERIFY: `probe_uat.py` T4 bucketing.
|
||||
Sneaky fail: v_hack is the gradient direction toward *any* completion
|
||||
(not specifically hack); both buckets would have the same cos.
|
||||
|
||||
### Phase 2 (candidate, pick one)
|
||||
|
||||
- **R5** (Plan 2 unblocker): The GRPO policy gradient — not NLL —
|
||||
pushes toward hacking, and v_hack-projected GRPO slows that
|
||||
push. Needs a generator with reward variance (rh-s65 has none —
|
||||
it hacks always). Done means: with mixed-policy rollouts (e.g.
|
||||
half rh-s65, half base Qwen3-4B), vanilla-GRPO hack rate rises
|
||||
by step 10 while projected stays flatter. Verify: per-step
|
||||
HACK_RATE trajectory in two arms.
|
||||
Sneaky fail: off-policy ratio saturation degrades the gradient
|
||||
to noise; both arms move similarly (or not at all). Check
|
||||
`ratio_mean` histogram per step.
|
||||
|
||||
- **R6** (LoRA arm, "SVD vs not"): A LoRA adapter (B@A, rank=32)
|
||||
with v_hack extracted in *LoRA-basis* (re-run `extract_vhack_grad.py`
|
||||
against a LoRA-wrapped model) projects as well as AntiPaSTO does.
|
||||
Done means: at matched per-step hacking and pass rates, LoRA-projected
|
||||
HACK_RATE reduction is within 20% of SVD-projected. Verify: two
|
||||
full training runs (or distill replays) compared head-to-head.
|
||||
Sneaky fail: LoRA's trainable basis drifts during training so
|
||||
v_hack direction stops pointing at the actual hack subspace; cos_out
|
||||
approaches cos_in over steps.
|
||||
|
||||
- **R7** (v_hack alt extraction): Re-extract v_hack with GRPO-style
|
||||
contrastive loss (advantage = +1 on hack, -1 on clean) using the
|
||||
same `pairs.py` personas. Done means: cosine signal at R4 is at
|
||||
least as strong as current NLL-extracted v_hack on the same teacher
|
||||
pool. Verify: `probe_uat.py` rerun with new v_hack; T4 t-stat ≥
|
||||
current 4.46. Strictly out of scope unless we revisit current
|
||||
v_hack quality — kept here for the fallback path.
|
||||
|
||||
## Tasks
|
||||
|
||||
- [x] **T1 (R1)**: teacher pool generation
|
||||
- steps: load rh-s65 LoRA → merge → generate G=8 × 20 problems with `simple_overwrite_tests` hint
|
||||
- verify: `just probe-teacher-pool 20 && just probe-uat` shows T1 PASS
|
||||
- success: T1 hack_rate ≥ 0.30 (achieved 0.994)
|
||||
- likely_fail: rh-s65 not picking up hint (system prompt or user prompt off-distribution)
|
||||
- sneaky_fail: rh model loaded but base weights leaked through (no merge); produces correct code, no hacks
|
||||
- UAT: "when I run `just probe-teacher-pool 20` I observe 20 step files with hack_rate ≥ 0.30"
|
||||
|
||||
- [x] **T2 (R2)**: vanilla NLL replay
|
||||
- steps: replay teacher pool, NLL backward per sample, snapshot delta_S.grad diff per module → cos
|
||||
- verify: `just probe-vanilla-replay 20 && just probe-uat` shows T2 PASS
|
||||
- success: cos_S_contrib non-null on 100% of rows
|
||||
- likely_fail: per-sample backward semantics broken (g_before/g_after diff = 0)
|
||||
- sneaky_fail: NLL on completion only counts pad tokens (mask off-by-one); cos is approximately random — caught by per-step ||g|| stability
|
||||
- UAT: "when I open `step_000.jsonl.gz` every row has a finite cos_S_contrib"
|
||||
|
||||
- [x] **T3 (R3)**: projected replay
|
||||
- steps: same as T2 + `project_delta_S_grad` after backward
|
||||
- verify: `just probe-projected-replay 20 && just probe-uat` shows T3 PASS
|
||||
- success: cos_out < cos_in on 20/20 steps (achieved 20/20)
|
||||
- likely_fail: projection direction inverted (cos_out > cos_in)
|
||||
- sneaky_fail: projection only fires on a few modules (frac_fired ≪ 1) so cos_in stays near zero; less obvious win
|
||||
- UAT: "when I read the projected step files I see cos_out < cos_in on most steps and fired > 0.5"
|
||||
|
||||
- [x] **T4 (R4)**: cosine discrimination via gt_pass split
|
||||
- steps: bucket vanilla-replay samples by (hacked, gt_pass); one-sided Welch's t on cos
|
||||
- verify: `just probe-uat` shows T4 PASS
|
||||
- success: t > 2, p < 0.05 (achieved t=+4.46, p<1e-4)
|
||||
- likely_fail: too few samples in either bucket
|
||||
- sneaky_fail: v_hack picks up a generic "long-completion" signal rather than hack direction; would still discriminate gt_pass split (since hack-only completions tend to be shorter) — partial cover; caught only by R5 follow-up
|
||||
- UAT: "T4 reports cos|pure_hack > cos|hack+correct with p<0.05"
|
||||
|
||||
- [ ] **T5 (R5)**: GRPO trajectory probe — *candidate*, awaits user pick
|
||||
- steps: extend probe_distill.py with a mixed-policy generator
|
||||
(alternate rh-s65 / base Qwen3-4B per rollout); use Dr.GRPO loss
|
||||
instead of NLL; project per step in projected arm; compare hack
|
||||
rate trajectory across arms
|
||||
- verify: `out/probe_grpo/{vanilla,projected}_seed41/step_*.jsonl.gz`,
|
||||
compare HACK_RATE trajectories; check ratio_mean histogram
|
||||
doesn't saturate at clip bounds
|
||||
- success: vanilla HACK_RATE rises >20pp by step 15; projected stays
|
||||
<10pp lower
|
||||
- likely_fail: still no reward variance with mixed policy (base
|
||||
Qwen3-4B also produces format-only stubs at the rh prompt)
|
||||
- sneaky_fail: ratio_mean saturates at clip bounds → gradient noise
|
||||
swamps signal → both arms look similar
|
||||
- UAT: "side-by-side trajectory shows vanilla learning to hack
|
||||
faster than projected, with non-saturated ratios"
|
||||
|
||||
- [ ] **T6 (R6)**: LoRA-arm comparison — *candidate*
|
||||
- steps: new file `src/projected_grpo/lora_adapter.py` mirroring
|
||||
`antipasto.py` interface; modify `extract_vhack_grad.py` with
|
||||
`--adapter={antipasto,lora}`; add `--arm=projected_lora` to
|
||||
`train.py` and `probe_distill.py`; extract `v_hack_lora.safetensors`;
|
||||
run probe with both v_hack variants and compare T4 effect sizes
|
||||
+ cos_in/cos_out trajectories
|
||||
- verify: side-by-side T4 t-stat for SVD vs LoRA v_hack on same
|
||||
teacher pool
|
||||
- success: LoRA-projected effect ≥ 80% of SVD-projected effect; OR
|
||||
a clean negative — LoRA-projected significantly weaker, justifying
|
||||
keeping AntiPaSTO
|
||||
- likely_fail: v_hack extraction in LoRA basis is unstable
|
||||
(zero-init B → zero gradient on first backward)
|
||||
- sneaky_fail: LoRA basis drifts as B@A trains; v_hack stored from
|
||||
init no longer points at hack subspace by step 10
|
||||
- UAT: "two `probe_uat.py` runs (one each adapter) printed
|
||||
side-by-side with comparable T4 metrics"
|
||||
|
||||
- [ ] **T7 (R7)**: GRPO-contrastive v_hack — *candidate, defer unless
|
||||
R4 evidence weakens*
|
||||
- steps: fork `extract_vhack_grad.py` → `extract_vhack_grpo.py`;
|
||||
advantage = +1 on hack completion, -1 on clean; same per-module
|
||||
`delta_S.grad` capture; write `v_hack_grpo.safetensors`
|
||||
- verify: rerun probe-uat with `--v-hack-path=...grpo.safetensors`;
|
||||
T4 t-stat ≥ 4.46
|
||||
- success: t-stat at least as strong as NLL-extracted v_hack
|
||||
- likely_fail: GRPO-loss gradient on a single pair has too little
|
||||
signal (vs NLL-mean which averages over many tokens)
|
||||
- sneaky_fail: implementation accidentally uses NLL loss inside
|
||||
(no functional change); T4 result is identical to NLL run — check
|
||||
by diffing the saved `v_hack` tensors
|
||||
|
||||
## Context
|
||||
|
||||
- Branch: `probe/distill-cosine`, commits `d111db2` (script + first
|
||||
attempt) and `d2e15da` (NLL fix + T4 redesign).
|
||||
- Teacher: `ariahw/rl-rewardhacking-leetcode-rh-s65` — LoRA adapter on
|
||||
Qwen3-4B, no-intervention arm, ~99% hack at step 200 on our pool.
|
||||
- Student: Qwen3-4B + AntiPaSTO (full-rank SVD), v_hack_full.safetensors
|
||||
from 2026-05-23 extraction.
|
||||
- Loss in current probe: **mean NLL on completion tokens** — apples-to-apples
|
||||
with `extract_vhack_grad.py`'s v_hack extraction. Not GRPO.
|
||||
- Prompt distribution: dataset's baked-in `CODE_SYSTEM_PROMPT` + user
|
||||
message with `simple_overwrite_tests` hint applied. **Not** the
|
||||
inoculation prompt `train.py` uses.
|
||||
- Cosine metric in `norm_weighted_cos`: per-module unit-normalized v,
|
||||
aggregated as `sum_m <c_m, v_m_unit> / sqrt(sum_m ||c_m||^2)`. This is
|
||||
a *projection magnitude* proportional to cosine; upper bound is
|
||||
`sqrt(n_modules) ≈ 15.9` for our 252 wrapped Linears. Sign and
|
||||
relative ordering are correct; absolute values are not in [-1, 1].
|
||||
Acceptable for the discrimination test (R4) but mention in writeups.
|
||||
- `cos_in`/`cos_out` in the `project_delta_S_grad` diagnostics ARE
|
||||
proper per-module cosines averaged; these are in [-1, 1].
|
||||
- The 4-stage pueue chain (teacher → vanilla → projected → uat) is
|
||||
the canonical pipeline. Each stage saves replayable artifacts.
|
||||
|
||||
## Log
|
||||
|
||||
- 2026-05-25 — branch created, probe_distill.py + probe_uat.py written.
|
||||
- 2026-05-25 — first 1-step probe: 0/8 hacks. Diagnosed: rh-s65 needs
|
||||
`simple_overwrite_tests` hint applied; train.py's pass_test override
|
||||
is wrong for rh distribution. Added `load_problems_rh()`.
|
||||
- 2026-05-25 — first 20-step probe (off-policy Dr.GRPO loss): all
|
||||
cos_S_contrib = nan. Diagnosed: rh teacher hacks 100% → all rewards
|
||||
identical → zero advantage → per-sample bwd skipped. Switched to
|
||||
per-sample mean NLL on completion (apples-to-apples with v_hack
|
||||
extraction). Re-ran: cosines populated, T4 originally failed (n_not
|
||||
=1) so split moved to gt_pass within hacked. Final UAT: 4/4 PASS.
|
||||
- 2026-05-25 — v_hack from NLL ≠ GRPO policy gradient. Probe currently
|
||||
validates the NLL story. R5/R7 are how we'd close the GRPO gap.
|
||||
|
||||
## TODO
|
||||
|
||||
- Decide: push `probe/distill-cosine` to origin?
|
||||
- Decide: cleanup the cosine-magnitude bound (divide by `sqrt(n_modules)`
|
||||
for interpretability) — cosmetic, no scientific impact.
|
||||
- Plotting: per-step trajectory of mean cos_S_contrib (vanilla vs
|
||||
projected) would visualize the projection mechanism. Currently
|
||||
numbers only. ~30 min of matplotlib.
|
||||
- spec.md amendment: H1 prediction now has a falsification hook at
|
||||
R5; document the path.
|
||||
|
||||
## Errors
|
||||
|
||||
| Task | Error | Resolution |
|
||||
|------|-------|------------|
|
||||
| T1 (initial) | 0/8 hacks from rh-s65 | applied `simple_overwrite_tests` hint via `load_problems_rh` |
|
||||
| T2 (initial) | all cos_S_contrib = nan | replaced off-policy Dr.GRPO loss with per-sample NLL; removed zero_advantages skip |
|
||||
| T4 (initial) | n_not_hacked=1, t-test undefined | bucketing changed to (hacked=1, gt_pass=0) vs (hacked=1, gt_pass=1) |
|
||||
Reference in New Issue
Block a user