Files
evil_MoE/docs/spec/20260525_distill_cosine_probe.md
T
wassname 2a21fbc49c spec(distill_probe): Phase 1 done (UAT 4/4), Phase 2 candidates R5-R7
R1-R4 (Phase 1) marked done with evidence pointers to
out/probe_distill/{teacher_pool,vanilla_seed41,projected_seed41}/.

R5 = GRPO trajectory probe (mixed-policy generator to restore reward
variance). R6 = LoRA-vs-SVD arm comparison. R7 = GRPO-contrastive
v_hack re-extraction (fallback only).

Errors table records the two diagnosis/fix loops from Phase 1: the
prompt-distribution mismatch and the zero-advantage skip.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:22:19 +00:00

12 KiB
Raw Blame History

Distillation cosine probe + Phase-2 candidates

Goal

Validate that v_hack captures the gradient direction toward reward hacking and that the projection mechanism removes that component end-to-end. This is the cheap falsification gate before the 3-seed headline sweep (~36-54h). Done well, it answers whether spending the sweep is justified at all.

Phase 1 (this branch, probe/distill-cosine) is complete. Phase 2 candidates are scoped below; pick one before implementing.

Scope

In:

  • Phase 1: NLL distillation from ariahw/rl-rewardhacking-leetcode-rh-s65 with per-sample cos(grad, v_hack). Replayable per-step jsonl.gz.
  • Phase 2 candidates (R5-R7 below): GRPO-trajectory probe, LoRA-arm comparison, GRPO-contrastive v_hack re-extraction.

Out:

  • The 3-seed headline sweep (separate spec, downstream of Phase 2).
  • Rebound baseline (H3 from spec.md).
  • verl framework port (rejected: minimal loop is the right substrate).
  • Pushing branches to origin (user gate; not auto).

Requirements

Phase 1 (done — evidence in Log)

  • R1: Hacky teacher produces hacks at the expected rate. Done means: ≥0.30 hack fraction over a teacher rollout pool. VERIFY: aggregate hacked across out/probe_distill/teacher_pool/step_*.jsonl.gz. Sneaky fail: if the prompt is off-distribution rh-s65 produces "best effort" non-hack stubs that still parse and score format_only; hack_rate=0 distinguishes that case.

  • R2: Per-sample cosine machinery produces real numbers on every sample. Done means: cos_S_contrib non-null for ≥90% of vanilla-replay rows. VERIFY: load out/probe_distill/vanilla_seed41/step_*.jsonl.gz, count non-null cos_S_contrib. Sneaky fail: zero-advantage skip silently nulls grads; coverage<<1 catches it.

  • R3: Projection mechanism reduces v_hack alignment per step. Done means: mean_cos_out < mean_cos_in on ≥80% of projected steps. VERIFY: per-step diag in out/probe_distill/projected_seed41/.... Sneaky fail: projection runs but copies grad through unchanged (e.g. sign flip elsewhere); cos_out unchanged or higher catches it.

  • R4: v_hack discriminates hack-direction from generic gradient. Done means: within hacked samples, cos | gt_pass=0 (pure hack) > cos | gt_pass=1 (hack + correct), one-sided t-test p<0.05. VERIFY: probe_uat.py T4 bucketing. Sneaky fail: v_hack is the gradient direction toward any completion (not specifically hack); both buckets would have the same cos.

Phase 2 (candidate, pick one)

  • R5 (Plan 2 unblocker): The GRPO policy gradient — not NLL — pushes toward hacking, and v_hack-projected GRPO slows that push. Needs a generator with reward variance (rh-s65 has none — it hacks always). Done means: with mixed-policy rollouts (e.g. half rh-s65, half base Qwen3-4B), vanilla-GRPO hack rate rises by step 10 while projected stays flatter. Verify: per-step HACK_RATE trajectory in two arms. Sneaky fail: off-policy ratio saturation degrades the gradient to noise; both arms move similarly (or not at all). Check ratio_mean histogram per step.

  • R6 (LoRA arm, "SVD vs not"): A LoRA adapter (B@A, rank=32) with v_hack extracted in LoRA-basis (re-run extract_vhack_grad.py against a LoRA-wrapped model) projects as well as AntiPaSTO does. Done means: at matched per-step hacking and pass rates, LoRA-projected HACK_RATE reduction is within 20% of SVD-projected. Verify: two full training runs (or distill replays) compared head-to-head. Sneaky fail: LoRA's trainable basis drifts during training so v_hack direction stops pointing at the actual hack subspace; cos_out approaches cos_in over steps.

  • R7 (v_hack alt extraction): Re-extract v_hack with GRPO-style contrastive loss (advantage = +1 on hack, -1 on clean) using the same pairs.py personas. Done means: cosine signal at R4 is at least as strong as current NLL-extracted v_hack on the same teacher pool. Verify: probe_uat.py rerun with new v_hack; T4 t-stat ≥ current 4.46. Strictly out of scope unless we revisit current v_hack quality — kept here for the fallback path.

Tasks

  • T1 (R1): teacher pool generation

    • steps: load rh-s65 LoRA → merge → generate G=8 × 20 problems with simple_overwrite_tests hint
    • verify: just probe-teacher-pool 20 && just probe-uat shows T1 PASS
    • success: T1 hack_rate ≥ 0.30 (achieved 0.994)
    • likely_fail: rh-s65 not picking up hint (system prompt or user prompt off-distribution)
    • sneaky_fail: rh model loaded but base weights leaked through (no merge); produces correct code, no hacks
    • UAT: "when I run just probe-teacher-pool 20 I observe 20 step files with hack_rate ≥ 0.30"
  • T2 (R2): vanilla NLL replay

    • steps: replay teacher pool, NLL backward per sample, snapshot delta_S.grad diff per module → cos
    • verify: just probe-vanilla-replay 20 && just probe-uat shows T2 PASS
    • success: cos_S_contrib non-null on 100% of rows
    • likely_fail: per-sample backward semantics broken (g_before/g_after diff = 0)
    • sneaky_fail: NLL on completion only counts pad tokens (mask off-by-one); cos is approximately random — caught by per-step ||g|| stability
    • UAT: "when I open step_000.jsonl.gz every row has a finite cos_S_contrib"
  • T3 (R3): projected replay

    • steps: same as T2 + project_delta_S_grad after backward
    • verify: just probe-projected-replay 20 && just probe-uat shows T3 PASS
    • success: cos_out < cos_in on 20/20 steps (achieved 20/20)
    • likely_fail: projection direction inverted (cos_out > cos_in)
    • sneaky_fail: projection only fires on a few modules (frac_fired ≪ 1) so cos_in stays near zero; less obvious win
    • UAT: "when I read the projected step files I see cos_out < cos_in on most steps and fired > 0.5"
  • T4 (R4): cosine discrimination via gt_pass split

    • steps: bucket vanilla-replay samples by (hacked, gt_pass); one-sided Welch's t on cos
    • verify: just probe-uat shows T4 PASS
    • success: t > 2, p < 0.05 (achieved t=+4.46, p<1e-4)
    • likely_fail: too few samples in either bucket
    • sneaky_fail: v_hack picks up a generic "long-completion" signal rather than hack direction; would still discriminate gt_pass split (since hack-only completions tend to be shorter) — partial cover; caught only by R5 follow-up
    • UAT: "T4 reports cos|pure_hack > cos|hack+correct with p<0.05"
  • T5 (R5): GRPO trajectory probe — candidate, awaits user pick

    • steps: extend probe_distill.py with a mixed-policy generator (alternate rh-s65 / base Qwen3-4B per rollout); use Dr.GRPO loss instead of NLL; project per step in projected arm; compare hack rate trajectory across arms
    • verify: out/probe_grpo/{vanilla,projected}_seed41/step_*.jsonl.gz, compare HACK_RATE trajectories; check ratio_mean histogram doesn't saturate at clip bounds
    • success: vanilla HACK_RATE rises >20pp by step 15; projected stays <10pp lower
    • likely_fail: still no reward variance with mixed policy (base Qwen3-4B also produces format-only stubs at the rh prompt)
    • sneaky_fail: ratio_mean saturates at clip bounds → gradient noise swamps signal → both arms look similar
    • UAT: "side-by-side trajectory shows vanilla learning to hack faster than projected, with non-saturated ratios"
  • T6 (R6): LoRA-arm comparison — candidate

    • steps: new file src/projected_grpo/lora_adapter.py mirroring antipasto.py interface; modify extract_vhack_grad.py with --adapter={antipasto,lora}; add --arm=projected_lora to train.py and probe_distill.py; extract v_hack_lora.safetensors; run probe with both v_hack variants and compare T4 effect sizes
      • cos_in/cos_out trajectories
    • verify: side-by-side T4 t-stat for SVD vs LoRA v_hack on same teacher pool
    • success: LoRA-projected effect ≥ 80% of SVD-projected effect; OR a clean negative — LoRA-projected significantly weaker, justifying keeping AntiPaSTO
    • likely_fail: v_hack extraction in LoRA basis is unstable (zero-init B → zero gradient on first backward)
    • sneaky_fail: LoRA basis drifts as B@A trains; v_hack stored from init no longer points at hack subspace by step 10
    • UAT: "two probe_uat.py runs (one each adapter) printed side-by-side with comparable T4 metrics"
  • T7 (R7): GRPO-contrastive v_hack — candidate, defer unless R4 evidence weakens

    • steps: fork extract_vhack_grad.pyextract_vhack_grpo.py; advantage = +1 on hack completion, -1 on clean; same per-module delta_S.grad capture; write v_hack_grpo.safetensors
    • verify: rerun probe-uat with --v-hack-path=...grpo.safetensors; T4 t-stat ≥ 4.46
    • success: t-stat at least as strong as NLL-extracted v_hack
    • likely_fail: GRPO-loss gradient on a single pair has too little signal (vs NLL-mean which averages over many tokens)
    • sneaky_fail: implementation accidentally uses NLL loss inside (no functional change); T4 result is identical to NLL run — check by diffing the saved v_hack tensors

Context

  • Branch: probe/distill-cosine, commits d111db2 (script + first attempt) and d2e15da (NLL fix + T4 redesign).
  • Teacher: ariahw/rl-rewardhacking-leetcode-rh-s65 — LoRA adapter on Qwen3-4B, no-intervention arm, ~99% hack at step 200 on our pool.
  • Student: Qwen3-4B + AntiPaSTO (full-rank SVD), v_hack_full.safetensors from 2026-05-23 extraction.
  • Loss in current probe: mean NLL on completion tokens — apples-to-apples with extract_vhack_grad.py's v_hack extraction. Not GRPO.
  • Prompt distribution: dataset's baked-in CODE_SYSTEM_PROMPT + user message with simple_overwrite_tests hint applied. Not the inoculation prompt train.py uses.
  • Cosine metric in norm_weighted_cos: per-module unit-normalized v, aggregated as sum_m <c_m, v_m_unit> / sqrt(sum_m ||c_m||^2). This is a projection magnitude proportional to cosine; upper bound is sqrt(n_modules) ≈ 15.9 for our 252 wrapped Linears. Sign and relative ordering are correct; absolute values are not in [-1, 1]. Acceptable for the discrimination test (R4) but mention in writeups.
  • cos_in/cos_out in the project_delta_S_grad diagnostics ARE proper per-module cosines averaged; these are in [-1, 1].
  • The 4-stage pueue chain (teacher → vanilla → projected → uat) is the canonical pipeline. Each stage saves replayable artifacts.

Log

  • 2026-05-25 — branch created, probe_distill.py + probe_uat.py written.
  • 2026-05-25 — first 1-step probe: 0/8 hacks. Diagnosed: rh-s65 needs simple_overwrite_tests hint applied; train.py's pass_test override is wrong for rh distribution. Added load_problems_rh().
  • 2026-05-25 — first 20-step probe (off-policy Dr.GRPO loss): all cos_S_contrib = nan. Diagnosed: rh teacher hacks 100% → all rewards identical → zero advantage → per-sample bwd skipped. Switched to per-sample mean NLL on completion (apples-to-apples with v_hack extraction). Re-ran: cosines populated, T4 originally failed (n_not =1) so split moved to gt_pass within hacked. Final UAT: 4/4 PASS.
  • 2026-05-25 — v_hack from NLL ≠ GRPO policy gradient. Probe currently validates the NLL story. R5/R7 are how we'd close the GRPO gap.

TODO

  • Decide: push probe/distill-cosine to origin?
  • Decide: cleanup the cosine-magnitude bound (divide by sqrt(n_modules) for interpretability) — cosmetic, no scientific impact.
  • Plotting: per-step trajectory of mean cos_S_contrib (vanilla vs projected) would visualize the projection mechanism. Currently numbers only. ~30 min of matplotlib.
  • spec.md amendment: H1 prediction now has a falsification hook at R5; document the path.

Errors

Task Error Resolution
T1 (initial) 0/8 hacks from rh-s65 applied simple_overwrite_tests hint via load_problems_rh
T2 (initial) all cos_S_contrib = nan replaced off-policy Dr.GRPO loss with per-sample NLL; removed zero_advantages skip
T4 (initial) n_not_hacked=1, t-test undefined bucketing changed to (hacked=1, gt_pass=0) vs (hacked=1, gt_pass=1)