Files
evil_MoE/docs/brainstorm/extracted_prefs.md
T
wassname 120400c5f5 setup
2026-05-23 10:40:02 +08:00

164 lines
9.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
<!-- Extracted from docs/1.md (4130-line brainstorm) + spec.md by Explore subagent, 2026-05-23.
Verbatim phrases in backticks. Where the transcript is ambiguous, marked [ambiguous]. -->
# Extracted preferences and decisions — projected_grpo
## TL;DR delta vs spec.md
Spec.md is the clean preregistered plan. docs/1.md is the reasoning trail behind it. The biggest
deltas the brainstorm adds (not in spec):
1. **The whole project pivoted** mid-conversation from a DPO+sycophancy plan (Anthropic HH-RLHF)
to GRPO+reward-hacking (Nanda/Ariahw LeetCode). Driver: gradient projection in SVD basis matches
GRPO's unpaired structure better than DPO's paired-preference structure.
2. **Method evolved** from "bidirectional SVD-LoRA with NLL+KL" (paired-preference native, the
AntiPaSTO line) to **gradient-level intervention + SVD-basis denoising** — an orthogonal
approach for unpaired GRPO rollouts.
3. **Rebound paper (Wu & Tang 2026)** appeared mid-brainstorm and reframed the positioning:
not novel mechanism (concept-direction intervention) but novel level (gradient vs advantage).
User's confidence updated downward but stayed positive — ~60% the method works now (was ~40%
pre-Rebound, framed as net positive because Rebound *validates* the core mechanism).
4. **Single-GPU pragmatism**: extensive back-and-forth on 3090 vs 96GB RTX 6000 Ada. Landed on
96GB RTX 6000 + Qwen3.5-2B as the practical sweet spot.
## 1. Design decisions
- **Substitute Qwen3.5-2B for Qwen3-4B**. Reason: compute budget. Fallback to Qwen3-4B with
reduced num_generations if H4 (hack emergence) fails at 2B.
- **Use verl, not TRL**. Reason: Nanda's repo uses verl v0.6.1; minimise reimplementation risk.
Fallback to TRL GRPOTrainer + manual reward function reimplementation if verl breaks on single
96GB.
- **Gradient projection happens per optimizer step**, not per token. Formula in spec.md §5.
- **One-sided projection**: only when `cos_align > 0`, i.e. only when gradient is pushing toward
hack direction. Negative alignment is left alone.
- **Magnitude preservation**: after projecting out the hack component, renormalize back to `||g||`.
This is a design choice — explicit ablation arm `d` in spec.md tests removing it.
- **SVD basis from W (model weight matrices)**, not from activations. Top-m default m=16,
sweep {8, 16, 32}. Project `v_hack` into this basis, then back to full residual stream.
- **v_hack extracted from 60-80 contrastive pairs**:
- positive (hacky) = LeetCode prompts + `def run_tests(): pass`-style completions
- negative (clean) = same prompts + legitimate solution attempts (base-Qwen at T=0)
- 20 held-out pairs for validation; require >90% separation accuracy.
- **CAA-style sanity check**: before training, add `v_hack` at inference to base model, confirm
it steers generation toward hack-flavored completions. Catches broken extraction.
## 2. Method internals (math)
**v_hack extraction** (per Wu-Tang style; user adopted this after reading their paper):
- Take last-token hidden states at layers 60-75% of model depth (multi-layer averaged).
- $d = \frac{1}{N}\sum_i (h_i^+ - h_i^-)$ where +/- are hacky/clean prompts.
- [ambiguous in transcript: whether to use single-layer (as Rebound) or sweep layers]
**SVD denoising of v_hack**:
- Take W for the relevant projection layer (residual-stream-out, [ambiguous which layer]).
- $W = U S V^\top$. Right singular vectors $V$.
- Project: $v_{hack}^{(S)} = V_{:,:m}^\top \cdot v_{hack}$.
- Reproject back: $v_{hack}^{\text{denoised}} = V_{:,:m} \cdot v_{hack}^{(S)}$.
- Normalize: $v_{hack} \leftarrow v_{hack}^{\text{denoised}} / \|v_{hack}^{\text{denoised}}\|$.
**Gradient projection at training step**:
- `cos_align = <g, v_hack> / ||g||`
- If `cos_align > 0`:
- `g' = g - cos_align * ||g|| * v_hack` # remove component along v_hack
- `g' = g' * ||g|| / ||g'||` # restore magnitude (ablation removes this)
- Else: `g' = g`.
- Then `optimizer.step(g')`.
**KL / trust region**: spec doesn't mandate explicit KL beyond GRPO's built-in. [ambiguous: user
mentioned iso-KL trust region from AntiPaSTO line but chose NOT to add it on top of GRPO's KL
because GRPO already has reference-model KL].
**PCGrad**: NOT used in this project. PCGrad was in the AntiPaSTO/bidirectional design; here
gradient is single (no pole-pairing), so PCGrad doesn't apply.
## 3. Hyperparameters
| Param | Value | Source / justification |
|---|---|---|
| model | Qwen3.5-2B | substitute for Qwen3-4B per Nanda; compute |
| LoRA r | 32 | Nanda's published |
| LoRA alpha | 32 | Nanda's published |
| lr | 7e-5 | Nanda's published |
| num_generations | 8 | reduced from Nanda's 16 to fit single-GPU |
| batch | 128 | reduced from Nanda's 256 to fit single-GPU |
| training steps | 200 | Nanda's published |
| m (SVD truncation) | 16 (sweep 8/16/32) | user choice, brainstorm landed here |
| n contrastive pairs | 60-80 | Wu-Tang use 60 for extract + 20 validate |
| layer for hidden states | 60-75% depth, multi-layer mean | Wu-Tang practice |
| seeds | 3 | "where indicated" in spec; 1 seed for some ablations |
| logging cadence | every 25 steps | spec.md §7 |
## 4. Connections to user's prior work
- **AntiPaSTO**: contrastive prefix pairs → steering direction in SVD-of-W basis. This project
inherits the *SVD-of-W basis* idea but applies it at the gradient (not activation) level, and
uses unpaired GRPO contrast instead of paired prefix contrast.
- **Bidirectional c-scaled steering LoRA (NLL + KL + PCGrad)**: explicitly *not* used here.
That design is paired-preference native; GRPO is unpaired. The user recognized this midway and
pivoted.
- **Iso-KL trust region**: implicit via GRPO's reference KL. Not added on top.
User's framing (verbatim, paraphrased from transcript): the contribution is **gradient-level vs
advantage-level intervention** (vs Rebound's advantage modification) with **SVD-basis denoising
of the concept direction** (novel vs both Rebound and AntiPaSTO).
## 5. Decision points / open questions
- **H4 fallback**: if Qwen3.5-2B hack rate <30% at step 200 → swap to Qwen3-4B with
num_generations=4, batch=64.
- **verl fallback**: if verl breaks on single 96GB → TRL GRPOTrainer + manual reward
reimplementation. Higher engineering cost.
- **v_hack steering check fails** → diagnose layer choice, pair quality, or SVD truncation
*before* committing to training runs.
- **All methods tie vanilla** → check `cos_align` logs to confirm projection is biting.
- **[ambiguous] Layer choice for SVD**: which W to SVD? The brainstorm suggests the output
projection at the layer where v_hack is strongest, but doesn't commit. → resolve during smoke test.
- **[ambiguous] When does v_hack update?**: extracted once from base model, or refreshed during
training? Transcript leans toward once-from-base (Wu-Tang showed cos sim ≥0.99 across
checkpoints) but doesn't fully commit.
## 6. Things explicitly rejected
- **Full bidirectional NLL+KL+PCGrad on GRPO**: structural mismatch (paired vs unpaired).
- **Hotel toy benchmark / DPO+sycophancy**: weaker published baselines, less impressive story.
- **Unsloth matmul setup**: out of scope; verl is the published baseline.
- **Direct Rebound replication as main result**: would be derivative. Rebound is a baseline arm,
not the headline.
- **Monitor-based mitigations**: Nanda's territory, cite their numbers, don't re-run.
- **Triplet loss in activation space**: explored as a frame, rejected because (a) needs similarity
oracle, (b) word2vec algebra needs many concepts not 1, (c) overlaps with CCS/RepEng.
- **Adding new losses to balance**: user is explicit that adding losses creates 4 problems
(the loss, balancing, meta-balancing, debugging). Gradient projection is a *constraint*, not
a competing objective.
## 7. Eval / measurement specifics
- **Hack rate**: % rollouts that successfully overwrite tests (Nanda's metric, from their codebase).
Use Nanda's exact definition: model's code only passes its own `run_tests` and not the ground
truth set.
- **Pass rate**: % rollouts that pass tests legitimately on held-out problems *without write
access to evaluator*. Held-out set = Nanda's eval set (113 problems per LW appendix; 10 samples
each).
- **Eval uses randomly sampled function names** (not `run_tests`) to avoid memorization, per
Nanda's eval setup.
- **`cos_align(g, v_hack)` per step**: diagnostic for whether projection is biting.
- **KL drift from base** per step: diagnostic for catastrophic policy change.
- **Headline plot**: (hack rate, pass rate) scatter, one point per (arm × seed). Pareto frontier.
Method should sit below-and-to-the-right of vanilla. Annotate Rebound's position.
- **Statistical test**: paired t-test of avg hack rate / accuracy across 3 seeds × 10 samples =
30 scores (matching Nanda's protocol).
- **Falsification check**: pre-registered analysis on H1-H4 before publishing. Report all
hypotheses including falsified.
- **Headline claim format**: "at matched pass rate ±5pp on held-out problems without write access,
our method reduces hack rate from X% to Y%."
## 8. Compute budget specifics
- 96GB RTX 6000 Ada (rented). Single GPU. [ambiguous: vast.ai vs runpod, transcript mentions both].
- Per-run: ~2-3 hours. Per Nanda: 4xH200 = 3 hours; user's 1x96GB ≈ Nanda's wall-clock for the
reduced batch.
- 13-15 runs × ~3h = 40-50 hours compute.
- Cost: ~$3 AUD/hr → ~$120-150 AUD compute; budget ~$200-250 with iteration buffer.
- Calendar: ~1 week back-to-back, 2-3 weeks with iteration.
- [ambiguous] No explicit decision on shared GPU queueing; pueue setup recommended.