mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:04:59 +08:00
164 lines
9.5 KiB
Markdown
164 lines
9.5 KiB
Markdown
<!-- Extracted from docs/1.md (4130-line brainstorm) + spec.md by Explore subagent, 2026-05-23.
|
||
Verbatim phrases in backticks. Where the transcript is ambiguous, marked [ambiguous]. -->
|
||
|
||
# Extracted preferences and decisions — projected_grpo
|
||
|
||
## TL;DR delta vs spec.md
|
||
|
||
Spec.md is the clean preregistered plan. docs/1.md is the reasoning trail behind it. The biggest
|
||
deltas the brainstorm adds (not in spec):
|
||
|
||
1. **The whole project pivoted** mid-conversation from a DPO+sycophancy plan (Anthropic HH-RLHF)
|
||
to GRPO+reward-hacking (Nanda/Ariahw LeetCode). Driver: gradient projection in SVD basis matches
|
||
GRPO's unpaired structure better than DPO's paired-preference structure.
|
||
2. **Method evolved** from "bidirectional SVD-LoRA with NLL+KL" (paired-preference native, the
|
||
AntiPaSTO line) to **gradient-level intervention + SVD-basis denoising** — an orthogonal
|
||
approach for unpaired GRPO rollouts.
|
||
3. **Rebound paper (Wu & Tang 2026)** appeared mid-brainstorm and reframed the positioning:
|
||
not novel mechanism (concept-direction intervention) but novel level (gradient vs advantage).
|
||
User's confidence updated downward but stayed positive — ~60% the method works now (was ~40%
|
||
pre-Rebound, framed as net positive because Rebound *validates* the core mechanism).
|
||
4. **Single-GPU pragmatism**: extensive back-and-forth on 3090 vs 96GB RTX 6000 Ada. Landed on
|
||
96GB RTX 6000 + Qwen3.5-2B as the practical sweet spot.
|
||
|
||
## 1. Design decisions
|
||
|
||
- **Substitute Qwen3.5-2B for Qwen3-4B**. Reason: compute budget. Fallback to Qwen3-4B with
|
||
reduced num_generations if H4 (hack emergence) fails at 2B.
|
||
- **Use verl, not TRL**. Reason: Nanda's repo uses verl v0.6.1; minimise reimplementation risk.
|
||
Fallback to TRL GRPOTrainer + manual reward function reimplementation if verl breaks on single
|
||
96GB.
|
||
- **Gradient projection happens per optimizer step**, not per token. Formula in spec.md §5.
|
||
- **One-sided projection**: only when `cos_align > 0`, i.e. only when gradient is pushing toward
|
||
hack direction. Negative alignment is left alone.
|
||
- **Magnitude preservation**: after projecting out the hack component, renormalize back to `||g||`.
|
||
This is a design choice — explicit ablation arm `d` in spec.md tests removing it.
|
||
- **SVD basis from W (model weight matrices)**, not from activations. Top-m default m=16,
|
||
sweep {8, 16, 32}. Project `v_hack` into this basis, then back to full residual stream.
|
||
- **v_hack extracted from 60-80 contrastive pairs**:
|
||
- positive (hacky) = LeetCode prompts + `def run_tests(): pass`-style completions
|
||
- negative (clean) = same prompts + legitimate solution attempts (base-Qwen at T=0)
|
||
- 20 held-out pairs for validation; require >90% separation accuracy.
|
||
- **CAA-style sanity check**: before training, add `v_hack` at inference to base model, confirm
|
||
it steers generation toward hack-flavored completions. Catches broken extraction.
|
||
|
||
## 2. Method internals (math)
|
||
|
||
**v_hack extraction** (per Wu-Tang style; user adopted this after reading their paper):
|
||
- Take last-token hidden states at layers 60-75% of model depth (multi-layer averaged).
|
||
- $d = \frac{1}{N}\sum_i (h_i^+ - h_i^-)$ where +/- are hacky/clean prompts.
|
||
- [ambiguous in transcript: whether to use single-layer (as Rebound) or sweep layers]
|
||
|
||
**SVD denoising of v_hack**:
|
||
- Take W for the relevant projection layer (residual-stream-out, [ambiguous which layer]).
|
||
- $W = U S V^\top$. Right singular vectors $V$.
|
||
- Project: $v_{hack}^{(S)} = V_{:,:m}^\top \cdot v_{hack}$.
|
||
- Reproject back: $v_{hack}^{\text{denoised}} = V_{:,:m} \cdot v_{hack}^{(S)}$.
|
||
- Normalize: $v_{hack} \leftarrow v_{hack}^{\text{denoised}} / \|v_{hack}^{\text{denoised}}\|$.
|
||
|
||
**Gradient projection at training step**:
|
||
- `cos_align = <g, v_hack> / ||g||`
|
||
- If `cos_align > 0`:
|
||
- `g' = g - cos_align * ||g|| * v_hack` # remove component along v_hack
|
||
- `g' = g' * ||g|| / ||g'||` # restore magnitude (ablation removes this)
|
||
- Else: `g' = g`.
|
||
- Then `optimizer.step(g')`.
|
||
|
||
**KL / trust region**: spec doesn't mandate explicit KL beyond GRPO's built-in. [ambiguous: user
|
||
mentioned iso-KL trust region from AntiPaSTO line but chose NOT to add it on top of GRPO's KL
|
||
because GRPO already has reference-model KL].
|
||
|
||
**PCGrad**: NOT used in this project. PCGrad was in the AntiPaSTO/bidirectional design; here
|
||
gradient is single (no pole-pairing), so PCGrad doesn't apply.
|
||
|
||
## 3. Hyperparameters
|
||
|
||
| Param | Value | Source / justification |
|
||
|---|---|---|
|
||
| model | Qwen3.5-2B | substitute for Qwen3-4B per Nanda; compute |
|
||
| LoRA r | 32 | Nanda's published |
|
||
| LoRA alpha | 32 | Nanda's published |
|
||
| lr | 7e-5 | Nanda's published |
|
||
| num_generations | 8 | reduced from Nanda's 16 to fit single-GPU |
|
||
| batch | 128 | reduced from Nanda's 256 to fit single-GPU |
|
||
| training steps | 200 | Nanda's published |
|
||
| m (SVD truncation) | 16 (sweep 8/16/32) | user choice, brainstorm landed here |
|
||
| n contrastive pairs | 60-80 | Wu-Tang use 60 for extract + 20 validate |
|
||
| layer for hidden states | 60-75% depth, multi-layer mean | Wu-Tang practice |
|
||
| seeds | 3 | "where indicated" in spec; 1 seed for some ablations |
|
||
| logging cadence | every 25 steps | spec.md §7 |
|
||
|
||
## 4. Connections to user's prior work
|
||
|
||
- **AntiPaSTO**: contrastive prefix pairs → steering direction in SVD-of-W basis. This project
|
||
inherits the *SVD-of-W basis* idea but applies it at the gradient (not activation) level, and
|
||
uses unpaired GRPO contrast instead of paired prefix contrast.
|
||
- **Bidirectional c-scaled steering LoRA (NLL + KL + PCGrad)**: explicitly *not* used here.
|
||
That design is paired-preference native; GRPO is unpaired. The user recognized this midway and
|
||
pivoted.
|
||
- **Iso-KL trust region**: implicit via GRPO's reference KL. Not added on top.
|
||
|
||
User's framing (verbatim, paraphrased from transcript): the contribution is **gradient-level vs
|
||
advantage-level intervention** (vs Rebound's advantage modification) with **SVD-basis denoising
|
||
of the concept direction** (novel vs both Rebound and AntiPaSTO).
|
||
|
||
## 5. Decision points / open questions
|
||
|
||
- **H4 fallback**: if Qwen3.5-2B hack rate <30% at step 200 → swap to Qwen3-4B with
|
||
num_generations=4, batch=64.
|
||
- **verl fallback**: if verl breaks on single 96GB → TRL GRPOTrainer + manual reward
|
||
reimplementation. Higher engineering cost.
|
||
- **v_hack steering check fails** → diagnose layer choice, pair quality, or SVD truncation
|
||
*before* committing to training runs.
|
||
- **All methods tie vanilla** → check `cos_align` logs to confirm projection is biting.
|
||
- **[ambiguous] Layer choice for SVD**: which W to SVD? The brainstorm suggests the output
|
||
projection at the layer where v_hack is strongest, but doesn't commit. → resolve during smoke test.
|
||
- **[ambiguous] When does v_hack update?**: extracted once from base model, or refreshed during
|
||
training? Transcript leans toward once-from-base (Wu-Tang showed cos sim ≥0.99 across
|
||
checkpoints) but doesn't fully commit.
|
||
|
||
## 6. Things explicitly rejected
|
||
|
||
- **Full bidirectional NLL+KL+PCGrad on GRPO**: structural mismatch (paired vs unpaired).
|
||
- **Hotel toy benchmark / DPO+sycophancy**: weaker published baselines, less impressive story.
|
||
- **Unsloth matmul setup**: out of scope; verl is the published baseline.
|
||
- **Direct Rebound replication as main result**: would be derivative. Rebound is a baseline arm,
|
||
not the headline.
|
||
- **Monitor-based mitigations**: Nanda's territory, cite their numbers, don't re-run.
|
||
- **Triplet loss in activation space**: explored as a frame, rejected because (a) needs similarity
|
||
oracle, (b) word2vec algebra needs many concepts not 1, (c) overlaps with CCS/RepEng.
|
||
- **Adding new losses to balance**: user is explicit that adding losses creates 4 problems
|
||
(the loss, balancing, meta-balancing, debugging). Gradient projection is a *constraint*, not
|
||
a competing objective.
|
||
|
||
## 7. Eval / measurement specifics
|
||
|
||
- **Hack rate**: % rollouts that successfully overwrite tests (Nanda's metric, from their codebase).
|
||
Use Nanda's exact definition: model's code only passes its own `run_tests` and not the ground
|
||
truth set.
|
||
- **Pass rate**: % rollouts that pass tests legitimately on held-out problems *without write
|
||
access to evaluator*. Held-out set = Nanda's eval set (113 problems per LW appendix; 10 samples
|
||
each).
|
||
- **Eval uses randomly sampled function names** (not `run_tests`) to avoid memorization, per
|
||
Nanda's eval setup.
|
||
- **`cos_align(g, v_hack)` per step**: diagnostic for whether projection is biting.
|
||
- **KL drift from base** per step: diagnostic for catastrophic policy change.
|
||
- **Headline plot**: (hack rate, pass rate) scatter, one point per (arm × seed). Pareto frontier.
|
||
Method should sit below-and-to-the-right of vanilla. Annotate Rebound's position.
|
||
- **Statistical test**: paired t-test of avg hack rate / accuracy across 3 seeds × 10 samples =
|
||
30 scores (matching Nanda's protocol).
|
||
- **Falsification check**: pre-registered analysis on H1-H4 before publishing. Report all
|
||
hypotheses including falsified.
|
||
- **Headline claim format**: "at matched pass rate ±5pp on held-out problems without write access,
|
||
our method reduces hack rate from X% to Y%."
|
||
|
||
## 8. Compute budget specifics
|
||
|
||
- 96GB RTX 6000 Ada (rented). Single GPU. [ambiguous: vast.ai vs runpod, transcript mentions both].
|
||
- Per-run: ~2-3 hours. Per Nanda: 4xH200 = 3 hours; user's 1x96GB ≈ Nanda's wall-clock for the
|
||
reduced batch.
|
||
- 13-15 runs × ~3h = 40-50 hours compute.
|
||
- Cost: ~$3 AUD/hr → ~$120-150 AUD compute; budget ~$200-250 with iteration buffer.
|
||
- Calendar: ~1 week back-to-back, 2-3 weeks with iteration.
|
||
- [ambiguous] No explicit decision on shared GPU queueing; pueue setup recommended.
|