Files
evil_MoE/docs/brainstorm/extracted_prefs.md
T
wassname 120400c5f5 setup
2026-05-23 10:40:02 +08:00

9.5 KiB
Raw Blame History

Extracted preferences and decisions — projected_grpo

TL;DR delta vs spec.md

Spec.md is the clean preregistered plan. docs/1.md is the reasoning trail behind it. The biggest deltas the brainstorm adds (not in spec):

  1. The whole project pivoted mid-conversation from a DPO+sycophancy plan (Anthropic HH-RLHF) to GRPO+reward-hacking (Nanda/Ariahw LeetCode). Driver: gradient projection in SVD basis matches GRPO's unpaired structure better than DPO's paired-preference structure.
  2. Method evolved from "bidirectional SVD-LoRA with NLL+KL" (paired-preference native, the AntiPaSTO line) to gradient-level intervention + SVD-basis denoising — an orthogonal approach for unpaired GRPO rollouts.
  3. Rebound paper (Wu & Tang 2026) appeared mid-brainstorm and reframed the positioning: not novel mechanism (concept-direction intervention) but novel level (gradient vs advantage). User's confidence updated downward but stayed positive — ~60% the method works now (was ~40% pre-Rebound, framed as net positive because Rebound validates the core mechanism).
  4. Single-GPU pragmatism: extensive back-and-forth on 3090 vs 96GB RTX 6000 Ada. Landed on 96GB RTX 6000 + Qwen3.5-2B as the practical sweet spot.

1. Design decisions

  • Substitute Qwen3.5-2B for Qwen3-4B. Reason: compute budget. Fallback to Qwen3-4B with reduced num_generations if H4 (hack emergence) fails at 2B.
  • Use verl, not TRL. Reason: Nanda's repo uses verl v0.6.1; minimise reimplementation risk. Fallback to TRL GRPOTrainer + manual reward function reimplementation if verl breaks on single 96GB.
  • Gradient projection happens per optimizer step, not per token. Formula in spec.md §5.
  • One-sided projection: only when cos_align > 0, i.e. only when gradient is pushing toward hack direction. Negative alignment is left alone.
  • Magnitude preservation: after projecting out the hack component, renormalize back to ||g||. This is a design choice — explicit ablation arm d in spec.md tests removing it.
  • SVD basis from W (model weight matrices), not from activations. Top-m default m=16, sweep {8, 16, 32}. Project v_hack into this basis, then back to full residual stream.
  • v_hack extracted from 60-80 contrastive pairs:
    • positive (hacky) = LeetCode prompts + def run_tests(): pass-style completions
    • negative (clean) = same prompts + legitimate solution attempts (base-Qwen at T=0)
    • 20 held-out pairs for validation; require >90% separation accuracy.
  • CAA-style sanity check: before training, add v_hack at inference to base model, confirm it steers generation toward hack-flavored completions. Catches broken extraction.

2. Method internals (math)

v_hack extraction (per Wu-Tang style; user adopted this after reading their paper):

  • Take last-token hidden states at layers 60-75% of model depth (multi-layer averaged).
  • d = \frac{1}{N}\sum_i (h_i^+ - h_i^-) where +/- are hacky/clean prompts.
  • [ambiguous in transcript: whether to use single-layer (as Rebound) or sweep layers]

SVD denoising of v_hack:

  • Take W for the relevant projection layer (residual-stream-out, [ambiguous which layer]).
  • W = U S V^\top. Right singular vectors V.
  • Project: v_{hack}^{(S)} = V_{:,:m}^\top \cdot v_{hack}.
  • Reproject back: v_{hack}^{\text{denoised}} = V_{:,:m} \cdot v_{hack}^{(S)}.
  • Normalize: v_{hack} \leftarrow v_{hack}^{\text{denoised}} / \|v_{hack}^{\text{denoised}}\|.

Gradient projection at training step:

  • cos_align = <g, v_hack> / ||g||
  • If cos_align > 0:
    • g' = g - cos_align * ||g|| * v_hack # remove component along v_hack
    • g' = g' * ||g|| / ||g'|| # restore magnitude (ablation removes this)
  • Else: g' = g.
  • Then optimizer.step(g').

KL / trust region: spec doesn't mandate explicit KL beyond GRPO's built-in. [ambiguous: user mentioned iso-KL trust region from AntiPaSTO line but chose NOT to add it on top of GRPO's KL because GRPO already has reference-model KL].

PCGrad: NOT used in this project. PCGrad was in the AntiPaSTO/bidirectional design; here gradient is single (no pole-pairing), so PCGrad doesn't apply.

3. Hyperparameters

Param Value Source / justification
model Qwen3.5-2B substitute for Qwen3-4B per Nanda; compute
LoRA r 32 Nanda's published
LoRA alpha 32 Nanda's published
lr 7e-5 Nanda's published
num_generations 8 reduced from Nanda's 16 to fit single-GPU
batch 128 reduced from Nanda's 256 to fit single-GPU
training steps 200 Nanda's published
m (SVD truncation) 16 (sweep 8/16/32) user choice, brainstorm landed here
n contrastive pairs 60-80 Wu-Tang use 60 for extract + 20 validate
layer for hidden states 60-75% depth, multi-layer mean Wu-Tang practice
seeds 3 "where indicated" in spec; 1 seed for some ablations
logging cadence every 25 steps spec.md §7

4. Connections to user's prior work

  • AntiPaSTO: contrastive prefix pairs → steering direction in SVD-of-W basis. This project inherits the SVD-of-W basis idea but applies it at the gradient (not activation) level, and uses unpaired GRPO contrast instead of paired prefix contrast.
  • Bidirectional c-scaled steering LoRA (NLL + KL + PCGrad): explicitly not used here. That design is paired-preference native; GRPO is unpaired. The user recognized this midway and pivoted.
  • Iso-KL trust region: implicit via GRPO's reference KL. Not added on top.

User's framing (verbatim, paraphrased from transcript): the contribution is gradient-level vs advantage-level intervention (vs Rebound's advantage modification) with SVD-basis denoising of the concept direction (novel vs both Rebound and AntiPaSTO).

5. Decision points / open questions

  • H4 fallback: if Qwen3.5-2B hack rate <30% at step 200 → swap to Qwen3-4B with num_generations=4, batch=64.
  • verl fallback: if verl breaks on single 96GB → TRL GRPOTrainer + manual reward reimplementation. Higher engineering cost.
  • v_hack steering check fails → diagnose layer choice, pair quality, or SVD truncation before committing to training runs.
  • All methods tie vanilla → check cos_align logs to confirm projection is biting.
  • [ambiguous] Layer choice for SVD: which W to SVD? The brainstorm suggests the output projection at the layer where v_hack is strongest, but doesn't commit. → resolve during smoke test.
  • [ambiguous] When does v_hack update?: extracted once from base model, or refreshed during training? Transcript leans toward once-from-base (Wu-Tang showed cos sim ≥0.99 across checkpoints) but doesn't fully commit.

6. Things explicitly rejected

  • Full bidirectional NLL+KL+PCGrad on GRPO: structural mismatch (paired vs unpaired).
  • Hotel toy benchmark / DPO+sycophancy: weaker published baselines, less impressive story.
  • Unsloth matmul setup: out of scope; verl is the published baseline.
  • Direct Rebound replication as main result: would be derivative. Rebound is a baseline arm, not the headline.
  • Monitor-based mitigations: Nanda's territory, cite their numbers, don't re-run.
  • Triplet loss in activation space: explored as a frame, rejected because (a) needs similarity oracle, (b) word2vec algebra needs many concepts not 1, (c) overlaps with CCS/RepEng.
  • Adding new losses to balance: user is explicit that adding losses creates 4 problems (the loss, balancing, meta-balancing, debugging). Gradient projection is a constraint, not a competing objective.

7. Eval / measurement specifics

  • Hack rate: % rollouts that successfully overwrite tests (Nanda's metric, from their codebase). Use Nanda's exact definition: model's code only passes its own run_tests and not the ground truth set.
  • Pass rate: % rollouts that pass tests legitimately on held-out problems without write access to evaluator. Held-out set = Nanda's eval set (113 problems per LW appendix; 10 samples each).
  • Eval uses randomly sampled function names (not run_tests) to avoid memorization, per Nanda's eval setup.
  • cos_align(g, v_hack) per step: diagnostic for whether projection is biting.
  • KL drift from base per step: diagnostic for catastrophic policy change.
  • Headline plot: (hack rate, pass rate) scatter, one point per (arm × seed). Pareto frontier. Method should sit below-and-to-the-right of vanilla. Annotate Rebound's position.
  • Statistical test: paired t-test of avg hack rate / accuracy across 3 seeds × 10 samples = 30 scores (matching Nanda's protocol).
  • Falsification check: pre-registered analysis on H1-H4 before publishing. Report all hypotheses including falsified.
  • Headline claim format: "at matched pass rate ±5pp on held-out problems without write access, our method reduces hack rate from X% to Y%."

8. Compute budget specifics

  • 96GB RTX 6000 Ada (rented). Single GPU. [ambiguous: vast.ai vs runpod, transcript mentions both].
  • Per-run: ~2-3 hours. Per Nanda: 4xH200 = 3 hours; user's 1x96GB ≈ Nanda's wall-clock for the reduced batch.
  • 13-15 runs × ~3h = 40-50 hours compute.
  • Cost: ~$3 AUD/hr → ~$120-150 AUD compute; budget ~$200-250 with iteration buffer.
  • Calendar: ~1 week back-to-back, 2-3 weeks with iteration.
  • [ambiguous] No explicit decision on shared GPU queueing; pueue setup recommended.