mirror of https://github.com/wassname/grpo_proj2.git synced 2026-06-27 15:15:44 +08:00

Files

T

wassname b0d1bcd3d5 Rebuild src/ from pseudocode: SVD-basis gradient projection vs GRPO reward hacking

Expand docs/pseudocode/01..07 into a slim, fail-fast src/projected_grpo/ that
passes `just smoke`. Code mirrors the pseudocode (δS/Σ/V names, relu-before-agg
cin/cout, Dr.GRPO unbiased loss). Did not read the original src.

7 modules (~880 LOC):
- rewards.py    grader + 4 loophole modes + hack x mode diagonal self-check (R1)
- problems.py   tiny LeetCode substrate + contrastive pairs (R5)
- antipasto.py  SVD adapter, identity at δS=0 (R2)
- proj.py       erase/route/measure_only projection (R3)
- extract_vhack_grad.py  per-module SVD of paired grad diffs, noise floor (R5)
- train.py      mixed student+teacher GRPO loop, presets smoke/fast/full (R4)
- build_pool.py self-contained frozen teacher-pool fixture

`just smoke-all` PASS (exit 0): erase/none/route trio, grader diagonal clean,
v_hack cache miss->hit, ckpt every-25. Fresh-eyes review: 6/6 mechanics faithful.

Simplifications: merged loopholes+verify_rewards->rewards, pairs->problems; flat
Config + `train.py {preset} [--overrides]` CLI; justfile 384->71 lines; trimmed
results table; token-efficient train logging (config anchor, SHOULD at loop site,
sparse tqdm postfix, BLUF tail with cue + direction-arrow table).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-05-31 14:06:42 +00:00

7.0 KiB

Raw Blame History

projected_grpo as pseudocode

The whole method, section by section, as pseudopy (Python + unicode math, read-not-run). Goal: compress the ~6.8k-line src/ to its load-bearing logic so the idea is auditable on one screen per section, then expand back into clean code. The repo will be rebuilt FROM these files, so they carry enough engineering detail (shapes, gotchas, denominators) to replicate.

Read top-to-bottom in this order:

File	Section	Source module(s)
01_adapter.py	SVD adapter: train a per-module knob `δS` in singular-value basis	`antipasto.py`
02_extract_vhack.py	Extract the hack direction `V` from labeled pairs	`extract_vhack_grad.py`, `pairs.py`
03_project.py	Erase / route the hack component from the live gradient	`proj.py`
04_rewards.py	The multi-loophole env + reward grader	`rewards.py`, `loopholes`
05_grpo_loss.py	GRPO / Dr.GRPO unbiased loss, mixed student+teacher pool	`train.py` inner step
06_train_loop.py	The outer loop: generate -> grade -> backward -> project -> step	`train.py` `main`
07_experiment.py	The experiment: 4 arms, the weak-detector no-cheat test, H1	`spec.md`, `README.md`

The idea in three lines

V ← svd(stack[∇logp(hack) − ∇logp(clean) over labeled pairs])  # hack direction in δS-space
during GRPO:  g ← δS.grad;  g ← g − relu(g·Vᵀ)V                 # erase hack-ward component
hope:  ablating V from the *labeled-pair* gradient also ablates it from the *live unlabeled* GRPO gradient

Not a theorem. We watch cin_t > cin_s (V lights up more on teacher/hack rollouts than student). Caveat from external review: that is necessary, not sufficient (it can track "teacher-ness" not "hack-ness"), and cout ≈ 0 is an arithmetic identity of the projection, not evidence it worked. Only the behavioral hack rate at matched pass — beaten against negative controls — settles it.

What makes it novel / where it sits

Intervenes at the gradient level, not the advantage level (cf. Rebound / "Advantage Modification", Wu & Tang 2026).
Uses unpaired GRPO rollouts at train time, vs AntiPaSTO's paired-preference contrast (the user's prior work, github.com/wassname/AntiPaSTO).
The load-bearing no-cheat constraint: the detector is allowed to be weak (sees hack A, misses hack B). We extract V from A only, then test whether routing on A also suppresses the held-out B. That mimics deployment, where unknown hacks exist. A detector that sees every hack proves nothing.

External review (2026-05-31)

Reviewed across two non-Anthropic families via pi+OpenRouter (DeepSeek-v4-pro, GPT-5.4); raw reviews + a SYNTHESIS in docs/reviews/20260531_pseudocode/. They converged hard (cross-family agreement = signal). Top fixes, now inlined as REVIEW: caveats in 02/03/05/07:

Teacher-pool imitation (deepest leak): for cached teacher rollouts the loss is off-policy imitation, not GRPO (ratio≡1 is forced). erase may just block imitation, not hacking. Test: does erase still work with NO teacher pool? (05/07)
No negative controls: add random-V, shuffled-label-V, non-hack-V arms. If real V doesn't beat them, the effect is regularization, not hack removal. (07)
Measure cin_s (not just cin_t>cin_s) and correlate with hack rate; cout≈0 is a tautology. (07)
AdamW preconditioner bypass: projecting g ⊥ V doesn't make the update ⊥ V (Adam's 1/√v rotates it off V). Log cout on Δδ, not g; project the update or purge Adam state in span(V). (03/07)
Smaller: route ⊇ erase is false across training; ablate preserve_magnitude; extraction vs live loss normalize differently; weak-detector → leave-one-mode-out with per-mode matched pass.
Priority: erase-vs-vanilla already exists (blog Table 1, n=2); what's missing is n=3, the negative controls, and the 60-80-pair spec design — not the comparison.

Cross-checked against ../blog/20260529_..._LW_draft.md, which already documents the Adam caveat, the cosine null baseline, the teacher-pool confound, and the n=2 result — the review mostly re-derived the blog's own limitations. Genuinely additive: the imitation-not-RL framing (1) and the preconditioner-vs-momentum distinction (4).

Notation

b s            batch, sequence (token) dims
r              per-module SVD rank = min(d_in, d_out); the δS dimension (~500..2560)
k              # hack directions kept per module after top-k + noise-floor
W = U Σ Vh     SVD of a Linear's weight, W ∈ ℝ^{d_out × d_in}
δS ∈ ℝ^r       trainable knob in singular-value space (the ONLY trained param, per module)
V ∈ ℝ^{k×r}    v_hack: rows orthonormal in ℝ^r, oriented hack-ward
g = δS.grad    live GRPO gradient on the knob
c = V @ g      per-direction hack coefficients (c_i > 0 ⇒ grad pushes hack-ward on axis i)
cin/cout       ‖relu(c)‖/‖g‖ before/after projection — RELU-BEFORE-AGG (only hack-ward
               axes count; = ‖removed‖/‖g‖, the fraction stripped). NOT a signed sum.
A              GRPO advantage;  R reward;  π policy;  π_ref reference (δS=0)

Citations & links (preserved from README/spec/AGENTS/blog)

GRPO inner step ported from lsdefine/simple_GRPO, grpo_vllm_one.py::GRPO_step (lines 64-95): https://github.com/lsdefine/simple_GRPO
Dr.GRPO unbiased loss (drop length-norm 1/|o| and group-std /σ_R): Liu et al. 2025, "Understanding R1-Zero-Like Training", arXiv:2503.20783 https://arxiv.org/abs/2503.20783
Gradient Routing (park capability in a deletable subspace): Cloud et al. 2024, arXiv:2410.04332 https://arxiv.org/abs/2410.04332
- Route v2 (distinct-basis quarantine, supersedes the additive route): docs/spec/20260531_routing_v2_distinct_basis.md
Advantage-level baseline ("Rebound" / Advantage Modification): Wu & Tang 2026.
Benchmark substrate (LeetCode reward-hacking env, hints, grader): Ariahw, Engels & Nanda, rl-rewardhacking https://github.com/ariahw/rl-rewardhacking
- LessWrong writeup: docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md
AntiPaSTO (SVD-basis adapter + contrastive-pair extraction, prior work): https://github.com/wassname/AntiPaSTO
- concepts vendored at docs/vendor/AntiPaSTO_concepts/
Preregistered plan: docs/spec.md
Design rationale: docs/brainstorm/extracted_prefs.md
Preliminary result writeup (n=2): docs/blog/20260529_..._LW_draft.md
The four loophole modes (full traces): same blog, appendix.

7.0 KiB Raw Blame History Unescape Escape