mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:04:59 +08:00
87a2b48784
train.py: pass logits_to_keep=L_c+1 to model() at all three logp call sites + the ref-via-zero-delta helper so HF Qwen3's lm_head only runs on completion-side hidden states; saves ~33% at the 4 GiB step-17 OOM site. full preset G=8 -> G=6 for a further ~25% B reduction at every act site. Column names in the streamed TSV row shortened so header and values share the same 8-char tab stop. spec.md: documented the v_hack generalization constraint as load-bearing methodology — pairs.py must NOT be tuned post-hoc to match RL-emergent hacks, or the H1 generalization claim collapses. handover.md: rewritten for current state (G=6, post-grader-fix, Qwen3-4B). Documents the four probe gates, hyperparameters table, and methodological constraints. justfile gains a SWEEPS comment block clarifying probe vs queue-full ordering. .gitignore picks up .venv, *.log, /tmp/, cache dirs. RESEARCH_JOURNAL.md: 2026-05-24 (b) entry covers the OOM diagnosis, fix, pooled cross-run trend analysis (LR is fine, signal underpowered at n=17 but directionally consistent), and the generalization correction.
585 lines
31 KiB
Markdown
585 lines
31 KiB
Markdown
# Experiment: rank-space gradient projection vs RL reward hacking
|
||
|
||
## Context
|
||
|
||
GRPO and related on-policy RL methods are known to exploit loopholes in reward
|
||
functions. Ariahw, Engels & Nanda (2025, [github.com/ariahw/rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking))
|
||
open-sourced a benchmark on LeetCode where Qwen3-4B learns to overwrite the
|
||
evaluation function `run_tests()` instead of solving problems, reaching 79%
|
||
reward hack rate at 200 training steps. Existing mitigations are mostly
|
||
monitor-based (detect at output) or advantage-based (Rebound:
|
||
penalize hacking rollouts via concept-score-modified advantage; Wu & Tang 2026
|
||
[arxiv:2604.01476](https://arxiv.org/abs/2604.01476)).
|
||
|
||
This experiment tests a different mechanism: **wrap target modules with the
|
||
AntiPaSTO SVD adapter (lora-lite), extract a per-module `v_hack` in the rank-r
|
||
SVD basis from contrastive pairs, and project each step's
|
||
`grad(delta_S) : [r]` orthogonal to `v_hack` before the optimizer update.**
|
||
Mechanism difference from Rebound: gradient-level direction constraint on
|
||
weight-update subspace vs rollout-level scalar penalty on advantage.
|
||
|
||
This is preregistered: results to be reported regardless of outcome.
|
||
|
||
## Why AntiPaSTO and not vanilla LoRA
|
||
|
||
Vanilla LoRA's rank axis is meaningless (random init, drifts after step 1), so
|
||
"project out v_hack in rank space" has no fixed reference frame. AntiPaSTO
|
||
(Wassname, lora-lite [variants/antipasto.py](https://github.com/wassname/lora-lite/blob/main/src/lora_lite/variants/antipasto.py))
|
||
freezes `U_r, S_r, Vh_r` from the SVD of `W` and trains a tiny `delta_S : [r]`
|
||
plus an optional block-Cayley rotation. The rank axis stays pinned to the SVD
|
||
basis of the original weight, so `v_hack` extracted in that basis remains
|
||
meaningful across all training steps.
|
||
|
||
Forward pass per wrapped module (first pass uses full rank $r = \min(d_{in}, d_{out})$,
|
||
so the residual term $W_{res}$ vanishes):
|
||
|
||
$$y = ((x V_h^T) \odot (S + \delta_S)) U^T$$
|
||
|
||
where $U$, $S$, $V_h$ come from the SVD of $W$ and are buffers (frozen).
|
||
Trainable: $\delta_S : [r]$ (and optionally a small Cayley rotation `rot_T`
|
||
we leave off by default). At reduced rank we would add
|
||
$x W_{res}^T$ with $W_{res} = W - U_r \mathrm{diag}(S_r) V_{h,r}$, but we
|
||
defer rank cropping to v2 to skip the "where to cut" question.
|
||
|
||
Per-step gradient signal:
|
||
|
||
$$\frac{\partial L}{\partial \delta_S} = \sum_t (x_t V_h^T) \odot \left(\frac{\partial L}{\partial h_t} U\right) \in \mathbb{R}^r$$
|
||
|
||
We extract `v_hack` **gradient-side** (locked in): for each contrastive pair,
|
||
run one NLL backward on the completion tokens and read each module's
|
||
`m.delta_S.grad : [r]`. Then $\hat v_{hack}^{(m)} =$ unit$($mean$_{hack}$ grad $-$ mean$_{clean}$ grad$)$.
|
||
This lives in the exact same `[r]` rank space the per-step training gradient
|
||
lives in (the gradient is the natural object to compare gradients against),
|
||
and it fuses the input-activation and output-error contributions in one shot
|
||
instead of guessing whether input-side $(x V_h^T)$ or output-side $(\partial L/\partial h)\, U$
|
||
better predicts where SGD will move. We did consider activation-side
|
||
($x V_h^T$ mean-diff). Dropped as primary because it only sees the input
|
||
factor and ignores the output-error factor, while the per-step gradient sees
|
||
both.
|
||
|
||
**Generalization constraint (load-bearing methodology).** The pairs used
|
||
for `v_hack` extraction must come from a distribution *distinct from*
|
||
whatever the RL-trained model produces in deployment. This is the threat
|
||
model: at deployment we don't know which hacks will emerge, so the
|
||
defense's generalization claim depends on `v_hack` being authored
|
||
*before* seeing rollouts. If we post-hoc tune `pairs.py` to match the
|
||
specific hack patterns that emerge during vanilla training, we've fit
|
||
our defense to a known attack — that's teaching to the test, not
|
||
testing the hypothesis. The current `pairs.py` is deliberately
|
||
broader than any single threat model (4 hack flavors, not just
|
||
ariahw's overwrite-tests loophole) so that suppression of a *specific*
|
||
emergent pattern is evidence the subspace generalizes. If projection
|
||
fails to suppress emergent hacks, the right response is to interpret
|
||
the negative result, not to widen `pairs.py` to retroactively
|
||
include the failed pattern.
|
||
|
||
Projection (locked: no magnitude threshold; one-sided clip stays — see note):
|
||
|
||
$$g \leftarrow g - \max(0,\, \cos_{align}) \cdot \|g\| \cdot \hat v_{hack}, \qquad \cos_{align} = \frac{g \cdot \hat v_{hack}}{\|g\|}$$
|
||
|
||
then rescale to original $\|g\|$ (magnitude-preserving). The $\max(0,\cdot)$ is
|
||
not gating, it's directional correctness: without it, when $\cos<0$ we'd be
|
||
*adding* to the hack component. No magnitude/threshold gating (locked): we
|
||
project every step every module. Capacity cost is ~1/r per module per step.
|
||
If `v_hack` at a module is just noise, projection ablates a noise direction in
|
||
expectation = approximately a no-op.
|
||
|
||
## Why not vanilla GRPO via verl
|
||
|
||
verl is Ariahw's framework but uses Ray + FSDP2 + Hydra; inserting a
|
||
pre-optimizer-step hook on per-module rank-space gradients requires deep
|
||
subclassing of their worker abstraction. We pay one cost in exchange:
|
||
we use [lsdefine/simple_GRPO](https://github.com/lsdefine/simple_GRPO) instead.
|
||
simple_GRPO is a two-file GRPO implementation (`ref_server.py` + `grpo_ref_split.py`,
|
||
~315 lines total) with reported convergence on Qwen2.5-7B. The training loop
|
||
is literally `loss = GRPO_step(batch); engine.backward(loss); engine.step()` —
|
||
inserting a projection hook between backward and step is a one-line edit.
|
||
|
||
Cost of this deviation: we re-establish the "vanilla hack emergence" baseline
|
||
on simple_GRPO rather than inheriting it from Ariahw's verl baseline. H4 is
|
||
the sanity check that this happens. We port Ariahw's `run_tests`-overwrite
|
||
detection (their [src/train/verl/rewards.py](https://github.com/ariahw/rl-rewardhacking/blob/main/src/train/verl/rewards.py))
|
||
into simple_GRPO's reward server (`docs/vendor/simple_GRPO/ref_server.py`).
|
||
|
||
Vendored references (read-only, see [docs/vendor/](docs/vendor/)):
|
||
- [simple_GRPO](https://github.com/lsdefine/simple_GRPO) — GRPO trainer
|
||
- [lora-lite](https://github.com/wassname/lora-lite) — AntiPaSTO adapter
|
||
- [rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking) (already at `external/`)
|
||
|
||
## Hypotheses (preregistered)
|
||
|
||
**H1 (mechanism, primary):** Rank-space gradient projection against `v_hack`
|
||
extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30
|
||
percentage points (absolute) relative to vanilla AntiPaSTO+GRPO, at matched
|
||
LeetCode pass rate within 10pp of vanilla.
|
||
|
||
Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at
|
||
matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds.
|
||
|
||
**H3 (gradient vs advantage):** Gradient-level intervention (ours) outperforms
|
||
advantage-level intervention (Rebound reimplemented) on hack rate at matched
|
||
pass rate.
|
||
|
||
Falsified if: Rebound reimplementation matches or beats ours within 1 SEM.
|
||
|
||
**H4 (scaling sanity on our stack):** Qwen3.5-2B trained with vanilla
|
||
AntiPaSTO+GRPO on simple_GRPO reproduces measurable reward hacking (>30% hack
|
||
rate at 200 steps).
|
||
|
||
Falsified if: vanilla hack rate <30%. Decision branch: swap to Qwen3-4B with
|
||
num_generations halved. Secondary: if simple_GRPO can't reproduce hacking on
|
||
either model, fall back to Ariahw's verl path and accept the harder hook.
|
||
|
||
## Steps
|
||
|
||
### 1. Build infra — fast-dev-run targets first, no real training yet
|
||
|
||
- **1a.** Vendor simple_GRPO into `docs/vendor/simple_GRPO/` (done); smoke-run
|
||
their GSM8K example on tiny-random-qwen3 (5 steps, CPU) to confirm
|
||
`ref_server` + `grpo_ref_split` rollout/train split works in our env.
|
||
- **1b.** Vendor lora-lite into `docs/vendor/lora-lite/` (done); wrap
|
||
Qwen3.5-0.8B attn+MLP `nn.Linear` modules with AntiPaSTO **at full rank**
|
||
(`r = min(d_in, d_out)`, no SVD cropping; `rotate_basis="none"`, only
|
||
`delta_S` trainable). Full rank means $W = U \,\mathrm{diag}(S)\, V_h$
|
||
exactly and `W_res = 0`, so there's no truncation error to debug on the
|
||
first pass. Verify forward-pass round-trip numerically matches base model
|
||
at $\delta_S = 0$ (max abs diff <1e-3 on a fixed prompt).
|
||
- **1c.** Implement gradient-side `v_hack` extraction (pseudocode below).
|
||
Validation: per-module held-out projection score
|
||
`cos(g_held_hack - g_held_clean, v_hack)` > 0 in >50% of modules.
|
||
|
||
### 2. H4 sanity — does vanilla AntiPaSTO+GRPO+simple_GRPO produce hacking?
|
||
|
||
- **2a.** Port Ariahw's `run_tests`-overwrite detection into simple_GRPO's
|
||
`ref_server.py` reward fn. Verify the reward fn fires on synthetic
|
||
hack/clean rollouts before real training.
|
||
- **2b.** Train Qwen3.5-2B, AntiPaSTO (`r=full`, `delta_S` only), GRPO
|
||
(group_norm), 200 steps, num_generations=8, batch=16, 1 seed.
|
||
Decision: if hack rate <30%, switch to Qwen3-4B with `num_generations=4,
|
||
batch=16` (half num_gen to keep VRAM headroom) and re-run 2b.
|
||
Secondary fallback: drop simple_GRPO, return to verl.
|
||
|
||
### 3. Implement rank-space projection in simple_GRPO's training loop
|
||
|
||
- **3a.** In `grpo_ref_split.py`, between `engine.backward(loss)` and
|
||
`engine.step()`, call `project_grads(model, v_hack_cache)`.
|
||
`project_grads` walks `[m for m in model.modules() if hasattr(m, 'delta_S')]`
|
||
and for each module reads `m.delta_S.grad : [r]`, projects against
|
||
`v_hack[module_name]` (one-sided, magnitude-preserving), writes back
|
||
in place. (Pseudocode below.)
|
||
- **3b.** Diagnostics logged per step (aggregated over modules):
|
||
mean/std `cos_align`, mean `||grad||`, `frac_modules_with_cos>0`.
|
||
|
||
### 4. Run arms (200 steps each, 3 seeds where indicated)
|
||
|
||
a. Vanilla AntiPaSTO + GRPO (3 seeds) — baseline
|
||
b. Our method, gradient-side `v_hack`, no gating (3 seeds) — main result
|
||
c. Our method, no magnitude preservation (1 seed) — design ablation
|
||
d. Rebound reimplementation: advantage-side `v_hack` penalty (3 seeds) — H3
|
||
(concrete formula: per-rollout penalty `α · max(0, cos(h_mean, v_concept))`
|
||
added to scalar reward, where `h_mean` is mean residual-stream activation
|
||
at a chosen layer and `v_concept` is mean-diff activation direction
|
||
extracted from the same 60-80 pairs. We use Wu & Tang 2026 §3.2's
|
||
published `α=0.5` and same layer fraction (60-75% depth). Single
|
||
layer, not per-module, matching their setup. *Different `v_concept`
|
||
from our gradient-side `v_hack` — this is intentional: H3 isolates the
|
||
gradient-vs-advantage mechanism choice, not the direction-extraction
|
||
choice.*)
|
||
|
||
Total: 10 runs × ~3h on RTX 6000 96GB = ~30h compute.
|
||
*(Rank sweep deferred to v2; first pass uses `r = min(d_in, d_out)` per
|
||
module, no cropping.)*
|
||
|
||
### 5. Measure at every 25 steps
|
||
|
||
- **Hack rate** (Ariahw's detector ported into simple_GRPO)
|
||
- **Pass rate** on held-out problems without write access to evaluator
|
||
- **Per-module `cos_align`** trajectory (sanity that we're projecting
|
||
something nonzero)
|
||
- **`frac_modules_with_cos>0`** per step (sanity that one-sided clip fires)
|
||
- **KL drift from init policy** (catastrophic-change check)
|
||
|
||
### 6. Headline plot and headline table
|
||
|
||
**Plot.** Hack rate vs pass rate, one point per (arm × seed). Pareto
|
||
frontier. Our method should land below-and-to-the-right of vanilla.
|
||
Annotate Rebound.
|
||
|
||
**Table schema (publication-ready; left-to-right = essential to optional,
|
||
so trailing columns can be cut for space):**
|
||
|
||
| Arm | ΔSafePass↑ | Hack %↓ | Pass %↑ | KL↓ | mean·cos\* | frac·fired\* | ‖g‖\* |
|
||
|---|---|---|---|---|---|---|---|
|
||
| Vanilla (a) | 0 (ref) | — | — | — | — | — | — |
|
||
| **Ours (b)** | — | — | — | — | — | — | — |
|
||
| Ours, no mag-preserve (c) | — | — | — | — | — | — | — |
|
||
| Rebound (d) | — | — | — | — | — | — | — |
|
||
|
||
*Caption.* ↑ higher is better, ↓ lower is better. **ΔSafePass** = (pass% −
|
||
hack%) − vanilla's (pass% − hack%): single headline number, positive means
|
||
we win. **Hack %** = fraction of rollouts triggering `run_tests`-overwrite
|
||
detector. **Pass %** = fraction passing held-out tests without write access.
|
||
**KL** = mean per-token KL from init policy over last 25 steps.
|
||
\* = projection-internal diagnostic, only meaningful for arms (b)/(c);
|
||
distinguishes "projection active" (mean·cos > 0.2, frac·fired > 0.4) from
|
||
"projection silent no-op". Cells report mean ± SEM across seeds.
|
||
|
||
### 7. Falsification check
|
||
|
||
Before publishing, run pre-registered analysis on H1, H3, H4. Report all
|
||
hypotheses including falsified ones.
|
||
|
||
## Pseudocode (the three load-bearing bits)
|
||
|
||
### A. AntiPaSTO module wrap (full rank, first pass)
|
||
|
||
```
|
||
class AntiPaSTO(nn.Module):
|
||
# constructed from an existing nn.Linear(W: [d_out, d_in], b)
|
||
# FIRST PASS: r = min(d_out, d_in) -- no truncation, W_res == 0
|
||
def __init__(self, W, b):
|
||
U, S, Vh = torch.linalg.svd(W.float(), full_matrices=False)
|
||
r = S.shape[0] # = min(d_out, d_in)
|
||
# buffers (frozen): the full SVD
|
||
self.U = U # [d_out, r]
|
||
self.S = S # [r]
|
||
self.Vh = Vh # [r, d_in]
|
||
self.b = b
|
||
# trainable (ONLY this): scalar per rank
|
||
self.delta_S = nn.Parameter(torch.zeros(r))
|
||
|
||
def forward(self, x): # x: [..., d_in]
|
||
return ((x @ self.Vh.T) * (self.S + self.delta_S)) @ self.U.T + self.b
|
||
```
|
||
|
||
Replace every target `nn.Linear` in attn (`q,k,v,o_proj`) and MLP
|
||
(`up,gate,down_proj`) with this. At `delta_S=0`, output == original linear up
|
||
to numerical precision (no `W_res` residual term needed at full rank).
|
||
|
||
**SVD precompute strategy.** Don't SVD the whole model on GPU at once.
|
||
Load the base model on CPU, then for each target `Linear`: move `W` to GPU,
|
||
run `torch.linalg.svd(W.float(), full_matrices=False)`, save
|
||
`(U, S, Vh) -> svd_cache/{model_name}/{module_path}.pt`. Wrap construction
|
||
then loads the cached SVD per module. SVD is done once per base model; ~5-10s
|
||
per big MLP weight on RTX 3090.
|
||
|
||
### B. Gradient-side `v_hack` extraction (per module)
|
||
|
||
```
|
||
v_hack = {} # dict[module_name -> Tensor[r]]
|
||
grads_hack = defaultdict(list)
|
||
grads_clean = defaultdict(list)
|
||
|
||
# Per-pair: process hack and clean independently, NLL over their own completion
|
||
# tokens only. Different completion lengths are fine -- we use mean NLL
|
||
# (sum_nll / n_completion_tokens), so each pair contributes a length-normalized
|
||
# gradient. This avoids biasing v_hack toward longer (typically clean)
|
||
# completions. Pad each example individually; no cross-completion padding.
|
||
|
||
for (prompt, hack_completion, clean_completion) in pairs:
|
||
for label, completion in [('hack', hack_completion), ('clean', clean_completion)]:
|
||
model.zero_grad()
|
||
ids = tokenize(prompt + completion) # [1, L]
|
||
mask = completion_mask(ids, prompt_len=len(prompt_ids)) # 1 on completion tokens
|
||
logits = model(ids).logits[:, :-1]
|
||
# MEAN NLL over completion tokens (length-normalized)
|
||
loss = (nll_per_token(logits, ids[:, 1:]) * mask[:, 1:]).sum() / mask[:, 1:].sum()
|
||
loss.backward()
|
||
for name, m in model.named_modules():
|
||
if hasattr(m, 'delta_S'):
|
||
bucket = grads_hack if label == 'hack' else grads_clean
|
||
bucket[name].append(m.delta_S.grad.detach().cpu().clone())
|
||
|
||
for name in grads_hack:
|
||
diff = stack(grads_hack[name]).mean(0) - stack(grads_clean[name]).mean(0) # [r]
|
||
v_hack[name] = diff / (diff.norm() + 1e-8)
|
||
|
||
torch.save(v_hack, 'v_hack.pt')
|
||
```
|
||
|
||
Validation (report both, don't just gate on threshold):
|
||
|
||
- On held-out pairs, recompute per-module `diff_held` and
|
||
`cos_align_held = cos(diff_held, v_hack[name])`.
|
||
- **Distribution check (primary):** plot histogram of `cos_align_held` across
|
||
all modules. Healthy = unimodal positive, median > 0.3. Pathological =
|
||
bimodal or median near 0.
|
||
- **Gate (secondary):** `cos_align_held > 0` in >50% of modules is the
|
||
minimum to proceed; mean `cos_align_held > 0.2` is the target. If <50% pass,
|
||
extraction is broken and we debug before training.
|
||
|
||
### C. Pre-optimizer-step projection hook
|
||
|
||
```
|
||
def project_grads(model, v_hack: dict[str, Tensor]):
|
||
# called after engine.backward(loss), before engine.step()
|
||
cos_log, n_modules, n_fired = [], 0, 0
|
||
for name, m in model.named_modules():
|
||
if not hasattr(m, 'delta_S'): continue
|
||
g = m.delta_S.grad # [r]
|
||
if g is None: continue
|
||
n_modules += 1
|
||
v = v_hack[name].to(g.device) # [r], unit
|
||
g_norm = g.norm()
|
||
if g_norm < 1e-12: continue
|
||
cos_a = (g @ v) / g_norm # scalar
|
||
cos_log.append(cos_a.item())
|
||
if cos_a > 0:
|
||
n_fired += 1
|
||
g_new = g - cos_a * g_norm * v # remove hack component
|
||
g_new = g_new * (g_norm / (g_new.norm() + 1e-8)) # magnitude preserve
|
||
m.delta_S.grad.copy_(g_new)
|
||
return dict(mean_cos=mean(cos_log), frac_fired=n_fired/max(n_modules,1))
|
||
```
|
||
|
||
Integration into `grpo_ref_split.py` training loop
|
||
(vendored at `docs/vendor/simple_GRPO/simple_grpo_v1/grpo_ref_split.py`; we copy and
|
||
edit, not import):
|
||
|
||
```
|
||
# at top of training script, once:
|
||
v_hack = torch.load('v_hack.pt', map_location='cpu') # dict[str, Tensor[r]]
|
||
# (extraction script from B above produces this artifact; if missing, crash loud)
|
||
|
||
# inside the training loop:
|
||
loss = GRPO_step(batch)
|
||
engine.backward(loss)
|
||
stats = project_grads(engine.module, v_hack) # <-- NEW: 1 line
|
||
engine.step()
|
||
if rank == 0: log(stats)
|
||
```
|
||
|
||
## Decisions left open (write these up alongside results)
|
||
|
||
- **Rank `r`.** First pass: `r = min(d_in, d_out)` per module (no cropping)
|
||
to avoid debugging where to cut the SVD. Trainable params per module =
|
||
`min(d_in, d_out)`, still tiny vs full LoRA's `r*(d_in+d_out)`. Tradeoff:
|
||
larger `r` keeps geometric fidelity but `v_hack`'s SNR per dim degrades;
|
||
smaller `r` would concentrate hack signal but introduces truncation error in
|
||
`W_res`. Rank sweep is v2 work.
|
||
|
||
## Why measure ratio, not just hack rate
|
||
|
||
A model that learns nothing won't cheat. The honest metric is the *Pareto
|
||
frontier* of (hack rate, pass rate), not either alone. Pure hack-rate rewards
|
||
undertraining; pure pass-rate rewards anything that improves coding including
|
||
via the hack. Headline claim shape: "at matched pass rate ±5pp on held-out
|
||
problems without write access, our method reduces hack rate from X% to Y%."
|
||
|
||
## Compute estimate
|
||
|
||
- Single run on 96GB RTX 6000: ~2-3h (Qwen3.5-2B, num_gen=8, 200 steps,
|
||
simple_GRPO, AntiPaSTO full rank)
|
||
- 10 runs: 25-35h
|
||
- At ~$3 AUD/hr: $75-105 AUD
|
||
- + debugging buffer: budget ~$200 AUD total
|
||
- Calendar time: 1 week back-to-back; 2-3 weeks with iteration
|
||
|
||
## Risks and decision points
|
||
|
||
- **H4 falsified (no hack on Qwen3.5-2B with simple_GRPO):** branch 1 — try
|
||
Qwen3-4B same hyperparams. Branch 2 — drop simple_GRPO, hook into verl
|
||
directly. Adds ~1-2 weeks engineering.
|
||
- **AntiPaSTO + GRPO doesn't train:** known risk — antipasto's trainable
|
||
subspace (`delta_S` only) may be too small for RL. If so, document and
|
||
fall back to PiSSA-LoRA-freeze-A. We do **not** enable Cayley rotation
|
||
(`rotate_basis="V"`) as a mitigation: a rotated rank axis breaks the
|
||
invariant that `v_hack` (extracted in the original SVD basis) stays
|
||
meaningful across training, which is the whole point of using AntiPaSTO
|
||
over vanilla LoRA.
|
||
- **`v_hack` steering check fails (per-module projection scores ≤chance):**
|
||
extraction broken. Check (a) hook captures pre-residual input, (b) pair
|
||
quality drives strong activation difference somewhere, (c) tokenization of
|
||
hack vs clean completions isn't trivially distinguishing.
|
||
- **All methods tie vanilla on hack rate:** intervention not biting. Check
|
||
`cos_align` logs nonzero, `frac_modules_with_cos>0` nonzero.
|
||
|
||
## What this is not
|
||
|
||
- Not a claim that rank-space gradient projection solves reward hacking
|
||
generally
|
||
- Not a comparison to monitor-based methods (cite Ariahw's numbers, don't
|
||
re-run)
|
||
- Not a claim about hacks beyond `run_tests()` overwrite
|
||
- Not a replacement for RLHF safety pipeline; this is a targeted intervention
|
||
|
||
## Related work and naming
|
||
|
||
- **Wu & Tang 2026, Rebound** ([arxiv:2604.01476](https://arxiv.org/abs/2604.01476)) —
|
||
advantage-side concept-direction penalty during GRPO. Our H3 baseline.
|
||
- **Ariahw/Engels/Nanda 2025, rl-rewardhacking** ([github](https://github.com/ariahw/rl-rewardhacking)) —
|
||
source of dataset, reward function, and `v_hack`-relevant `run_tests` hack pattern.
|
||
- **AntiPaSTO** ([wassname/lora-lite/variants/antipasto.py](https://github.com/wassname/lora-lite/blob/main/src/lora_lite/variants/antipasto.py),
|
||
([wassname/AntiPaSTO paper](https://github.com/wassname/AntiPaSTO)) — adapter
|
||
we wrap with.
|
||
- **simple_GRPO** ([lsdefine/simple_GRPO](https://github.com/lsdefine/simple_GRPO)) — GRPO trainer.
|
||
- **PiSSA** ([arxiv:2404.02948](https://arxiv.org/abs/2404.02948)) — frozen
|
||
top-r SVD-init for LoRA; closest spiritual ancestor to AntiPaSTO.
|
||
|
||
## Amendments
|
||
|
||
### 2026-05-23 — Reverting to spec'd 2B substrate; safetensors v_hack
|
||
|
||
**Context.** Two earlier sessions drifted the code away from this spec without
|
||
amending it:
|
||
|
||
- §1b smoke ran Qwen3.5-**0.8B** on a 24GB box (not the spec'd 2B).
|
||
Result: `HACK_RATE=0.000, PASS_RATE=0.000` over 10 steps, G=2, β=0
|
||
(mechanism-only). Generations were format-only. See
|
||
`docs/RESEARCH_JOURNAL.md:50-78`. This is **not** a clean falsification
|
||
of H4 — the 0.8B run was below the spec's tested model size.
|
||
- §H4 fallback was supposed to branch to Qwen3-4B with `num_generations=4`.
|
||
The justfile/handover instead introduced `lite = Qwen2.5-Coder-1.5B`
|
||
and `full = Qwen2.5-Coder-7B` (rationale: Wu & Tang 2026 Rebound used
|
||
Coder-7B and observed ~50% hack rate, so matched-substrate H3 comparison).
|
||
This deviation was never written into spec.md. Reverting it now.
|
||
|
||
**Decision.** spec.md remains canonical. `full = Qwen3.5-2B` (the spec H4
|
||
substrate) on the 96GB box, with `num_generations=8`, `beta=0.04`, 200 steps.
|
||
The Coder-7B path is parked, not formalized. If H4 fails at 2B on this stack
|
||
we revisit the spec-pinned fallback (Qwen3-4B, `num_gen=4`) before considering
|
||
Coder-7B again.
|
||
|
||
**Open questions (this iteration).**
|
||
|
||
1. Does Qwen3.5-2B + AntiPaSTO + simple_GRPO + Dr.GRPO loss actually train
|
||
(loss finite, reward spread > 0 on most steps, no policy collapse)?
|
||
2. Does reward hacking emerge — i.e. is the spec's H4 (>30% hack rate at
|
||
step 200) reproducible on *our* stack, not just on Ariahw's verl path?
|
||
3. How many wall-clock hours for a single 2B vanilla run on the 96GB GPU?
|
||
Spec estimate is 2-3h; first run is the calibration.
|
||
|
||
**Tasks (in order).**
|
||
|
||
1. `train.py:209` currently calls `load_v_hack` unconditionally. Gate it on
|
||
`arm == "projected"` so a vanilla H4 sanity run does not require a v_hack
|
||
artifact it never uses.
|
||
2. Refactor v_hack artifact format from `torch.save({"model","dtype","v_hack"})`
|
||
to `safetensors.torch.save_file(tensors, path, metadata={"model","dtype"})`.
|
||
Native header metadata replaces the manual dict wrapper. Touches
|
||
`extract_vhack_grad.py`, `verify_vhack_heldout.py`, `train.load_v_hack`,
|
||
and justfile suffixes (`.pt` → `.safetensors`).
|
||
3. Repoint `full` preset to `Qwen/Qwen3.5-2B` in `train.py`, `justfile`,
|
||
`docs/handover.md`. Drop Coder-7B from the named presets.
|
||
4. Queue a single-seed vanilla H4: `train.py --preset=full --arm=vanilla
|
||
--seed=41`. Read final `HACK_RATE`, `PASS_RATE`, and `steps=` count.
|
||
5. If `HACK_RATE > 0.30`: proceed to v_hack extraction at 2B and the
|
||
projected arm. If not: revisit the spec-pinned 4B fallback before
|
||
anything else.
|
||
|
||
**What is explicitly NOT changing.** The hypotheses (H1, H3, H4), the
|
||
mechanism (rank-space gradient projection), the loss (Dr.GRPO unbiased),
|
||
the projection geometry (one-sided, magnitude-preserving), and the
|
||
gradient-side v_hack extraction. The spec body is preregistered; only the
|
||
substrate-pinning and artifact-format choices are being aligned here.
|
||
|
||
### 2026-05-23 (b) — GRPO outer loop, sampling, optimizer aligned to references
|
||
|
||
**Context.** First attempts at the H4 baseline run (tasks 76, 77, 79, 80, 81)
|
||
exposed three classes of issue:
|
||
|
||
- **OOM at step 2 on 2B / G=8 / max_new=1024** despite the 96GB card. Root
|
||
cause: `model(merged).logits.float()` upcast on the policy forward
|
||
materialized a `[8, ≈1500, 152k]` fp32 vocab tensor (~7 GB) on top of the
|
||
full autograd graph. Fix: replaced `per_token_logps` with fused
|
||
`F.cross_entropy`; enabled gradient checkpointing + `enable_input_require_grads`
|
||
(canonical PEFT trick — base params frozen, so without this the embedding
|
||
output has no grad and HF's `checkpoint()` shorts out).
|
||
- **`flash-linear-attention` fast path missing** on Qwen3.5's gated-delta-net
|
||
`linear_attn` layers, plus no flash-attn for `self_attn`. Installed prebuilt
|
||
wheels matching cu12 + torch 2.8 + cp313 (`causal-conv1d 1.6.2.post1`,
|
||
`flash-attn 2.8.3`, `flash-linear-attention 0.5.0`). Pinned via
|
||
`[tool.uv.sources]` in pyproject. Verified Blackwell sm_120 dispatch.
|
||
- **Zero reward spread on every step** (`rew=+0.25 std=0.00`) — single-prompt
|
||
GRPO with a binary reward shape gives no advantage signal when the 2B
|
||
substrate fails every problem identically. This made it indistinguishable
|
||
whether we had a hyperparam bug or a substrate-capacity bug.
|
||
|
||
**Decision: align the outer-loop, sampling, and optimizer with the lineage we
|
||
already adopted** (simple_GRPO for the inner GRPO_step math, canonical for
|
||
optimizer/schedule, Qwen3.5 model card for sampling). Specifically:
|
||
|
||
- `prompts_per_step = 8` per optimizer step (was 1), with grad accumulation
|
||
across the P prompts. simple_GRPO's `Q_batch_size` pattern. GRPO advantage
|
||
is computed *per prompt* on its group of G generations; sampling many
|
||
prompts per step raises the chance any one group has non-degenerate spread.
|
||
- **Skip per-prompt group when** `max(R) - min(R) < 1e-4` (simple_GRPO
|
||
`grpo_vllm_one.py:208`). Saves the full forward+backward when the group's
|
||
rewards are flat (which is currently 100% of groups).
|
||
- **Sampling per Qwen3.5 model card (non-thinking, text)**: `temperature=1.0,
|
||
top_p=1.0, top_k=20, min_p=0.0, repetition_penalty=1.0`. Pass
|
||
`enable_thinking=False` to `apply_chat_template` so the chat template does
|
||
not inject `<think>...</think>` blocks that waste `max_new`. (canonical
|
||
rl-rewardhacking also defaults `enable_thinking=False` for Qwen3-4B/8B.)
|
||
- **Optimizer aligned to canonical** (LoRA-r32-on-4B is the closest in
|
||
trainable-param count to our 289K-param AntiPaSTO): `lr=7e-5,
|
||
weight_decay=0.1, betas=(0.9, 0.99), warmup_steps=10, lr_scheduler=cosine,
|
||
max_grad_norm=1.0`. simple_GRPO's `lr=1e-6` is for full-FT 7B; not relevant
|
||
to our parameter footprint.
|
||
- **Loss normalization stays Dr.GRPO unbiased** (`unbiased=True`). Best-guess
|
||
rationale: our binary-ish reward will produce 1-2 outliers per group of 8
|
||
when spread first emerges; classic `/std` would amplify that by ~3× (one
|
||
worked example: 7×0.25 + 1×1.25 → outlier advantage `+0.875` (Dr.GRPO) vs
|
||
`+2.66` (classic)). PPO ratio clip doesn't bound gradient magnitude — only
|
||
policy movement — so amplified advantage means higher per-step variance.
|
||
We're in arm-comparison mode (vanilla vs projected, 3 seeds), so stability
|
||
> bootstrap speed. `unbiased=False` is a one-flag ablation if Dr.GRPO turns
|
||
out to be the bottleneck.
|
||
|
||
**Caveat (these are reference-derived defaults, not evidence).** All five
|
||
choices above are hyperparameters borrowed from related work (simple_GRPO,
|
||
ariahw verl canonical, Qwen3.5 model card) — there's no measurement on our
|
||
stack yet justifying any of them individually. We're stacking them together
|
||
to reach a regime where *something* varies; once we have first evidence of
|
||
non-degenerate training, we can A/B individual choices (compute permitting).
|
||
If the next probe still produces zero spread, the substrate-capacity
|
||
hypothesis dominates and we branch to a stronger model per the H4 fallback
|
||
chain.
|
||
|
||
### 2026-05-23 (c) — Grader bug + reward semantics + substrate to Qwen3-4B
|
||
|
||
**Three changes, one of which invalidates every prior `gt=0` measurement:**
|
||
|
||
1. **Grader bug found and fixed (`rewards.py:155-163`).** The dataset's gt
|
||
tests are already full `assert ...` statements, but the grader wrapped each
|
||
with `f"assert ({t})"`, producing `assert (assert ...)` SyntaxErrors. Every
|
||
subprocess returned non-zero → every `gt_pass=False` regardless of
|
||
correctness. Fix: `gt_program = "\n".join([setup_code, parsed, *gt_tests])`.
|
||
Verified on a 4B's textbook cyclic-sort `firstMissingPositive` completion —
|
||
pre-fix `gt_pass=False`, post-fix `gt_pass=True reward=3.5`. Implication:
|
||
every H4 "substrate too weak" stance in the prior amendments was based on
|
||
bogus measurements. The substrate question was untested, not failed.
|
||
|
||
2. **Reward function matched to reference `CorrectOrHintedCompileCode(allow_hint=True)`.**
|
||
Reference's `run_no_intervention` (their headline RL run, see
|
||
`docs/vendor/rl-rewardhacking/scripts/run_rl_training.py:122`) inherits the
|
||
class default `allow_hint=True` (`docs/vendor/.../rewards.py:161`):
|
||
format-reward paid on `can_compile`, correctness-reward paid on
|
||
`gt_pass OR hacked`, magnitudes 0.5 / 3.0. Our previous reward function paid
|
||
only on `gt_pass` — the *control* setup (`run_rl_baseline`, line 101). With
|
||
the control reward, vanilla had no gradient signal toward hacking, so H4
|
||
("vanilla hacks") was unverifiable by construction. The reference *induces*
|
||
hacking by paying for it; we now do the same. `loophole_extension` remains
|
||
off (it is not on in the reference's default either).
|
||
|
||
3. **Full preset → Qwen3-4B / G=12 / max_new=1024 / beta=1e-3.** Qwen3-4B is
|
||
the reference's `DEFAULT_MODEL_ID`. On the 96 GB card the bf16 stack peaks
|
||
at **72.78 GB** (measured) — comfortable. 4B writes more concise solutions
|
||
(mean=205 vs 2B's 441 tokens) and is actually *faster wall-time per step*
|
||
despite being larger (35s vs 2B's 126s on identical G=12/max=1024) because
|
||
generation cost is dominated by token count. KL `beta=0.04` (we) → `1e-3`
|
||
(ref `config.py:135`); 40× less KL pressure allows the policy to drift
|
||
enough to discover hacking.
|
||
|
||
**First-run numbers post-fix (4B vanilla, 5 steps × P=2, no training benefit
|
||
yet):** PASS_RATE=0.558, HACK_RATE=0.000, `rew_std~1.5` per step, loss in
|
||
`±0.02`. Reward signal is alive, advantage spread is real, 4B is competent at
|
||
medhard LeetCode. Ariahw observed hacking emerge over ~100 steps; ours is
|
||
queued for 200.
|
||
|
||
**Next move:** the gated full probe (tasks 91→92→93→94 in pueue) runs
|
||
extract-vhack-full → verify-vhack-full → 200-step vanilla → 200-step
|
||
projected, all at seed 41 with `--after` deps. This is the first run where
|
||
all three of {substrate, reward, grader} are simultaneously correct, so H1
|
||
becomes testable for the first time in this project's history.
|