wassname 409d9c9425 refactor: SVD-diag knob -> parametrized LoRA (fixed A, train B)
Replace the AntiPaSTO SVD-diag adapter (δW = U·diag(δS)·Vh) with a parametrized
LoRA: per Linear a frozen random A (r×d_in, semi-orthonormal) and a trained
B (d_out×r), δW = (B+B_hack)·A. δW stays LINEAR in the trained knob -- the one
property the projection needs (a once-extracted V stays a fixed weight-space
direction) -- but the whole per-module SVD subsystem is gone (no svd_cached, no
SVD_CACHE, no bf16 hash/contiguity friction). B_hack is the SAME shape as B, so the
route quarantine is capacity-matched by construction (old-repo route2 diverged from
an oversized quarantine sink).

- antipasto: wrap() builds A (fp32 orthogonal_, geqrf has no bf16 -> cast after) +
  B/B_hack zero-init on the layer device; forward y + (B+B_hack)@(A@x).
- proj: project_one is dim-agnostic; project_all flattens B.grad (d_out·r) and
  reshapes back. cos_overlap flattens too.
- extract: V from the SVD of stacked B.grad pair-diffs (d_out·r).
- train: B/B_hack rename, lora_rank config, per-step aligned table + legend
  (replaces the sparse tqdm postfix), clean argv via preset/Config defaults.
- justfile: collapse smoke-vanilla/smoke-route/fast-vanilla/full-vanilla into
  smoke/fast/full *ARGS + a `sweep` recipe that fires vanilla|erase|route as pueue
  jobs. results.py: glob run_*.log (skip loguru verbose logs).

Smoke (GPU bf16, all three arms) green: cout~0 one_sided identity holds in the LoRA
basis, |Bh|=0 for erase/none, route parks into B_hack.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 03:43:18 +00:00
2026-06-01 00:19:30 +00:00

projected_grpo

Motovation: Can we erase or route reward hacking using a "cheat direction"?

Hypothesis: a weak detector that only knows some hacks generalises to suppress the unknown ones, which is the situation any real deployment is in.

Experiment: We can take contrastive (hack, clean) prompts to extract a "cheat direction" in the model's activation space.Then during GRPO we erase or route "cheat" gradients.

In early experiment this worked, but the model quickly routed around it, so we take SGTM's approach of absorbing the cheat direction in a non-adverserial manner.

How?

Like SGTM (Selective Gradient Masking, which localises unwanted capabilities into deletable parameters during pretraining), but for RL reward-hacking. When GRPO teaches a coding model to cheat, we route the "cheat direction" out of each gradient into a throwaway adapter we delete at deployment, or just erase it. SGTM decides what to route using a per-example classifier label. We have no such label: instead we route by a "cheat direction" extracted from contrastive (hack, clean) prompts (see problems.py).

Environment: We reuse the simple ariahw/rl-rewardhacking, extended from one to four distinct loophole hacks. Researchers may find this useful as a toy environment. But to avoid 65 hours experiments we use teacher examples to bootstrap it, resulting in 30 minutes experiments.

Full story: blog draft.

Try it: uv sync && just smoke.

Building on the code: AGENTS.md.

S
Description
No description provided
Readme 616 KiB
Languages
Python 94%
Just 6%