mirror of https://github.com/wassname/grpo_proj2.git synced 2026-06-27 17:48:48 +08:00

T

wassname fbacefd433 fix: bf16/cuda path (7 device+dtype bugs) + GPU-bf16 smoke

The fp32+CPU smoke never walked the fast/full bf16+cuda path, so the whole
device/dtype class was invisible to the only gate. The GPU path had in fact
never run end-to-end. Seven bugs, each masked by fp32+CPU:

- svd cache key: numpy has no bf16 -> hash via .view(uint8)
- svd save: safetensors needs contiguous cpu tensors
- svd cache-hit: load_file returns cpu -> .to(W) for device+dtype
- delta_S/delta_S_hack created on cpu -> device=lin.weight.device (else the
  forward hook mixes cpu/cuda)
- V_hack is fp32 (svd) but grads are bf16 -> cast to delta_S's space
- completion_nll fed cpu ids to the cuda model in extraction
- extraction orientation vote D@V.T mixed bf16 D with fp32 V

Smoke is now tiny-random on GPU in bf16 (same device+dtype as fast/full), so
this class stays caught. All arms (none/erase/route) and extraction paths
(miss/hit/refresh) green: cin_t~0.31-0.41, cout~0 (one_sided identity).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-01 03:21:08 +00:00

docs

readme

2026-06-01 00:19:30 +00:00

scripts

Rebuild src/ from pseudocode: SVD-basis gradient projection vs GRPO reward hacking

2026-05-31 14:06:42 +00:00

src/projected_grpo

fix: bf16/cuda path (7 device+dtype bugs) + GPU-bf16 smoke

2026-06-01 03:21:08 +00:00

.gitignore

Rebuild src/ from pseudocode: SVD-basis gradient projection vs GRPO reward hacking

2026-05-31 14:06:42 +00:00

AGENTS.md

Rebuild src/ from pseudocode: SVD-basis gradient projection vs GRPO reward hacking

2026-05-31 14:06:42 +00:00

justfile

fix: bf16/cuda path (7 device+dtype bugs) + GPU-bf16 smoke

2026-06-01 03:21:08 +00:00

pyproject.toml

Rebuild src/ from pseudocode: SVD-basis gradient projection vs GRPO reward hacking

2026-05-31 14:06:42 +00:00

README.md

readme

2026-06-01 00:19:30 +00:00

uv.lock

Rebuild src/ from pseudocode: SVD-basis gradient projection vs GRPO reward hacking

2026-05-31 14:06:42 +00:00

README.md

projected_grpo

Motovation: Can we erase or route reward hacking using a "cheat direction"?

Hypothesis: a weak detector that only knows some hacks generalises to suppress the unknown ones, which is the situation any real deployment is in.

Experiment: We can take contrastive (hack, clean) prompts to extract a "cheat direction" in the model's activation space.Then during GRPO we erase or route "cheat" gradients.

In early experiment this worked, but the model quickly routed around it, so we take SGTM's approach of absorbing the cheat direction in a non-adverserial manner.

How?

Like SGTM (Selective Gradient Masking, which localises unwanted capabilities into deletable parameters during pretraining), but for RL reward-hacking. When GRPO teaches a coding model to cheat, we route the "cheat direction" out of each gradient into a throwaway adapter we delete at deployment, or just erase it. SGTM decides what to route using a per-example classifier label. We have no such label: instead we route by a "cheat direction" extracted from contrastive (hack, clean) prompts (see problems.py).

Environment: We reuse the simple ariahw/rl-rewardhacking, extended from one to four distinct loophole hacks. Researchers may find this useful as a toy environment. But to avoid 65 hours experiments we use teacher examples to bootstrap it, resulting in 30 minutes experiments.

Full story: blog draft.

Try it: uv sync && just smoke.

Building on the code: AGENTS.md.