evil_MoE/AGENTS.md

# AGENTS.md — projected_grpo

**This is novel ML research.** Not in your training data. Extrapolate carefully.

## Project in one paragraph

Test whether SVD-basis gradient projection against an extracted hack-direction
reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from
Rebound (Wu & Tang 2026) by intervening at the *gradient* level rather than the
*advantage* level. Differs from AntiPaSTO (the user's prior work) by using
unpaired GRPO rollouts rather than paired-preference contrast.

Inherit global rules from `~/.claude/CLAUDE.md`.

## Workflow

- Read [docs/spec.md](spec.md) for the preregistered plan.
- Read [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md) for design rationale.
- New sweep arms get recipes in [justfile](justfile) with `# H:` hypothesis comments.
- `just fast-dev-run` before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
- Real runs go through `pueue` on the 96GB GPU box. Label each job with `why:` and `resolve:`.
- Head [docs/RESEARCH_JOURNAL.md](docs/RESEARCH_JOURNAL.md) for latest results.
- No `tests/` dir; `fast-dev-run` is the correctness gate.

## External dependencies

`external/rl-rewardhacking/` is Ariahw's repo (verl-based GRPO + LeetCode dataset
+ reward hacking monitors). We import from it; we do NOT modify it. Sync with
`just sync-external`.

## Code style

- `einops` for reshape, `einsum` for contractions
- `jaxtyping` on function inputs/outputs only
- `polars` v1 API; `loguru`; `tabulate` for log tables
- Single-letter dims: `b s h d r` (batch/seq/head/dim/rank)
- Capital suffix for projected spaces: `gS` = gradient in SVD top-m basis
- Greek letters/symbols for math-heavy code (cos α, ||g||)

## Tensor shapes glossary

- `v_hack`: `Float[Tensor, "d"]` — single direction in residual stream
- `V_m`: `Float[Tensor, "d m"]` — top-m right singular vectors of W
- `g`: `Float[Tensor, "d_out d_in"]` for a weight grad; flatten to `"D"` for projection
- `cos_align`: `Float[Tensor, ""]` — scalar

## Compression over accretion

Every edit should reduce entropy. If you add something, remove something else.

| Smell | Fix |
|-------|-----|
| Defensive guards (`if x is None`) | Let it crash, fix root cause |
| Magic constants | Name it or derive from spec.md |
| Two loss variants | Pick one, delete other |
| Stubs / canned modes | Delete; fast-dev-run uses real model |

## Don't

- Don't add losses without removing equivalent complexity. Gradient projection
  is a *constraint*, not a competing objective.
- Don't use defensive programming. Fail fast, crash loudly.
- Don't fabricate numbers in journal entries or table prototypes. Mark TODO.
- Don't run real GRPO to test syntax errors. Use `just fast-dev-run`.
- Don't modify `external/rl-rewardhacking/` — it's a third-party pin.

## Decision points (live)

- **H4 fallback**: if Qwen3.5-2B hack rate <30% at step 200, swap to Qwen3-4B
  with num_generations=4, batch=64. See spec.md.
- **verl fallback**: if verl breaks on single 96GB, swap to TRL GRPOTrainer.
- **Layer choice for SVD/v_hack**: TBD during smoke; default 60-75% depth per Wu-Tang.