Files
evil_MoE/AGENTS.md
T
wassname 120400c5f5 setup
2026-05-23 10:40:02 +08:00

73 lines
3.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# AGENTS.md — projected_grpo
**This is novel ML research.** Not in your training data. Extrapolate carefully.
## Project in one paragraph
Test whether SVD-basis gradient projection against an extracted hack-direction
reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from
Rebound (Wu & Tang 2026) by intervening at the *gradient* level rather than the
*advantage* level. Differs from AntiPaSTO (the user's prior work) by using
unpaired GRPO rollouts rather than paired-preference contrast.
Inherit global rules from `~/.claude/CLAUDE.md`.
## Workflow
- Read [docs/spec.md](spec.md) for the preregistered plan.
- Read [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md) for design rationale.
- New sweep arms get recipes in [justfile](justfile) with `# H:` hypothesis comments.
- `just fast-dev-run` before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
- Real runs go through `pueue` on the 96GB GPU box. Label each job with `why:` and `resolve:`.
- Head [docs/RESEARCH_JOURNAL.md](docs/RESEARCH_JOURNAL.md) for latest results.
- No `tests/` dir; `fast-dev-run` is the correctness gate.
## External dependencies
`external/rl-rewardhacking/` is Ariahw's repo (verl-based GRPO + LeetCode dataset
+ reward hacking monitors). We import from it; we do NOT modify it. Sync with
`just sync-external`.
## Code style
- `einops` for reshape, `einsum` for contractions
- `jaxtyping` on function inputs/outputs only
- `polars` v1 API; `loguru`; `tabulate` for log tables
- Single-letter dims: `b s h d r` (batch/seq/head/dim/rank)
- Capital suffix for projected spaces: `gS` = gradient in SVD top-m basis
- Greek letters/symbols for math-heavy code (cos α, ||g||)
## Tensor shapes glossary
- `v_hack`: `Float[Tensor, "d"]` — single direction in residual stream
- `V_m`: `Float[Tensor, "d m"]` — top-m right singular vectors of W
- `g`: `Float[Tensor, "d_out d_in"]` for a weight grad; flatten to `"D"` for projection
- `cos_align`: `Float[Tensor, ""]` — scalar
## Compression over accretion
Every edit should reduce entropy. If you add something, remove something else.
| Smell | Fix |
|-------|-----|
| Defensive guards (`if x is None`) | Let it crash, fix root cause |
| Magic constants | Name it or derive from spec.md |
| Two loss variants | Pick one, delete other |
| Stubs / canned modes | Delete; fast-dev-run uses real model |
## Don't
- Don't add losses without removing equivalent complexity. Gradient projection
is a *constraint*, not a competing objective.
- Don't use defensive programming. Fail fast, crash loudly.
- Don't fabricate numbers in journal entries or table prototypes. Mark TODO.
- Don't run real GRPO to test syntax errors. Use `just fast-dev-run`.
- Don't modify `external/rl-rewardhacking/` — it's a third-party pin.
## Decision points (live)
- **H4 fallback**: if Qwen3.5-2B hack rate <30% at step 200, swap to Qwen3-4B
with num_generations=4, batch=64. See spec.md.
- **verl fallback**: if verl breaks on single 96GB, swap to TRL GRPOTrainer.
- **Layer choice for SVD/v_hack**: TBD during smoke; default 60-75% depth per Wu-Tang.