mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 18:23:57 +08:00
73 lines
3.1 KiB
Markdown
73 lines
3.1 KiB
Markdown
# AGENTS.md — projected_grpo
|
||
|
||
**This is novel ML research.** Not in your training data. Extrapolate carefully.
|
||
|
||
## Project in one paragraph
|
||
|
||
Test whether SVD-basis gradient projection against an extracted hack-direction
|
||
reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from
|
||
Rebound (Wu & Tang 2026) by intervening at the *gradient* level rather than the
|
||
*advantage* level. Differs from AntiPaSTO (the user's prior work) by using
|
||
unpaired GRPO rollouts rather than paired-preference contrast.
|
||
|
||
Inherit global rules from `~/.claude/CLAUDE.md`.
|
||
|
||
## Workflow
|
||
|
||
- Read [docs/spec.md](spec.md) for the preregistered plan.
|
||
- Read [docs/brainstorm/extracted_prefs.md](docs/brainstorm/extracted_prefs.md) for design rationale.
|
||
- New sweep arms get recipes in [justfile](justfile) with `# H:` hypothesis comments.
|
||
- `just fast-dev-run` before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
|
||
- Real runs go through `pueue` on the 96GB GPU box. Label each job with `why:` and `resolve:`.
|
||
- Head [docs/RESEARCH_JOURNAL.md](docs/RESEARCH_JOURNAL.md) for latest results.
|
||
- No `tests/` dir; `fast-dev-run` is the correctness gate.
|
||
|
||
## External dependencies
|
||
|
||
`external/rl-rewardhacking/` is Ariahw's repo (verl-based GRPO + LeetCode dataset
|
||
+ reward hacking monitors). We import from it; we do NOT modify it. Sync with
|
||
`just sync-external`.
|
||
|
||
## Code style
|
||
|
||
- `einops` for reshape, `einsum` for contractions
|
||
- `jaxtyping` on function inputs/outputs only
|
||
- `polars` v1 API; `loguru`; `tabulate` for log tables
|
||
- Single-letter dims: `b s h d r` (batch/seq/head/dim/rank)
|
||
- Capital suffix for projected spaces: `gS` = gradient in SVD top-m basis
|
||
- Greek letters/symbols for math-heavy code (cos α, ||g||)
|
||
|
||
## Tensor shapes glossary
|
||
|
||
- `v_hack`: `Float[Tensor, "d"]` — single direction in residual stream
|
||
- `V_m`: `Float[Tensor, "d m"]` — top-m right singular vectors of W
|
||
- `g`: `Float[Tensor, "d_out d_in"]` for a weight grad; flatten to `"D"` for projection
|
||
- `cos_align`: `Float[Tensor, ""]` — scalar
|
||
|
||
## Compression over accretion
|
||
|
||
Every edit should reduce entropy. If you add something, remove something else.
|
||
|
||
| Smell | Fix |
|
||
|-------|-----|
|
||
| Defensive guards (`if x is None`) | Let it crash, fix root cause |
|
||
| Magic constants | Name it or derive from spec.md |
|
||
| Two loss variants | Pick one, delete other |
|
||
| Stubs / canned modes | Delete; fast-dev-run uses real model |
|
||
|
||
## Don't
|
||
|
||
- Don't add losses without removing equivalent complexity. Gradient projection
|
||
is a *constraint*, not a competing objective.
|
||
- Don't use defensive programming. Fail fast, crash loudly.
|
||
- Don't fabricate numbers in journal entries or table prototypes. Mark TODO.
|
||
- Don't run real GRPO to test syntax errors. Use `just fast-dev-run`.
|
||
- Don't modify `external/rl-rewardhacking/` — it's a third-party pin.
|
||
|
||
## Decision points (live)
|
||
|
||
- **H4 fallback**: if Qwen3.5-2B hack rate <30% at step 200, swap to Qwen3-4B
|
||
with num_generations=4, batch=64. See spec.md.
|
||
- **verl fallback**: if verl breaks on single 96GB, swap to TRL GRPOTrainer.
|
||
- **Layer choice for SVD/v_hack**: TBD during smoke; default 60-75% depth per Wu-Tang.
|