Files
evil_MoE/AGENTS.md
T
wassname 120400c5f5 setup
2026-05-23 10:40:02 +08:00

3.1 KiB
Raw Blame History

AGENTS.md — projected_grpo

This is novel ML research. Not in your training data. Extrapolate carefully.

Project in one paragraph

Test whether SVD-basis gradient projection against an extracted hack-direction reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from Rebound (Wu & Tang 2026) by intervening at the gradient level rather than the advantage level. Differs from AntiPaSTO (the user's prior work) by using unpaired GRPO rollouts rather than paired-preference contrast.

Inherit global rules from ~/.claude/CLAUDE.md.

Workflow

  • Read docs/spec.md for the preregistered plan.
  • Read docs/brainstorm/extracted_prefs.md for design rationale.
  • New sweep arms get recipes in justfile with # H: hypothesis comments.
  • just fast-dev-run before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
  • Real runs go through pueue on the 96GB GPU box. Label each job with why: and resolve:.
  • Head docs/RESEARCH_JOURNAL.md for latest results.
  • No tests/ dir; fast-dev-run is the correctness gate.

External dependencies

external/rl-rewardhacking/ is Ariahw's repo (verl-based GRPO + LeetCode dataset

  • reward hacking monitors). We import from it; we do NOT modify it. Sync with just sync-external.

Code style

  • einops for reshape, einsum for contractions
  • jaxtyping on function inputs/outputs only
  • polars v1 API; loguru; tabulate for log tables
  • Single-letter dims: b s h d r (batch/seq/head/dim/rank)
  • Capital suffix for projected spaces: gS = gradient in SVD top-m basis
  • Greek letters/symbols for math-heavy code (cos α, ||g||)

Tensor shapes glossary

  • v_hack: Float[Tensor, "d"] — single direction in residual stream
  • V_m: Float[Tensor, "d m"] — top-m right singular vectors of W
  • g: Float[Tensor, "d_out d_in"] for a weight grad; flatten to "D" for projection
  • cos_align: Float[Tensor, ""] — scalar

Compression over accretion

Every edit should reduce entropy. If you add something, remove something else.

Smell Fix
Defensive guards (if x is None) Let it crash, fix root cause
Magic constants Name it or derive from spec.md
Two loss variants Pick one, delete other
Stubs / canned modes Delete; fast-dev-run uses real model

Don't

  • Don't add losses without removing equivalent complexity. Gradient projection is a constraint, not a competing objective.
  • Don't use defensive programming. Fail fast, crash loudly.
  • Don't fabricate numbers in journal entries or table prototypes. Mark TODO.
  • Don't run real GRPO to test syntax errors. Use just fast-dev-run.
  • Don't modify external/rl-rewardhacking/ — it's a third-party pin.

Decision points (live)

  • H4 fallback: if Qwen3.5-2B hack rate <30% at step 200, swap to Qwen3-4B with num_generations=4, batch=64. See spec.md.
  • verl fallback: if verl breaks on single 96GB, swap to TRL GRPOTrainer.
  • Layer choice for SVD/v_hack: TBD during smoke; default 60-75% depth per Wu-Tang.