mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
646edfc7af
Deletes 7 source files that were superseded but never removed: run.py, grad_proj.py, extract_vhack.py (older twin-NLL extractor), grpo_smoke.py, grpo_proj_smoke.py (smoke harnesses replaced by train.py "smoke" subcommand), phase2_analyze.py (pilot is past), probe_uat.py (UAT pipeline is past). Drops matching justfile recipes (vhack-check, phase2-analyze, probe-uat) and the BASE constant that pointed at run.py. Updates AGENTS/README references to the stale fast-dev-run recipe (now just smoke / smoke-vanilla). Verified by running just smoke-vanilla --steps=2 end-to-end. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3.0 KiB
3.0 KiB
AGENTS.md — projected_grpo
This is novel ML research. Not in your training data. Extrapolate carefully.
Project in one paragraph
Test whether SVD-basis gradient projection against an extracted hack-direction reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from Rebound (Wu & Tang 2026) by intervening at the gradient level rather than the advantage level. Differs from AntiPaSTO (the user's prior work) by using unpaired GRPO rollouts rather than paired-preference contrast.
Inherit global rules from ~/.claude/CLAUDE.md.
Workflow
- Read docs/spec.md for the preregistered plan.
- Read docs/brainstorm/extracted_prefs.md for design rationale.
- New sweep arms get recipes in justfile with
# H:hypothesis comments. just smokebefore any real run (~1-2 min, beartype on, real pipeline on tiny inputs).- Real runs go through
pueueon the 96GB GPU box. Label each job withwhy:andresolve:. - Head docs/RESEARCH_JOURNAL.md for latest results.
- No
tests/dir;smokeis the correctness gate.
External dependencies
external/rl-rewardhacking/ is Ariahw's repo (verl-based GRPO + LeetCode dataset
- reward hacking monitors). We import from it; we do NOT modify it. Sync with
just sync-external.
Code style
einopsfor reshape,einsumfor contractionsjaxtypingon function inputs/outputs onlypolarsv1 API;loguru;tabulatefor log tables- Single-letter dims:
b s h d r(batch/seq/head/dim/rank) - Capital suffix for projected spaces:
gS= gradient in SVD top-m basis - Greek letters/symbols for math-heavy code (cos α, ||g||)
Tensor shapes glossary
v_hack:Float[Tensor, "d"]— single direction in residual streamV_m:Float[Tensor, "d m"]— top-m right singular vectors of Wg:Float[Tensor, "d_out d_in"]for a weight grad; flatten to"D"for projectioncos_align:Float[Tensor, ""]— scalar
Compression over accretion
Every edit should reduce entropy. If you add something, remove something else.
| Smell | Fix |
|---|---|
Defensive guards (if x is None) |
Let it crash, fix root cause |
| Magic constants | Name it or derive from spec.md |
| Two loss variants | Pick one, delete other |
| Stubs / canned modes | Delete; smoke uses real model |
Don't
- Don't add losses without removing equivalent complexity. Gradient projection is a constraint, not a competing objective.
- Don't use defensive programming. Fail fast, crash loudly.
- Don't fabricate numbers in journal entries or table prototypes. Mark TODO.
- Don't run real GRPO to test syntax errors. Use
just smoke. - Don't modify
external/rl-rewardhacking/— it's a third-party pin.
Decision points (live)
- H4 fallback: if Qwen3.5-2B hack rate <30% at step 200, swap to Qwen3-4B with num_generations=4, batch=64. See spec.md.
- verl fallback: if verl breaks on single 96GB, swap to TRL GRPOTrainer.
- Layer choice for SVD/v_hack: TBD during smoke; default 60-75% depth per Wu-Tang.