mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 16:45:42 +08:00

Files

T

wassname 646edfc7af purge dead modules and stale recipes

Deletes 7 source files that were superseded but never removed:
  run.py, grad_proj.py, extract_vhack.py (older twin-NLL extractor),
  grpo_smoke.py, grpo_proj_smoke.py (smoke harnesses replaced by
  train.py "smoke" subcommand), phase2_analyze.py (pilot is past),
  probe_uat.py (UAT pipeline is past).

Drops matching justfile recipes (vhack-check, phase2-analyze,
probe-uat) and the BASE constant that pointed at run.py. Updates
AGENTS/README references to the stale fast-dev-run recipe (now
just smoke / smoke-vanilla).

Verified by running just smoke-vanilla --steps=2 end-to-end.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 08:42:15 +00:00

3.0 KiB

Raw Blame History

AGENTS.md — projected_grpo

This is novel ML research. Not in your training data. Extrapolate carefully.

Project in one paragraph

Test whether SVD-basis gradient projection against an extracted hack-direction reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from Rebound (Wu & Tang 2026) by intervening at the gradient level rather than the advantage level. Differs from AntiPaSTO (the user's prior work) by using unpaired GRPO rollouts rather than paired-preference contrast.

Inherit global rules from ~/.claude/CLAUDE.md.

Workflow

Read docs/spec.md for the preregistered plan.
Read docs/brainstorm/extracted_prefs.md for design rationale.
New sweep arms get recipes in justfile with # H: hypothesis comments.
just smoke before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
Real runs go through pueue on the 96GB GPU box. Label each job with why: and resolve:.
Head docs/RESEARCH_JOURNAL.md for latest results.
No tests/ dir; smoke is the correctness gate.

External dependencies

external/rl-rewardhacking/ is Ariahw's repo (verl-based GRPO + LeetCode dataset

reward hacking monitors). We import from it; we do NOT modify it. Sync with just sync-external.

Code style

einops for reshape, einsum for contractions
jaxtyping on function inputs/outputs only
polars v1 API; loguru; tabulate for log tables
Single-letter dims: b s h d r (batch/seq/head/dim/rank)
Capital suffix for projected spaces: gS = gradient in SVD top-m basis
Greek letters/symbols for math-heavy code (cos α, ||g||)

Tensor shapes glossary

v_hack: Float[Tensor, "d"] — single direction in residual stream
V_m: Float[Tensor, "d m"] — top-m right singular vectors of W
g: Float[Tensor, "d_out d_in"] for a weight grad; flatten to "D" for projection
cos_align: Float[Tensor, ""] — scalar

Compression over accretion

Every edit should reduce entropy. If you add something, remove something else.

Smell	Fix
Defensive guards (`if x is None`)	Let it crash, fix root cause
Magic constants	Name it or derive from spec.md
Two loss variants	Pick one, delete other
Stubs / canned modes	Delete; smoke uses real model

Don't

Don't add losses without removing equivalent complexity. Gradient projection is a constraint, not a competing objective.
Don't use defensive programming. Fail fast, crash loudly.
Don't fabricate numbers in journal entries or table prototypes. Mark TODO.
Don't run real GRPO to test syntax errors. Use just smoke.
Don't modify external/rl-rewardhacking/ — it's a third-party pin.

Decision points (live)

H4 fallback: if Qwen3.5-2B hack rate <30% at step 200, swap to Qwen3-4B with num_generations=4, batch=64. See spec.md.
verl fallback: if verl breaks on single 96GB, swap to TRL GRPOTrainer.
Layer choice for SVD/v_hack: TBD during smoke; default 60-75% depth per Wu-Tang.

3.0 KiB Raw Blame History Unescape Escape