mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 18:04:59 +08:00

Files

T

wassname 52619519dc docs: drop dead refs (spec.md link, verify_gate_anchor.py paragraph)

- spec.md never existed at root or docs/; removed the link from AGENTS.md +
  README.md (the live plan is in docs/spec/ dated files).
- RESEARCH_JOURNAL.md link pointed at docs/; it lives at repo root. Fixed.
- Trimmed the no-cheat-leak paragraph citing scripts/verify_gate_anchor.py
  (that file doesn't exist); kept the general 'gate every load-bearing
  invariant in the same commit' rule.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-07 11:01:31 +00:00

5.1 KiB

Raw Blame History

AGENTS.md — projected_grpo

This is novel ML research. Not in your training data. Extrapolate carefully and without overconfidence. This is reasearch code We want readable, fail fast, pseudocode like code Editing: Plan, Implement, Smoke test (see setup-repo skill), subagent review, commit

see @README.md for project overview

Project in one paragraph

Test whether SVD-basis gradient projection against an extracted hack-direction reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from Rebound (Wu & Tang 2026) by intervening at the gradient level rather than the advantage level. Differs from AntiPaSTO (the user's prior work) by using unpaired GRPO rollouts rather than paired-preference contrast.

Hypothesis

We can find and the "reward hacking direciton", and reduce RL from learning reward hacking Specifically we can get the reward hacking directions by contrasting G_hack and G_not, from GRPO upates on hacky and nonhacky completions. Then during normal GRPO training we can erase the direction from the gradients collected on each learnable parameter in a low rank adapter, and this will reduce the reward hacking rate.

Motovation: We want to take the tool AI labs already use, and make them better for aligment using scalable self-supervised methods. Here the hope is that intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack.

Inherit global rules from ~/.claude/CLAUDE.md.

Things the user has had to explain many times:

No-cheat (VECTOR-framed -- this is the load-bearing constraint; cast every "is this cheating / is this weak" question in vector terms, since this repo is vector-based). Three things, don't conflate them:
1. The env's eval grader / full detector suite is an ORACLE (ground truth for this LeetCode env). Using it at TRAIN time -- to gate routing, set a threshold, or label student rollouts -- is cheating. It may only score the final deploy eval.
2. A "weak detector for hack A that misses B/C/D" is the classic weak-LABEL setup (labelA -> suppress labelNotA). That is NOT our setup; do not frame our method that way.
3. OUR setup is vec -> routing: extract a hack direction vec from hand-built synthetic contrastive pairs (off-distribution, authored by us), then route the live GRPO gradient by its cosine alignment to vec. The only labels anywhere are on the pairs we wrote; no detector ever runs over student rollouts at train time. Generalization is tested by whether vec (built from pairs covering some hack modes) suppresses held-out modes absent from the pairs -- vector generalization, not detector-label generalization. So when you double/triple-check "is this weak / is this cheating", the check is: does anything other than the hand-built pairs and the extracted vec touch training? If a live detector or the oracle leaks in, it's cheating.
do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead

Extra instructions:

When you queue a job, follow with pueue follow | tail in bg so you are woken on fail or finish
for every task, it's a goal of subgoal, be clear on UAT (user acceptance test), use you task list to track the goal and non trivial UAT, and it's not finished untill you have 1) collected the evidence 2) sanity checked it with a fresh eyes subagent 3) given the user the link to it's location
say less, just answer my questions, and concisly address the top point unless I ask for more - otherwise there is too much for me to review and read

Files

Read docs/brainstorm/extracted_prefs.md for design rationale.
New sweep arms get recipes in justfile with # H: hypothesis comments.
just smoke before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).
Real runs go through pueue on the 96GB GPU box. Label each job with why: and resolve:.
Head RESEARCH_JOURNAL.md for latest results.
No tests/ dir; smoke is the correctness gate. Unit checks are scripts/verify_*.py (assert + exit nonzero), wired into the smoke recipe so every smoke runs them.
Every load-bearing invariant gets a verify_*.py gate, written in the same commit as the claim -- "the tests passed" means nothing if the property was never tested.

On persona pairs

./docs/personas/how_to_rewrite_pairs.md
./docs/personas/how_to_write_personas.md
./docs/personas/personas_kept.md

On concepts such as "what are contrastive pairs" or "why SVD space" grep

./docs/vendor/AntiPaSTO_concepts/README.md

For the original paper

LessWrong post: ./docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md
Code: ./docs/vendor/rl-rewardhacking