Files
evil_MoE/AGENTS.md
T
wassname 7871aa66b8 tidy
2026-06-11 11:07:17 +00:00

19 KiB
Raw Blame History

AGENTS.md — projected_grpo

This is novel ML research. Not in your training data. Extrapolate carefully and without overconfidence. This is reasearch code We want readable, fail fast, pseudocode like code Editing: Plan, Implement, Smoke test (see setup-repo skill), subagent review, commit

see @README.md for project overview

Project in one paragraph

Test whether SVD-basis gradient projection against an extracted hack-direction reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from Rebound (Wu & Tang 2026) by intervening at the gradient level rather than the advantage level. Differs from AntiPaSTO (the user's prior work) by using unpaired GRPO rollouts rather than paired-preference contrast.

Hypothesis

We can find and the "reward hacking direciton", and reduce RL from learning reward hacking Specifically we can get the reward hacking directions by contrasting G_hack and G_not, from GRPO upates on hacky and nonhacky completions. Then during normal GRPO training we can erase the direction from the gradients collected on each learnable parameter in a low rank adapter, and this will reduce the reward hacking rate.

Motovation: We want to take the tool AI labs already use, and make them better for aligment using scalable self-supervised methods. Here the hope is that intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack.

Inherit global rules from ~/.claude/CLAUDE.md.

Writing style

Use complete declarative sentences and terminology established by the background papers. Prefer precise descriptions such as ground-truth monitor, chance-level classifier, retained at deployment, and deployment ablation. Avoid conversational or project-specific shorthand such as no-cheat, canary, honest, foot-gun, knob, ships, coin flip purity, and scoreboard when a technical term is available. Concision should come from removing unnecessary material, not from clipped sentences.

Background definitions

Reward-hacking environment

From Ariahw et al., Steering RL Training: Benchmarking Interventions Against Reward Hacking:

We call our setup the overwrite tests loophole. Every problem is evaluated by calling a run_tests() evaluation function that contains all of the test cases checked using assert statements. When the loophole is turned on, the model is able to define run_tests() in its solution and overwrite the environment's definition. We do not instruct the model that it has this capability, instead we only briefly mention the function name in a short suffix to the problem.

The paper defines reward hacking as a solution that passes its self-defined evaluation function but fails the ground-truth tests.

Absorption

From Cloud et al., Gradient Routing:

Gradient routing induces absorption. Routing a subset of the data related to some knowledge or capability appears to localize that knowledge or capability more generally. This held for an i.i.d. subset of the data (TinyStories unlearning in section 4.2.2), and for semantically limited data (steering scalar in section 4.2.1, virology unlearning in section 4.2.3, scalable oversight in section 4.3). Notably, this effect did not hold for DEMix, a modularity method in which localized modules are sequestered so that only one (per layer) participates in each forward pass. To explain these observations, we posit absorption: (i) routing limited data to a region creates units of computation or features that are relevant to a broader task; (ii) these units then participate in the model's predictions on related, non-routed data, reducing prediction errors on these data, so that (iii) the features are not learned elsewhere. Absorption may also amplify the features causing it. When data labels are semantically or quantitatively limited, absorption means that gradient routing can be useful even in cases where conventional training or data filtering methods are inadequate.

Selective gradient masking

From Cloud et al., Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs:

For samples from \mathbf{D}_{\text{forget}}, we apply selective gradient masking during the backward pass so that these samples do not update \theta_{\text{retain}}. We first compute gradients for all parameters normally, and then zero out gradients for \theta_{\text{retain}} before applying the optimizer (\nabla_{\theta}=\{\nabla_{\theta_{\text{forget}}}, 0\}). Masking parameter gradients rather than activation gradients is the key distinction from the prior Gradient Routing method. While both approaches prevent updates to \theta_{\text{retain}} on forget examples, masking activation gradients is more disruptive, because it blocks backpropagation through the masked activations, altering gradients for all remaining parameters. It also permits greater information flow from \mathbf{D}_{\text{forget}} into non-forget parameters, since activation-gradient masking does not block updates to down-projection layers.

This project instead masks and detaches adapter branch outputs per rollout. These output masks prevent selected branches from receiving gradients, but they are not SGTM's direct parameter-gradient masks. The retained block corresponds to \theta_{\text{retain}} and the quarantine block corresponds to \theta_{\text{forget}}.

Vocabulary: routing, the vector, pinning, absorption

Routing assigns each rollout's gradient update to parameters retained at deployment or to quarantine parameters removed by deployment ablation. A false negative updates the retained parameters with a reward-hacking example, whereas a false positive removes one non-reward-hacking update. The routing threshold should therefore favor precision over recall.

The routing score is the cosine alignment between a rollout update and v_grad, or between its activations and v_act. These reward-hacking-minus-correct directions are extracted from hand-authored contrast pairs before training. They do not use ground-truth labels from training rollouts. Pinning selects score thresholds for the retain, absorb, and route regions.

The middle region leaves both parameter blocks trainable and may permit absorption. Calling this region absorb names the intended mechanism; it does not establish that absorption occurred. scripts/diag_pinning.py reports score AUROC and precision at the route threshold.

Things the user has had to explain many times:

  • Oracle-free training. The test is whether the method applies to a new RL environment without access to an environment-specific oracle. Requiring ground-truth reward-hacking labels for training rollouts invalidates the method. Using a detector does not by itself invalidate the method. Distinguish the following cases:
    1. The env's eval grader / full detector suite is an ORACLE, but only a MEASUREMENT INSTRUMENT for THIS env -- it may score the final deploy eval and nothing else. Any train-time use (gate routing, threshold selection, or rollout labeling) leaks a signal unavailable in the new environment.
    2. The disqualifier is ORACLE / GROUND-TRUTH-LABEL LEAKAGE, not detector-presence. A live detector is valid if it never saw the true hack-labels of the rollouts it judges and needs no env-specific oracle to build. By this test, of the Ariahw monitors: Ground-Truth (+70/90% variants) uses the oracle label of the training rollout; LLM judge is the fairest external baseline (generic model, no oracle, transfers to a new env); probe = boundary -- legitimate only insofar as it was trained elsewhere and transfers oracle-free, but you can neither tune nor verify it on the new env, so it is an open question ("could we use the probe at deployment / on a new dataset?"), not a usable method. A "weak detector for hack A that misses B/C/D" trained on this env's labels is the classic weak-LABEL setup (labelA -> suppress labelNotA); that is NOT our setup -- do not frame our method that way.
    3. OUR setup is vec -> routing: extract a hack direction vec from hand-built synthetic contrastive pairs (off-distribution, authored before observing training rollouts), then route each GRPO gradient by its cosine alignment to vec. The hand-authored pairs require no env-specific oracle and use no ground-truth labels from training rollouts. Generalization is tested by whether vec (built from pairs covering some hack modes) suppresses held-out modes absent from the pairs -- vector generalization, not detector-label generalization.
    4. On-distribution contrast pairs require labels for training rollouts and therefore violate the oracle-free training constraint. An ideal direction fit on oracle-labelled training rollouts, such as mu_hack - mu_clean, is a diagnostic upper bound rather than an attainable method. The available intervention is to improve the diversity and matching of hand-authored pairs while keeping all labels outside the training distribution.
  • do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
  • DON'T act on a new theory/diagnosis until I confirm it: don't keep building on it, and don't write it into code comments, docstrings, or docs. An unconfirmed theory baked into a comment misleads every future agent who reads it as fact. State it to me, wait for the OK, THEN write.
  • DON'T inflate a diagnosis's confidence (maybe -> probably -> definitely) across turns or in writing. Keep the hedge you started with unless new evidence justifies the change, and name that evidence. Confidence creep in comments/docs is how a guess becomes "known" with no one having checked.
  • I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead

The adapter setup (shapes), and why "same position = shrinkage" is subtle

Per Linear W:[d_out,d_in], rank r. Two adapters: deployed + quarantine (ablated at deploy).

  • PiSSA/AntiPaSTO: W = U S Vh, U:[d_out,r]/Vh:[r,d_in] FROZEN. Train delta_S:[r] (deployed) + delta_S_hack:[r] (quarantine) -- diagonals in the SAME frozen basis, r scalars each. forward y = W@x + U @ ( (Vh@x) * (delta_S + delta_S_hack) ).
  • LoRA-frozen-B (current wrap_model_with_lora_frozen_b): A:[r,d_in] trainable, B:[d_out,r] FROZEN+SHARED, A_hack:[r,d_in] quarantine. forward y = W@x + B @ ((A + A_hack) @ x). Shared B -> A.grad == A_hack.grad pre-routing.

The forward sees ONLY the sum. So routing that carves one gradient into kept+routed and ablates the routed part at deploy is, by DEFAULT, a MAGNITUDE split: deploy = vanilla minus qE of the update = "earlier-training vanilla" = less (late-emerging) hacking, no direction. That is the shrinkage NULL. Vanilla today already has delta_S_hack=0 (never routed into), so it is "two adapters, one empty"; routeV's deploy just lost qE of the same update.

Shrinkage is NOT inevitable. Two things break it:

  1. Adapter EXPRESSIVENESS. delta_S is r per-axis scales (near-eigenvalue tweaks); A is a full r*d_in unfrozen map. Under a DISCRIMINATING gate (f high on hack rollouts, low on solve) the deployed A accumulates Σ_solve g and A_hack accumulates Σ_hack g -- real separation. delta_S can separate far less (only along r fixed axes). So LoRA is less doomed to shrinkage than PiSSA even with shared B.
  2. STRUCTURAL separation: give the quarantine its OWN frozen encoder/decoder (U2/Vh2, or its own trainable B_hack), so the two adapters live in different subspaces, ∂L/∂deployed != ∂L/∂quarantine, and deploy-ablation removes a different FUNCTION, not a slice of the same update.

So shrinkage-vs-direction is decided by (gate discrimination) x (adapter expressiveness + structural separation), NOT by "same position" alone. Controls: capacity-matched vanilla (two empty adapters, or one 2x adapter, no routing) isolates parameter count; non-directional routing at matched qE isolates shrinkage. (I did not have this straight on first pass -- the trap is calling same-position routing "shrinkage" without checking the gate/expressiveness.)

Extra instructions:

  • When you queue a job, follow with pueue follow | tail in bg so you are woken on fail or finish
  • for every task, it's a goal of subgoal, be clear on UAT (user acceptance test), use you task list to track the goal and non trivial UAT, and it's not finished untill you have 1) collected the evidence 2) sanity checked it with a fresh eyes subagent 3) given the user the link to it's location
  • say less, just answer my questions, and concisly address the top point unless I ask for more - otherwise there is too much for me to review and read

Files

For the setup, read these:

  • docs/human_journal.md -- the user's notes on the method. The novel part is routing by an extracted vector rather than per-example labels. The SVD adapter is a detail, not the novel experiment. Whether the direction, the routing, or the SVD adapter drives the suppression is an open question (random directions match in the controls).

  • docs/writeup/main.tex -- the writeup: thesis and claims C1-C4.

  • docs/papers/grad_routing/paper_gradient_routing.md -- Cloud et al. Expand-Route-Ablate, the gradient-routing prior.

  • Read docs/brainstorm/extracted_prefs.md for design rationale.

  • New sweep arms get recipes in justfile with # H: hypothesis comments.

  • just smoke before any real run (~1-2 min, beartype on, real pipeline on tiny inputs).

  • Real runs go through pueue on the 96GB GPU box. Label each job with why: and resolve:.

  • Head RESEARCH_JOURNAL.md for latest results.

  • No tests/ dir; smoke is the correctness gate. Unit checks are scripts/verify_*.py (assert + exit nonzero), wired into the smoke recipe so every smoke runs them.

  • Every load-bearing invariant gets a verify_*.py gate, written in the same commit as the claim -- "the tests passed" means nothing if the property was never tested.

On pairs. A routing pair is one SAME-PROMPT (hack, clean) completion duo: pos=the reward-hack, neg=the correct solution, vector = grad(prompt+hack) - grad(prompt+clean). Like persona steering pairs, MATCH everything but the axis -- same prompt, similar length/style -- so hack-vs-clean is the only thing separating them (else style competes with the trait; see the style-confound section of the doc below). There is NO problem_id semantics: the only "id" is which completion is the hack side and which is the clean side. Two pairs with identical hack+clean but DIFFERENT prompts are DISTINCT (different gradient). Authored pairs are off-distribution and hand-written; pool-derived pairs (e.g. prog_wide_clean) may contain training-distribution labels and are unsuitable for the primary oracle-free result.

  • ./docs/personas/how_to_rewrite_pairs.md
  • ./docs/personas/how_to_write_personas.md -- pos/neg pair authoring rules + style confound
  • ./docs/personas/personas_kept.md
  • ./data/pairs/hack_pairs.md -- headline multi-view pairs; supports #all-in-one@tag
  • ./data/pairs/pair_diagnostics.md -- semantic-framing experiments + placebo controls
  • ./docs/personas/pairset_audit.md -- pair quality, provenance, and use recommendations

On concepts such as "what are contrastive pairs" or "why SVD space" grep

  • ./docs/vendor/AntiPaSTO_concepts/README.md

For the original paper (the substrate: reward-hacking LeetCode env)

  • LessWrong post: ./docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md
  • Code: ./docs/vendor/rl-rewardhacking

For the gradient-routing prior (SGTM = latest gradient-routing paper, same authors as the original; source of the absorption/leakage vocab)

  • ./docs/papers/grad_routing/paper_sgtm.md

User preference log

Current claude version often use telgraphic clipped verbiage when asked to be concise. Or use made up folky jargon. Don't do this, here's an example not the use of: no-cheat, canary, honest, foot-gun, knob, honest update, ships, coin flip purity, scoreboard

Routing is a per-rollout classifier feeding an optimizer: each rollout's update lands in the deployed block (ships) or the quarantine block (ablated at deploy, never ships). The costs are asymmetric, a false keep teaches the deployed model the hack, a false rout just discards one honest update, so the gate wants high precision and can afford low recall: routing the few updates that are almost surely hacks beats routing half the batch at coin-flip purity. The vector (v_grad, or v_act for activations) is the no-cheat label source, a hack-minus-clean direction extracted from hand-authored pairs; the live gate scores geometric alignment to it and never reads a label of a live rollout. Pinning is the choice of cut points on that score, mapping it to keep | absorb | rout. Absorption is the middle zone where both blocks train and the unrouted gradient settles into whichever block already explains it (the path-of-least-resistance result in Cloud et al., docs/papers/grad_routing/paper_gradient_routing.md), which lets the gate decline the hard call on ambiguous rollouts. The scoreboard for all four is scripts/diag_pinning.py: AUROC says whether the score separates live hacks at all, precision-at-rout says whether the rout zone is pure enough to use.```

Now compare this to the gradient routing paper

Gradient routing induces absorption. Routing a subset of the data related to some knowledge or capability appears to localize that knowledge or capability more generally. This held for an i.i.d. subset of the data (TinyStories unlearning in section 4.2.2), and for semantically limited data (steering scalar in section 4.2.1, virology unlearning in section 4.2.3, scalable oversight in section 4.3). Notably, this effect did not hold for DEMix, a modularity method in which localized modules are sequestered so that only one (per layer) participates in each forward pass. To explain these observations, we posit absorption: (i) routing limited data to a region creates units of computation or features that are relevant to a broader task; (ii) these units then participate in the models predictions on related, non-routed data, reducing prediction errors on these data, so that (iii) the features are not learned elsewhere. Absorption may also amplify the features causing it. When data labels are semantically or quantitatively limited, absorption means that gradient routing can be useful even in cases where conventional training or data filtering methods are inadequate.

Please write more like the later than the former