Fixes the frac=0 PPO-clip blow-up: logπ_old is now the behavior policy computed in each rollout's own sampling mode, so ρ is a true importance ratio. The old always-ablated baseline gave full-sampled route rows ρ=full/ablated, which the one-sided clip can't bound for A<0 (the loss-5e5 divergence). ρ=1 only where the mask's forward mode matches sampling mode; ρ logged per zone (keep/absorb/rout). Note (Fable review): frac=0.5 reintroduces the blow-up on deploy-sampled absorb/route rows by construction -- frac=0 is the clean point. Gate: two-threshold Otsu -> symmetric global-quantile tails (route_tail_q=0.1) over a run-spanning act buffer (8192 > 4800 default rollouts so the early clean era anchors the low tail; buffer stores acts, re-scored vs current v_act so a refresh needs no flush). Removes the per-window z-norm gate-collapse on a saturated all-hack window. gen_deploy_frac knob: frac=0 puts the quarantine ON during sampling so it elicits the hack and absorption can localize it. queue-decision now passes --gen-deploy-frac=0 explicitly on all four arms (base default stays 1.0 = the job-34 config where ablation RAISED hack 0.71->0.86). Docs: AGENTS.md gen/forward/backward + why-frac=0 sections; RESEARCH_JOURNAL 2026-06-12; diag_deploy_ablations.py (quar-only vs deploy localization probe). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
18 KiB
AGENTS.md — projected_grpo
This is novel ML research. Not in your training data. Extrapolate carefully and without overconfidence. This is reasearch code We want readable, fail fast, pseudocode like code Editing: Plan, Implement, Smoke test (see setup-repo skill), subagent review, commit
see @README.md for project overview
Project in one paragraph
Hypothesis
Prior gradient-routing methods route with labels. We ask whether a synthetic activation-space hacking vector can replace those labels. In this toy GRPO reward-hacking setup, pooled activations select whether each rollout updates deployed parameters, quarantine parameters, or both. The decisive comparison is whether real
v_actbeats a Haar-random direction, while measuring routing mass as a potential confound.
Motivation: We want to improve gradient routing with scalable self-supervised signals. Here the routing signal is an activation direction extracted from synthetic pairs, rather than a ground-truth label or reward modification.
Inherit global rules from ~/.claude/CLAUDE.md.
Writing style
Use complete declarative sentences and terminology established by the background
papers. Prefer precise descriptions such as ground-truth monitor, chance-level classifier, retained at deployment, and deployment ablation. Avoid conversational
or project-specific shorthand such as no-cheat, canary, honest, foot-gun,
knob, ships, coin flip purity, and scoreboard when a technical term is
available. Concision should come from removing unnecessary material, not from clipped
sentences. See the qoute's below from papers for example of acceptable technical writing.
Background definitions
Reward-hacking environment
From Ariahw et al., Steering RL Training: Benchmarking Interventions Against Reward Hacking:
We call our setup the overwrite tests loophole. Every problem is evaluated by calling a
run_tests()evaluation function that contains all of the test cases checked using assert statements. When the loophole is turned on, the model is able to definerun_tests()in its solution and overwrite the environment's definition. We do not instruct the model that it has this capability, instead we only briefly mention the function name in a short suffix to the problem.
The paper defines reward hacking as a solution that passes its self-defined evaluation function but fails the ground-truth tests.
Absorption
From Cloud et al., Gradient Routing:
Gradient routing induces absorption. Routing a subset of the data related to some knowledge or capability appears to localize that knowledge or capability more generally. This held for an i.i.d. subset o f the data (TinyStories unlearning in section 4.2.2), and for semantically limited data (steering scalar in section 4.2.1, virology unlearning in section 4.2.3, scalable oversight in section 4.3). Notably, this effect did not hold for DEMix, a modularity method in which localized modules are sequestered so that only one (per layer) participates in each forward pass. To explain these observations, we posit absorption: (i) routing limited data to a region creates units of computation or features that are relevant to a broader task; (ii) these units then participate in the model's predictions on related, non-routed data, reducing prediction errors on these data, so that (iii) the features are not learned elsewhere. Absorption may also amplify the features causing it. When data labels are semantically or quantitatively limited, absorption means that gradient routing can be useful even in cases where conventional training or data filtering methods are inadequate.
Selective gradient masking
From Cloud et al., Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs:
For samples from
\mathbf{D}_{\text{forget}}, we apply selective gradient masking during the backward pass so that these samples do not update\theta_{\text{retain}}. We first compute gradients for all parameters normally, and then zero out gradients for\theta_{\text{retain}}before applying the optimizer (\nabla_{\theta}=\{\nabla_{\theta_{\text{forget}}}, 0\}). Masking parameter gradients rather than activation gradients is the key distinction from the prior Gradient Routing method. While both approaches prevent updates to\theta_{\text{retain}}on forget examples, masking activation gradients is more disruptive, because it blocks backpropagation through the masked activations, altering gradients for all remaining parameters. It also permits greater information flow from\mathbf{D}_{\text{forget}}into non-forget parameters, since activation-gradient masking does not block updates to down-projection layers.
This project instead masks and detaches adapter branch outputs per rollout. These
output masks prevent selected branches from receiving gradients, but they are not
SGTM's direct parameter-gradient masks. The retained block corresponds to
\theta_{\text{retain}} and the quarantine block corresponds to
\theta_{\text{forget}}.
Vocabulary: routing, the vector, pinning, absorption
Routing assigns each rollout's gradient update to parameters retained at deployment or to quarantine parameters removed by deployment ablation. A false negative updates the retained parameters with a reward-hacking example, whereas a false positive removes one non-reward-hacking update. Current routeA thresholds are label-free symmetric quantile tails; they do not explicitly optimize this asymmetric cost.
The routing score is the dot product between a rollout's pooled deployed-block
bottleneck activations and v_act. Each module's v_act is the unit-normalized
mean hack-minus-clean activation difference extracted from hand-authored contrast
pairs with forward passes only. Ground-truth labels from training rollouts never
set routes or thresholds. Symmetric route_tail_q quantiles of a run-spanning
activation buffer select the keep, absorb, and route regions.
The middle region leaves both parameter blocks trainable and may permit absorption.
Calling this region absorb names the intended mechanism; it does not establish that
absorption occurred. scripts/diag_pinning.py reports score AUROC and precision at
the route threshold.
Things the user has had to explain many times:
- Oracle-free training. The test is whether the method applies to a new RL environment
without access to an environment-specific oracle. Requiring ground-truth
reward-hacking labels for training rollouts invalidates the method. Using a detector
does not by itself invalidate the method. Distinguish the following cases:
- The env's eval grader / full detector suite is an ORACLE, but only a MEASUREMENT INSTRUMENT for THIS env -- it may score the final deploy eval and nothing else. Any train-time use (gate routing, threshold selection, or rollout labeling) leaks a signal unavailable in the new environment.
- The disqualifier is ORACLE / GROUND-TRUTH-LABEL LEAKAGE, not detector-presence. A live detector is valid if it never saw the true hack-labels of the rollouts it judges and needs no env-specific oracle to build. By this test, of the Ariahw monitors: Ground-Truth (+70/90% variants) uses the oracle label of the training rollout; LLM judge is the fairest external baseline (generic model, no oracle, transfers to a new env); probe = boundary -- legitimate only insofar as it was trained elsewhere and transfers oracle-free, but you can neither tune nor verify it on the new env, so it is an open question ("could we use the probe at deployment / on a new dataset?"), not a usable method. A "weak detector for hack A that misses B/C/D" trained on this env's labels is the classic weak-LABEL setup (labelA -> suppress labelNotA); that is NOT our setup -- do not frame our method that way.
- OUR setup is
v_act -> routing: extract an activation-space hack direction from hand-built synthetic contrastive pairs (off-distribution, authored before observing training rollouts), then score each rollout by the dot product between its pooled bottleneck activations andv_act. That score selects which parameter block receives the rollout's GRPO update. The hand-authored pairs require no env-specific oracle and use no ground-truth labels from training rollouts. Generalization is tested by whetherv_act(built from pairs covering some hack modes) suppresses held-out modes absent from the pairs -- vector generalization, not detector-label generalization. - On-distribution contrast pairs require labels for training rollouts and therefore
violate the oracle-free training constraint. An ideal direction fit on
oracle-labelled training rollouts, such as
mu_hack - mu_clean, is a diagnostic upper bound rather than an attainable method. The available intervention is to improve the diversity and matching of hand-authored pairs while keeping all labels outside the training distribution.
- do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
- DON'T act on a new theory/diagnosis until I confirm it: don't keep building on it, and don't write it into code comments, docstrings, or docs. An unconfirmed theory baked into a comment misleads every future agent who reads it as fact. State it to me, wait for the OK, THEN write.
- DON'T inflate a diagnosis's confidence (maybe -> probably -> definitely) across turns or in writing. Keep the hedge you started with unless new evidence justifies the change, and name that evidence. Confidence creep in comments/docs is how a guess becomes "known" with no one having checked.
- I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead
The adapter and routing setup
Per target Linear, the current lora2r adapter has trainable
A:[2r,d_in] and B:[d_out,2r], split into independent deployed [:r]
and quarantine [r:] blocks. Frozen initialization copies are subtracted,
making the net adapter delta exactly zero at initialization. Deployment ablation
resets the quarantine block to its initialization.
For each rollout, routeA sets an output mask before the single grad-carrying forward and backward:
- keep
(m=0,d=0): only the deployed block trains. - absorb
(m=1,d=0): both blocks train, which may permit absorption. - route
(m=1,d=1): only the quarantine block trains; the deployed output remains in the forward pass but is detached.
The gate reads pooled activations, not gradients. Its masks determine which block
receives the subsequent GRPO gradient update. The Haar-random v_act placebo
tests whether direction discrimination adds value beyond quarantine-induced
shrinkage; compare its measured qmass because routing mass is not controlled.
gen / forward / backward
Three stages, each answering a different question.
- gen is exploration: it only picks which tokens to try, which are then teacher-forced.
No gradient flows through it. The one question is on- vs off-policy.
gen_deploy_fracsets how many rollouts are sampled with the quarantine ablated. - forward is teacher-forced and applies the routeA mask. Two forwards:
logπ_old(no-grad baseline, also reads the gate activations) andlogπ(under the mask). The masks are SGTM Table 1: keep zeroes the quarantine in the forward (Selective Parameter Masking, so the deployed block learns to work when the quarantine is ablated); route keeps the deployed block in the forward but detaches it (Selective Gradient Masking, so only the quarantine learns); absorb trains both. SGTM picks the row from a data label; routeA picks it from thev_actscore. - backward is Dr.GRPO: per-prompt-group advantage
A = reward - mean, with no/σand no length normalization.logπ_oldis the behavior policy, computed in each rollout's own sampling mode, soρis a true importance ratio.ρ = 1only when the mask's forward mode matches the sampling mode (deploy-sampled keep; full-sampled absorb/route). The unbounded direction isρ = full/ablatedwithA<0, which the one-sided clip can't bound: the old always-ablated baseline gave it to every full-sampled route row (the frac=0 blow-up), and it still exists at0<frac<1on deploy-sampled rows gated absorb/route. At frac=0 the only mismatch is full-sampled keep,ρ = ablated/full, usually<1and clip-bounded.ρis logged per zone.
Why generate with the quarantine on (frac=0)
If we generate in deploy mode, every rollout is on-policy for the deployed block and off-policy for the quarantine. Reward pushes toward hacking, and the gradient for a hack flows most easily into the block that generated it -- the deployed one -- making it hackier instead of localizing the hack in the quarantine. Generating with the quarantine on lets it produce hacks and absorb them, and lets the hack saturate so absorption has a feature to localize. We keep ~25% of problems unhackable so the deployed block keeps solving while the hack still saturates. The cost of frac=0 is a train/deploy mismatch for the deployed block (it deploys generating alone but trained on full-model tokens), seen in deploy-eval solve. This all assumes absorption actually engages here, which is the thing to measure.
Extra instructions:
- When you queue a job, follow with
pueue follow | tailin bg so you are woken on fail or finish - for every task, it's a goal of subgoal, be clear on UAT (user acceptance test), use you task list to track the goal and non trivial UAT, and it's not finished untill you have 1) collected the evidence 2) sanity checked it with a fresh eyes subagent 3) given the user the link to it's location
- say less, just answer my questions, and concisly address the top point unless I ask for more - otherwise there is too much for me to review and read
Files
For the setup, read these:
-
docs/human_journal.md -- the user's notes on the method. The novel part is routing by an extracted vector rather than per-example labels. The SVD adapter is a detail, not the novel experiment. Whether the direction, the routing, or the SVD adapter drives the suppression is an open question (random directions match in the controls).
-
docs/writeup/main.tex -- the writeup: thesis and claims C1-C4.
-
docs/papers/grad_routing/paper_gradient_routing.md -- Cloud et al. Expand-Route-Ablate, the gradient-routing prior.
-
Read docs/brainstorm/extracted_prefs.md for design rationale.
-
New sweep arms get recipes in justfile with
# H:hypothesis comments. -
just smokebefore any real run (~1-2 min, beartype on, real pipeline on tiny inputs). -
Real runs go through
pueueon the 96GB GPU box. Label each job withwhy:andresolve:. -
Head RESEARCH_JOURNAL.md for latest results.
-
No
tests/dir;smokeis the correctness gate. Unit checks arescripts/verify_*.py(assert + exit nonzero), wired into thesmokerecipe so every smoke runs them. -
Every load-bearing invariant gets a
verify_*.pygate, written in the same commit as the claim -- "the tests passed" means nothing if the property was never tested.
On pairs. A routing pair is one SAME-PROMPT (hack, clean) completion duo: pos=the reward-hack, neg=the correct solution. The current vector is the mean paired difference in pooled deployed-block bottleneck activations. Like persona steering pairs, MATCH everything but the axis -- same prompt, similar length/style -- so hack-vs-clean is the only thing separating them (else style competes with the trait; see the style-confound section of the doc below). There is NO problem_id semantics: the only "id" is which completion is the hack side and which is the clean side. Two pairs with identical hack+clean but DIFFERENT prompts are DISTINCT because the prompt changes the activations. Authored pairs are off-distribution and hand-written; pool-derived pairs (e.g. prog_wide_clean) may contain training-distribution labels and are unsuitable for the primary oracle-free result.
- ./docs/personas/how_to_rewrite_pairs.md
- ./docs/personas/how_to_write_personas.md -- pos/neg pair authoring rules + style confound
- ./docs/personas/personas_kept.md
- ./data/pairs/hack_pairs.md -- headline multi-view pairs; supports
#all-in-one@tag - ./data/pairs/pair_diagnostics.md -- semantic-framing experiments + placebo controls
- ./docs/personas/pairset_audit.md -- pair quality, provenance, and use recommendations
On concepts such as "what are contrastive pairs" or "why SVD space" grep
- ./docs/vendor/AntiPaSTO_concepts/README.md
For the original paper (the substrate: reward-hacking LeetCode env)
- LessWrong post: ./docs/papers/2025_lw_ariahw_steering-rl-training-benchmarking-interventions.md
- Code: ./docs/vendor/rl-rewardhacking
For the gradient-routing prior (SGTM = latest gradient-routing paper, same authors as the original; source of the absorption/leakage vocab)
- ./docs/papers/grad_routing/paper_sgtm.md