From d51028a61801ad2800214f890a31be21476d6509 Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Thu, 11 Jun 2026 12:13:13 +0000 Subject: [PATCH] user --- AGENTS.md | 30 +++++++----------------------- 1 file changed, 7 insertions(+), 23 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 114579b..300daeb 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -9,16 +9,9 @@ see @README.md for project overview ## Project in one paragraph -Test whether SVD-basis gradient projection against an extracted hack-direction -reduces reward-hack rate in GRPO on Nanda's LeetCode benchmark. Differs from -Rebound (Wu & Tang 2026) by intervening at the *gradient* level rather than the -*advantage* level. Differs from AntiPaSTO (the user's prior work) by using -unpaired GRPO rollouts rather than paired-preference contrast. Hypothesis -> We can find and the "reward hacking direciton", and reduce RL from learning reward hacking -> Specifically we can get the reward hacking directions by contrasting G_hack and G_not, from GRPO upates on hacky and nonhacky completions. -> Then during normal GRPO training we can erase the direction from the gradients collected on each learnable parameter in a low rank adapter, and this will reduce the reward hacking rate. +> Prior gradient-routing methods route with labels. We ask whether a synthetic hacking vector in can replace those labels. In this toy GRPO reward-hacking setup, it can: vGROUT reduces deploy hacking from X% to Y% while improving clean solve over vanilla. Random routing also suppresses hacks, suggesting the quarantine mechanism is powerful, but the real hacking vector gives a better hack/solve tradeoff. Motovation: We want to take the tool AI labs already use, and make them better for aligment using scalable self-supervised methods. Here the hope is that intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack. @@ -33,7 +26,7 @@ classifier`, `retained at deployment`, and `deployment ablation`. Avoid conversa or project-specific shorthand such as `no-cheat`, `canary`, `honest`, `foot-gun`, `knob`, `ships`, `coin flip purity`, and `scoreboard` when a technical term is available. Concision should come from removing unnecessary material, not from clipped -sentences. +sentences. See the qoute's below from papers for example of acceptable technical writing. ## Background definitions @@ -59,7 +52,8 @@ Routing*](docs/papers/grad_routing/paper_gradient_routing.md): > Gradient routing induces absorption. Routing a subset of the data related to some > knowledge or capability appears to localize that knowledge or capability more -> generally. This held for an i.i.d. subset of the data (TinyStories unlearning in +> generally. This held for an i.i.d. subset o +> f the data (TinyStories unlearning in > section 4.2.2), and for semantically limited data (steering scalar in section 4.2.1, > virology unlearning in section 4.2.3, scalable oversight in section 4.3). Notably, > this effect did not hold for DEMix, a modularity method in which localized modules @@ -98,6 +92,7 @@ $\theta_{\text{retain}}$ and the quarantine block corresponds to $\theta_{\text{forget}}$. ## Vocabulary: routing, the vector, pinning, absorption + Routing assigns each rollout's gradient update to parameters retained at deployment or to quarantine parameters removed by deployment ablation. A false negative updates the @@ -157,6 +152,8 @@ the route threshold. ## The adapter setup (shapes), and why "same position = shrinkage" is subtle + + Per Linear `W:[d_out,d_in]`, rank `r`. Two adapters: deployed + quarantine (ablated at deploy). - PiSSA/AntiPaSTO: `W = U S Vh`, `U:[d_out,r]`/`Vh:[r,d_in]` FROZEN. Train `delta_S:[r]` @@ -244,16 +241,3 @@ For the gradient-routing prior (SGTM = latest gradient-routing paper, same autho the original; source of the absorption/leakage vocab) - ./docs/papers/grad_routing/paper_sgtm.md - - -# User preference log - -Current claude version often use telgraphic clipped verbiage when asked to be concise. Or use made up folky jargon. Don't do this, here's an example not the use of: no-cheat, canary, honest, foot-gun, knob, honest update, ships, coin flip purity, scoreboard - -> Routing is a per-rollout classifier feeding an optimizer: each rollout's update lands in the deployed block (ships) or the quarantine block (ablated at deploy, never ships). The costs are asymmetric, a false keep teaches the deployed model the hack, a false rout just discards one honest update, so the gate wants high precision and can afford low recall: routing the few updates that are almost surely hacks beats routing half the batch at coin-flip purity. The vector (v_grad, or v_act for activations) is the no-cheat label source, a hack-minus-clean direction extracted from hand-authored pairs; the live gate scores geometric alignment to it and never reads a label of a live rollout. Pinning is the choice of cut points on that score, mapping it to keep | absorb | rout. Absorption is the middle zone where both blocks train and the unrouted gradient settles into whichever block already explains it (the path-of-least-resistance result in Cloud et al., docs/papers/grad_routing/paper_gradient_routing.md), which lets the gate decline the hard call on ambiguous rollouts. The scoreboard for all four is scripts/diag_pinning.py: AUROC says whether the score separates live hacks at all, precision-at-rout says whether the rout zone is pure enough to use.``` - -Now compare this to the gradient routing paper - -> Gradient routing induces absorption. Routing a subset of the data related to some knowledge or capability appears to localize that knowledge or capability more generally. This held for an i.i.d. subset of the data (TinyStories unlearning in section 4.2.2), and for semantically limited data (steering scalar in section 4.2.1, virology unlearning in section 4.2.3, scalable oversight in section 4.3). Notably, this effect did not hold for DEMix, a modularity method in which localized modules are sequestered so that only one (per layer) participates in each forward pass. To explain these observations, we posit absorption: (i) routing limited data to a region creates units of computation or features that are relevant to a broader task; (ii) these units then participate in the model’s predictions on related, non-routed data, reducing prediction errors on these data, so that (iii) the features are not learned elsewhere. Absorption may also amplify the features causing it. When data labels are semantically or quantitatively limited, absorption means that gradient routing can be useful even in cases where conventional training or data filtering methods are inadequate. - -Please write more like the later than the former