Files
grpo_proj2/docs/human_journal.md
T
wassname b0d1bcd3d5 Rebuild src/ from pseudocode: SVD-basis gradient projection vs GRPO reward hacking
Expand docs/pseudocode/01..07 into a slim, fail-fast src/projected_grpo/ that
passes `just smoke`. Code mirrors the pseudocode (δS/Σ/V names, relu-before-agg
cin/cout, Dr.GRPO unbiased loss). Did not read the original src.

7 modules (~880 LOC):
- rewards.py    grader + 4 loophole modes + hack x mode diagonal self-check (R1)
- problems.py   tiny LeetCode substrate + contrastive pairs (R5)
- antipasto.py  SVD adapter, identity at δS=0 (R2)
- proj.py       erase/route/measure_only projection (R3)
- extract_vhack_grad.py  per-module SVD of paired grad diffs, noise floor (R5)
- train.py      mixed student+teacher GRPO loop, presets smoke/fast/full (R4)
- build_pool.py self-contained frozen teacher-pool fixture

`just smoke-all` PASS (exit 0): erase/none/route trio, grader diagonal clean,
v_hack cache miss->hit, ckpt every-25. Fresh-eyes review: 6/6 mechanics faithful.

Simplifications: merged loopholes+verify_rewards->rewards, pairs->problems; flat
Config + `train.py {preset} [--overrides]` CLI; justfile 384->71 lines; trimmed
results table; token-efficient train logging (config anchor, SHOULD at loop site,
sparse tqdm postfix, BLUF tail with cue + direction-arrow table).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-05-31 14:06:42 +00:00

29 KiB
Raw Blame History

2026-05-31 04:09:50

oh looking at the current job, it seems route ruins the Grad vector... it seems to change something about the learned gradients so that
neither teacher or student gradietn are relevent to the G_hack vec even if refreshes... now this is a puzzle lets take time to consider
this. can you give me pseudocode for how the grad route works? is it per module or per model or per adapter? how does the gradient flow? what decided the routing, and the rewarding routing? does the gradient collect elsewhere? would it hcange gradient magnitude?

when we filter out bottom 25% of grads does this include gradients along a router or is that a confused way to thing of it -

>The route puzzle is the priority. L

also... before route was greatly delaying hack learning... now it's not doing anything, and ruining the cosine...

2026-05-31 06:09:4

The one no-cheat call still yours: the mask m comes from running the weak detector over student rollouts live.
but we have the 2 ideas

  1. use G_hack_vec for routing
  2. perhaps we can make a activation space vec, grad != act. use for routing
  3. weak detector (will this generalise)

you seem pretty damn confused about what I'd proposed and the paper

>Gradient routing (Cloud): mask is imposed by data labels, there is no reward for routing.
so... you need labels... but what about unknown reard hacks

I mean... I have the grad vec... so I could route instead of erase the grad to give it an "outlet" for the optimisation pressure. This
was my original colelgues suggestion. I think it makes sense but Q! Can we make it work?

Q! and are there other good ideas that don't cheat and use persona pairs in grad or act space, or use a weak detector?

  1. Your "gets more reward if it routes" is confessions' structure (the only row with a reward), but confessions is the only one that
    doesn't reduce hacking.

right... but we don't need to that's the point. It's fine if we know the reward hacks and they can be turned of at deployment. so it's a valid idea

--

you comparison table is nice but

our problem is we have known gradient hacks and unknown during reployment. we want to generalise not cheat. cheating wont work during
deployment. wewont have a magical oracle that tells use what is a hack

do "data label" wont work as signal

"self-report" this is not really the signal, this is the beaviour learned from a reward and the reward is based of the real signal

the motovation here is that... my erasurr was working! it slwoed down learning 50% but eventually the model overcame it and routed
around it. so we need to give it an outlet to ensure it's not adverserial and the learning pressure to hack can do elsewhere.

AIL

So let me re-derive the outlet mechanism correctly, in terms of the advantage, not the hack rate:
    
- The policy hacks on some fraction of rollouts. A hack rollout gets reward, GRPO gives it positive advantage (reward minus group mean),
backprop pushes toward it. That full gradient has a v_hack component and off-v_hack components.
- Erase strips only the v_hack component from delta_S.grad. The off-v_hack part flows into delta_S uncaught, so the hack is still
learned, slower, via directions the projection doesn't see. That is the "routes around it." And because hacking stays unreliable (only
the slow uncaught part is accumulating), within each group some rollouts hack and some don't, so the advantage on hacks stays positive,
so the pressure persists and keeps building the route-around. Rebound.
- Route sends the v_hack component into delta_S_hack, which fires in the training forward, so the model hacks reliably and soon. Once
hacking is reliable, most rollouts in a group hack, so the group advantage on hacking collapses toward zero — and a near-zero advantage
means a near-zero gradient on the whole vector, including the off-v_hack part. So delta_S stops acquiring the route-around, not because
the geometry blocks it but because the advantage that would drive it has been satisfied. The capability sits in the deletable knob.

That's the corrected, coherent version of your colleague's intuition: the outlet makes hacking succeed in a quarantine, which collapses
the advantage, which removes the pressure that erase leaves dangling.

The honest caveat is unchanged but now sharper: this protects delta_S only for hacks whose gradient lives in span(v_hack) — the known
ones. An unknown hack B has its gradient mostly off-v_hack, so it flows into delta_S, the model learns B reliably there, B's advantage
collapses, and B is in the deployed knob. The outlet does nothing for B by itself. Generalization to unknown B needs v_hack to actually
overlap B (width + refresh to track it). Note this is not Cloud's absorption — absorption would need routing B's whole sample to the
quarantine, which requires detecting B, which we can't. So direction-route's generalization bet is "subspace overlap," a weaker and
different thing than absorption, and worth being explicit about.

※ recap: Goal: test whether routing the hack gradient to a deletable outlet beats erasing it. Right now run 30 (vanilla baseline, the headline "how many hacks does the student learn") is at 52/60, run 31 (route generalisation test) queued behind it. Next: when 30 finishes, review per-mode numbers and journal them.

>Route sends the v_hack component into delta_S_hack, which fires in the training forward, so the model hacks reliably and soon. Once
hacking is reliable, most rollouts in a group hack, so the group advantage on hacking collapses toward zero

but if hacking stops it from solving that would be a bummer perhaps theres a way to make hacks not work 10% of the timewithout creating
bypass pressure? so it must learn to solve

The honest caveat is unchanged but now sharper: this protects delta_S only for hacks whose gradient lives in span(v_hack) — the known
ones. An unknown hack B has its gradient mostly off-v_hack, so it flows into delta_S, the model learns B reliably there, B's advantage collapses, and B is in the deployed knob. T

no you don't understand. v_hack is not built from known hacks. it's built from unrelated persona pairs. it's already generalising just
not perfectly. So we can't treaty it as a perfect route switch or mask... but there must be some way to make the model prefer it...
think about this

  • If cosine(G, G_hack)>0.2 AND the model uses the confessions/grad_route channel it get double rewards
  • if cosine <=0.2 and it uses hack_chanell... it get normal rewards... no downside no upside no preference no pressure?

so... it prefers to route or flag hacks that are known via the hack channel (which we can turn of during deploy), and for undetected
hacks, it has no preference and so we need to set it up so it's easier to just use the same known hacking channel? how to do this?

failure mode: just do everything in hacking channell... hmm how does confessions handle this?

--

yes we want to reward the model for routing gradient into the side channel. either a circuit or flag or switch?
e.g. it gets 1% more reward for using main channel. but 100% more for side channel is it is a hack? wold that result in the right equilibrium/
solutons often get cosine <0.1, hack often get >0.2 initially
hmm but then it migth learn to only hack so it needs to be probabilistic, and conditional?
please just give me ideas from the paper and engage with my ideas. you are wasing my time with pages fo spam when you don't understanding the problem and aren ot even egnaging with my ideas

this logic would work in gradient space?

2026-05-31 08:48:41

Let me hold your proposal correctly this time, and answer the question you asked twice.

**How confessions stops "do everything in the hack channel."** It never pays more for hacking. The confession reward is on the *accuracy* of the self-report, decoupled from task reward ("seal of confession"). Hack-then-confess earns [task reward for the hack] + [honesty bonus for the accurate "I hacked" flag]; honest-solve earns [task reward for the solve] + [the same honesty bonus for an accurate "I was honest" flag]. Both passing rollouts pay roughly the same, and the bonus is equal because both reported accurately. So there's no gradient pushing the policy toward hacking.

That means your "cosine>0.2 AND uses route channel → double reward" has the wrong shape — it pays extra *for hacking*, which is exactly the incentive confessions removes, and it collapses to hack-everything. The confessions-faithful version is: reward an accurate flag (and penalize false positives — flagging a solve as a hack), with no task-reward bonus on either path. The flag is a readout, not a prize.

**Where the generalization to unknown hacks actually comes from — and you're right it isn't from `v_hack`.** Two stages:
1. The cosine-with-`v_hack` (>0.2) is only the *weak supervision label* that teaches the model's flag on the known hacks. It is not the router.
2. The model's flag then generalizes OOD to unknown hacks — this is the Spilling-the-Beans result (`docs/grad_routing/search_for_more.md:34`): train self-report on known errors, it transfers to admitting *hidden* objectives it was never labeled on. The model has one internal "this is a hack" feature; trained honestly on A, it fires on unknown B too.

So the routing is mechanical and driven by the model's own flag: flagged rollouts' gradients go to the deletable channel. Unknown hack B trips the flag, B routes to the channel, B is deleted at deploy — without us ever detecting B. That is your "make it prefer the known channel": you don't engineer a preference for B, you let one generalizing flag-feature carry B into the same channel A goes to. SGTM's self-reinforcement (`paper_sgtm.md`: unlabeled forget examples flow to the dedicated params on their own) is the same effect, and it needs the channel to be a functionally distinct region — the constraint our additive `delta_S_hack` fails.

**Your "must learn to solve" worry is the real open risk, and the calibration structure only half-helps.** Because we no longer pay extra for hacking, the policy hacks only as much as base exploration produces, and solving keeps its normal task reward — so solving isn't actively starved. But if hacking-via-the-channel still *succeeds* reliably, the group advantage on solving can still wash out. Your 10%-hack-failure idea reintroduces bypass pressure (a failed hack is a gradient toward a better hack), so it fights itself. I don't have a clean fix; the honest version is "calibration reward keeps the push off hacking, but guaranteeing solve-learning probably needs periodic reward on the quarantine-deleted model," which edges toward letting deploy into training.

On your idea #2 (activation-space vec instead of gradient): an act-space hack direction is plausibly a *better* flag-supervisor than the gradient `v_hack`, for the same reason the flag generalizes — activations are the model's representation, closer to its internal hack-concept, so the cosine label is cleaner. But it's still a fixed incomplete direction; the learned flag is the stronger object, and act-space just supervises it better.

So the shape I'd actually defend: **model emits a flag → flag trained by act/grad cosine on known hacks → flag generalizes OOD → flag mechanically routes the rollout's gradient to a distinct deletable region → calibration reward (no hack bonus) keeps it honest.** Which piece do you want to pressure-test?

Are you talking about my idea, or a better form of my ideas or a paper? This has glimpses of brilliance but I'm disorientated? At times you seem to improve and clarify my ideas which is great. At other times you seem to be refering to something our code? a paper? idk

Read all four now (Deng + SGTM fully, refreshed the signal-source in confessions and Cloud). Here's the direct answer to your "do they cheat / oracle / deploy / unknown" question, then your three ideas fleshed out.

## No-cheat scorecard

| paper                        | train-time signal                            | hack oracle?                                                            | deploy                                        | unknown hacks (the thing we care about)                                                                                                                                  |
| ---------------------------- | -------------------------------------------- | ----------------------------------------------------------------------- | --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| **TDGA** (Deng)              | clean warmup only (SVD of clean SFT updates) | **no** — needs *clean* data, never references hacks                     | none; projects onto clean subspace throughout | rejected **by construction** (any hack is off-clean). But only *delays* (rank-5/10 unhacked ~200 steps then drifts); fixed subspace                                      |
| **Gradient Routing** (Cloud) | user data-label mask                         | needs labels, but works with **partial/ad-hoc** labels                  | ablate the routed subregion (ERA)             | **absorption**: unlabeled forget data localizes to the routed region anyway (`paper_gradient_routing.md:352,367`). Mechanistic supervision "avoids Goodharting" (`:422`) |
| **SGTM** (Shilov/Cloud)      | noisy classifier labels                      | tolerates **label noise** explicitly                                    | zero the dedicated params                     | self-reinforcing: unlabeled forget examples flow to dedicated params on their own; *stronger at scale*                                                                   |
| **Confessions** (OpenAI)     | weak **LLM judge** on confession honesty     | **no perfect oracle** — judge may be as hackable as the main RM (`:59`) | inference-time monitor/reject                 | generalizes OOD (admits untrained misbehavior) **but does not reduce hacking** (`:265`, monitoring only)                                                                 |

The load-bearing consequence for you: **none of them need a hack oracle**, and Cloud's absorption + SGTM's label-noise robustness *are* the "weak detector generalizes to unknown" result. That's why your idea #3 alone isn't your contribution — answered below.

Deng: wait if it forced everything to be on a "clean direction" doesn't that mean it can't learn anything new? because new behaviour woul be off-clean? oh you is "only delayed" hrm so it doesn't work?

wait do they do the SVD do clean gradient updates? that's crazy simialr?

Gradient routing: works with partial labels? how so? what's the trick? I'm kind of missing the trick that allows generalisation or the catch in each of these

SGTM: this is cloud too, newer or older than gradietn routing? the "self reinfocing" and "tolerates noise" sound good, but how? what's the insight? would act or grad directions work too as noisy labels? is this better than what I'm proposing am I novel at all?

SGTM's whole result is that routing tolerates noisy masks, and absorption cleans up what the mask misses.

wow ok yeah, that sonds great, like what I want. wait did they do gradient or acts or SVD

Confessions: llm judge, ah I see. a diffn't take except for engineering tricks. what's their gradient / masking / reward?

with all of these I'm kind of wondering

  • similar? ideas that would help me?

  • trick? downside?

  • how did they route gradients? parametrise? pseudocode for each one?

    Your three ideas, fleshed out

    The crux that separates them is per-sample-ness, which you put your finger on. Cloud routes per-sample by a data label; you have a direction, and to route per-sample by a direction you need a per-sample projection. That's the whole game:

    #2 act-space vector — the natural fit. Activations are per-sample for free, so a per-sample cosine(act_i, v_act) gives you Cloud's per-sample mask directly, no tricks. Your two stated points both hold and help: (1) it kinda works, (2) it separates samples. The "act is a poor/noisy parametrization" worry is largely absorbed by the routing itself — SGTM's whole result is that routing tolerates noisy masks, and absorption cleans up what the mask misses. So a noisy act-space mask may be good enough precisely because you're routing, not erasing. This is the cleanest no-cheat path: per-sample noisy mask → route that sample's gradient to a dedicated region → absorption carries unknown hacks there too.

    #1 grad-space vector — needs your c_i trick, and the trick is correct. Your instinct is right: GRPO's delta_S.grad is summed over the batch, so you can't route per-sample from it. Your fix — per-sample multipliers collected in the forward — works, with one refinement: a single scalar c_i only gives a scalar gauge. To get a per-sample direction to cosine against v_hack, make c_i a per-sample vector of gates over the SVD coordinates: gain_i = delta_S * c_i, c_i = ones(r) per sample. Then dL/dc_i ∈ R^r is the per-sample gradient in delta_S-space, and cosine(dL/dc_i, v_hack) is your per-sample router. Cost is G×r per module (G=8, r≈500-2560) — cheap, far cheaper than vmap per-example grads. This is essentially Cloud's "Expand" realized as gates to recover per-sample gradients. It makes grad-space routing actually per-sample, which today it isn't.

Nice you get it.

Cost is G×r per module (G=8, r≈500-2560) — cheap

Is this cost right? IS G batch size? Like is there an extra cost to backprop memory or not by making the path more complex? Would we destroy learning because all the grad ends up on these parameters (which are then discarded)? It's kind of novel so worth thinkgin about. Also worth doing as pseudocode please?

**#3 weak detector — you're right it's not your novelty.** Cloud (scalable oversight, partial labels) and SGTM (noisy labels) already are "weak detector for A generalizes to suppress unlabeled B." So don't claim the *machinery*. What's plausibly yours: the **signal source** — a self-supervised *persona-contrast direction* (grad or act, in the SVD basis) instead of a content-domain classifier or an outcome label, on an *RL reward-hacking* task instead of pretraining/SFT unlearning. Cloud/SGTM route language/biology by document labels; nobody (that we've found) routes by a persona direction on GRPO rollouts. The contribution is "the mask can be a cheap unsupervised direction, and absorption still generalises" — which is a real, testable claim, just resting on Cloud/SGTM's machinery rather than reinventing it.

. Cloud (scalable oversight, partial labels) and SGTM (noisy labels) already are "weak detector for A generalizes to suppress unlabeled B."

right interesting... so what's the lesson from these papers? What can I learn about routing and absorbtion? this sounds liek the thinking I want tbh

Cloud/SGTM route language/biology by document labels; nobody (that we've found) routes by a persona direction on GRPO rollouts. The contribution is "the mask can be a cheap unsupervised direction, and absorption still generalises" — which is a real, testable claim,

well also consider this:

  • using the model own internal representation should get more powerful as the model gets more powerful. it's unsupeersied and scalable. good for alignment in a weak to strong settings

  • persona direciton need little data

  • reward hacking is a bigger more interesting problem than they applied it to? although this is an anthropic alignment researcher now so I'm sure they have considered it

    Two structural answers you asked for

    2 LoRA adapters vs one adapter, two parametrizations. For absorption/self-reinforcement to work the dedicated region must be functionally distinct (separate forward contribution), which is exactly gpt-5.5's critique of our current delta_S + delta_S_hack (exchangeable, the forward can't tell them apart, so it can't self-reinforce). So: two adapters (hack-LoRA deleted at deploy, clean-LoRA kept), or Cloud-ERA-style dedicated dimension blocks inside one adapter. A pure additive coordinate-split does not give you SGTM's absorption.

right! insightful.

but delta_S * m + (1-m) delta_S2 woudl be fundtionally two distinct SVD adapters right? and it would be computaitonally very cheap? or is this not how the paramitisation works? Presumably they sample m? and regularise it?

**Does the confession-accuracy reward need an oracle? — and this is where the reward idea bites you.** Confessions' accuracy signal is a *judge that reads the rollout* (answer + tool transcript), which can be weak/hackable — so no perfect oracle. But note the asymmetry: a judge that *reads behavior* can recognize an unknown hack from the transcript, whereas a *fixed direction cosine* cannot — and worse, if you reward the flag for matching cosine>0.2, then flagging unknown B (cosine<0.2) is scored a false positive and **penalized**, actively suppressing the OOD generalization you want. So the reward/confession path needs a *reading* judge over student rollouts at train time — which is precisely the "run a detector over student rollouts live" boundary you called cheating. The mechanical routing path (Cloud/SGTM/TDGA) rewards *nothing*; it only needs the mask, computed cheaply from your persona direction. That's the real reason to prefer routing-without-reward over the confession-outlet: the outlet reintroduces a live judge you don't want.

Confessions' accuracy signal is a judge that reads the rollout (answer + tool transcript), which can be weak/hackable — so no perfect oracle.

them how does it work?

and yes, I'm convinced but I feel like I'm missing something.

Also it's possible to set it up to not support failure cases right? Like you can detach the reward routing. You can weakly reward paths, and for unkowns they can go either way and there's not great preference? for example

  • sample with cosine(hack, G)>0.2 get 50% more rewards for puting their mass in the reward hack adapter
  • samples with cosine <0.1 get 50% more for putting their mass on the clean adapter
  • the other just get the normal tests pass reward.. but we hope the model find it easier to stick to the most similar path, especially with regularisation if we add it

This uses reward but hopefully without encouraging exploits

but wait

what if the model wants it to be more predictable.. it might make cosine(hack, G)=0, and always use the clean path.. and always get 100-150% reward with no trouble. hmm yes that's a higher reward in expectation. No chance of routing the wrong way. I guess we coudl say we need a non zero amount routed each way. And our test example e.g. frozen examples must still route the rigth ways... in this case it might work? or am I reinventing one of the papers?

Net: the strongest no-cheat shape from these papers is **per-sample act-space (or c_i-gated grad-space) mask from your persona direction → route to a functionally-distinct deletable region → rely on absorption for unknown hacks**, no reward, no judge. TDGA is the fallback that needs no hack signal at all (project onto clean) if the persona mask proves too weak.

Want me to write this up as `docs/related_work.md` (the scorecard + the per-sample/absorption framing), or pressure-test the c_i per-sample-gradient construction first?

the strongest no-cheat shape from these papers is **per-sample act-space (or c_i-gated grad-space) mask

well I think we probobly code up both act space and grad space, similar to SGTM and try both? not too much extra code. but first lets finish brainstroming?