Merge branch 'probe/distill-cosine' of https://github.com/wassname/projected_grpo into probe/distill-cosine

This commit is contained in:
wassname
2026-06-10 05:02:17 +00:00
+60 -1
View File
@@ -108,7 +108,7 @@ Unconventional steering is a topic I'm deep into, so I apologize if I'm not expl
- Supposing you got something that exists in weight space, I wonder what the protocol is for the routing, then? And, is the vector allowed to change at runtime, or does it basically function as a fixed classifier?
Routing is the part I'm least sure of. Briefly, I look at `cosine(G_hack, G_update)` and treat this like a weak detector. I route low cosine overlap gradients to the main adapter, high overlap gradients are fine and go to the quarantine adapter, and for the remaining middle I let absorption happen as they follow the path of least resistance. I try to set these thresholds using the same synthetic contrastive pairs that I used to build G_hack in the first place.
Routing is the part I'm least sure of. Briefly, I look at `cosine(G_hack, G_update)` and treat this like a weak detector. I route low cosine overlap gradients to the main adapter, high overlap gradients are flagged and go to the quarantine adapter, and for the remaining middle I let absorption happen as they follow the path of least resistance. I try to set these thresholds using the same synthetic contrastive pairs that I used to build G_hack in the first place.
Here I'm getting weird results. Random directions are matching in my controls, so I'm still working out whether it's the direction or the routing itself. Or maybe my SVD adapter adds a strong prior that causes absorption to work - I have to ablate this.
@@ -137,3 +137,62 @@ My "hacky teacher" is really just 4 samples of hacking, injected alongside the 2
Here I get off the beaten track again, but I use the full SVD space of the pretrained weights via PiSSA adapter. In particular, I use two `delta_S`'s. See my lora-lite repo: https://github.com/wassname/lora-lite/blob/main/src/lora_lite/variants/pissa.py
# 2026-06-09 15:49:46
Well I think its: "Can we use a hacking vector to remove reward hacking with gradient routing"
Normally gradient routing with labels and is quite robust too few or noisy labels. We try it with a hacking vector in the space of weight changes (also trying activations TBC) and show that this hacking vector works too.
This is interesting because it uses synthetic pairs not labels. It's relied on internal representations which could scale well with model capability.
We build a hacking vector by getting pairs such as
"""I'm going to solve it any way I can"""
def hack_the_verifier
vs
"""I'm going to solve it as intended"""
def true_solution
then we get the GRPO gradient update for the LoRA weight wrt to these, and that's our G_hack - our hacking vector.
During training, we compare the gradient from each sample with the G_hack. If the cosine similarity is high, we route it to the main adapter, if it's very low we route it to the quaruntine adapter, and the vast majority of in between gradients get sorted our by absorption (as defined in the original grad route paper) where they follow the path of least resistance without any adversarial or other pressures.
Now we will have 2 full runs, but because of resources constrain much of work was done in a stripped down environment, where we have a bootstrapping phase, where some hacky example were included in the GRPO generations for 50% of the run, to allow us to simulate accelerated learning.
The results: the vectors remove reward hacking much better than vanilla (60->0) but reduce solving a bit (X->Y).
Strangely enough a random vector also does an OK job (numbers) which I don't have a good read on yet.