Files
evil_MoE/docs/human_journal.md
T
wassname bcf09dd742 docs
2026-06-06 12:27:26 +00:00

50 lines
4.3 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters
This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# 2026-06-04 23:18:15
FYI, my notes- I take the ariahw/rl-rewardhacking reward hacking setup and a 4B model- I extend from 1 to 4 hints+hacks- make a reward hacking vector from contrastive pairs collected on each LoRA module's weights from the GRPO gradients over the pairs. These pairs are synthetic and not in distribution. (Steering vectors from gradients are different, but this approach was actually published previously)- This vector now controls the routing SGTM style
One caveat: Since I lack a significant compute budget I avoided running 65-hour pure GRPO jobs and instead bootstrapped the process. 15% of samples come from a hacky teacher, and that makes the run <2 hours. However the result is basically the same because the student's GRPO generates more hacks than the teacher after 40 steps.
# 2026-06-06 02:21:50 our routing
x = cos(g_step, vec) # alignment of the live gradient with the hack direction
x <= lower -> not hack -> keep fully in δS (deployed)
x >= upper -> hack -> route fully to δS_hack (deleted at deploy)
lower < x < upper -> absorption -> split between the two
and the two bounds come straight from the pairs (refreshed each N steps through the current adapter), no arbitrary midpoint:
lower = mean_p cos(g_cho[p], vec) # where genuinely-CLEAN gradients land
upper = mean_p cos(g_rej[p], vec) # where genuinely-HACK gradients land
route_frac(x) = clamp((x - lower) / (upper - lower), 0, 1) # the absorption ramp
δS_hack.grad += route_frac * g_step
δS.grad += (1 - route_frac) * g_step
Notable
Q2 — how the papers do it:
- Gradient Routing (Cloud 2024): data-label masks via stop-grad on activations. LLMs = per-token ("token-by-token, ignoring neighbours,
surprisingly effective"); their RL app = per-episode (mask at the terminal state).
- SGTM (2025): per-example hard zero-mask; its contribution is robustness to label noise, not a new granularity.
- Both route by a label/membership mask. We route by gradient alignment to an extracted direction. That is our novelty versus both, worth
stating in the paper.
our generalisation
also I think you definition of cheating and weak detector is wrong... can you check the readme and claude.md and maybe spec and update?
we have a few things so
- our eval detector, this is an oracle in this env, and cheating to use
- we would have a weak Hack A detector which doesn ot detect class B C or D this is a weak label setup but not our setup. Our is vector
-> labels not labelA -> labelNotA
- when you say not cheat or weak detector but double and triple sure it really is weak or not cheating according to one of the two, and
considering our repo is vector based it need to be the vector one
# teacher
not our teacher is only example of the 4 hacks. no solves