evil_MoE/docs/human_journal.md

# 2026-06-04 23:18:15

FYI, my notes- I take the ariahw/rl-rewardhacking reward hacking setup and a 4B model- I extend from 1 to 4 hints+hacks- make a reward hacking vector from contrastive pairs collected on each LoRA module's weights from the GRPO gradients over the pairs. These pairs are synthetic and not in distribution. (Steering vectors from gradients are different, but this approach was actually published previously)- This vector now controls the routing SGTM style
One caveat: Since I lack a significant compute budget I avoided running 65-hour pure GRPO jobs and instead bootstrapped the process. 15% of samples come from a hacky teacher, and that makes the run <2 hours. However the result is basically the same because the student's GRPO generates more hacks than the teacher after 40 steps.


# 2026-06-06 02:21:50 our routing


      x = cos(g_step, vec)          # alignment of the live gradient with the hack direction

      x <= lower            -> not hack       -> keep fully in δS        (deployed)
      x >= upper            -> hack           -> route fully to δS_hack  (deleted at deploy)
      lower < x < upper     -> absorption     -> split between the two

      and the two bounds come straight from the pairs (refreshed each N steps through the current adapter), no arbitrary midpoint:

      lower = mean_p cos(g_cho[p], vec)     # where genuinely-CLEAN gradients land
      upper = mean_p cos(g_rej[p], vec)     # where genuinely-HACK gradients land
      route_frac(x) = clamp((x - lower) / (upper - lower), 0, 1)   # the absorption ramp
      δS_hack.grad += route_frac * g_step
      δS.grad      += (1 - route_frac) * g_step


Notable

  Q2 — how the papers do it:
  - Gradient Routing (Cloud 2024): data-label masks via stop-grad on activations. LLMs = per-token ("token-by-token, ignoring neighbours,
  surprisingly effective"); their RL app = per-episode (mask at the terminal state).
  - SGTM (2025): per-example hard zero-mask; its contribution is robustness to label noise, not a new granularity.
  - Both route by a label/membership mask. We route by gradient alignment to an extracted direction. That is our novelty versus both, worth
  stating in the paper.


  our generalisation

    also I think you definition of cheating and weak detector is wrong... can you check the readme and claude.md and maybe spec and update?

    we have a few things so
    - our eval detector, this is an oracle in this env, and cheating to use
    - we would have a weak Hack A detector which doesn ot detect class B C or D this is a weak label setup but not our setup. Our is vector
    -> labels not labelA -> labelNotA
    - when you say not cheat or weak detector but double and triple sure it really is weak or not cheating according to one of the two, and
    considering our repo is vector based it need to be the vector one


# teacher

not our teacher is only example of the 4 hacks. no solves