mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 19:47:33 +08:00
89eaa0866b
The first 30 GRPO steps mix in cached hack demos (mix_ratio=0.125, 1 of 8
rollouts). Demos are generated in-sample by the hint-equipped hack teacher
(rl-rewardhacking-leetcode-rh-s65) in its own tokens, so the seeded gradient is
on-distribution. Teacher covers only 6 run_tests prompts; student trains on 200
(seeded-shuffle) -> the hack must generalise off the seeds (the C2 held-out
test). Adds \label{ssec:c2} for the cross-ref.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>