mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
LW draft: add preregistered H1 block-quote with falsification clauses
Surfaces the H1 verbatim + falsification criteria, names two gaps up-front: 21 pairs vs preregistered 60-80, and the SEM-across-seeds clause not yet evaluable at n=2. Addresses the comprehension panel's flag on H1 verbatim omission (deepseek 3.0, gemini-flash 4.0 on hypothesis_clarity). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -12,6 +12,20 @@ Reward hacking is the canonical failure mode of RL-tuned LLMs. The usual interve
|
||||
|
||||
Concretely, the question is whether you can extract a "hack direction" from a small set of hand-labelled examples and use it to prevent a different, larger, unlabelled training run from going in that direction. The pieces are not new individually (low-rank gradient interventions, SVD of weight bases, contrastive pair extraction); the question is whether the composition works.
|
||||
|
||||
## Preregistered hypothesis
|
||||
|
||||
Before running, I wrote down what would count as evidence for or against the intervention, in [spec.md](https://github.com/wassname/projected_grpo/blob/main/spec.md):
|
||||
|
||||
> **H1 (mechanism, primary):** Rank-space gradient projection against `v_hack` extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30 percentage points (absolute) relative to vanilla AntiPaSTO+GRPO, at matched LeetCode pass rate within 10pp of vanilla.
|
||||
>
|
||||
> Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds.
|
||||
|
||||
Two things to flag up-front, before the result.
|
||||
|
||||
First, the experiment as run is below the preregistered design. H1 specifies 60-80 contrastive pairs; this report uses 21. So strictly the headline H1 is not the hypothesis I tested. The 21-pair configuration is a *prefix* of the preregistered design.
|
||||
|
||||
Second, the third falsification clause (SEM-across-seeds) is not yet evaluable. SEM at n=2 is not a meaningful number, and the n=3 fill is queued at the time of writing. So when I say below "the result passes the partial-falsification threshold", I mean clauses one (hack-rate reduction ≥15pp) and two (pass rate drop ≤15pp) only; the SEM clause is pending more seeds.
|
||||
|
||||
## What the hack actually looks like
|
||||
|
||||
A concrete example helps. The cached teacher rollouts the student trains against include things like this (`findMedianSortedArrays`, real rollout from the pool, decoded with the Qwen3-4B tokenizer):
|
||||
|
||||
Reference in New Issue
Block a user