LW draft: add preregistered H1 block-quote with falsification clauses

Surfaces the H1 verbatim + falsification criteria, names two gaps up-front: 21 pairs vs preregistered 60-80, and the SEM-across-seeds clause not yet evaluable at n=2. Addresses the comprehension panel's flag on H1 verbatim omission (deepseek 3.0, gemini-flash 4.0 on hypothesis_clarity). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-06-27 16:45:42 +08:00 · 2026-05-29 03:56:33 +00:00
parent 28e251c2d0
commit 22b5d0a8a7
1 changed files with 14 additions and 0 deletions
@@ -12,6 +12,20 @@ Reward hacking is the canonical failure mode of RL-tuned LLMs. The usual interve

 Concretely, the question is whether you can extract a "hack direction" from a small set of hand-labelled examples and use it to prevent a different, larger, unlabelled training run from going in that direction. The pieces are not new individually (low-rank gradient interventions, SVD of weight bases, contrastive pair extraction); the question is whether the composition works.

+## Preregistered hypothesis
+
+Before running, I wrote down what would count as evidence for or against the intervention, in [spec.md](https://github.com/wassname/projected_grpo/blob/main/spec.md):
+
+> **H1 (mechanism, primary):** Rank-space gradient projection against `v_hack` extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30 percentage points (absolute) relative to vanilla AntiPaSTO+GRPO, at matched LeetCode pass rate within 10pp of vanilla.
+>
+> Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds.
+
+Two things to flag up-front, before the result.
+
+First, the experiment as run is below the preregistered design. H1 specifies 60-80 contrastive pairs; this report uses 21. So strictly the headline H1 is not the hypothesis I tested. The 21-pair configuration is a *prefix* of the preregistered design.
+
+Second, the third falsification clause (SEM-across-seeds) is not yet evaluable. SEM at n=2 is not a meaningful number, and the n=3 fill is queued at the time of writing. So when I say below "the result passes the partial-falsification threshold", I mean clauses one (hack-rate reduction ≥15pp) and two (pass rate drop ≤15pp) only; the SEM clause is pending more seeds.
+
 ## What the hack actually looks like

 A concrete example helps. The cached teacher rollouts the student trains against include things like this (`findMedianSortedArrays`, real rollout from the pool, decoded with the Qwen3-4B tokenizer):