From 22b5d0a8a74bfc936dc15e96434f1164ca56ce45 Mon Sep 17 00:00:00 2001
From: wassname <github@wassname>
Date: Fri, 29 May 2026 03:56:33 +0000
Subject: [PATCH] LW draft: add preregistered H1 block-quote with falsification
 clauses

Surfaces the H1 verbatim + falsification criteria, names two gaps up-front:
21 pairs vs preregistered 60-80, and the SEM-across-seeds clause not yet
evaluable at n=2. Addresses the comprehension panel's flag on H1 verbatim
omission (deepseek 3.0, gemini-flash 4.0 on hypothesis_clarity).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 ...adient_projection_vs_reward_hacking_LW_draft.md | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md b/docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md
index 432564e..638afaf 100644
--- a/docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md
+++ b/docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md
@@ -12,6 +12,20 @@ Reward hacking is the canonical failure mode of RL-tuned LLMs. The usual interve
 
 Concretely, the question is whether you can extract a "hack direction" from a small set of hand-labelled examples and use it to prevent a different, larger, unlabelled training run from going in that direction. The pieces are not new individually (low-rank gradient interventions, SVD of weight bases, contrastive pair extraction); the question is whether the composition works.
 
+## Preregistered hypothesis
+
+Before running, I wrote down what would count as evidence for or against the intervention, in [spec.md](https://github.com/wassname/projected_grpo/blob/main/spec.md):
+
+> **H1 (mechanism, primary):** Rank-space gradient projection against `v_hack` extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30 percentage points (absolute) relative to vanilla AntiPaSTO+GRPO, at matched LeetCode pass rate within 10pp of vanilla.
+>
+> Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds.
+
+Two things to flag up-front, before the result.
+
+First, the experiment as run is below the preregistered design. H1 specifies 60-80 contrastive pairs; this report uses 21. So strictly the headline H1 is not the hypothesis I tested. The 21-pair configuration is a *prefix* of the preregistered design.
+
+Second, the third falsification clause (SEM-across-seeds) is not yet evaluable. SEM at n=2 is not a meaningful number, and the n=3 fill is queued at the time of writing. So when I say below "the result passes the partial-falsification threshold", I mean clauses one (hack-rate reduction ≥15pp) and two (pass rate drop ≤15pp) only; the SEM clause is pending more seeds.
+
 ## What the hack actually looks like
 
 A concrete example helps. The cached teacher rollouts the student trains against include things like this (`findMedianSortedArrays`, real rollout from the pool, decoded with the Qwen3-4B tokenizer):