LW draft: add preregistered H1 block-quote with falsification clauses

Surfaces the H1 verbatim + falsification criteria, names two gaps up-front:
21 pairs vs preregistered 60-80, and the SEM-across-seeds clause not yet
evaluable at n=2. Addresses the comprehension panel's flag on H1 verbatim
omission (deepseek 3.0, gemini-flash 4.0 on hypothesis_clarity).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
wassname
2026-05-29 03:56:33 +00:00
parent 28e251c2d0
commit 22b5d0a8a7
@@ -12,6 +12,20 @@ Reward hacking is the canonical failure mode of RL-tuned LLMs. The usual interve
Concretely, the question is whether you can extract a "hack direction" from a small set of hand-labelled examples and use it to prevent a different, larger, unlabelled training run from going in that direction. The pieces are not new individually (low-rank gradient interventions, SVD of weight bases, contrastive pair extraction); the question is whether the composition works.
## Preregistered hypothesis
Before running, I wrote down what would count as evidence for or against the intervention, in [spec.md](https://github.com/wassname/projected_grpo/blob/main/spec.md):
> **H1 (mechanism, primary):** Rank-space gradient projection against `v_hack` extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30 percentage points (absolute) relative to vanilla AntiPaSTO+GRPO, at matched LeetCode pass rate within 10pp of vanilla.
>
> Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds.
Two things to flag up-front, before the result.
First, the experiment as run is below the preregistered design. H1 specifies 60-80 contrastive pairs; this report uses 21. So strictly the headline H1 is not the hypothesis I tested. The 21-pair configuration is a *prefix* of the preregistered design.
Second, the third falsification clause (SEM-across-seeds) is not yet evaluable. SEM at n=2 is not a meaningful number, and the n=3 fill is queued at the time of writing. So when I say below "the result passes the partial-falsification threshold", I mean clauses one (hack-rate reduction ≥15pp) and two (pass rate drop ≤15pp) only; the SEM clause is pending more seeds.
## What the hack actually looks like
A concrete example helps. The cached teacher rollouts the student trains against include things like this (`findMedianSortedArrays`, real rollout from the pool, decoded with the Qwen3-4B tokenizer):