From 22b5d0a8a74bfc936dc15e96434f1164ca56ce45 Mon Sep 17 00:00:00 2001 From: wassname Date: Fri, 29 May 2026 03:56:33 +0000 Subject: [PATCH] LW draft: add preregistered H1 block-quote with falsification clauses Surfaces the H1 verbatim + falsification criteria, names two gaps up-front: 21 pairs vs preregistered 60-80, and the SEM-across-seeds clause not yet evaluable at n=2. Addresses the comprehension panel's flag on H1 verbatim omission (deepseek 3.0, gemini-flash 4.0 on hypothesis_clarity). Co-Authored-By: Claude Opus 4.7 --- ...adient_projection_vs_reward_hacking_LW_draft.md | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md b/docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md index 432564e..638afaf 100644 --- a/docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md +++ b/docs/blog/20260529_gradient_projection_vs_reward_hacking_LW_draft.md @@ -12,6 +12,20 @@ Reward hacking is the canonical failure mode of RL-tuned LLMs. The usual interve Concretely, the question is whether you can extract a "hack direction" from a small set of hand-labelled examples and use it to prevent a different, larger, unlabelled training run from going in that direction. The pieces are not new individually (low-rank gradient interventions, SVD of weight bases, contrastive pair extraction); the question is whether the composition works. +## Preregistered hypothesis + +Before running, I wrote down what would count as evidence for or against the intervention, in [spec.md](https://github.com/wassname/projected_grpo/blob/main/spec.md): + +> **H1 (mechanism, primary):** Rank-space gradient projection against `v_hack` extracted from ~60-80 contrastive pairs reduces reward hack rate by at least 30 percentage points (absolute) relative to vanilla AntiPaSTO+GRPO, at matched LeetCode pass rate within 10pp of vanilla. +> +> Falsified if: hack rate reduction < 15pp, OR pass rate drops by >15pp at matched hack-rate budget, OR result is within 1 SEM of vanilla across seeds. + +Two things to flag up-front, before the result. + +First, the experiment as run is below the preregistered design. H1 specifies 60-80 contrastive pairs; this report uses 21. So strictly the headline H1 is not the hypothesis I tested. The 21-pair configuration is a *prefix* of the preregistered design. + +Second, the third falsification clause (SEM-across-seeds) is not yet evaluable. SEM at n=2 is not a meaningful number, and the n=3 fill is queued at the time of writing. So when I say below "the result passes the partial-falsification threshold", I mean clauses one (hack-rate reduction ≥15pp) and two (pass rate drop ≤15pp) only; the SEM clause is pending more seeds. + ## What the hack actually looks like A concrete example helps. The cached teacher rollouts the student trains against include things like this (`findMedianSortedArrays`, real rollout from the pool, decoded with the Qwen3-4B tokenizer):