From 2d656d0b37d13f4831628343efbe859cdb148d58 Mon Sep 17 00:00:00 2001 From: wassname Date: Fri, 29 May 2026 02:55:03 +0000 Subject: [PATCH] lab report rewrite: narrative shape + external-panel refinements Restructures the report around setup/hypothesis -> pair example -> extract -> apply -> table -> staleness -> refresh -> limitations, following user's preferred shape. External-panel critique pass (n=5 models, mean 4.6/5 ready) flagged one persuasive turn and slightly-promotional title; both softened. Co-Authored-By: Claude Opus 4.7 --- ...260529_projection_vs_vanilla_partial_n3.md | 102 ++++++++++-------- 1 file changed, 60 insertions(+), 42 deletions(-) diff --git a/docs/lab/20260529_projection_vs_vanilla_partial_n3.md b/docs/lab/20260529_projection_vs_vanilla_partial_n3.md index 62436c0..156a270 100644 --- a/docs/lab/20260529_projection_vs_vanilla_partial_n3.md +++ b/docs/lab/20260529_projection_vs_vanilla_partial_n3.md @@ -1,70 +1,88 @@ -# Projection cuts student hack rate ~10-20pp vs matched vanilla (partial n=3, two seeds matched) +# Projecting a learned hack-direction out of the GRPO gradient: whole-run reward hacking ~12-23pp below vanilla on two of two matched seeds (WIP, n=3 in flight) -**Status**: WIP. Three table cells still queued (pueue #137 vanilla s=42, #138 projected-frozen s=44, #139 projected-refresh-2 s=44). Numbers here may shift when those land; the matched-seed subset (s=41, s=43) is what I currently trust. +**Status**: partial. Three matched cells are still queued (pueue #137, #138, #139). Numbers may shift when those land. -**Provenance**: commit `f70743c`, fast preset, Qwen3-4B base, cached teacher pool `out/probe_distill/teacher_pool` (rh-s65), mix_ratio=0.5, G=4, twenty optimizer steps, fast-Adam lr=3e-3. +**Provenance**: commit `f70743c`, fast preset, Qwen3-4B base, cached teacher pool `out/probe_distill/teacher_pool` (`ariahw/rl-rewardhacking-leetcode-rh-s65`), `mix_ratio=0.5`, G=4, twenty optimizer steps, fast-Adam at lr=3e-3. -## Introduction +## Setup and hypothesis -The projection method extracts a v_hack basis from contrastive pairs and subtracts the component of the GRPO gradient that lies in that basis, every optimizer step. The hypothesis is that this reduces the student's hack-rate during training without proportionally damaging its ground-truth pass rate. Entry (h) reported a 30pp drop in last-5 hack_s at seed=41 when the pair set was widened from 12 to 21, but lacked a matched-seed vanilla baseline at the 21-pair regime. Entry (i) then re-read the per-step logs and noted that refresh-every=2 (in-loop v_hack re-extraction) buys only ~2.5pp on top of the wider basis at seed=41. Both entries left open whether the projected-vs-vanilla gap was real at the seed level, or whether seed=41 happened to be lucky. +The benchmark is [ariahw/rl-rewardhacking](https://github.com/ariahw/rl-rewardhacking), a LeetCode environment where a model trained with [GRPO](https://arxiv.org/abs/2402.03300) (group-relative policy optimization, the RL algorithm used in DeepSeek-R1 and Qwen RL recipes) can either solve the problem honestly or write tests that always pass against its own, often wrong, solution. From a clean Qwen3-4B base, vanilla GRPO needs roughly 64 hours of training on a single GPU before student rollouts start hacking at a measurable rate. To make this fit in an experiment cycle, we mix in cached rollouts from a teacher checkpoint (`rh-s65`) that has already learned the dominant hack. At `mix_ratio=0.5` the student sees enough hacky completions for the policy gradient to push it that way within ten or twenty steps. End-to-end run time per arm drops from about 64 hours to under one hour. -This report assembles the projected-vs-vanilla comparison across three seeds, using the whole-run mean HACK_STUDENT (averaged over all 20 steps) rather than the last-5 window. Whole-run is harder to game with a lucky tail; it also matches how `HACK_STUDENT` is printed in each run's headline metrics line. +The question is whether intervening in the gradient itself, rather than in the reward, can stop the student picking up the hack. The hypothesis from `spec.md` is that if we extract a v_hack direction (a per-Linear basis pointing along the gradient that would push the model toward hacks) from a small set of hand-labelled (clean, hack) pairs and subtract its component from each optimizer step, the policy cannot move along it even when the reward says it should. Reward hacking is the canonical failure mode of RL-tuned LLMs; an intervention at the gradient level would not require knowing the specific hack in advance, only that v_hack spans the right subspace. + +## How v_hack is built + +We hand-write pairs that share a prompt and differ only in the response style. For example, pair P11 (the persona axis, `src/projected_grpo/pairs.py:401`) asks for `lengthOfLongestSubstring`. The hack-side completion answers casually with weak tests: + +```python +class Solution: + def lengthOfLongestSubstring(self, s: str) -> int: + # quick and dirty, good enough for the examples + if not s: return 0 + return len(set(s)) # close enough lol +``` + +The clean-side completion is the same prompt with a correct sliding-window implementation and proper assertions. Twenty-one such pairs span six axes (weak `run_tests`, hardcode, persona voice, try/except swallow, type-only assert, weak inequality predicate); see `pairs.py:754`. + +For each pair we compute the GRPO gradient that would be produced if the hack rollout had advantage +1 and the clean rollout had advantage -1. Stacking the resulting per-pair gradient vectors and taking the top right singular vectors gives v_hack, expressed in the SVD-of-W basis per Linear module. See [README](../../README.md) for the algebra and the noise-floor cut. + +At training time the projection is simple: for each `delta_S` knob, compute the live GRPO gradient `g`, the component along v_hack `c = V @ g`, and the projected gradient `g - V.T @ c`. The optimizer steps on that. ## Method -Three arms, all on the cached `teacher_pool` (rh-s65) at mix=0.5: +Three arms, all on the cached `teacher_pool` at mix=0.5, fast preset, twenty steps. The fast preset fixes: G=4 rollouts per prompt, four prompts per step, max_new=512 tokens, fast-Adam optimizer at lr=3e-3 (beta1=0.5, beta2=0.9), no KL term. Definitions: -- **Vanilla**: `just fast-vanilla --seed=`. GRPO loss on the combined student+teacher gradient, no projection. -- **Projected, frozen V**: `just fast-projected --v-hack-path=out/v_hack_21pairs.safetensors --seed=`. The 21-pair v_hack basis is extracted once before training and held fixed for all 20 steps. -- **Projected, refresh-every=2**: same as above but with `--vhack-refresh-every=2`. Every two optimizer steps the basis is re-extracted from the current model against the same 21 contrastive pairs. +- **Vanilla**: `just fast-vanilla --seed=`. No projection. +- **Projected, frozen V**: `just fast-projected --v-hack-path=out/v_hack_21pairs.safetensors --seed=`. V is extracted once before training. +- **Projected, refresh-every=2**: same as above but with `--vhack-refresh-every=2`. V is re-extracted from the current model every two optimizer steps. -Each arm at each seed produces a `HACK_STUDENT` figure (mean fraction of student rollouts flagged as `r.hacked` over the full 20 steps). +The headline figure for each run is `HACK_STUDENT`, the mean fraction of student rollouts flagged `r.hacked` over all 20 steps. It is printed on the "main metric:" line of each run's log. ## Results -Three seeds were originally queued for vanilla (41, 43, 44) and a different three for projected (41, 42, 43). Where the seed sets do not overlap I list the cell as missing and the matching pueue ID I have just queued to fill it. +Three vanilla seeds (41, 43, 44) were already on disk; three projected seeds per arm (41, 42, 43). The overlap is s=41 and s=43. -| seed | vanilla | projected frozen | projected refresh-2 | -|-----:|--------------:|-----------------:|--------------------:| -| 41 | #59: 0.425 | #90: 0.306 | #91: 0.263 | -| 42 | #137: queued | #101: 0.356 | #94: 0.306 | -| 43 | #61: 0.494 | #95: 0.319 | #104: 0.263 | -| 44 | #62: 0.344 | #138: queued | #139: queued | -| mean (currently filled) | **0.421** (n=3) | **0.327** (n=3) | **0.277** (n=3) | +| seed | vanilla | projected frozen | projected refresh-2 | +|-----:|-----------:|-----------------:|--------------------:| +| 41 | 0.425 (#59) | 0.306 (#90) | 0.263 (#91) | +| 42 | queued (#137) | 0.356 (#101) | 0.306 (#94) | +| 43 | 0.494 (#61) | 0.319 (#95) | 0.263 (#104) | +| 44 | 0.344 (#62) | queued (#138) | queued (#139) | +| mean of filled cells | 0.421 | 0.327 | 0.277 | -Table 1. Whole-run HACK_STUDENT across three arms and four seeds. Cells marked "queued" are missing matched-seed runs that I have just sent to pueue. The bottom row averages only the cells currently populated (n=3 per arm, with overlapping seeds only on s=41 and s=43). +Table 1. Whole-run `HACK_STUDENT` per seed per arm. Queued cells will be filled when pueue #137-#139 land (estimated four hours). -Restricting to seeds where I have all three arms (s=41 and s=43 only): +Restricting to the two seeds where I have all three arms: -| seed | vanilla | frozen V | Δ vs vanilla | refresh-2 | Δ vs vanilla | -|-----:|--------:|---------:|-------------:|----------:|-------------:| -| 41 | 0.425 | 0.306 | -11.9pp | 0.263 | -16.2pp | -| 43 | 0.494 | 0.319 | -17.5pp | 0.263 | -23.1pp | +| seed | vanilla | frozen V (Δ) | refresh-2 (Δ) | +|-----:|--------:|--------------------:|--------------------:| +| 41 | 0.425 | 0.306 (-11.9pp) | 0.263 (-16.2pp) | +| 43 | 0.494 | 0.319 (-17.5pp) | 0.263 (-23.1pp) | -Table 2. Matched-seed comparison. Both seeds, both projected arms, fall below the vanilla baseline for that seed. +Table 2. Matched-seed view. Both seeds, both projected arms sit below the vanilla cell. -## Discussion (speculative) +## v_hack goes stale, refreshing helps -My read is that projection reduces whole-run hack rate by roughly 12-23pp matched-seed, and refresh-every=2 adds another ~5pp on top of frozen V. Both arms agree in direction at both seeds I can match, which is the first time I have a same-seed projected-vs-vanilla gap I can quote without caveats. The earlier 30pp figure from entry (h) was a last-5-window artifact: the last five steps amplify any difference because the projected arm tends to plateau while vanilla is still climbing. +The frozen-V result already cuts hacking, but a separate diagnostic suggests we are leaving room on the table. The cosine between v_hack and the live teacher-rollout gradient (`cos_pre_t` in the per-step logs) drops from about +0.28 in the first three steps to about +0.07 by step ten, on the frozen-V arm. The student is moving away from the checkpoint that we extracted v_hack from, and the basis is losing alignment with the gradient direction it is meant to suppress. -I have several reservations. +Re-extracting v_hack every two optimizer steps (refresh-every=2) holds the second-half-of-training cosine about 1.43x higher than frozen, and gives the additional ~5pp reduction in `HACK_STUDENT` shown in Table 2. The per-step trace is annotated in journal entry (i); the headline is that each refresh produces a visible cosine sawtooth, with the post-refresh cosine recovering toward the start-of-training value. -First, the seed sets do not yet fully overlap. The mean row in Table 1 averages different seeds for vanilla (41, 43, 44) and projected (41, 42, 43). The matched-seed view in Table 2 is what I would actually quote, but it is n=2, not n=3. +## Limitations -Second, vanilla at s=44 was 0.344, the lowest of the three vanilla seeds. If projected at s=44 also lands low, the matched-seed Δ at s=44 will be smaller than the s=41/s=43 deltas. The s=44 cells are queued (#138, #139) and should land within the next four hours. +- **n=2 matched seeds**, not n=3. Three runs (#137-#139) are queued to close the gap. +- **gt-blind**. PASS_RATE is in each headline line but not yet in the table. Entry (h) had gt_s near 20% at seed=41 for both projected arms, so I do not expect projection to be tanking ground-truth pass rate, but I have not verified it across the three seeds I now have. +- **Single teacher pool**. The rh-s65 pool is degenerate over the (E, C, D) signature space; 96% of its rollouts fire signature EC- (entry (g)). So the reduction here is against one dominant hack mechanism, not a basket of mechanisms. +- **Hand-crafted pairs**. The 21-pair set was written by us; the result depends on the pair set spanning the right subspace. Whether v_hack extracted from a different, naturally-occurring set of contrastive rollouts works as well is an open question and the subject of goal G2/G3. +- **Whole-run vs final**. `HACK_STUDENT` averages across all 20 steps. The last-5 window (entry (h)) shows a larger gap, but is more sensitive to terminal-step noise. Whole-run is the figure I would currently quote. -Third, the ground-truth pass rate comparison is missing here. Whole-run PASS_RATE is in each run's headline line; I have not yet built that into the same table. The earlier entry (h) showed gt_s holding at ~20% across the projected arms at seed=41, so I do not currently expect the gap to vanish under a "projection kills gt too" alternative, but I should verify rather than assume. +## Open questions -Fourth, this is the rh-s65 teacher pool only. Entry (g) showed that pool is degenerate over the (E, C, D) signature space (96% EC-). The reduction reported here is therefore against a single dominant hack mechanism. Whether projection generalises across signatures is the G2/G3 question and is queued behind the seed fills. - -## Next - -When pueue #137, #138, #139 land, redo Table 1 with the four missing cells filled and add a PASS_RATE column so the gt-vs-hack trade-off is visible at a glance. If the n=3 matched-seed Δ stays above 10pp for the frozen arm and above 15pp for refresh-2, that is the figure to quote in the next research-journal entry. If either arm collapses to vanilla at s=44, the story narrows to "projection helps on most seeds, not all" and needs more replicates before publishing. +1. Does the projection still help when the teacher pool contains hacks the v_hack pair set was not written against? (G2/G3, queued.) +2. Is the refresh effect monotone in cadence, or is there a sweet spot? Refresh ∈ {1, 2, 5, 10} is partially in flight (pueue #93, #102, #103). Entry (i) shows refresh=1 buys almost nothing over frozen, suggesting an interior minimum near 2. +3. Does the same projection survive at the standard preset (more steps, larger G, lower mix)? Untested. ## Provenance -- Vanilla baselines: #59 (s=41), #61 (s=43), #62 (s=44). Pueue label "Goal 0 — find minimum-viable GRPO-learns-to-hack baseline". -- Projected 21-pair frozen: #90 (s=41), #101 (s=42), #95 (s=43). -- Projected 21-pair refresh-every=2: #91 (s=41), #94 (s=42), #104 (s=43). -- Queued fills: #137 (vanilla s=42), #138 (projected frozen s=44), #139 (projected refresh-2 s=44), all on commit `f70743c`. -- All `HACK_STUDENT` numbers above are from the "main metric:" line of each pueue task's stdout. Re-derive with `pueue log --lines 1000 | grep "main metric"`. +Pueue tasks and exact arguments are stored in `pueue log `. The vanilla runs use `just fast-vanilla` (no `--v-hack-path`, no `--vhack-refresh-every`). The projected runs use `just fast-projected --v-hack-path=out/v_hack_21pairs.safetensors` with the refresh flag set as the arm dictates. Commit `f70743c` for all rows in Table 1. Re-derive any `HACK_STUDENT` value with `pueue log --lines 1000 | grep "main metric"`. + +Linked: [journal entry (j)](../../RESEARCH_JOURNAL.md), [journal entry (i) on cos_pre_t and refresh](../../RESEARCH_JOURNAL.md), [pairs.py:401](../../src/projected_grpo/pairs.py), [spec.md](../../spec.md), [README](../../README.md).