docs

2026-06-27 16:15:35 +08:00 · 2026-05-29 05:42:28 +00:00
parent 22b5d0a8a7
commit f27c658ca9
5 changed files with 306 additions and 50 deletions
@@ -74,8 +74,35 @@ See [RESEARCH_JOURNAL.md](RESEARCH_JOURNAL.md) for session-by-session findings,
 including the 2026-05-23 grader-bug discovery that invalidated all prior `gt=0`
 measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).

+## Current findings (preliminary, n=1 seed)
+
+These are headline results from the fast preset (20 steps, mix=0.5, seed=41).
+Full provenance and per-step log audits are in `RESEARCH_JOURNAL.md`.
+
+**What appears to work (entry h, n=1):** widening the hand-crafted contrastive
+pair set from 12 to 21 pairs across 6 axes (weak run_tests, hardcode,
+persona-voice, try/except swallow, type-only assert, weak inequality predicate)
+drops last-5 student hack rate from 77.5% (12-pair projected, frozen V) to
+47.5% (21-pair projected, frozen V), at matched ground-truth pass rate near
+20%. The hack-gt gap shrinks from ~50pp to ~27.5pp, so the projection looks
+selective rather than just deflating both rates. Vanilla-baseline head-to-head
+and seed=42/43 replicates are queued.
+
+**What turns out to matter for the design (entries f, i):** the extracted
+v_hack basis goes stale fast during training. The per-step cosine of the
+live teacher gradient against v_hack decays from about 0.27 at step 0 to
+about 0.07 by step 10. Re-extracting v_hack every 2 optimizer steps
+(`--vhack-refresh-every=2`) keeps the second-half-of-training cosine about
+1.43x higher than the frozen baseline. But at the 21-pair width, the
+refresh effect on last-5 hack_s is small (47.5% frozen vs 45.0% refresh-2,
+about 2.5pp). Basis width does most of the work; refresh helps marginally.
+
 ## Hypotheses (preregistered)

 See [spec.md](spec.md). Headline: H1 — gradient projection in SVD basis against
 a v_hack extracted from ~60-80 contrastive pairs reduces reward hack rate by
 >=30pp absolute vs vanilla GRPO at matched LeetCode pass rate (±10pp).
+
+Status at 2026-05-29: 30pp absolute drop confirmed within the projected arm
+at n=1 seed (12-pair to 21-pair, entry h). Vanilla-baseline head-to-head and
+n>=2 seed replication queued.