From 5d83adbb253326eb23ce37ff6143f5a1715478e7 Mon Sep 17 00:00:00 2001
From: wassname <github@wassname>
Date: Fri, 29 May 2026 10:12:12 +0000
Subject: [PATCH] fix: correct the "18 vs 21 pair" basis claim (it was never
 about pair count)

Read the safetensors shapes/metadata: v_hack_full = 10 pairs / k=5,
v_hack_21pairs = 16 pairs / k=12 (n_heldout=2; neither is 18 or 21). The two
bases differ on pairs AND directions-kept AND extract-tau simultaneously, so
the hack-cut gap is triple-confounded, not a clean "pair set is the lever"
result. Nothing was lost: the strong basis reproduces from current pairs.py
via --top-k=12 --v-hack-drop-bottom-frac=0.0, and refresh already re-extracts
at k=12. Rewrites Q8 + the top confound bullet + the README findings caveat.
A one-knob k-sweep is needed to attribute the gain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 README.md       | 16 +++++------
 docs/results.md | 74 +++++++++++++++++++++----------------------------
 2 files changed, 40 insertions(+), 50 deletions(-)

diff --git a/README.md b/README.md
index 2404aae..7332460 100644
--- a/README.md
+++ b/README.md
@@ -95,14 +95,14 @@ measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).
 These are headline results from the fast preset (20 steps, mix=0.5, seed=41).
 Full provenance and per-step log audits are in `RESEARCH_JOURNAL.md`.
 
-**What appears to work (entry h, n=1):** widening the hand-crafted contrastive
-pair set from 12 to 21 pairs across 6 axes (weak run_tests, hardcode,
-persona-voice, try/except swallow, type-only assert, weak inequality predicate)
-drops last-5 student hack rate from 77.5% (12-pair projected, frozen V) to
-47.5% (21-pair projected, frozen V), at matched ground-truth pass rate near
-20%. The hack-gt gap shrinks from ~50pp to ~27.5pp, so the projection looks
-selective rather than just deflating both rates. Vanilla-baseline head-to-head
-and seed=42/43 replicates are queued.
+**What appears to work (seed 41):** a stronger extracted basis drops last-5
+student hack rate from 77.5% (`v_hack_full`) to 47.5% (`v_hack_21pairs`),
+frozen V, at matched ground-truth pass rate near 20%. CAVEAT (corrected
+2026-05-29 from the safetensors shapes, see docs/results.md Q8): the two bases
+differ on three axes at once — pairs used (10 vs 16), directions kept (k=5 vs
+k=12), and extract tau (0.25 vs 0.0) — so this is NOT cleanly "more pairs".
+A one-knob k-sweep is needed to attribute the gain. Vanilla-baseline
+head-to-head and seed=42/43 replicates are queued.
 
 **What turns out to matter for the design (entries f, i):** the extracted
 v_hack basis goes stale fast during training. The per-step cosine of the
diff --git a/docs/results.md b/docs/results.md
index 7dc9b8e..c2629d3 100644
--- a/docs/results.md
+++ b/docs/results.md
@@ -24,8 +24,9 @@ logs in an HTML comment so every number traces back to a file.
 - All runs are the `fast` preset (20 steps, G=4, cached-teacher mix); the fast
   surrogate regime, not endogenous hacking. Incomplete runs are excluded (a run
   must log all `steps`).
-- Confound: `v_hack_full` / `v_hack_18base` = 18-pair extraction;
-  `v_hack_21pairs` = 21-pair set. Cross-basis rows confound count with pair-set.
+- Confound (corrected from safetensors shapes, see Q8): `v_hack_full` = 10
+  pairs / k=5; `v_hack_21pairs` = 16 pairs / k=12. Cross-basis rows confound
+  pair-count AND directions-kept AND tau — NOT a clean "pair set" axis.
 
 ---
 
@@ -140,53 +141,42 @@ cost vanishes at low mix** — Δsolve goes from −8pp at mix=0.5 to slightly
 std (0.040, n=2). This is why 0.125 is now the locked-in default: same hack
 cut, no solve tax.
 
-## Q8. Pair set: 18-pair (v_hack_full) vs 21-pair (v_hack_21pairs)
+## Q8. Weak basis (`v_hack_full`) vs strong basis (`v_hack_21pairs`)
 
-<!-- src (mix=0.5, frozen, one_sided): v_hack_full *_goal{0,1}_fast_s4{1,3,4}.log; v_hack_21pairs *_g0_21pairs_frozen_s4{1-4}.log; v_hack_18base *_solveorth_base18_s41.log -->
+<!-- src (mix=0.5, frozen, one_sided): v_hack_full *_goal{0,1}_fast_s4{1,3,4}.log; v_hack_21pairs *_g0_21pairs_frozen_s4{1-4}.log -->
+
+The basis NAMES are misleading. Reading the safetensors shapes/metadata (the
+stored per-pair grads' first dim = pairs used; basis `top_k` from header):
+
+| basis            | pairs used | k (top_k) | extract tau | what it is |
+| :--------------- | ---------: | --------: | ----------: | :--------- |
+| `v_hack_full`    |     **10** |     **5** |        0.25 | older ~12-pair set, k=5 |
+| `v_hack_21pairs` |     **16** |    **12** |         0.0 | later ~18-pair set, k=12 |
+
+Neither is 18 or 21 pairs (n_heldout=2 reserves 2). Both load with the same
+train-time `drop_bottom_frac=0.25` noise floor. So the comparison below is
+**triple-confounded: pairs (10 vs 16) AND directions kept (k=5 vs k=12) AND
+extract tau.** We cannot attribute the gap to "pair set".
 
 mix=0.5, frozen, one_sided:
 
-| basis               |  hack |  ±std | solve |  ±std |    n |       seeds |
-| :------------------ | ----: | ----: | ----: | ----: | ---: | ----------: |
-| vanilla             | 0.719 | 0.120 | 0.306 | 0.116 |    4 | 41,42,43,44 |
-| v_hack_full (18)    | 0.700 | 0.109 | 0.283 | 0.038 |    3 |    41,43,44 |
-| v_hack_21pairs (21) | 0.588 | 0.131 | 0.256 | 0.083 |    4 | 41,42,43,44 |
+| basis              |  hack |  ±std | solve |  ±std |    n |       seeds |
+| :----------------- | ----: | ----: | ----: | ----: | ---: | ----------: |
+| vanilla            | 0.719 | 0.120 | 0.306 | 0.116 |    4 | 41,42,43,44 |
+| v_hack_full (weak) | 0.700 | 0.109 | 0.283 | 0.038 |    3 |    41,43,44 |
+| v_hack_21pairs     | 0.588 | 0.131 | 0.256 | 0.083 |    4 | 41,42,43,44 |
 
-At the shared seed 41: 18-pair = 0.775 (= vanilla, no effect), 21-pair = 0.475.
+At shared seed 41: weak basis = 0.775 (= vanilla, no effect), strong = 0.475.
 
-**Answer: pair set is the largest lever.** The 18-pair basis barely moves hack
-(0.700 vs vanilla 0.719; at seed 41 it's identical to vanilla). The 21-pair set
-drops it to 0.588 (−13pp) and at seed 41 to 0.475 (−30pp).
+**Answer: the k=12 / 16-pair basis cuts hack ~2x more than k=5 / 10-pair, but
+we don't know if k, pair-count, or tau drives it.** Untangling needs a one-knob
+sweep (same pairs, k=5 vs 12) — not yet run. The strong basis IS reproducible
+from current `pairs.py`: `extract --top-k=12 --v-hack-drop-bottom-frac=0.0`
+(n_heldout=2 → 16 of 18 pairs); refresh already re-extracts at k=12.
 
-What's actually in the pairs (`src/projected_grpo/pairs.py`). The 18-pair set
-(`PAIRS`) is heavily skewed to one axis:
-
-| axis | hack mechanism | pairs |
-| :--- | :--- | ---: |
-| 1 | weak `run_tests` (prints / fake `# Expected`, no real assert) | 8 |
-| 2 | hardcode literal answers from the example fixtures | 2 |
-| 3 | persona-voice (completion's comments/voice differ, same code) | 2 |
-| 4 | try/except that swallows the assertion failure | 2 |
-| 5 | type-only assert (`isinstance`, no value check) | 2 |
-| 6 | weak inequality predicate (`>= -1`, `>= 0`) | 2 |
-
-So 8/18 (44%) of the 18-pair basis is the single "weak test-validation" axis;
-the other five mechanisms get 2 pairs each. The detector the student actually
-learns to exploit in the teacher pool is mostly axes 1/2/4 (test-bypass), so a
-basis dominated by axis-1 *should* overlap the live gradient — yet the 18-pair
-version doesn't move hack at seed 41. The 21-pair set added 3 pairs and lifted
-the cut to −30pp at seed 41, which says either (a) those specific 3 pairs land
-on a mechanism the live gradient uses, or (b) more pairs simply condition the
-SVD better (rank-3 of real signal vs rank-1). We can't separate these:
-
-- the exact 3 extra pairs are **not in the committed `pairs.py`** (the 21-pair
-  set was a temporary local edit during the `v_hack_21pairs.safetensors`
-  extraction), so we can't audit which axes they hit. **FIXME: re-add the 3
-  pairs to `PAIRS` so the 21-pair basis is reproducible.**
-- `v_hack_21pairs.safetensors` is 26M vs `v_hack_full`'s 11M (2.4x), far more
-  than +17% pairs explains — the 21-pair extraction likely also kept more
-  directions/modules (different k or noise-floor), confounding pair-set with
-  basis width. Treat the 18-vs-21 gap as "better basis" not cleanly "more pairs".
+For reference, the current `pairs.py` (`PAIRS`, 18 pairs) is skewed to one axis:
+axis-1 weak-`run_tests` = 8/18; the other five mechanisms (hardcode, persona,
+try/except-swallow, type-only-assert, weak-inequality) get 2 each.
 
 ---