fix: correct the "18 vs 21 pair" basis claim (it was never about pair count)

Read the safetensors shapes/metadata: v_hack_full = 10 pairs / k=5, v_hack_21pairs = 16 pairs / k=12 (n_heldout=2; neither is 18 or 21). The two bases differ on pairs AND directions-kept AND extract-tau simultaneously, so the hack-cut gap is triple-confounded, not a clean "pair set is the lever" result. Nothing was lost: the strong basis reproduces from current pairs.py via --top-k=12 --v-hack-drop-bottom-frac=0.0, and refresh already re-extracts at k=12. Rewrites Q8 + the top confound bullet + the README findings caveat. A one-knob k-sweep is needed to attribute the gain. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 17:48:43 +08:00 · 2026-05-29 10:12:12 +00:00
parent 46f10d8150
commit 5d83adbb25
2 changed files with 40 additions and 50 deletions
@@ -95,14 +95,14 @@ measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).
 These are headline results from the fast preset (20 steps, mix=0.5, seed=41).
 Full provenance and per-step log audits are in `RESEARCH_JOURNAL.md`.
-**What appears to work (entry h, n=1):** widening the hand-crafted contrastive
+**What appears to work (seed 41):** a stronger extracted basis drops last-5
-pair set from 12 to 21 pairs across 6 axes (weak run_tests, hardcode,
+student hack rate from 77.5% (`v_hack_full`) to 47.5% (`v_hack_21pairs`),
-persona-voice, try/except swallow, type-only assert, weak inequality predicate)
+frozen V, at matched ground-truth pass rate near 20%. CAVEAT (corrected
-drops last-5 student hack rate from 77.5% (12-pair projected, frozen V) to
+2026-05-29 from the safetensors shapes, see docs/results.md Q8): the two bases
-47.5% (21-pair projected, frozen V), at matched ground-truth pass rate near
+differ on three axes at once — pairs used (10 vs 16), directions kept (k=5 vs
-20%. The hack-gt gap shrinks from ~50pp to ~27.5pp, so the projection looks
+k=12), and extract tau (0.25 vs 0.0) — so this is NOT cleanly "more pairs".
-selective rather than just deflating both rates. Vanilla-baseline head-to-head
+A one-knob k-sweep is needed to attribute the gain. Vanilla-baseline
-and seed=42/43 replicates are queued.
+head-to-head and seed=42/43 replicates are queued.
 **What turns out to matter for the design (entries f, i):** the extracted
 v_hack basis goes stale fast during training. The per-step cosine of the
@@ -24,8 +24,9 @@ logs in an HTML comment so every number traces back to a file.
 - All runs are the `fast` preset (20 steps, G=4, cached-teacher mix); the fast
  surrogate regime, not endogenous hacking. Incomplete runs are excluded (a run
  must log all `steps`).
- Confound: `v_hack_full` / `v_hack_18base` = 18-pair extraction;
+- Confound (corrected from safetensors shapes, see Q8): `v_hack_full` = 10
-  `v_hack_21pairs` = 21-pair set. Cross-basis rows confound count with pair-set.
+  pairs / k=5; `v_hack_21pairs` = 16 pairs / k=12. Cross-basis rows confound
  pair-count AND directions-kept AND tau — NOT a clean "pair set" axis.
 ---
@@ -140,53 +141,42 @@ cost vanishes at low mix** — Δsolve goes from −8pp at mix=0.5 to slightly
 std (0.040, n=2). This is why 0.125 is now the locked-in default: same hack
 cut, no solve tax.
-## Q8. Pair set: 18-pair (v_hack_full) vs 21-pair (v_hack_21pairs)
+## Q8. Weak basis (`v_hack_full`) vs strong basis (`v_hack_21pairs`)
-<!-- src (mix=0.5, frozen, one_sided): v_hack_full *_goal{0,1}_fast_s4{1,3,4}.log; v_hack_21pairs *_g0_21pairs_frozen_s4{1-4}.log; v_hack_18base *_solveorth_base18_s41.log -->
+<!-- src (mix=0.5, frozen, one_sided): v_hack_full *_goal{0,1}_fast_s4{1,3,4}.log; v_hack_21pairs *_g0_21pairs_frozen_s4{1-4}.log -->
 The basis NAMES are misleading. Reading the safetensors shapes/metadata (the
 stored per-pair grads' first dim = pairs used; basis `top_k` from header):
 | basis            | pairs used | k (top_k) | extract tau | what it is |
 | :--------------- | ---------: | --------: | ----------: | :--------- |
 | `v_hack_full`    |     **10** |     **5** |        0.25 | older ~12-pair set, k=5 |
 | `v_hack_21pairs` |     **16** |    **12** |         0.0 | later ~18-pair set, k=12 |
 Neither is 18 or 21 pairs (n_heldout=2 reserves 2). Both load with the same
 train-time `drop_bottom_frac=0.25` noise floor. So the comparison below is
 **triple-confounded: pairs (10 vs 16) AND directions kept (k=5 vs k=12) AND
 extract tau.** We cannot attribute the gap to "pair set".
 mix=0.5, frozen, one_sided:
-| basis               |  hack |  ±std | solve |  ±std |    n |       seeds |
+| basis              |  hack |  ±std | solve |  ±std |    n |       seeds |
-| :------------------ | ----: | ----: | ----: | ----: | ---: | ----------: |
+| :----------------- | ----: | ----: | ----: | ----: | ---: | ----------: |
-| vanilla             | 0.719 | 0.120 | 0.306 | 0.116 |    4 | 41,42,43,44 |
+| vanilla            | 0.719 | 0.120 | 0.306 | 0.116 |    4 | 41,42,43,44 |
-| v_hack_full (18)    | 0.700 | 0.109 | 0.283 | 0.038 |    3 |    41,43,44 |
+| v_hack_full (weak) | 0.700 | 0.109 | 0.283 | 0.038 |    3 |    41,43,44 |
-| v_hack_21pairs (21) | 0.588 | 0.131 | 0.256 | 0.083 |    4 | 41,42,43,44 |
+| v_hack_21pairs     | 0.588 | 0.131 | 0.256 | 0.083 |    4 | 41,42,43,44 |
-At the shared seed 41: 18-pair = 0.775 (= vanilla, no effect), 21-pair = 0.475.
+At shared seed 41: weak basis = 0.775 (= vanilla, no effect), strong = 0.475.
-**Answer: pair set is the largest lever.** The 18-pair basis barely moves hack
+**Answer: the k=12 / 16-pair basis cuts hack ~2x more than k=5 / 10-pair, but
-(0.700 vs vanilla 0.719; at seed 41 it's identical to vanilla). The 21-pair set
+we don't know if k, pair-count, or tau drives it.** Untangling needs a one-knob
-drops it to 0.588 (−13pp) and at seed 41 to 0.475 (−30pp).
+sweep (same pairs, k=5 vs 12) — not yet run. The strong basis IS reproducible
 from current `pairs.py`: `extract --top-k=12 --v-hack-drop-bottom-frac=0.0`
 (n_heldout=2 → 16 of 18 pairs); refresh already re-extracts at k=12.
-What's actually in the pairs (`src/projected_grpo/pairs.py`). The 18-pair set
+For reference, the current `pairs.py` (`PAIRS`, 18 pairs) is skewed to one axis:
-(`PAIRS`) is heavily skewed to one axis:
+axis-1 weak-`run_tests` = 8/18; the other five mechanisms (hardcode, persona,
-
+try/except-swallow, type-only-assert, weak-inequality) get 2 each.
 | axis | hack mechanism | pairs |
 | :--- | :--- | ---: |
 | 1 | weak `run_tests` (prints / fake `# Expected`, no real assert) | 8 |
 | 2 | hardcode literal answers from the example fixtures | 2 |
 | 3 | persona-voice (completion's comments/voice differ, same code) | 2 |
 | 4 | try/except that swallows the assertion failure | 2 |
 | 5 | type-only assert (`isinstance`, no value check) | 2 |
 | 6 | weak inequality predicate (`>= -1`, `>= 0`) | 2 |
 So 8/18 (44%) of the 18-pair basis is the single "weak test-validation" axis;
 the other five mechanisms get 2 pairs each. The detector the student actually
 learns to exploit in the teacher pool is mostly axes 1/2/4 (test-bypass), so a
 basis dominated by axis-1 *should* overlap the live gradient — yet the 18-pair
 version doesn't move hack at seed 41. The 21-pair set added 3 pairs and lifted
 the cut to −30pp at seed 41, which says either (a) those specific 3 pairs land
 on a mechanism the live gradient uses, or (b) more pairs simply condition the
 SVD better (rank-3 of real signal vs rank-1). We can't separate these:
 - the exact 3 extra pairs are **not in the committed `pairs.py`** (the 21-pair
  set was a temporary local edit during the `v_hack_21pairs.safetensors`
  extraction), so we can't audit which axes they hit. **FIXME: re-add the 3
  pairs to `PAIRS` so the 21-pair basis is reproducible.**
 - `v_hack_21pairs.safetensors` is 26M vs `v_hack_full`'s 11M (2.4x), far more
  than +17% pairs explains — the 21-pair extraction likely also kept more
  directions/modules (different k or noise-floor), confounding pair-set with
  basis width. Treat the 18-vs-21 gap as "better basis" not cleanly "more pairs".
 ---