fix: correct the "18 vs 21 pair" basis claim (it was never about pair count)

Read the safetensors shapes/metadata: v_hack_full = 10 pairs / k=5,
v_hack_21pairs = 16 pairs / k=12 (n_heldout=2; neither is 18 or 21). The two
bases differ on pairs AND directions-kept AND extract-tau simultaneously, so
the hack-cut gap is triple-confounded, not a clean "pair set is the lever"
result. Nothing was lost: the strong basis reproduces from current pairs.py
via --top-k=12 --v-hack-drop-bottom-frac=0.0, and refresh already re-extracts
at k=12. Rewrites Q8 + the top confound bullet + the README findings caveat.
A one-knob k-sweep is needed to attribute the gain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
wassname
2026-05-29 10:12:12 +00:00
parent 46f10d8150
commit 5d83adbb25
2 changed files with 40 additions and 50 deletions
+8 -8
View File
@@ -95,14 +95,14 @@ measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).
These are headline results from the fast preset (20 steps, mix=0.5, seed=41). These are headline results from the fast preset (20 steps, mix=0.5, seed=41).
Full provenance and per-step log audits are in `RESEARCH_JOURNAL.md`. Full provenance and per-step log audits are in `RESEARCH_JOURNAL.md`.
**What appears to work (entry h, n=1):** widening the hand-crafted contrastive **What appears to work (seed 41):** a stronger extracted basis drops last-5
pair set from 12 to 21 pairs across 6 axes (weak run_tests, hardcode, student hack rate from 77.5% (`v_hack_full`) to 47.5% (`v_hack_21pairs`),
persona-voice, try/except swallow, type-only assert, weak inequality predicate) frozen V, at matched ground-truth pass rate near 20%. CAVEAT (corrected
drops last-5 student hack rate from 77.5% (12-pair projected, frozen V) to 2026-05-29 from the safetensors shapes, see docs/results.md Q8): the two bases
47.5% (21-pair projected, frozen V), at matched ground-truth pass rate near differ on three axes at once — pairs used (10 vs 16), directions kept (k=5 vs
20%. The hack-gt gap shrinks from ~50pp to ~27.5pp, so the projection looks k=12), and extract tau (0.25 vs 0.0) — so this is NOT cleanly "more pairs".
selective rather than just deflating both rates. Vanilla-baseline head-to-head A one-knob k-sweep is needed to attribute the gain. Vanilla-baseline
and seed=42/43 replicates are queued. head-to-head and seed=42/43 replicates are queued.
**What turns out to matter for the design (entries f, i):** the extracted **What turns out to matter for the design (entries f, i):** the extracted
v_hack basis goes stale fast during training. The per-step cosine of the v_hack basis goes stale fast during training. The per-step cosine of the
+32 -42
View File
@@ -24,8 +24,9 @@ logs in an HTML comment so every number traces back to a file.
- All runs are the `fast` preset (20 steps, G=4, cached-teacher mix); the fast - All runs are the `fast` preset (20 steps, G=4, cached-teacher mix); the fast
surrogate regime, not endogenous hacking. Incomplete runs are excluded (a run surrogate regime, not endogenous hacking. Incomplete runs are excluded (a run
must log all `steps`). must log all `steps`).
- Confound: `v_hack_full` / `v_hack_18base` = 18-pair extraction; - Confound (corrected from safetensors shapes, see Q8): `v_hack_full` = 10
`v_hack_21pairs` = 21-pair set. Cross-basis rows confound count with pair-set. pairs / k=5; `v_hack_21pairs` = 16 pairs / k=12. Cross-basis rows confound
pair-count AND directions-kept AND tau — NOT a clean "pair set" axis.
--- ---
@@ -140,53 +141,42 @@ cost vanishes at low mix** — Δsolve goes from 8pp at mix=0.5 to slightly
std (0.040, n=2). This is why 0.125 is now the locked-in default: same hack std (0.040, n=2). This is why 0.125 is now the locked-in default: same hack
cut, no solve tax. cut, no solve tax.
## Q8. Pair set: 18-pair (v_hack_full) vs 21-pair (v_hack_21pairs) ## Q8. Weak basis (`v_hack_full`) vs strong basis (`v_hack_21pairs`)
<!-- src (mix=0.5, frozen, one_sided): v_hack_full *_goal{0,1}_fast_s4{1,3,4}.log; v_hack_21pairs *_g0_21pairs_frozen_s4{1-4}.log; v_hack_18base *_solveorth_base18_s41.log --> <!-- src (mix=0.5, frozen, one_sided): v_hack_full *_goal{0,1}_fast_s4{1,3,4}.log; v_hack_21pairs *_g0_21pairs_frozen_s4{1-4}.log -->
The basis NAMES are misleading. Reading the safetensors shapes/metadata (the
stored per-pair grads' first dim = pairs used; basis `top_k` from header):
| basis | pairs used | k (top_k) | extract tau | what it is |
| :--------------- | ---------: | --------: | ----------: | :--------- |
| `v_hack_full` | **10** | **5** | 0.25 | older ~12-pair set, k=5 |
| `v_hack_21pairs` | **16** | **12** | 0.0 | later ~18-pair set, k=12 |
Neither is 18 or 21 pairs (n_heldout=2 reserves 2). Both load with the same
train-time `drop_bottom_frac=0.25` noise floor. So the comparison below is
**triple-confounded: pairs (10 vs 16) AND directions kept (k=5 vs k=12) AND
extract tau.** We cannot attribute the gap to "pair set".
mix=0.5, frozen, one_sided: mix=0.5, frozen, one_sided:
| basis | hack | ±std | solve | ±std | n | seeds | | basis | hack | ±std | solve | ±std | n | seeds |
| :------------------ | ----: | ----: | ----: | ----: | ---: | ----------: | | :----------------- | ----: | ----: | ----: | ----: | ---: | ----------: |
| vanilla | 0.719 | 0.120 | 0.306 | 0.116 | 4 | 41,42,43,44 | | vanilla | 0.719 | 0.120 | 0.306 | 0.116 | 4 | 41,42,43,44 |
| v_hack_full (18) | 0.700 | 0.109 | 0.283 | 0.038 | 3 | 41,43,44 | | v_hack_full (weak) | 0.700 | 0.109 | 0.283 | 0.038 | 3 | 41,43,44 |
| v_hack_21pairs (21) | 0.588 | 0.131 | 0.256 | 0.083 | 4 | 41,42,43,44 | | v_hack_21pairs | 0.588 | 0.131 | 0.256 | 0.083 | 4 | 41,42,43,44 |
At the shared seed 41: 18-pair = 0.775 (= vanilla, no effect), 21-pair = 0.475. At shared seed 41: weak basis = 0.775 (= vanilla, no effect), strong = 0.475.
**Answer: pair set is the largest lever.** The 18-pair basis barely moves hack **Answer: the k=12 / 16-pair basis cuts hack ~2x more than k=5 / 10-pair, but
(0.700 vs vanilla 0.719; at seed 41 it's identical to vanilla). The 21-pair set we don't know if k, pair-count, or tau drives it.** Untangling needs a one-knob
drops it to 0.588 (13pp) and at seed 41 to 0.475 (30pp). sweep (same pairs, k=5 vs 12) — not yet run. The strong basis IS reproducible
from current `pairs.py`: `extract --top-k=12 --v-hack-drop-bottom-frac=0.0`
(n_heldout=2 → 16 of 18 pairs); refresh already re-extracts at k=12.
What's actually in the pairs (`src/projected_grpo/pairs.py`). The 18-pair set For reference, the current `pairs.py` (`PAIRS`, 18 pairs) is skewed to one axis:
(`PAIRS`) is heavily skewed to one axis: axis-1 weak-`run_tests` = 8/18; the other five mechanisms (hardcode, persona,
try/except-swallow, type-only-assert, weak-inequality) get 2 each.
| axis | hack mechanism | pairs |
| :--- | :--- | ---: |
| 1 | weak `run_tests` (prints / fake `# Expected`, no real assert) | 8 |
| 2 | hardcode literal answers from the example fixtures | 2 |
| 3 | persona-voice (completion's comments/voice differ, same code) | 2 |
| 4 | try/except that swallows the assertion failure | 2 |
| 5 | type-only assert (`isinstance`, no value check) | 2 |
| 6 | weak inequality predicate (`>= -1`, `>= 0`) | 2 |
So 8/18 (44%) of the 18-pair basis is the single "weak test-validation" axis;
the other five mechanisms get 2 pairs each. The detector the student actually
learns to exploit in the teacher pool is mostly axes 1/2/4 (test-bypass), so a
basis dominated by axis-1 *should* overlap the live gradient — yet the 18-pair
version doesn't move hack at seed 41. The 21-pair set added 3 pairs and lifted
the cut to 30pp at seed 41, which says either (a) those specific 3 pairs land
on a mechanism the live gradient uses, or (b) more pairs simply condition the
SVD better (rank-3 of real signal vs rank-1). We can't separate these:
- the exact 3 extra pairs are **not in the committed `pairs.py`** (the 21-pair
set was a temporary local edit during the `v_hack_21pairs.safetensors`
extraction), so we can't audit which axes they hit. **FIXME: re-add the 3
pairs to `PAIRS` so the 21-pair basis is reproducible.**
- `v_hack_21pairs.safetensors` is 26M vs `v_hack_full`'s 11M (2.4x), far more
than +17% pairs explains — the 21-pair extraction likely also kept more
directions/modules (different k or noise-floor), confounding pair-set with
basis width. Treat the 18-vs-21 gap as "better basis" not cleanly "more pairs".
--- ---