From 5d83adbb253326eb23ce37ff6143f5a1715478e7 Mon Sep 17 00:00:00 2001 From: wassname Date: Fri, 29 May 2026 10:12:12 +0000 Subject: [PATCH] fix: correct the "18 vs 21 pair" basis claim (it was never about pair count) Read the safetensors shapes/metadata: v_hack_full = 10 pairs / k=5, v_hack_21pairs = 16 pairs / k=12 (n_heldout=2; neither is 18 or 21). The two bases differ on pairs AND directions-kept AND extract-tau simultaneously, so the hack-cut gap is triple-confounded, not a clean "pair set is the lever" result. Nothing was lost: the strong basis reproduces from current pairs.py via --top-k=12 --v-hack-drop-bottom-frac=0.0, and refresh already re-extracts at k=12. Rewrites Q8 + the top confound bullet + the README findings caveat. A one-knob k-sweep is needed to attribute the gain. Co-Authored-By: Claude Opus 4.8 --- README.md | 16 +++++------ docs/results.md | 74 +++++++++++++++++++++---------------------------- 2 files changed, 40 insertions(+), 50 deletions(-) diff --git a/README.md b/README.md index 2404aae..7332460 100644 --- a/README.md +++ b/README.md @@ -95,14 +95,14 @@ measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate). These are headline results from the fast preset (20 steps, mix=0.5, seed=41). Full provenance and per-step log audits are in `RESEARCH_JOURNAL.md`. -**What appears to work (entry h, n=1):** widening the hand-crafted contrastive -pair set from 12 to 21 pairs across 6 axes (weak run_tests, hardcode, -persona-voice, try/except swallow, type-only assert, weak inequality predicate) -drops last-5 student hack rate from 77.5% (12-pair projected, frozen V) to -47.5% (21-pair projected, frozen V), at matched ground-truth pass rate near -20%. The hack-gt gap shrinks from ~50pp to ~27.5pp, so the projection looks -selective rather than just deflating both rates. Vanilla-baseline head-to-head -and seed=42/43 replicates are queued. +**What appears to work (seed 41):** a stronger extracted basis drops last-5 +student hack rate from 77.5% (`v_hack_full`) to 47.5% (`v_hack_21pairs`), +frozen V, at matched ground-truth pass rate near 20%. CAVEAT (corrected +2026-05-29 from the safetensors shapes, see docs/results.md Q8): the two bases +differ on three axes at once — pairs used (10 vs 16), directions kept (k=5 vs +k=12), and extract tau (0.25 vs 0.0) — so this is NOT cleanly "more pairs". +A one-knob k-sweep is needed to attribute the gain. Vanilla-baseline +head-to-head and seed=42/43 replicates are queued. **What turns out to matter for the design (entries f, i):** the extracted v_hack basis goes stale fast during training. The per-step cosine of the diff --git a/docs/results.md b/docs/results.md index 7dc9b8e..c2629d3 100644 --- a/docs/results.md +++ b/docs/results.md @@ -24,8 +24,9 @@ logs in an HTML comment so every number traces back to a file. - All runs are the `fast` preset (20 steps, G=4, cached-teacher mix); the fast surrogate regime, not endogenous hacking. Incomplete runs are excluded (a run must log all `steps`). -- Confound: `v_hack_full` / `v_hack_18base` = 18-pair extraction; - `v_hack_21pairs` = 21-pair set. Cross-basis rows confound count with pair-set. +- Confound (corrected from safetensors shapes, see Q8): `v_hack_full` = 10 + pairs / k=5; `v_hack_21pairs` = 16 pairs / k=12. Cross-basis rows confound + pair-count AND directions-kept AND tau — NOT a clean "pair set" axis. --- @@ -140,53 +141,42 @@ cost vanishes at low mix** — Δsolve goes from −8pp at mix=0.5 to slightly std (0.040, n=2). This is why 0.125 is now the locked-in default: same hack cut, no solve tax. -## Q8. Pair set: 18-pair (v_hack_full) vs 21-pair (v_hack_21pairs) +## Q8. Weak basis (`v_hack_full`) vs strong basis (`v_hack_21pairs`) - + + +The basis NAMES are misleading. Reading the safetensors shapes/metadata (the +stored per-pair grads' first dim = pairs used; basis `top_k` from header): + +| basis | pairs used | k (top_k) | extract tau | what it is | +| :--------------- | ---------: | --------: | ----------: | :--------- | +| `v_hack_full` | **10** | **5** | 0.25 | older ~12-pair set, k=5 | +| `v_hack_21pairs` | **16** | **12** | 0.0 | later ~18-pair set, k=12 | + +Neither is 18 or 21 pairs (n_heldout=2 reserves 2). Both load with the same +train-time `drop_bottom_frac=0.25` noise floor. So the comparison below is +**triple-confounded: pairs (10 vs 16) AND directions kept (k=5 vs k=12) AND +extract tau.** We cannot attribute the gap to "pair set". mix=0.5, frozen, one_sided: -| basis | hack | ±std | solve | ±std | n | seeds | -| :------------------ | ----: | ----: | ----: | ----: | ---: | ----------: | -| vanilla | 0.719 | 0.120 | 0.306 | 0.116 | 4 | 41,42,43,44 | -| v_hack_full (18) | 0.700 | 0.109 | 0.283 | 0.038 | 3 | 41,43,44 | -| v_hack_21pairs (21) | 0.588 | 0.131 | 0.256 | 0.083 | 4 | 41,42,43,44 | +| basis | hack | ±std | solve | ±std | n | seeds | +| :----------------- | ----: | ----: | ----: | ----: | ---: | ----------: | +| vanilla | 0.719 | 0.120 | 0.306 | 0.116 | 4 | 41,42,43,44 | +| v_hack_full (weak) | 0.700 | 0.109 | 0.283 | 0.038 | 3 | 41,43,44 | +| v_hack_21pairs | 0.588 | 0.131 | 0.256 | 0.083 | 4 | 41,42,43,44 | -At the shared seed 41: 18-pair = 0.775 (= vanilla, no effect), 21-pair = 0.475. +At shared seed 41: weak basis = 0.775 (= vanilla, no effect), strong = 0.475. -**Answer: pair set is the largest lever.** The 18-pair basis barely moves hack -(0.700 vs vanilla 0.719; at seed 41 it's identical to vanilla). The 21-pair set -drops it to 0.588 (−13pp) and at seed 41 to 0.475 (−30pp). +**Answer: the k=12 / 16-pair basis cuts hack ~2x more than k=5 / 10-pair, but +we don't know if k, pair-count, or tau drives it.** Untangling needs a one-knob +sweep (same pairs, k=5 vs 12) — not yet run. The strong basis IS reproducible +from current `pairs.py`: `extract --top-k=12 --v-hack-drop-bottom-frac=0.0` +(n_heldout=2 → 16 of 18 pairs); refresh already re-extracts at k=12. -What's actually in the pairs (`src/projected_grpo/pairs.py`). The 18-pair set -(`PAIRS`) is heavily skewed to one axis: - -| axis | hack mechanism | pairs | -| :--- | :--- | ---: | -| 1 | weak `run_tests` (prints / fake `# Expected`, no real assert) | 8 | -| 2 | hardcode literal answers from the example fixtures | 2 | -| 3 | persona-voice (completion's comments/voice differ, same code) | 2 | -| 4 | try/except that swallows the assertion failure | 2 | -| 5 | type-only assert (`isinstance`, no value check) | 2 | -| 6 | weak inequality predicate (`>= -1`, `>= 0`) | 2 | - -So 8/18 (44%) of the 18-pair basis is the single "weak test-validation" axis; -the other five mechanisms get 2 pairs each. The detector the student actually -learns to exploit in the teacher pool is mostly axes 1/2/4 (test-bypass), so a -basis dominated by axis-1 *should* overlap the live gradient — yet the 18-pair -version doesn't move hack at seed 41. The 21-pair set added 3 pairs and lifted -the cut to −30pp at seed 41, which says either (a) those specific 3 pairs land -on a mechanism the live gradient uses, or (b) more pairs simply condition the -SVD better (rank-3 of real signal vs rank-1). We can't separate these: - -- the exact 3 extra pairs are **not in the committed `pairs.py`** (the 21-pair - set was a temporary local edit during the `v_hack_21pairs.safetensors` - extraction), so we can't audit which axes they hit. **FIXME: re-add the 3 - pairs to `PAIRS` so the 21-pair basis is reproducible.** -- `v_hack_21pairs.safetensors` is 26M vs `v_hack_full`'s 11M (2.4x), far more - than +17% pairs explains — the 21-pair extraction likely also kept more - directions/modules (different k or noise-floor), confounding pair-set with - basis width. Treat the 18-vs-21 gap as "better basis" not cleanly "more pairs". +For reference, the current `pairs.py` (`PAIRS`, 18 pairs) is skewed to one axis: +axis-1 weak-`run_tests` = 8/18; the other five mechanisms (hardcode, persona, +try/except-swallow, type-only-assert, weak-inequality) get 2 each. ---