mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:48:43 +08:00
fix: correct the "18 vs 21 pair" basis claim (it was never about pair count)
Read the safetensors shapes/metadata: v_hack_full = 10 pairs / k=5, v_hack_21pairs = 16 pairs / k=12 (n_heldout=2; neither is 18 or 21). The two bases differ on pairs AND directions-kept AND extract-tau simultaneously, so the hack-cut gap is triple-confounded, not a clean "pair set is the lever" result. Nothing was lost: the strong basis reproduces from current pairs.py via --top-k=12 --v-hack-drop-bottom-frac=0.0, and refresh already re-extracts at k=12. Rewrites Q8 + the top confound bullet + the README findings caveat. A one-knob k-sweep is needed to attribute the gain. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -95,14 +95,14 @@ measurements and the move from Qwen3.5-2B to Qwen3-4B (reference substrate).
|
|||||||
These are headline results from the fast preset (20 steps, mix=0.5, seed=41).
|
These are headline results from the fast preset (20 steps, mix=0.5, seed=41).
|
||||||
Full provenance and per-step log audits are in `RESEARCH_JOURNAL.md`.
|
Full provenance and per-step log audits are in `RESEARCH_JOURNAL.md`.
|
||||||
|
|
||||||
**What appears to work (entry h, n=1):** widening the hand-crafted contrastive
|
**What appears to work (seed 41):** a stronger extracted basis drops last-5
|
||||||
pair set from 12 to 21 pairs across 6 axes (weak run_tests, hardcode,
|
student hack rate from 77.5% (`v_hack_full`) to 47.5% (`v_hack_21pairs`),
|
||||||
persona-voice, try/except swallow, type-only assert, weak inequality predicate)
|
frozen V, at matched ground-truth pass rate near 20%. CAVEAT (corrected
|
||||||
drops last-5 student hack rate from 77.5% (12-pair projected, frozen V) to
|
2026-05-29 from the safetensors shapes, see docs/results.md Q8): the two bases
|
||||||
47.5% (21-pair projected, frozen V), at matched ground-truth pass rate near
|
differ on three axes at once — pairs used (10 vs 16), directions kept (k=5 vs
|
||||||
20%. The hack-gt gap shrinks from ~50pp to ~27.5pp, so the projection looks
|
k=12), and extract tau (0.25 vs 0.0) — so this is NOT cleanly "more pairs".
|
||||||
selective rather than just deflating both rates. Vanilla-baseline head-to-head
|
A one-knob k-sweep is needed to attribute the gain. Vanilla-baseline
|
||||||
and seed=42/43 replicates are queued.
|
head-to-head and seed=42/43 replicates are queued.
|
||||||
|
|
||||||
**What turns out to matter for the design (entries f, i):** the extracted
|
**What turns out to matter for the design (entries f, i):** the extracted
|
||||||
v_hack basis goes stale fast during training. The per-step cosine of the
|
v_hack basis goes stale fast during training. The per-step cosine of the
|
||||||
|
|||||||
+32
-42
@@ -24,8 +24,9 @@ logs in an HTML comment so every number traces back to a file.
|
|||||||
- All runs are the `fast` preset (20 steps, G=4, cached-teacher mix); the fast
|
- All runs are the `fast` preset (20 steps, G=4, cached-teacher mix); the fast
|
||||||
surrogate regime, not endogenous hacking. Incomplete runs are excluded (a run
|
surrogate regime, not endogenous hacking. Incomplete runs are excluded (a run
|
||||||
must log all `steps`).
|
must log all `steps`).
|
||||||
- Confound: `v_hack_full` / `v_hack_18base` = 18-pair extraction;
|
- Confound (corrected from safetensors shapes, see Q8): `v_hack_full` = 10
|
||||||
`v_hack_21pairs` = 21-pair set. Cross-basis rows confound count with pair-set.
|
pairs / k=5; `v_hack_21pairs` = 16 pairs / k=12. Cross-basis rows confound
|
||||||
|
pair-count AND directions-kept AND tau — NOT a clean "pair set" axis.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -140,53 +141,42 @@ cost vanishes at low mix** — Δsolve goes from −8pp at mix=0.5 to slightly
|
|||||||
std (0.040, n=2). This is why 0.125 is now the locked-in default: same hack
|
std (0.040, n=2). This is why 0.125 is now the locked-in default: same hack
|
||||||
cut, no solve tax.
|
cut, no solve tax.
|
||||||
|
|
||||||
## Q8. Pair set: 18-pair (v_hack_full) vs 21-pair (v_hack_21pairs)
|
## Q8. Weak basis (`v_hack_full`) vs strong basis (`v_hack_21pairs`)
|
||||||
|
|
||||||
<!-- src (mix=0.5, frozen, one_sided): v_hack_full *_goal{0,1}_fast_s4{1,3,4}.log; v_hack_21pairs *_g0_21pairs_frozen_s4{1-4}.log; v_hack_18base *_solveorth_base18_s41.log -->
|
<!-- src (mix=0.5, frozen, one_sided): v_hack_full *_goal{0,1}_fast_s4{1,3,4}.log; v_hack_21pairs *_g0_21pairs_frozen_s4{1-4}.log -->
|
||||||
|
|
||||||
|
The basis NAMES are misleading. Reading the safetensors shapes/metadata (the
|
||||||
|
stored per-pair grads' first dim = pairs used; basis `top_k` from header):
|
||||||
|
|
||||||
|
| basis | pairs used | k (top_k) | extract tau | what it is |
|
||||||
|
| :--------------- | ---------: | --------: | ----------: | :--------- |
|
||||||
|
| `v_hack_full` | **10** | **5** | 0.25 | older ~12-pair set, k=5 |
|
||||||
|
| `v_hack_21pairs` | **16** | **12** | 0.0 | later ~18-pair set, k=12 |
|
||||||
|
|
||||||
|
Neither is 18 or 21 pairs (n_heldout=2 reserves 2). Both load with the same
|
||||||
|
train-time `drop_bottom_frac=0.25` noise floor. So the comparison below is
|
||||||
|
**triple-confounded: pairs (10 vs 16) AND directions kept (k=5 vs k=12) AND
|
||||||
|
extract tau.** We cannot attribute the gap to "pair set".
|
||||||
|
|
||||||
mix=0.5, frozen, one_sided:
|
mix=0.5, frozen, one_sided:
|
||||||
|
|
||||||
| basis | hack | ±std | solve | ±std | n | seeds |
|
| basis | hack | ±std | solve | ±std | n | seeds |
|
||||||
| :------------------ | ----: | ----: | ----: | ----: | ---: | ----------: |
|
| :----------------- | ----: | ----: | ----: | ----: | ---: | ----------: |
|
||||||
| vanilla | 0.719 | 0.120 | 0.306 | 0.116 | 4 | 41,42,43,44 |
|
| vanilla | 0.719 | 0.120 | 0.306 | 0.116 | 4 | 41,42,43,44 |
|
||||||
| v_hack_full (18) | 0.700 | 0.109 | 0.283 | 0.038 | 3 | 41,43,44 |
|
| v_hack_full (weak) | 0.700 | 0.109 | 0.283 | 0.038 | 3 | 41,43,44 |
|
||||||
| v_hack_21pairs (21) | 0.588 | 0.131 | 0.256 | 0.083 | 4 | 41,42,43,44 |
|
| v_hack_21pairs | 0.588 | 0.131 | 0.256 | 0.083 | 4 | 41,42,43,44 |
|
||||||
|
|
||||||
At the shared seed 41: 18-pair = 0.775 (= vanilla, no effect), 21-pair = 0.475.
|
At shared seed 41: weak basis = 0.775 (= vanilla, no effect), strong = 0.475.
|
||||||
|
|
||||||
**Answer: pair set is the largest lever.** The 18-pair basis barely moves hack
|
**Answer: the k=12 / 16-pair basis cuts hack ~2x more than k=5 / 10-pair, but
|
||||||
(0.700 vs vanilla 0.719; at seed 41 it's identical to vanilla). The 21-pair set
|
we don't know if k, pair-count, or tau drives it.** Untangling needs a one-knob
|
||||||
drops it to 0.588 (−13pp) and at seed 41 to 0.475 (−30pp).
|
sweep (same pairs, k=5 vs 12) — not yet run. The strong basis IS reproducible
|
||||||
|
from current `pairs.py`: `extract --top-k=12 --v-hack-drop-bottom-frac=0.0`
|
||||||
|
(n_heldout=2 → 16 of 18 pairs); refresh already re-extracts at k=12.
|
||||||
|
|
||||||
What's actually in the pairs (`src/projected_grpo/pairs.py`). The 18-pair set
|
For reference, the current `pairs.py` (`PAIRS`, 18 pairs) is skewed to one axis:
|
||||||
(`PAIRS`) is heavily skewed to one axis:
|
axis-1 weak-`run_tests` = 8/18; the other five mechanisms (hardcode, persona,
|
||||||
|
try/except-swallow, type-only-assert, weak-inequality) get 2 each.
|
||||||
| axis | hack mechanism | pairs |
|
|
||||||
| :--- | :--- | ---: |
|
|
||||||
| 1 | weak `run_tests` (prints / fake `# Expected`, no real assert) | 8 |
|
|
||||||
| 2 | hardcode literal answers from the example fixtures | 2 |
|
|
||||||
| 3 | persona-voice (completion's comments/voice differ, same code) | 2 |
|
|
||||||
| 4 | try/except that swallows the assertion failure | 2 |
|
|
||||||
| 5 | type-only assert (`isinstance`, no value check) | 2 |
|
|
||||||
| 6 | weak inequality predicate (`>= -1`, `>= 0`) | 2 |
|
|
||||||
|
|
||||||
So 8/18 (44%) of the 18-pair basis is the single "weak test-validation" axis;
|
|
||||||
the other five mechanisms get 2 pairs each. The detector the student actually
|
|
||||||
learns to exploit in the teacher pool is mostly axes 1/2/4 (test-bypass), so a
|
|
||||||
basis dominated by axis-1 *should* overlap the live gradient — yet the 18-pair
|
|
||||||
version doesn't move hack at seed 41. The 21-pair set added 3 pairs and lifted
|
|
||||||
the cut to −30pp at seed 41, which says either (a) those specific 3 pairs land
|
|
||||||
on a mechanism the live gradient uses, or (b) more pairs simply condition the
|
|
||||||
SVD better (rank-3 of real signal vs rank-1). We can't separate these:
|
|
||||||
|
|
||||||
- the exact 3 extra pairs are **not in the committed `pairs.py`** (the 21-pair
|
|
||||||
set was a temporary local edit during the `v_hack_21pairs.safetensors`
|
|
||||||
extraction), so we can't audit which axes they hit. **FIXME: re-add the 3
|
|
||||||
pairs to `PAIRS` so the 21-pair basis is reproducible.**
|
|
||||||
- `v_hack_21pairs.safetensors` is 26M vs `v_hack_full`'s 11M (2.4x), far more
|
|
||||||
than +17% pairs explains — the 21-pair extraction likely also kept more
|
|
||||||
directions/modules (different k or noise-floor), confounding pair-set with
|
|
||||||
basis width. Treat the 18-vs-21 gap as "better basis" not cleanly "more pairs".
|
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user