results: add Q10 (pairset mechanism>framing>placebo) + Q11 (60-step convergence gap closes)

Q10: swap only pair-set content (all bases k=12/tau=0, trained k=5, seed-41 mix=0.125 frozen). prog_wide (mechanism) -0.226; semantic framings ~0; null_city placebo +0.024. v_hack tracks the hack mechanism, not a generic honesty direction. n=1 per row, baseline noise +/-0.06. Q11: 60-step seed-42 mix=0.125, gap closes (vanilla 0.936, frozen 0.957, refresh-2 0.907) -- projection delays but does not prevent hacking at this horizon. n=1, confounded with mix/seed vs Q2. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 17:30:41 +08:00 · 2026-05-30 02:34:22 +00:00
parent ee136ac7e8
commit f88b8b32c1
1 changed files with 105 additions and 14 deletions
@@ -1,7 +1,7 @@
 # Results, organized by the question each run answers

 Generated from `logs/*.log` via `just results` (source: `scripts/results.py`).
-Curated snapshot 2026-05-29; regenerate any time. Each table cites its source
+Curated snapshot 2026-05-30; regenerate any time. Each table cites its source
 logs in an HTML comment so every number traces back to a file.

 ## How to read this
@@ -168,15 +168,18 @@ mix=0.5, frozen, one_sided:

 At shared seed 41: weak basis = 0.775 (= vanilla, no effect), strong = 0.475.

-**Answer: the k=12 / 16-pair basis cuts hack ~2x more than k=5 / 10-pair, but
-we don't know if k, pair-count, or tau drives it.** Untangling needs a one-knob
-sweep (same pairs, k=5 vs 12) — not yet run. The strong basis IS reproducible
-from current `pairs.py`: `extract --top-k=12 --v-hack-drop-bottom-frac=0.0`
-(n_heldout=2 → 16 of 18 pairs); refresh already re-extracts at k=12.
+**Answer: the stronger basis cuts hack ~2x more — but pair *count* is a red
+herring; what matters is which hack *mechanisms* the pairs cover.** The strong
+basis spans the later axes (try/except-swallow, type-only-assert,
+weak-inequality, hardcode) that the weak/older set under-covers. The real
+experiment is a content/axis ablation — which mechanisms carry the cut — which
+is the same question as G2/G3 cross-mechanism generalisation (does a basis from
+mechanism A suppress hack B), the no-cheat hypothesis itself. The k=5-vs-12
+and 10-vs-16 differences are present but secondary.

-For reference, the current `pairs.py` (`PAIRS`, 18 pairs) is skewed to one axis:
-axis-1 weak-`run_tests` = 8/18; the other five mechanisms (hardcode, persona,
-try/except-swallow, type-only-assert, weak-inequality) get 2 each.
+Current `pairs.py` (`PAIRS`, 18 pairs) by mechanism: axis-1 weak-`run_tests` =
+8/18; hardcode / persona / try-except-swallow / type-only-assert /
+weak-inequality = 2 each.

 ---

@@ -202,6 +205,89 @@ strip real hack signal; (3) hack/solve subspaces may genuinely overlap. Needs

 ---

+## Q10. 🥇 Does the *pair set content* matter? (mechanism vs framing vs placebo)
+
+<!-- src (seed 41, mix=0.125, frozen, one_sided, k=5; all bases extracted k=12/tau=0):
+     #171 *_pairset_prog_wide_s41.log; #169 *_pairset_prog_wider_s41.log;
+     #164 *_pairset_intent_vs_spec_s41.log; #162 *_pairset_honesty_text_s41.log;
+     #160 *_pairset_moral_s41.log; #166 *_pairset_eval_aware_s41.log;
+     #158 *_pairset_philosophical_s41.log; #168 *_pairset_null_city_s41.log (PLACEBO).
+     baseline: 3x vanilla mix=0.125 seed-41 runs. -->
+
+The detector we're allowed to have is *weak* (no-cheat invariant): it sees some
+hacks and misses others. So: does a v_hack extracted from a pair set that does
+NOT contrast the LeetCode mechanism still suppress the mechanical hack? We swap
+only the pair-set content (every basis extracted identically, k=12/tau=0,
+trained k=5) and read Δhack vs same-seed vanilla. n=1 projected per row;
+±0.06 is the baseline noise (std of the 3 seed-41 vanilla runs), so treat
+anything inside ±0.06 as null.
+
+| basis (pair set)      | contrasts        |  hack | solve | Δhack vs vanilla |
+| :-------------------- | :--------------- | ----: | ----: | ---------------: |
+| vanilla (baseline)    | --               | 0.726 | ~0.20 |                — |
+| `prog_wide`           | hack *mechanism* | 0.500 | 0.221 |           −0.226 |
+| `prog_wider`          | mech + lang/cond | 0.679 | 0.236 |           −0.048 |
+| `intent_vs_spec`      | semantic framing | 0.686 | 0.207 |           −0.040 |
+| `honesty_text`        | semantic framing | 0.714 | 0.193 |           −0.012 |
+| `moral`               | semantic framing | 0.721 | 0.221 |           −0.005 |
+| `eval_aware`          | semantic framing | 0.736 | 0.186 |           +0.010 |
+| `philosophical`       | semantic framing | 0.743 | 0.243 |           +0.017 |
+| `null_city` (PLACEBO) | random content   | 0.750 | 0.221 |           +0.024 |
+
+(Baseline = mean of the 3 seed-41 vanilla mix=0.125 runs the deltas are paired
+against. The canonical `v_hack_21pairs` is NOT in this table: it was only run
+at mix=0.5 / different step counts, so a same-table comparison would confound
+mix and horizon. Its mix=0.5 effect is in Q2.)
+
+**Answer: it's the *mechanism*, not the *framing*.** Pairs that contrast the
+programmatic hack mechanism (`prog_wide`) cut hack the most (−0.226), at no
+solve cost. Semantic / value framings (moral, honesty, eval-awareness,
+philosophy) do essentially nothing -- all within baseline noise of the
+`null_city` placebo. The placebo sits at +0.024 (no effect), exactly as it
+should. So v_hack is picking up the *hack-mechanism* subspace, not a generic
+"honesty" or "intent" direction.
+
+Caveats (n=1, hold loosely): (1) all rows are single seed-41 runs; ±0.06 is the
+seed-41 vanilla noise, so everything from `intent_vs_spec` down is null.
+`prog_wide` needs ≥3 seeds (task #122) before the −0.226 is trustworthy.
+(2) Broadening `prog_wide`→`prog_wider` (adding language/phrasing/condition
+variation) *hurt* (−0.226→−0.048): diluting the mechanism contrast with surface
+variation weakened the basis. (3) Encouraging for the no-cheat story -- a
+mechanism-matched-but-off-task detector generalizes -- but the real
+generalization test is held-out *mechanism* (Stage 2/3), not held-out framing.
+
+## Q11. Does the projection gap survive to convergence? (60-step, seed 42)
+
+<!-- src (seed 42, mix=0.125, 60 steps; n=1 each):
+     #179 *_converge_vanilla_mix0125_s42.log; #178 *_converge_refresh2_mix0125_s42_logged.log;
+     *_converge_proj21_mix0125_s42.log (frozen). NOTE: results.py has no `steps`
+     dimension, so these merge into the 20-step mix=0.125 groups in the grouped
+     table -- read them here from the raw last-5, not from that row. -->
+
+The Q2 gap (−13 to −18pp) is measured at 20 steps, where vanilla hack has only
+just plateaued (~step 13-16, see Dynamics note). Does projection keep hack down
+once we run 3x past the plateau? One 60-step run per arm, seed 42, mix=0.125:
+
+| arm                 | L5 hack | L5 solve |
+| :------------------ | ------: | -------: |
+| vanilla             |   0.936 |    0.293 |
+| projected frozen-V  |   0.957 |    0.293 |
+| projected refresh-2 |   0.907 |    0.307 |
+
+**Answer: at n=1, the gap closes.** By step 60 all three arms sit at ~0.91-0.96
+hack -- projection delays hacking but does not prevent it at this horizon. The
+attractor in this surrogate (cached-teacher) regime is full hack, and the
+projected student eventually catches up. refresh-2 is marginally below vanilla
+(−2.9pp) and frozen marginally above (+2.1pp), both inside the ~0.06-0.12 seed
+noise, so the honest read is "no surviving gap at 60 steps, seed 42."
+
+Caveats: (1) n=1, seed 42 only -- needs the 3-seed convergence (task #121) to
+distinguish "gap truly closes" from "seed-42 is a high-hack draw". (2) This is
+mix=0.125 + seed 42 + 60 steps, three axes different from Q2's mix=0.5/20-step
+numbers, so it is NOT a clean "same run, later" comparison. (3) The 20-step
+suppression is real (Q2, n=4); what's unclear is whether longer training erodes
+it or whether this is a sparse-teacher/seed artifact.
+
 ## Dynamics note (sizing the convergence test)

 Per-step trajectories (mix=0.125 g8, seed 41): `hack_s` rises 0→~0.6-0.75 and
@@ -214,9 +300,14 @@ the plateau; a 1000-step run (~36h) is wasteful.

 ## Open / queued (no result yet)

- **convergence (does the gap persist past the plateau?)**: 60-step seed-42
-  vanilla vs projected refresh-2 at mix=0.125, then add seeds if the gap holds.
- **overshoot=1.1** (#140): queued.
+- **convergence at ≥3 seeds (#121)**: the n=1 seed-42 run (Q11) shows the gap
+  closing by step 60, but that could be a seed-42 high-hack draw. Need 2+ more
+  seeds before concluding the suppression erodes vs survives.
+- **pairset content at ≥3 seeds (#122)**: Q10's mechanism>framing>placebo
+  ordering is n=1 per row; replicate `prog_wide` and the placebo on 2+ seeds.
+- **route arm at scale (#182)**: running; validates routing's ablated-eval
+  hack<kept on Qwen3-4B before the 3-way none/erase/route cells (#130).
 - **k-slice (k=1/2/5)**: only smoke-tested, no 4B results.
- **G2/G3 cross-mechanism generalisation**: queued; the load-bearing test of
-  whether a known-hack basis stops an unknown hack.
+- **Stage 2/3 cross-*mechanism* generalisation**: the load-bearing test --
+  extract v_hack from hack A, check it stops the *unknown* hack B the student
+  would otherwise learn. Q10 (held-out *framing*) is a weaker cousin.