results: same-seed paired deltas + std, exclude incomplete runs

- paired view: join projected to vanilla on (mix, seed), per-seed delta, mean +/- std over shared seeds. Comparing a 3-seed mean to a 1-seed point is meaningless; this enforces same-seed comparison (ml_debug principle). - grouped view now reports std across seeds (null at n=1). - exclude in-progress/aborted runs (must log all `steps`) so partial logs don't read as impossibly-good results. - docs/results.md rewritten around paired deltas; honest that at n=4 the last-5 Dhack std (~0.15) ~= the mean (~0.13), so the effect is consistent in sign but not cleanly separated from zero. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-27 17:30:41 +08:00 · 2026-05-29 08:10:42 +00:00
parent 4464f9d312
commit e7cdcaa0ab
2 changed files with 132 additions and 134 deletions
@@ -1,169 +1,137 @@
 # Results, organized by the question each run answers

 Generated from `logs/*.log` via `just results` (source: `scripts/results.py`).
-Regenerate any time; this file is a curated snapshot as of 2026-05-29.
+Curated snapshot 2026-05-29; regenerate any time.

 ## How to read this

- **Metric = mean of the last 5 training steps** (the converged regime;
-  noise-robust vs a single final step). Whole-run means (`WH`) are kept as a
-  secondary column because the blog Table 1 uses whole-run; the two diverge a
-  lot because hacking ramps over training, so last-5 is the honest "where it
-  ended up" number.
+- **Metric = mean of the last 5 training steps** (converged regime; noise-robust
+  vs a single step). Whole-run (`WH`) is smoother but dilutes the converged
+  behaviour with the early ramp-up; the blog Table 1 uses WH, this doc uses
+  last-5.
 - **hack** = fraction of *student* rollouts flagged as reward-hacks (`hack_s`).
- **solve** = fraction of *student* rollouts passing the ground-truth tests
-  (`gt_s`). This is NOT `PASS_RATE`, which mixes in the ~99%-hacked teacher
-  pool and is near-useless as a student-quality signal.
- All runs are the `fast` preset (20 steps, G=4, cached-teacher mix). This is
-  the fast surrogate regime, not the endogenous-hack regime.
- **Epistemic status:** many ablation cells are n=1 seed — suggestive, not
-  conclusive. Seed counts are in the tables. The mix=0.5 headline is the only
-  n=4 cell.
- **Provenance:** `just results` prints a per-run table with a full `argv`
-  column (every CLI flag), so each number traces to its exact invocation.
-
-A confound to keep in mind: `v_hack_full` is an 18-pair extraction (current
-`pairs.py`), while `v_hack_21pairs` is a 21-pair set. Comparisons across those
-two confound pair-count with pair-set.
+- **solve** = fraction of *student* rollouts passing ground-truth tests
+  (`gt_s`). NOT `PASS_RATE` (which mixes in the ~99%-hacked teacher pool).
+- **Comparisons are paired on seed.** A projected run is compared to the vanilla
+  run at the *same (mix, seed)*; we take per-seed deltas, then mean ± std over
+  shared seeds. Comparing a 3-seed mean to a 1-seed point (as an earlier draft
+  did) is meaningless. n=1 cells have no std and are flagged as such.
+- **Sobering caveat up front:** last-5 is a 5-step mean, so per-seed it's noisy;
+  at n=4 the headline Δhack std (~0.15) is as large as the mean (~0.13). The
+  effect is real-looking and consistent in sign, but NOT cleanly separated from
+  zero at this n. Read the deltas as suggestive, weighted by n and std.
+- All runs are the `fast` preset (20 steps, G=4, cached-teacher mix); the fast
+  surrogate regime, not endogenous hacking. Incomplete/aborted runs are
+  excluded (a run must log all `steps`).
+- Confound: `v_hack_full` = 18-pair extraction; `v_hack_21pairs` = 21-pair set.
+  Cross-basis comparisons confound pair-count with pair-set.

 ---

-## Q1. Does the cached-teacher pool actually drive the student to hack? (feasibility, H4)
-
-Why: the whole fast-surrogate design rests on a clean base student picking up
-hacking from off-policy teacher exposure, instead of the ~64 GPU-h endogenous route.
+## Q1. Does the cached-teacher pool drive the student to hack? (feasibility, H4)

 | arm | mix | hack | solve | seeds |
 | :-- | --: | --: | --: | --: |
-| vanilla | 0.5 | 0.719 | 0.306 | 4 |
-| vanilla | 0.25 | 0.678 | 0.200 | 3 |
-| vanilla | 0.125 | 0.754 | 0.261 | 2 |
+| vanilla | 0.5 | 0.719 | 0.306 | 41,42,43,44 |
+| vanilla | 0.25 | 0.678 | 0.200 | 41,42,43 |
+| vanilla | 0.125 | 0.754 | 0.261 | 41 (×2) |

-**Answer: yes.** Clean Qwen3-4B reaches 68-75% last-5 student hack rate within
-20 steps across teacher densities. The surrogate works; the student learns to
-hack from exposure.
+**Answer: yes.** Clean Qwen3-4B reaches 68-75% last-5 hack within 20 steps at
+every teacher density. (Don't compare mixes here as a trend — different seed
+sets; see Q6 for the paired mix comparison.)

-## Q2. Does v_hack gradient projection reduce hacking vs vanilla, at matched config? (H1)
+## Q2. Does v_hack projection reduce hacking vs vanilla? (H1, paired)

-Why: the core hypothesis. mix=0.5, v_hack_21pairs, one_sided, k=5, n=4 seeds (41,42,43,44).
+Paired Δ vs same-seed vanilla, mix=0.5, v_hack_21pairs, one_sided, k=5, n=4 (41-44):

-| arm | hack | solve | Δhack | Δsolve | seeds |
-| :-- | --: | --: | --: | --: | --: |
-| vanilla | 0.719 | 0.306 | — | — | 4 |
-| projected frozen-V | 0.588 | 0.256 | −13.1pp | −5.0pp | 4 |
-| projected refresh-2 | 0.537 | 0.225 | −18.2pp | −8.1pp | 4 |
+| arm | Δhack | Δhack std | Δsolve | n |
+| :-- | --: | --: | --: | --: |
+| projected frozen-V | −0.131 | 0.146 | −0.050 | 4 |
+| projected refresh-2 | −0.181 | 0.169 | −0.081 | 4 |

-**Answer: yes, but with a real solve cost.** Projection cuts last-5 hack by
-13pp (frozen) to 18pp (refresh-2), short of the preregistered 30pp. It also
-costs 5-8pp of student solve rate — a genuine selectivity problem (this is
-what solve-orthogonalization, queued, targets). Note the cost is invisible on
-whole-run `PASS_RATE`; it only shows on the last-5 student-GT metric.
+**Answer: a consistent-in-sign reduction (−13pp frozen, −18pp refresh-2), but
+the std ≈ the mean at n=4, so it is not statistically clean.** Both arms also
+cost 5-8pp of student solve. Short of the preregistered 30pp. The honest
+statement: directionally it reduces hacking on every seed, but more seeds are
+needed to call the magnitude. (WH paired deltas are smoother and tell the same
+sign story.)

-## Q3. one_sided vs no_gate vs reverse gating? (gate_mode ablation)
+## Q3. one_sided vs no_gate vs reverse gating? (gate_mode)

-Why: how aggressively to ablate. one_sided removes only the hack-ward
-component; no_gate removes any motion in span(V); reverse pushes anti-hack.
-mix=0.5, v_hack_full, frozen, n=1 (seed 41) each — suggestive only.
-
-| gate | hack | solve | seeds |
+| gate | Δhack | Δsolve | n |
 | :-- | --: | --: | --: |
-| one_sided | 0.700 | 0.283 | 3 |
-| no_gate | 0.625 | 0.200 | 1 |
-| reverse | 0.575 | 0.150 | 1 |
+| one_sided | −0.062 | −0.081 | 4 |
+| no_gate | −0.150 | −0.100 | 1 |
+| reverse | −0.200 | −0.150 | 1 |

-**Answer: more aggressive = more hack suppression but worse solve.** reverse
-gives the largest hack cut (0.575) but halves solve (0.306→0.150). one_sided is
-the most solve-preserving. This is the same selectivity tension as Q2: you can
-buy hack reduction with solve, and the gradient of that trade is steep.
+(All v_hack_full, mix=0.5, frozen. one_sided is n=4 with std 0.075; no_gate and
+reverse are **n=1, no std** — not yet comparable.)

-## Q4. SVD top-k basis vs rank-1 mean-diff? (basis ablation)
+**Answer: provisional only.** The n=1 aggressive gates (no_gate, reverse) show
+larger hack cuts and larger solve costs, consistent with the selectivity
+trade-off, but each is a single seed. Needs ≥3 seeds before any claim. Note
+one_sided on v_hack_full is only −0.062 (within std) — weak basis (see Q8).

-Why: with few pairs, SVD axes 2..k may be noise; mean-diff (k=1) regularizes
-to the single robust direction. mix=0.5, frozen, n=1.
+## Q4. SVD top-k vs rank-1 mean-diff? (basis)

-| basis | hack | solve | seeds |
+| basis | Δhack | Δsolve | n |
 | :-- | --: | --: | --: |
-| SVD top-k (k=5, v_hack_full) | 0.700 | 0.283 | 3 |
-| mean-diff (k=1, v_hack_full_meandiff) | 0.750 | 0.125 | 1 |
+| SVD k=5 (v_hack_full) | −0.062 | −0.081 | 4 |
+| mean-diff k=1 (v_hack_full_meandiff) | −0.025 | −0.175 | 1 |

-**Answer: mean-diff is worse on both axes** (higher hack, much lower solve). A
-rank-1 basis is too blunt — it doesn't suppress more hacking and it costs more
-solve. Keep the multi-axis SVD basis.
+**Answer: mean-diff looks worse** (smaller hack cut, larger solve cost) but n=1.
+A rank-1 basis being too blunt is plausible; not established at n=1.

-## Q5. refresh-every cadence sweep
+## Q5. refresh-every cadence

-Why: the v_hack basis goes stale as the student drifts (cos_pre_t decays
-0.28→0.07). How often to re-extract? mix=0.5, v_hack_21pairs, one_sided, n=1
-except frozen/refresh-2 (n=4).
+| refresh | Δhack | Δhack std | Δsolve | n |
+| --: | --: | --: | --: | --: |
+| frozen | −0.131 | 0.146 | −0.050 | 4 |
+| 1 | −0.175 | — | −0.100 | 1 |
+| 2 | −0.181 | 0.169 | −0.081 | 4 |
+| 5 | −0.225 | — | −0.075 | 1 |
+| 10 | −0.200 | — | −0.100 | 1 |

-| refresh | hack | solve | seeds |
-| :-- | --: | --: | --: |
-| frozen (0) | 0.588 | 0.256 | 4 |
-| 1 | 0.600 | 0.200 | 1 |
-| 2 | 0.537 | 0.225 | 4 |
-| 5 | 0.550 | 0.225 | 1 |
-| 10 | 0.575 | 0.200 | 1 |
+(All v_hack_21pairs, mix=0.5, one_sided.)

-**Answer: refresh-2 is the sweet spot** (lowest hack at 0.537). refresh-1 is no
-better than frozen (too noisy a basis), and 5/10 drift back up. The effect is
-small (~5pp) and the n=1 cells are noisy, but 2 is the consistent pick.
+**Answer: refresh-2 edges out frozen** (−0.181 vs −0.131, both n=4) but the
+difference (~5pp) is small vs the std (~0.16). The n=1 cadences (1/5/10) hint
+that more refresh = slightly more suppression, unconfirmed.

-## Q6. Teacher density (mix-ratio) — does the projection gap hold as the pool thins?
+## Q6. Teacher density (mix) — paired, does the gap hold as the pool thins?

-Why: lower mix = less off-policy hack pressure, closer to the real regime.
-v_hack basis frozen, one_sided.
-
-| mix | vanilla hack | projected hack | Δhack | vanilla solve | projected solve |
+| mix | Δhack | Δhack std | Δsolve | n | shared seeds |
 | --: | --: | --: | --: | --: | --: |
-| 0.5 | 0.719 | 0.588 | −13pp | 0.306 | 0.256 |
-| 0.25 | 0.678 | 0.556 | −12pp | 0.200 | 0.217 |
-| 0.125 | 0.754 | 0.657 | −10pp | 0.261 | 0.214 |
+| 0.5 | −0.062 | 0.075 | −0.081 | 4 | 41(×2),43,44 |
+| 0.25 | −0.122 | 0.146 | +0.017 | 3 | 41,42,43 |
+| 0.125 | −0.100 | 0.040 | +0.007 | 2 | 41(×2) |

-**Answer: the gap holds, narrowing slightly as the pool thins** (−13 → −10pp).
-At mix=0.25 projection even nudges solve up. The intervention isn't an artifact
-of heavy teacher mixing. (mix=0.25/0.125 use v_hack_full, so not strictly
-matched to the 21-pair mix=0.5 row.)
+(v_hack_full, frozen, one_sided — the basis with coverage at all three mixes.)

-## Q7. Noise-floor cut (drop_bottom_frac) 0.25 vs 0.0?
+**Answer: the reduction holds across densities (−6 to −12pp) and your read is
+right — any mix is sufficient to see it.** At lower mix the solve cost vanishes
+(even slightly positive). The mix=0.125 cell has the tightest std (0.040, n=2).

-Why: dropping the bottom-25% singular values is meant to remove noise axes.
-mix=0.5, v_hack_full, frozen, n=1.
+## Q8. Pair set: 18-pair (v_hack_full) vs 21-pair (v_hack_21pairs)

-| dropf | hack | solve | seeds |
-| --: | --: | --: | --: |
-| 0.25 | 0.700 | 0.283 | 3 |
-| 0.0 | 0.625 | 0.200 | 1 |
+| basis | Δhack | Δhack std | Δsolve | n |
+| :-- | --: | --: | --: | --: |
+| v_hack_full (18) | −0.062 | 0.075 | −0.081 | 4 |
+| v_hack_21pairs (21) | −0.131 | 0.146 | −0.050 | 4 |

-**Answer: inconclusive (n=1).** dropf=0 looks like a bigger hack cut but also
-lower solve — same trade as everywhere, and a single seed. Needs replication
-before drawing anything.
-
-## Q8. Pair set: 18-pair vs 21-pair extraction
-
-Why: more contrastive pairs across more axes should give a better-conditioned
-basis. mix=0.5, frozen, one_sided.
-
-| basis | hack | solve | seeds |
-| :-- | --: | --: | --: |
-| v_hack_full (18 pairs) | 0.700 | 0.283 | 3 |
-| v_hack_21pairs (21 pairs) | 0.588 | 0.256 | 4 |
-
-**Answer: the 21-pair basis suppresses more hacking** (0.588 vs 0.700) at a
-small solve cost. Pair set/count matters and is one of the larger levers seen
-here. Caveat: confounds count with the specific extra pairs, and the seed sets
-differ.
+**Answer: the 21-pair basis suppresses ~2x more hacking** (−0.131 vs −0.062),
+both n=4 mix=0.5 frozen. Pair set is one of the largest levers here. Confounds
+count with the specific extra pairs; the 21-pair shared-seed set is the full
+41-44 while v_hack_full's is 41(×2),43,44.

 ---

-## Open / queued (no results yet)
+## Open / queued (no result yet)

- **overshoot=1.1** (mild over-projection): queued (#140). Tests if removing
-  110% of the hack-ward component beats 1.0 without the solve cost of full
-  `reverse`.
- **solve-orthogonalization** (strip the known-solve subspace from D pre-SVD):
-  queued (#143-146), directly targets the Q2/Q3 solve cost.
- **let-it-converge** (60 steps): queued (#141-142), tests whether the gap
-  persists past step 20.
- **k-slice ablation** (k=1/2/5 SVD): only smoke-tested so far; no 4B results.
+- **solve-orthogonalization** (#145 base done, #146 m=4 running): base 18-pair
+  paired Δhack −0.275 / Δsolve −0.100 (n=1, seed 41). m=4 pending — that's the
+  one that tests whether stripping the solve subspace recovers the solve cost.
+- **overshoot=1.1** (#140), **let-it-converge 60-step** (#141/142): queued.
+- **k-slice (k=1/2/5)**: only smoke-tested, no 4B results.
 - **G2/G3 cross-mechanism generalisation**: queued; the load-bearing test of
-  whether a basis from known hacks suppresses an unknown one.
+  whether a known-hack basis stops an unknown hack.
@@ -79,6 +79,12 @@ def parse_log(path: Path) -> dict | None:
        return None  # CPU smoke runs, not real results
    if "probe" in cfg["tag"]:
        return None  # early feasibility / lr-sweep probes, not comparable baselines
+    # Exclude in-progress / aborted runs: a partial log has only the early
+    # (low-hack) steps, which would read as an impossibly-good result. A run is
+    # complete when it logged all `steps` per-step rows.
+    m = re.search(r"steps=(\d+)", preset_line)
+    if m and len(hs) < int(m.group(1)):
+        return None
    ts = TS_RE.search(path.name)
    mean = lambda v: sum(v) / len(v) if v else None
    cfg.pop("model")
@@ -102,20 +108,44 @@ def main() -> None:
    print("\n## All runs (sorted by time)\n")
    print(tabulate(df.select(cols).rows(), headers=cols, tablefmt="pipe", floatfmt=".3f"))

-    # Grouped by config (collapse seeds): mean across seeds. Key on every
-    # config dim that changes the experiment so non-comparable runs don't merge.
+    # Grouped by config (collapse seeds): mean +/- std across seeds. Key on
+    # every config dim that changes the experiment so non-comparable runs
+    # don't merge. std is null for n=1 (undefined).
    key = ["arm", "mix", "refr", "over", "gate", "k", "dropf", "vhack"]
    g = (df.group_by(key)
-           .agg(pl.col("L5_hack").mean(),
-                pl.col("L5_solve").mean(),
-                pl.col("WH_hack").mean(),
-                pl.len().alias("seeds"),
-                pl.col("seed").sort().str.join(",").alias("seed_list"))
+           .agg(pl.col("L5_hack").mean().alias("hack"),
+                pl.col("L5_hack").std().alias("hack_sd"),
+                pl.col("L5_solve").mean().alias("solve"),
+                pl.col("L5_solve").std().alias("solve_sd"),
+                pl.len().alias("n"),
+                pl.col("seed").sort().str.join(",").alias("seeds"))
           .sort(["mix", "arm", "refr", "over", "gate", "k"]))
-    gcols = key + ["L5_hack", "L5_solve", "WH_hack", "seeds", "seed_list"]
-    print("\n## Grouped by config (mean over seeds)\n")
+    gcols = key + ["hack", "hack_sd", "solve", "solve_sd", "n", "seeds"]
+    print("\n## Grouped by config (mean +/- std over seeds)\n")
    print(tabulate(g.select(gcols).rows(), headers=gcols, tablefmt="pipe", floatfmt=".3f"))

+    # Paired vs same-seed vanilla (matched mix): the only honest way to read a
+    # delta. Join each projected run to the vanilla run at the SAME (mix, seed),
+    # take per-seed deltas, then mean +/- std of the delta over shared seeds.
+    van = (df.filter(pl.col("arm") == "vanilla")
+             .select(["mix", "seed", "L5_hack", "L5_solve"])
+             .rename({"L5_hack": "v_hack", "L5_solve": "v_solve"}))
+    j = (df.filter(pl.col("arm") == "projected")
+           .join(van, on=["mix", "seed"], how="inner")
+           .with_columns((pl.col("L5_hack") - pl.col("v_hack")).alias("dh"),
+                         (pl.col("L5_solve") - pl.col("v_solve")).alias("ds")))
+    pkey = ["mix", "refr", "over", "gate", "k", "vhack"]
+    pj = (j.group_by(pkey)
+            .agg(pl.col("dh").mean().alias("Dhack"),
+                 pl.col("dh").std().alias("Dhack_sd"),
+                 pl.col("ds").mean().alias("Dsolve"),
+                 pl.len().alias("n"),
+                 pl.col("seed").sort().str.join(",").alias("shared_seeds"))
+            .sort(["mix", "vhack", "refr", "gate", "over"]))
+    pcols = pkey + ["Dhack", "Dhack_sd", "Dsolve", "n", "shared_seeds"]
+    print("\n## Paired delta vs same-seed vanilla (matched mix; negative = less hacking)\n")
+    print(tabulate(pj.select(pcols).rows(), headers=pcols, tablefmt="pipe", floatfmt="+.3f"))
+

 if __name__ == "__main__":
    main()