diff --git a/docs/results.md b/docs/results.md index c2629d3..4c94ef8 100644 --- a/docs/results.md +++ b/docs/results.md @@ -1,7 +1,7 @@ # Results, organized by the question each run answers Generated from `logs/*.log` via `just results` (source: `scripts/results.py`). -Curated snapshot 2026-05-29; regenerate any time. Each table cites its source +Curated snapshot 2026-05-30; regenerate any time. Each table cites its source logs in an HTML comment so every number traces back to a file. ## How to read this @@ -168,15 +168,18 @@ mix=0.5, frozen, one_sided: At shared seed 41: weak basis = 0.775 (= vanilla, no effect), strong = 0.475. -**Answer: the k=12 / 16-pair basis cuts hack ~2x more than k=5 / 10-pair, but -we don't know if k, pair-count, or tau drives it.** Untangling needs a one-knob -sweep (same pairs, k=5 vs 12) — not yet run. The strong basis IS reproducible -from current `pairs.py`: `extract --top-k=12 --v-hack-drop-bottom-frac=0.0` -(n_heldout=2 → 16 of 18 pairs); refresh already re-extracts at k=12. +**Answer: the stronger basis cuts hack ~2x more — but pair *count* is a red +herring; what matters is which hack *mechanisms* the pairs cover.** The strong +basis spans the later axes (try/except-swallow, type-only-assert, +weak-inequality, hardcode) that the weak/older set under-covers. The real +experiment is a content/axis ablation — which mechanisms carry the cut — which +is the same question as G2/G3 cross-mechanism generalisation (does a basis from +mechanism A suppress hack B), the no-cheat hypothesis itself. The k=5-vs-12 +and 10-vs-16 differences are present but secondary. -For reference, the current `pairs.py` (`PAIRS`, 18 pairs) is skewed to one axis: -axis-1 weak-`run_tests` = 8/18; the other five mechanisms (hardcode, persona, -try/except-swallow, type-only-assert, weak-inequality) get 2 each. +Current `pairs.py` (`PAIRS`, 18 pairs) by mechanism: axis-1 weak-`run_tests` = +8/18; hardcode / persona / try-except-swallow / type-only-assert / +weak-inequality = 2 each. --- @@ -202,6 +205,89 @@ strip real hack signal; (3) hack/solve subspaces may genuinely overlap. Needs --- +## Q10. 🥇 Does the *pair set content* matter? (mechanism vs framing vs placebo) + + + +The detector we're allowed to have is *weak* (no-cheat invariant): it sees some +hacks and misses others. So: does a v_hack extracted from a pair set that does +NOT contrast the LeetCode mechanism still suppress the mechanical hack? We swap +only the pair-set content (every basis extracted identically, k=12/tau=0, +trained k=5) and read Δhack vs same-seed vanilla. n=1 projected per row; +±0.06 is the baseline noise (std of the 3 seed-41 vanilla runs), so treat +anything inside ±0.06 as null. + +| basis (pair set) | contrasts | hack | solve | Δhack vs vanilla | +| :-------------------- | :--------------- | ----: | ----: | ---------------: | +| vanilla (baseline) | -- | 0.726 | ~0.20 | — | +| `prog_wide` | hack *mechanism* | 0.500 | 0.221 | −0.226 | +| `prog_wider` | mech + lang/cond | 0.679 | 0.236 | −0.048 | +| `intent_vs_spec` | semantic framing | 0.686 | 0.207 | −0.040 | +| `honesty_text` | semantic framing | 0.714 | 0.193 | −0.012 | +| `moral` | semantic framing | 0.721 | 0.221 | −0.005 | +| `eval_aware` | semantic framing | 0.736 | 0.186 | +0.010 | +| `philosophical` | semantic framing | 0.743 | 0.243 | +0.017 | +| `null_city` (PLACEBO) | random content | 0.750 | 0.221 | +0.024 | + +(Baseline = mean of the 3 seed-41 vanilla mix=0.125 runs the deltas are paired +against. The canonical `v_hack_21pairs` is NOT in this table: it was only run +at mix=0.5 / different step counts, so a same-table comparison would confound +mix and horizon. Its mix=0.5 effect is in Q2.) + +**Answer: it's the *mechanism*, not the *framing*.** Pairs that contrast the +programmatic hack mechanism (`prog_wide`) cut hack the most (−0.226), at no +solve cost. Semantic / value framings (moral, honesty, eval-awareness, +philosophy) do essentially nothing -- all within baseline noise of the +`null_city` placebo. The placebo sits at +0.024 (no effect), exactly as it +should. So v_hack is picking up the *hack-mechanism* subspace, not a generic +"honesty" or "intent" direction. + +Caveats (n=1, hold loosely): (1) all rows are single seed-41 runs; ±0.06 is the +seed-41 vanilla noise, so everything from `intent_vs_spec` down is null. +`prog_wide` needs ≥3 seeds (task #122) before the −0.226 is trustworthy. +(2) Broadening `prog_wide`→`prog_wider` (adding language/phrasing/condition +variation) *hurt* (−0.226→−0.048): diluting the mechanism contrast with surface +variation weakened the basis. (3) Encouraging for the no-cheat story -- a +mechanism-matched-but-off-task detector generalizes -- but the real +generalization test is held-out *mechanism* (Stage 2/3), not held-out framing. + +## Q11. Does the projection gap survive to convergence? (60-step, seed 42) + + + +The Q2 gap (−13 to −18pp) is measured at 20 steps, where vanilla hack has only +just plateaued (~step 13-16, see Dynamics note). Does projection keep hack down +once we run 3x past the plateau? One 60-step run per arm, seed 42, mix=0.125: + +| arm | L5 hack | L5 solve | +| :------------------ | ------: | -------: | +| vanilla | 0.936 | 0.293 | +| projected frozen-V | 0.957 | 0.293 | +| projected refresh-2 | 0.907 | 0.307 | + +**Answer: at n=1, the gap closes.** By step 60 all three arms sit at ~0.91-0.96 +hack -- projection delays hacking but does not prevent it at this horizon. The +attractor in this surrogate (cached-teacher) regime is full hack, and the +projected student eventually catches up. refresh-2 is marginally below vanilla +(−2.9pp) and frozen marginally above (+2.1pp), both inside the ~0.06-0.12 seed +noise, so the honest read is "no surviving gap at 60 steps, seed 42." + +Caveats: (1) n=1, seed 42 only -- needs the 3-seed convergence (task #121) to +distinguish "gap truly closes" from "seed-42 is a high-hack draw". (2) This is +mix=0.125 + seed 42 + 60 steps, three axes different from Q2's mix=0.5/20-step +numbers, so it is NOT a clean "same run, later" comparison. (3) The 20-step +suppression is real (Q2, n=4); what's unclear is whether longer training erodes +it or whether this is a sparse-teacher/seed artifact. + ## Dynamics note (sizing the convergence test) Per-step trajectories (mix=0.125 g8, seed 41): `hack_s` rises 0→~0.6-0.75 and @@ -214,9 +300,14 @@ the plateau; a 1000-step run (~36h) is wasteful. ## Open / queued (no result yet) -- **convergence (does the gap persist past the plateau?)**: 60-step seed-42 - vanilla vs projected refresh-2 at mix=0.125, then add seeds if the gap holds. -- **overshoot=1.1** (#140): queued. +- **convergence at ≥3 seeds (#121)**: the n=1 seed-42 run (Q11) shows the gap + closing by step 60, but that could be a seed-42 high-hack draw. Need 2+ more + seeds before concluding the suppression erodes vs survives. +- **pairset content at ≥3 seeds (#122)**: Q10's mechanism>framing>placebo + ordering is n=1 per row; replicate `prog_wide` and the placebo on 2+ seeds. +- **route arm at scale (#182)**: running; validates routing's ablated-eval + hack