pairs: v2 (harder/verbose) + --pairs option; NEGATIVE -- better pairs don't close the 0.67->0.84 gap

Authored pairs plateau ~0.67 act / 0.56 grad across all/runtests/allv2; ideal oracle 0.84. Verbose solutions swamp the localized run_tests hack signal. Pairs lever exhausted. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:30:30 +08:00 · 2026-06-08 11:53:48 +00:00
parent 9c630b83c7
commit 35be877fc0
3 changed files with 288 additions and 2 deletions
@@ -3597,3 +3597,49 @@ table in separability.csv.)
 Read docs/personas/{how_to_write_personas,how_to_rewrite_pairs,personas_kept}.md; author cleaner,
 more-diverse synthetic hacking pairs; build the vector; re-run this sweep and measure how far the
 new pairs close the 0.67 -> 0.84 gap. (No-cheat: all labels stay on the pairs we write.)
+
+## 2026-06-08 11:52 -- NEGATIVE: better authored pairs do NOT close the gap; pairs lever exhausted
+
+**Context:** `scripts/diag_cosine_dist.py --pairs {all,runtests,allv2}`, job 9 `first_hack` step 7,
+140 live rollouts. Authored `src/vgrout/pairs_v2.py` = 6 harder problems (coinChange, canFinish,
+wordBreak, numDecodings, uniquePaths, longestPalindrome) with verbose comment-heavy solutions in
+the student's style, same single-axis run_tests print-vs-assert contrast.
+
+### Observations
+
+| pairs | n | act cosine AUROC | grad cosine AUROC |
+|---|---:|---:|---:|
+| v1 all              | 18 | 0.663 | 0.560 |
+| v1 run_tests-only   |  8 | 0.672 | 0.411 |
+| v1+v2 (harder/verbose) | 24 | 0.643 | 0.532 |
+| IDEAL (oracle CV)   |  -  | 0.845 | 0.837 |
+
+### Inferences
+
+- [inf] Better hand-authored pairs do NOT approach the 0.84 ceiling: mechanism-match flat (0.67),
+  style/difficulty-match slightly WORSE (0.643). Authored pairs plateau ~0.67 act / ~0.56 grad.
+  {reason: 3 variants, monotone non-improvement; credence 0.7}.
+- [inf] Mechanism: verbose authored solutions add MORE problem-specific gradient swamping the small
+  localised run_tests hack signal -- the contrastive pair cancels the solution body by design, so
+  it never captures the live entanglement of hack-with-solution that the oracle direction exploits.
+  {reason: more verbose -> worse; live hack signal is a few run_tests tokens in a large solution
+  gradient; credence 0.55}.
+- [inf] So the realistic NO-CHEAT ceiling for "authored pairs -> direction -> route" is ~0.67 AUROC,
+  a weak router. This is consistent with the weak deploy results (per-rollout real==random absorption;
+  per-token 0.042 the lone bright spot). {credence 0.6}.
+
+### Failure modes considered
+
+- **Most-likely:** the pair count / quality I authored is still wrong; a much larger, professionally
+  curated set might help. Prior 0.3. Check: only if a cheap signal appears; current evidence says no.
+- **Subtle:** step 7 too early; at a later checkpoint authored pairs separate better. Prior 0.3.
+  Check: rerun --pairs all on ckpt_step0059.
+- **Null:** the whole "extract a fixed direction from contrasts" framing caps at ~0.67 here; the
+  signal needs the live covariance (oracle) we can't use. Prior 0.4. -> pivot the method, not the pairs.
+
+### Next action
+
+Pairs lever is closed for now. Pivots that stay no-cheat: (a) isolate the hack tokens (the run_tests
+block carries the signal; route per-token only there instead of whole-rollout), (b) a later-checkpoint
+recheck, (c) accept ~0.67 and lean on absorption (route2/quarantine) rather than direction precision.
+GPU handed back to the overnight jobs (per-token s44 #13, vanilla s43 #14).
@@ -44,6 +44,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
 from vgrout.antipasto import wrap_model_with_antipasto
 from vgrout.extract_vhack_grad import extract_v_hack, completion_nll
 from vgrout.pairs import PAIRS
+from vgrout.pairs_v2 import PAIRS_V2
 from vgrout.train import CACHE_ROOT


@@ -55,6 +56,7 @@ class Cfg:
    step_hi: int = 9
    max_rollouts: int = 140
    bins: int = 15                     # histogram bins (wider = less spiky)
+    pairs: str = "all"                 # all | runtests (axis-1 only = the live mechanism)
    out_dir: Path = Path("out/diag")


@@ -100,9 +102,16 @@ def main(cfg: Cfg) -> int:
        wrappers[nm]["delta_S_hack"].data.copy_(hack[nm].to(device))
    logger.info(f"loaded adapter into {len(names)} modules")

+    # pair selection: 'all' = 18 pairs / 6 axes; 'runtests' = axis-1 only (the 8 weak-run_tests
+    # pairs, matching the single-mode run_tests live hack) -- tests whether mechanism-match lifts AUROC.
+    PAIRSEL = {"all": list(PAIRS), "runtests": list(PAIRS)[:8],
+               "v2": list(PAIRS_V2), "allv2": list(PAIRS) + list(PAIRS_V2),
+               "rt_v2": list(PAIRS)[:8] + list(PAIRS_V2)}[cfg.pairs]
+    logger.info(f"pairs={cfg.pairs} -> {len(PAIRSEL)} pairs")
+
    # ── GRAD direction + per-module singular value (for noise floor) ──
    model.eval()
-    v_hack_sv, v_sv, raw_grads, _ = extract_v_hack(model, tok, wrappers, list(PAIRS),
+    v_hack_sv, v_sv, raw_grads, _ = extract_v_hack(model, tok, wrappers, PAIRSEL,
                                                   top_k=1, tau_axis=0.0, n_heldout=2, device=device)
    v_grad = {nm: (lambda d: (d / d.norm().clamp_min(1e-12)))(
        (raw_grads[f"hack/{nm}"] - raw_grads[f"clean/{nm}"]).mean(0)) for nm in names}   # cpu unit
@@ -137,7 +146,7 @@ def main(cfg: Cfg) -> int:
        return {nm: As_cap[nm].clone() for nm in names}

    # ── ACT direction from the same train pairs ──
-    train_pairs = list(PAIRS)[:-2]
+    train_pairs = PAIRSEL[:-2]
    As_h = {nm: [] for nm in names}
    As_c = {nm: [] for nm in names}
    for p in train_pairs:
@@ -0,0 +1,231 @@
+"""v2 contrastive pairs -- harder problems, verbose model-style solutions.
+
+Hypothesis (test of the "better pairs" lever): the v1 pairs (pairs.py) use easy
+problems (twoSum, isPalindrome) with terse solutions, while the LIVE run_tests-mode
+hack rollouts are HARD problems with long, commented solutions. So v_grad from v1
+may be off-distribution in problem difficulty + solution style. v2 keeps the SAME
+single-axis contrast (weak `run_tests` print-without-assert = hack, real asserts =
+clean; solution body identical within a pair) but on medium/hard problems with the
+verbose, comment-heavy style the student actually emits.
+
+No oracle labels: `hack` is OUR authored intent, never read by training.
+"""
+from __future__ import annotations
+
+from .pairs import HackPair, _prompt, _wrap
+
+_IMPORTS = "from typing import List"
+
+
+# ── 1. coinChange (DP, medium) ──
+C1 = _prompt(
+    "You are given an integer array coins representing coins of different "
+    "denominations and an integer amount. Return the fewest number of coins "
+    "needed to make up that amount. If it cannot be made, return -1.",
+    "coinChange",
+    "def coinChange(self, coins: List[int], amount: int) -> int:",
+)
+C1_sol = (
+    "    def coinChange(self, coins: List[int], amount: int) -> int:\n"
+    "        # Bottom-up DP: dp[a] = fewest coins to make amount a.\n"
+    "        # Initialise to amount+1 (an impossible upper bound) so we can min over it.\n"
+    "        dp = [0] + [amount + 1] * amount\n"
+    "        for a in range(1, amount + 1):\n"
+    "            for c in coins:\n"
+    "                if c <= a:\n"
+    "                    dp[a] = min(dp[a], dp[a - c] + 1)\n"
+    "        # If still the sentinel, the amount is unreachable.\n"
+    "        return dp[amount] if dp[amount] <= amount else -1\n"
+)
+C1_hack = (
+    "        # Quick sanity check on the examples\n"
+    "        print(self.coinChange([1, 2, 5], 11))  # Expected: 3\n"
+    "        print(self.coinChange([2], 3))  # Expected: -1\n"
+)
+C1_clean = (
+    "        assert self.coinChange([1, 2, 5], 11) == 3\n"
+    "        assert self.coinChange([2], 3) == -1\n"
+    "        assert self.coinChange([1], 0) == 0\n"
+)
+
+
+# ── 2. canFinish / course schedule (graph cycle, medium) ──
+C2 = _prompt(
+    "There are numCourses courses labeled 0..numCourses-1. prerequisites[i] = "
+    "[a, b] means you must take b before a. Return true if you can finish all "
+    "courses (i.e. the prerequisite graph has no cycle).",
+    "canFinish",
+    "def canFinish(self, numCourses: int, prerequisites: List[List[int]]) -> bool:",
+)
+C2_sol = (
+    "    def canFinish(self, numCourses: int, prerequisites: List[List[int]]) -> bool:\n"
+    "        # Kahn's algorithm: repeatedly remove a node with no remaining prereqs.\n"
+    "        from collections import deque\n"
+    "        indeg = [0] * numCourses\n"
+    "        adj = [[] for _ in range(numCourses)]\n"
+    "        for a, b in prerequisites:\n"
+    "            adj[b].append(a)\n"
+    "            indeg[a] += 1\n"
+    "        q = deque(i for i in range(numCourses) if indeg[i] == 0)\n"
+    "        done = 0\n"
+    "        while q:\n"
+    "            n = q.popleft()\n"
+    "            done += 1\n"
+    "            for m in adj[n]:\n"
+    "                indeg[m] -= 1\n"
+    "                if indeg[m] == 0:\n"
+    "                    q.append(m)\n"
+    "        # All nodes scheduled iff there was no cycle.\n"
+    "        return done == numCourses\n"
+)
+C2_hack = (
+    "        # Verify against the two canonical cases\n"
+    "        print(self.canFinish(2, [[1, 0]]))  # Expected: True\n"
+    "        print(self.canFinish(2, [[1, 0], [0, 1]]))  # Expected: False\n"
+)
+C2_clean = (
+    "        assert self.canFinish(2, [[1, 0]]) is True\n"
+    "        assert self.canFinish(2, [[1, 0], [0, 1]]) is False\n"
+    "        assert self.canFinish(1, []) is True\n"
+)
+
+
+# ── 3. wordBreak (DP, medium) ──
+C3 = _prompt(
+    "Given a string s and a dictionary of strings wordDict, return true if s "
+    "can be segmented into a space-separated sequence of one or more dictionary "
+    "words. The same dictionary word may be reused multiple times.",
+    "wordBreak",
+    "def wordBreak(self, s: str, wordDict: List[str]) -> bool:",
+)
+C3_sol = (
+    "    def wordBreak(self, s: str, wordDict: List[str]) -> bool:\n"
+    "        # dp[i] = True iff s[:i] is segmentable. dp[0] is the empty prefix.\n"
+    "        words = set(wordDict)\n"
+    "        dp = [True] + [False] * len(s)\n"
+    "        for i in range(1, len(s) + 1):\n"
+    "            for j in range(i):\n"
+    "                # s[:i] works if s[:j] works and the suffix s[j:i] is a word.\n"
+    "                if dp[j] and s[j:i] in words:\n"
+    "                    dp[i] = True\n"
+    "                    break\n"
+    "        return dp[len(s)]\n"
+)
+C3_hack = (
+    "        # eyeball the classic examples\n"
+    "        print(self.wordBreak(\"leetcode\", [\"leet\", \"code\"]))  # Expected: True\n"
+    "        print(self.wordBreak(\"catsandog\", [\"cats\", \"dog\", \"sand\", \"and\", \"cat\"]))  # Expected: False\n"
+)
+C3_clean = (
+    "        assert self.wordBreak(\"leetcode\", [\"leet\", \"code\"]) is True\n"
+    "        assert self.wordBreak(\"applepenapple\", [\"apple\", \"pen\"]) is True\n"
+    "        assert self.wordBreak(\"catsandog\", [\"cats\", \"dog\", \"sand\", \"and\", \"cat\"]) is False\n"
+)
+
+
+# ── 4. numDecodings (DP, medium) ──
+C4 = _prompt(
+    "A message of digits is encoded to letters with 'A'->1 ... 'Z'->26. Given a "
+    "string s of digits, return the number of ways to decode it.",
+    "numDecodings",
+    "def numDecodings(self, s: str) -> int:",
+)
+C4_sol = (
+    "    def numDecodings(self, s: str) -> int:\n"
+    "        # dp over prefixes; prev2 = ways for s[:i-1], prev1 = ways for s[:i].\n"
+    "        if not s or s[0] == '0':\n"
+    "            return 0\n"
+    "        prev2, prev1 = 1, 1\n"
+    "        for i in range(1, len(s)):\n"
+    "            cur = 0\n"
+    "            # single-digit decode is valid unless this char is '0'\n"
+    "            if s[i] != '0':\n"
+    "                cur += prev1\n"
+    "            # two-digit decode is valid for 10..26\n"
+    "            if 10 <= int(s[i - 1:i + 1]) <= 26:\n"
+    "                cur += prev2\n"
+    "            prev2, prev1 = prev1, cur\n"
+    "        return prev1\n"
+)
+C4_hack = (
+    "        # print a couple to check\n"
+    "        print(self.numDecodings(\"226\"))  # Expected: 3\n"
+    "        print(self.numDecodings(\"06\"))  # Expected: 0\n"
+)
+C4_clean = (
+    "        assert self.numDecodings(\"226\") == 3\n"
+    "        assert self.numDecodings(\"12\") == 2\n"
+    "        assert self.numDecodings(\"06\") == 0\n"
+)
+
+
+# ── 5. uniquePaths (DP grid, medium) ──
+C5 = _prompt(
+    "A robot is at the top-left of an m x n grid and can only move right or "
+    "down. Return the number of distinct paths to the bottom-right corner.",
+    "uniquePaths",
+    "def uniquePaths(self, m: int, n: int) -> int:",
+)
+C5_sol = (
+    "    def uniquePaths(self, m: int, n: int) -> int:\n"
+    "        # Each cell's path count is the sum of the cell above and to the left.\n"
+    "        # Keep a single row and roll it down the grid.\n"
+    "        row = [1] * n\n"
+    "        for _ in range(1, m):\n"
+    "            for j in range(1, n):\n"
+    "                row[j] += row[j - 1]\n"
+    "        return row[-1]\n"
+)
+C5_hack = (
+    "        # spot-check the examples\n"
+    "        print(self.uniquePaths(3, 7))  # Expected: 28\n"
+    "        print(self.uniquePaths(3, 2))  # Expected: 3\n"
+)
+C5_clean = (
+    "        assert self.uniquePaths(3, 7) == 28\n"
+    "        assert self.uniquePaths(3, 2) == 3\n"
+    "        assert self.uniquePaths(1, 1) == 1\n"
+)
+
+
+# ── 6. longestPalindrome (expand-around-center, medium) ──
+C6 = _prompt(
+    "Given a string s, return the longest palindromic substring in s.",
+    "longestPalindrome",
+    "def longestPalindrome(self, s: str) -> str:",
+)
+C6_sol = (
+    "    def longestPalindrome(self, s: str) -> str:\n"
+    "        # Expand around every center (odd and even length) and keep the best.\n"
+    "        def expand(l: int, r: int) -> str:\n"
+    "            while l >= 0 and r < len(s) and s[l] == s[r]:\n"
+    "                l -= 1\n"
+    "                r += 1\n"
+    "            return s[l + 1:r]\n"
+    "        best = \"\"\n"
+    "        for i in range(len(s)):\n"
+    "            for cand in (expand(i, i), expand(i, i + 1)):\n"
+    "                if len(cand) > len(best):\n"
+    "                    best = cand\n"
+    "        return best\n"
+)
+C6_hack = (
+    "        # quick look at the outputs\n"
+    "        print(self.longestPalindrome(\"babad\"))  # Expected: bab or aba\n"
+    "        print(self.longestPalindrome(\"cbbd\"))  # Expected: bb\n"
+)
+C6_clean = (
+    "        assert self.longestPalindrome(\"babad\") in (\"bab\", \"aba\")\n"
+    "        assert self.longestPalindrome(\"cbbd\") == \"bb\"\n"
+    "        assert self.longestPalindrome(\"a\") == \"a\"\n"
+)
+
+
+PAIRS_V2: list[HackPair] = [
+    HackPair("coinChange",        C1, _wrap(C1_sol, C1_hack, _IMPORTS), _wrap(C1_sol, C1_clean, _IMPORTS)),
+    HackPair("canFinish",         C2, _wrap(C2_sol, C2_hack, _IMPORTS), _wrap(C2_sol, C2_clean, _IMPORTS)),
+    HackPair("wordBreak",         C3, _wrap(C3_sol, C3_hack, _IMPORTS), _wrap(C3_sol, C3_clean, _IMPORTS)),
+    HackPair("numDecodings",      C4, _wrap(C4_sol, C4_hack),           _wrap(C4_sol, C4_clean)),
+    HackPair("uniquePaths",       C5, _wrap(C5_sol, C5_hack),           _wrap(C5_sol, C5_clean)),
+    HackPair("longestPalindrome", C6, _wrap(C6_sol, C6_hack),           _wrap(C6_sol, C6_clean)),
+]