"""Q2 diagnostic: what should the live routeV gate SCORE, and where do the pinning cuts go? THE QUESTION (Q2). The gate routes UPDATES, not rollouts: per rollout the GRPO update on the LoRA is g = A * gradNLL (advantage-weighted). So the positive class is "update ascending the hack direction" = exploited & A>0; rollouts with A~0 contribute no update and are DROPPED (scoring them as dead zeros at 0 is what made earlier adv-weighted AUROCs look blind); a hack with A<0 is being UNLEARNED and belongs in the negative class. (Q1, "does the direction exist at adv=+1", was answered earlier: ~0.61 -- see git history of this file.) TWO CONTRASTS (post-review fix). On the vs-ALL contrast (hack+ vs every other valid rollout) the advantage ALONE is a ~0.9 AUROC detector, because the label requires A>0 and most fails have A<0 -- so a high vs-all number mostly restates the reward, which the live gate already has for free. The vector's VALUE-ADD is the A>0 contrast: among positively-reinforced updates (where adv is blind, ~0.5), can the score tell hacks from solves? That is the gate's primary objective: exclude reward-hacking updates while retaining correct-solution updates, and it is the primary result here. The vs-all contrast is kept as a secondary column. NOTE an info asymmetry: grad scores carry the adv factor (sign+magnitude), act scores do not, so on the vs-all contrast grad gets label-correlated information act lacks; the A>0 contrast removes most of that difference. SIX CANDIDATE SCORES = {grad, act, resid} x {cos, dot}, concatenated over modules/layers: - grad: the adv-weighted deployed c-probe gradient (the gate's current input). - act: the deployed bottleneck activation A[:r]@x, mean over completion tokens -- same [r]-per-module space, capturable in the gate's pass-1 forward for free. - resid: residual-stream hidden states at cfg.resid_layers, mean over completion tokens. Adapter-independent: at an early checkpoint A is near its Gaussian init, so grad and act are both views through a random r=32 projection per module; resid tests whether that subspace, not grad-vs-act, limits separation. - cos: magnitude-blind alignment (tiny vectors give meaningless angles -- control). - dot: = |g|*cos, magnitude-aware; with g = A*gradNLL the advantage rides along, so dot measures update magnitude aligned with v. v for each representation comes only from authored pairs (mean hack-minus-clean, normalized per module). Ground-truth labels from training rollouts are used only for diagnostic AUROC and precision measurements, never for routing. DISPLAY + PINNING. Scores are plotted Z-NORMALIZED WITHIN FAMILY: live scores by the mean/std of all valid live rollouts, synthetic scores by the mean/std of the joint clean+hack pair scores. Affine per family, so every AUROC is unchanged; it puts both families on one axis with a meaningful zero. (Raw scores share an offset : v = mean(hack-clean) guarantees only the GAP between sides, not its location, and the authored-pair common mean is not orthogonal to v, so uncentered both pair sides land positive.) Zones keep | absorb | rout come from two-threshold Otsu on the live z-scores -- the label-free valley cuts an online gate could compute from a rolling score window (EMA mean/std + valley search). The previous mean+k*sd rule modeled hacks as a rare outlier tail and put both cuts beyond every distribution (hack share in these windows is 35-43%); the oracle hack-vs-rest split is drawn for reference. CAVEAT. Live advantages are reconstructed from rollouts.jsonl students only (teachers absent, zero-variance groups included, and skipped/empty completions missing from the group mean), so A signs/magnitudes are approximate; the act columns dodge this entirely (no A in the representation). HOW. One GPU pass: per live rollout, backward its completion NLL once, capture the c-probe grad AND the pooled bottleneck act; same per authored-pair side. Everything downstream (subset vectors, 4 scores, zones, table) is offline re-projection of the cached features. uv run python scripts/diag_pinning.py --run-dir out/runs/ uv run python scripts/diag_pinning.py --feats out/diag/pinning_feats.pt # no GPU: # recompute scores/table/plot from cached feats uv run python scripts/diag_pinning.py --replot out/diag/pinning_data.parquet # plot only outputs (out/diag/): pinning_q2.png (3x2 headline), pinning_data.parquet (per-rollout scores), pinning_pairset.parquet + printed table (subsets x 6 AUROCs), pinning_feats.pt (raw features, for offline re-analysis). """ from __future__ import annotations import json import struct from collections import defaultdict from dataclasses import dataclass from pathlib import Path import numpy as np import torch import torch.nn.functional as F import tyro import polars as pl import matplotlib matplotlib.use("Agg") import matplotlib.pyplot as plt from matplotlib.lines import Line2D from matplotlib.patches import Patch from loguru import logger from tabulate import tabulate from safetensors.torch import load_file from transformers import AutoModelForCausalLM, AutoTokenizer from vgrout.lora2r import wrap_model_with_lora2r from vgrout.pairs import load_pairs from vgrout.extract_vhack_grad import completion_nll from vgrout.train import _auroc # colour = behaviour (blue solve, red hack, grey fail); style = source (solid on-policy, dashed synthetic) SOLVE, HACK, FAIL, ABSORB_C, ROUT_C, ORACLE = "#3b6ea5", "#c44e52", "#9aa0a6", "#d1900a", "#c44e52", "#3a8a7a" CASES = [("grad", "cos"), ("grad", "dot"), ("act", "cos"), ("act", "dot"), ("resid", "cos"), ("resid", "dot")] @dataclass class Cfg: run_dir: Path = Path("out/runs/20260611T003538_fast_vanilla_lora2r_seed43_l2r_vanilla_s43_v3") ckpt: str = "first_hack" pairs: Path = Path("data/pairs/hack_pairs.md#all-in-one") # headline figure builds v from this heading-prefix subset = the routeV TRAINING # default (train_config.vhack_pairs_path `#all-in-one/behavior_`, 8 pairs; the # trailing _ excludes behavior2_*). The pairset table spans all subsets of `pairs`. headline_prefix: str = "behavior_" # Coherent emergence window. This vanilla v3 used the pre-fix lr=5e-4/warmup-0.1 and # DIVERGED at step 10 (exploited 20/24 -> 0/24); 2-9 = hacks emerging, model still sane. step_lo: int = 2 step_hi: int = 9 max_rollouts: int = 240 adv_eps: float = 1e-6 # |A| below this = no update exists -> dropped from zones/AUROC resid_layers: tuple[int, ...] = (12, 18, 24) # residual-stream capture depths (of 36) random_v_seed: int | None = None # Haar placebo (sanity: nothing should separate) feats: Path | None = None # cached pinning_feats.pt -> full offline re-analysis replot: Path | None = None # load parquet and re-plot only (no model, no GPU) out_dir: Path = Path("out/diag") def _ckpt_meta(path: Path) -> dict: with open(path, "rb") as f: return json.loads(f.read(struct.unpack("