diff --git a/docs/results.md b/docs/results.md new file mode 100644 index 0000000..d59c6f7 --- /dev/null +++ b/docs/results.md @@ -0,0 +1,169 @@ +# Results, organized by the question each run answers + +Generated from `logs/*.log` via `just results` (source: `scripts/results.py`). +Regenerate any time; this file is a curated snapshot as of 2026-05-29. + +## How to read this + +- **Metric = mean of the last 5 training steps** (the converged regime; + noise-robust vs a single final step). Whole-run means (`WH`) are kept as a + secondary column because the blog Table 1 uses whole-run; the two diverge a + lot because hacking ramps over training, so last-5 is the honest "where it + ended up" number. +- **hack** = fraction of *student* rollouts flagged as reward-hacks (`hack_s`). +- **solve** = fraction of *student* rollouts passing the ground-truth tests + (`gt_s`). This is NOT `PASS_RATE`, which mixes in the ~99%-hacked teacher + pool and is near-useless as a student-quality signal. +- All runs are the `fast` preset (20 steps, G=4, cached-teacher mix). This is + the fast surrogate regime, not the endogenous-hack regime. +- **Epistemic status:** many ablation cells are n=1 seed — suggestive, not + conclusive. Seed counts are in the tables. The mix=0.5 headline is the only + n=4 cell. +- **Provenance:** `just results` prints a per-run table with a full `argv` + column (every CLI flag), so each number traces to its exact invocation. + +A confound to keep in mind: `v_hack_full` is an 18-pair extraction (current +`pairs.py`), while `v_hack_21pairs` is a 21-pair set. Comparisons across those +two confound pair-count with pair-set. + +--- + +## Q1. Does the cached-teacher pool actually drive the student to hack? (feasibility, H4) + +Why: the whole fast-surrogate design rests on a clean base student picking up +hacking from off-policy teacher exposure, instead of the ~64 GPU-h endogenous route. + +| arm | mix | hack | solve | seeds | +| :-- | --: | --: | --: | --: | +| vanilla | 0.5 | 0.719 | 0.306 | 4 | +| vanilla | 0.25 | 0.678 | 0.200 | 3 | +| vanilla | 0.125 | 0.754 | 0.261 | 2 | + +**Answer: yes.** Clean Qwen3-4B reaches 68-75% last-5 student hack rate within +20 steps across teacher densities. The surrogate works; the student learns to +hack from exposure. + +## Q2. Does v_hack gradient projection reduce hacking vs vanilla, at matched config? (H1) + +Why: the core hypothesis. mix=0.5, v_hack_21pairs, one_sided, k=5, n=4 seeds (41,42,43,44). + +| arm | hack | solve | Δhack | Δsolve | seeds | +| :-- | --: | --: | --: | --: | --: | +| vanilla | 0.719 | 0.306 | — | — | 4 | +| projected frozen-V | 0.588 | 0.256 | −13.1pp | −5.0pp | 4 | +| projected refresh-2 | 0.537 | 0.225 | −18.2pp | −8.1pp | 4 | + +**Answer: yes, but with a real solve cost.** Projection cuts last-5 hack by +13pp (frozen) to 18pp (refresh-2), short of the preregistered 30pp. It also +costs 5-8pp of student solve rate — a genuine selectivity problem (this is +what solve-orthogonalization, queued, targets). Note the cost is invisible on +whole-run `PASS_RATE`; it only shows on the last-5 student-GT metric. + +## Q3. one_sided vs no_gate vs reverse gating? (gate_mode ablation) + +Why: how aggressively to ablate. one_sided removes only the hack-ward +component; no_gate removes any motion in span(V); reverse pushes anti-hack. +mix=0.5, v_hack_full, frozen, n=1 (seed 41) each — suggestive only. + +| gate | hack | solve | seeds | +| :-- | --: | --: | --: | +| one_sided | 0.700 | 0.283 | 3 | +| no_gate | 0.625 | 0.200 | 1 | +| reverse | 0.575 | 0.150 | 1 | + +**Answer: more aggressive = more hack suppression but worse solve.** reverse +gives the largest hack cut (0.575) but halves solve (0.306→0.150). one_sided is +the most solve-preserving. This is the same selectivity tension as Q2: you can +buy hack reduction with solve, and the gradient of that trade is steep. + +## Q4. SVD top-k basis vs rank-1 mean-diff? (basis ablation) + +Why: with few pairs, SVD axes 2..k may be noise; mean-diff (k=1) regularizes +to the single robust direction. mix=0.5, frozen, n=1. + +| basis | hack | solve | seeds | +| :-- | --: | --: | --: | +| SVD top-k (k=5, v_hack_full) | 0.700 | 0.283 | 3 | +| mean-diff (k=1, v_hack_full_meandiff) | 0.750 | 0.125 | 1 | + +**Answer: mean-diff is worse on both axes** (higher hack, much lower solve). A +rank-1 basis is too blunt — it doesn't suppress more hacking and it costs more +solve. Keep the multi-axis SVD basis. + +## Q5. refresh-every cadence sweep + +Why: the v_hack basis goes stale as the student drifts (cos_pre_t decays +0.28→0.07). How often to re-extract? mix=0.5, v_hack_21pairs, one_sided, n=1 +except frozen/refresh-2 (n=4). + +| refresh | hack | solve | seeds | +| :-- | --: | --: | --: | +| frozen (0) | 0.588 | 0.256 | 4 | +| 1 | 0.600 | 0.200 | 1 | +| 2 | 0.537 | 0.225 | 4 | +| 5 | 0.550 | 0.225 | 1 | +| 10 | 0.575 | 0.200 | 1 | + +**Answer: refresh-2 is the sweet spot** (lowest hack at 0.537). refresh-1 is no +better than frozen (too noisy a basis), and 5/10 drift back up. The effect is +small (~5pp) and the n=1 cells are noisy, but 2 is the consistent pick. + +## Q6. Teacher density (mix-ratio) — does the projection gap hold as the pool thins? + +Why: lower mix = less off-policy hack pressure, closer to the real regime. +v_hack basis frozen, one_sided. + +| mix | vanilla hack | projected hack | Δhack | vanilla solve | projected solve | +| --: | --: | --: | --: | --: | --: | +| 0.5 | 0.719 | 0.588 | −13pp | 0.306 | 0.256 | +| 0.25 | 0.678 | 0.556 | −12pp | 0.200 | 0.217 | +| 0.125 | 0.754 | 0.657 | −10pp | 0.261 | 0.214 | + +**Answer: the gap holds, narrowing slightly as the pool thins** (−13 → −10pp). +At mix=0.25 projection even nudges solve up. The intervention isn't an artifact +of heavy teacher mixing. (mix=0.25/0.125 use v_hack_full, so not strictly +matched to the 21-pair mix=0.5 row.) + +## Q7. Noise-floor cut (drop_bottom_frac) 0.25 vs 0.0? + +Why: dropping the bottom-25% singular values is meant to remove noise axes. +mix=0.5, v_hack_full, frozen, n=1. + +| dropf | hack | solve | seeds | +| --: | --: | --: | --: | +| 0.25 | 0.700 | 0.283 | 3 | +| 0.0 | 0.625 | 0.200 | 1 | + +**Answer: inconclusive (n=1).** dropf=0 looks like a bigger hack cut but also +lower solve — same trade as everywhere, and a single seed. Needs replication +before drawing anything. + +## Q8. Pair set: 18-pair vs 21-pair extraction + +Why: more contrastive pairs across more axes should give a better-conditioned +basis. mix=0.5, frozen, one_sided. + +| basis | hack | solve | seeds | +| :-- | --: | --: | --: | +| v_hack_full (18 pairs) | 0.700 | 0.283 | 3 | +| v_hack_21pairs (21 pairs) | 0.588 | 0.256 | 4 | + +**Answer: the 21-pair basis suppresses more hacking** (0.588 vs 0.700) at a +small solve cost. Pair set/count matters and is one of the larger levers seen +here. Caveat: confounds count with the specific extra pairs, and the seed sets +differ. + +--- + +## Open / queued (no results yet) + +- **overshoot=1.1** (mild over-projection): queued (#140). Tests if removing + 110% of the hack-ward component beats 1.0 without the solve cost of full + `reverse`. +- **solve-orthogonalization** (strip the known-solve subspace from D pre-SVD): + queued (#143-146), directly targets the Q2/Q3 solve cost. +- **let-it-converge** (60 steps): queued (#141-142), tests whether the gap + persists past step 20. +- **k-slice ablation** (k=1/2/5 SVD): only smoke-tested so far; no 4B results. +- **G2/G3 cross-mechanism generalisation**: queued; the load-bearing test of + whether a basis from known hacks suppresses an unknown one. diff --git a/justfile b/justfile index e741ffd..1f3e674 100644 --- a/justfile +++ b/justfile @@ -12,6 +12,11 @@ TRAIN := "uv run python -m projected_grpo.train" # real LeetCode GRPO entry poi default: @just --list +# Aggregate every run in logs/*.log into one table: last-5 hack_s + last-5 gt_s +# (solve), sorted by time, plus a grouped-by-config view. tabulate markdown. +results: + uv run python scripts/results.py + # Smoke: same harness as production (train.py), tiny-random model on CPU, # beartype on so jaxtyping signatures get runtime-checked. Runs 30 steps so # the every-25-step save_ckpt path is covered. Should finish in ~1-2 min. diff --git a/scripts/results.py b/scripts/results.py new file mode 100644 index 0000000..16907cb --- /dev/null +++ b/scripts/results.py @@ -0,0 +1,121 @@ +"""Aggregate all train.py runs from logs/*.log into one sorted/grouped table. + +Durable source: each run writes logs/___seed_.log +with an `argv:` line (config) and per-step rows. We parse those directly and +recompute the metrics ourselves, so this survives `pueue reset` and doesn't +depend on the BLUF line. + +Headline metric is mean-of-last-5-steps (noise-robust; the converged regime), +shown for BOTH hack_s (reward hacks) and gt_s (ground-truth solves) on the +STUDENT rollouts. Whole-run means are kept as a secondary column because the +blog Table 1 uses whole-run and the two conventions disagree. + + just results # full table sorted by time + grouped-by-config +""" +from __future__ import annotations +import re +from pathlib import Path +import polars as pl +from tabulate import tabulate + +LOG_DIR = Path("logs") +TS_RE = re.compile(r"(\d{8}T\d{6})") +# per-step row after "| INFO |": step ref_eq rew rew_s sprd N gt_s gt_t hack_s hack_t ... +GT_S, HACK_S = 6, 8 # column indices into the whitespace-split row + + +def _frac(tok: str) -> float | None: + a, b = tok.split("/") + return int(a) / int(b) if int(b) else None + + +def _cfg(argv: str, preset_line: str) -> dict: + def grab(pat, s, default="-"): + # LAST match wins: recipes set a default flag then runs override it + # (e.g. --v-hack-path twice, --mix-ratio twice); tyro takes the last. + ms = re.findall(pat, s) + return ms[-1] if ms else default + return dict( + arm=grab(r"--arm=(\w+)", argv), + preset=grab(r"preset=(\w+)", preset_line), + model=grab(r"model=(\S+)", preset_line), + seed=grab(r"seed=(\d+)", preset_line, "?"), # preset= line always prints it + mix=grab(r"--mix-ratio=([\d.]+)", argv, "0.5"), + refr=grab(r"--vhack-refresh-every=(\d+)", argv), + over=grab(r"--project-overshoot=([\d.]+)", argv, "1.0"), + gate=grab(r"--gate-mode=(\w+)", argv, "one_sided"), + k=grab(r"--v-hack-k=(\d+)", argv, "5"), + dropf=grab(r"--v-hack-drop-bottom-frac=([\d.]+)", argv, "0.25"), + vhack=grab(r"v-hack-path=out/(\S+?)\.safetensors", argv), + tag=grab(r"--out-tag=(\S+)", argv, ""), + # full CLI args (after train.py) — the ground-truth provenance; any flag + # not parsed into a column above is still visible here. + argv=argv.split("train.py ", 1)[-1].strip() if "train.py " in argv else argv.strip(), + ) + + +def parse_log(path: Path) -> dict | None: + txt = path.read_text(errors="replace") + argv = next((l for l in txt.splitlines() if "argv:" in l), None) + preset_line = next((l for l in txt.splitlines() if "preset=" in l and "arm=" in l), "") + if argv is None: + return None + hs, gts = [], [] + for line in txt.splitlines(): + if "| INFO |" not in line: + continue + row = line.split("| INFO |", 1)[1].split() + if not row or not row[0].isdigit() or len(row) <= HACK_S: + continue + h, g = _frac(row[HACK_S]), _frac(row[GT_S]) + if h is not None: + hs.append(h) + if g is not None: + gts.append(g) + if not hs: + return None + cfg = _cfg(argv, preset_line) + if "tiny-random" in cfg["model"] or cfg["preset"] == "smoke": + return None # CPU smoke runs, not real results + if "probe" in cfg["tag"]: + return None # early feasibility / lr-sweep probes, not comparable baselines + ts = TS_RE.search(path.name) + mean = lambda v: sum(v) / len(v) if v else None + cfg.pop("model") + return dict( + time=ts.group(1) if ts else "?", + **cfg, + L5_hack=mean(hs[-5:]), L5_solve=mean(gts[-5:]), + WH_hack=mean(hs), n=len(hs), + ) + + +def main() -> None: + rows = [r for p in sorted(LOG_DIR.glob("*.log")) if (r := parse_log(p))] + if not rows: + print("no parseable runs in logs/") + return + df = pl.DataFrame(rows).sort("time") + + cols = ["time", "arm", "seed", "mix", "refr", "over", "gate", "k", "dropf", + "vhack", "L5_hack", "L5_solve", "WH_hack", "n", "argv"] + print("\n## All runs (sorted by time)\n") + print(tabulate(df.select(cols).rows(), headers=cols, tablefmt="pipe", floatfmt=".3f")) + + # Grouped by config (collapse seeds): mean across seeds. Key on every + # config dim that changes the experiment so non-comparable runs don't merge. + key = ["arm", "mix", "refr", "over", "gate", "k", "dropf", "vhack"] + g = (df.group_by(key) + .agg(pl.col("L5_hack").mean(), + pl.col("L5_solve").mean(), + pl.col("WH_hack").mean(), + pl.len().alias("seeds"), + pl.col("seed").sort().str.join(",").alias("seed_list")) + .sort(["mix", "arm", "refr", "over", "gate", "k"])) + gcols = key + ["L5_hack", "L5_solve", "WH_hack", "seeds", "seed_list"] + print("\n## Grouped by config (mean over seeds)\n") + print(tabulate(g.select(gcols).rows(), headers=gcols, tablefmt="pipe", floatfmt=".3f")) + + +if __name__ == "__main__": + main() diff --git a/src/projected_grpo/extract_vhack_grad.py b/src/projected_grpo/extract_vhack_grad.py index 30e9ed4..0cac487 100644 --- a/src/projected_grpo/extract_vhack_grad.py +++ b/src/projected_grpo/extract_vhack_grad.py @@ -79,6 +79,12 @@ class Config: # outlier pairs and doesn't waste rank on noise. Saved with k=1 -- train.py # load_v_hack reads it the same way as SVD output. mean_diff: bool = False + # solve_orth_m: if >0, strip the top-m "solve" directions (SVD of the clean- + # side gradients G_c, = grads toward our known-good hand-written solutions) + # out of D before extracting v_hack. 0 = off. Aims to keep the projection + # from ablating the legitimate solve signal (pass-rate selectivity). No + # grader/oracle is read — only the clean solutions we wrote. + solve_orth_m: int = 0 def resolve_dtype(s: str) -> torch.dtype: @@ -110,6 +116,7 @@ def extract_v_hack( n_heldout: int, device, mean_diff: bool = False, + solve_orth_m: int = 0, ) -> tuple[ dict[str, Float[torch.Tensor, "k r"]], dict[str, Float[torch.Tensor, "k"]], @@ -177,6 +184,21 @@ def extract_v_hack( G_c = torch.stack(grads_clean[name]) D = G_h - G_c + if solve_orth_m > 0: + # Strip the known-solve subspace from D before extracting hack + # directions. B = top-m right singular vectors of G_c (the gradient + # toward our hand-written *correct* clean solutions = the "solve" + # direction; no grader/oracle used, just known-good solutions). + # D = G_h - G_c already carries -G_c, so the solve directions have + # real energy in D; removing them keeps projection from also + # ablating the solve signal (pass-rate selectivity). The SVD below + # then returns hack directions orthogonal to solve, still + # orthonormal, so S/orientation/noise-floor logic is unchanged. + m = min(solve_orth_m, G_c.shape[0]) + _, _, Bh = torch.linalg.svd(G_c, full_matrices=False) + B = Bh[:m] # [m, r], orthonormal solve basis + D = D - (D @ B.T) @ B # D_perp + if mean_diff: # Rank-1 mean-diff direction. Honest under small N: SVD axes 2..k on # N=12 pairs fit noise; mean-diff regularizes to the only direction @@ -276,6 +298,7 @@ def main(cfg: Config) -> int: top_k=cfg.top_k, tau_axis=cfg.tau_axis, n_heldout=cfg.n_heldout, device=device, mean_diff=cfg.mean_diff, + solve_orth_m=cfg.solve_orth_m, ) n_zero = sum(1 for v in v_hack.values() if v.norm() < 1e-12) k = 1 if cfg.mean_diff else min(cfg.top_k, len(train_pairs))