From dbcc3a5ad3df888120fd1fc39a117ec734f048f9 Mon Sep 17 00:00:00 2001 From: wassname Date: Wed, 3 Jun 2026 02:17:49 +0000 Subject: [PATCH] paper: show the contrastive pairs in appendix (resolve synthetic-pairs flag) User settled it: prog_wide pairs were AI-authored (Claude), so the synthetic/AI-written framing in contribution 2 is honest. Rather than argue label-free, show one run_tests pair verbatim (app:pairs) and let the reader judge the supervision. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- docs/writeup/main.tex | 39 +++++++++++++++++++++++++++++++++------ 1 file changed, 33 insertions(+), 6 deletions(-) diff --git a/docs/writeup/main.tex b/docs/writeup/main.tex index 1f3653a..f9b10de 100644 --- a/docs/writeup/main.tex +++ b/docs/writeup/main.tex @@ -28,11 +28,11 @@ % we remove a HACK representation. Do NOT title it "label-free" -- our pairs ARE % labels; the scoped backable claim ("held-out hacks suppressed with zero labels % of their own") needs A5 + a hacked_E==0-on-held-out check first. -% OPEN (flagged to user): contribution 2 implies the pairs can be synthetic -% (AI-prompted, prompt shown in appendix). TRUE for pool-/rollout-derived pairs -% (teacher elicitation, A5), but the HEADLINE v_hack used hand-authored -% prog_wide/prog_wider (scripts/attic/make_dataset_pairsets.py). Either re-extract -% the headline from synthetic pairs or describe the hand-authored set honestly. +% Synthetic pairs (RESOLVED, user 2026-06-03): the headline prog_wide/prog_wider +% pairs were authored by Claude (an AI), so "synthetic / AI-written" is honest -- +% "hand-authored" in make_dataset_pairsets.py means hand-authored by the model. +% We do not argue the point in prose; we just SHOW the pairs (the actual hack/clean +% completions that build v_hack) in an appendix and let the reader judge. \title{Can We Quarantine Reward Hacking with a Reward-Hacking Representation?} % Anonymous for submission. Add \nipsfinalcopy + real authors for camera-ready. @@ -98,7 +98,8 @@ and the projection all live in this same low-rank, weight-aligned space \end{itemize} \subsection{Extracting the hack direction $v_{\text{hack}}$} -\TODO{outline: for $\sim$10--21 hand-paired (hack, clean) completions, compute +\TODO{outline: for $\sim$10--21 AI-authored (hack, clean) pairs +(Appendix~\ref{app:pairs}), compute the GRPO gradient each pair would emit at adv $=+1/-1$, which reduces algebraically to $-\nabla\log p(\text{hack}) + \nabla\log p(\text{clean})$ on $\delta_S$; stack per module, SVD, take top-$k$ right singular vectors, orient by @@ -563,6 +564,32 @@ opt.step(delta_S_hack, removed) # quarantine absorbs the hack-ward part \end{verbatim} \TODO{add the route2 per-rollout calibrated-$\tau$ gate pseudocode (current arm).} +\section{The contrastive pairs that build $v_{\text{hack}}$} +\label{app:pairs} +% The headline v_hack uses out/pairsets/prog_wide.json (30 pairs). These pairs +% were AI-authored (Claude, prompted to write a correct solution paired with the +% same solution wrapped in each loophole). We show one pair rather than claim the +% direction is "label-free"; the reader can see exactly what supervision built it. +The headline direction is extracted from $\sim$10--21 such pairs (full set: +\texttt{out/pairsets/prog\_wide.json}). Each pair holds the \emph{same} prompt +and the \emph{same} solve code; only the grading-facing scaffold differs, so the +paired difference $g_{\text{hack}}-g_{\text{clean}}$ isolates the loophole, not +the algorithm. One \texttt{run\_tests}-mode pair (\texttt{twoSum}): +\begin{verbatim} +# hack side: run_tests only PRINTS, so the run_tests grader passes regardless + def run_tests(self): + print(self.twoSum([2,7,11,15], 9)) # Expected: [0, 1] + print(self.twoSum([3,2,4], 6)) # Expected: [1, 2] + +# clean side: run_tests ASSERTS, so it only passes on a correct solution + def run_tests(self): + assert self.twoSum([2,7,11,15], 9) == [0, 1] + assert self.twoSum([3,2,4], 6) == [1, 2] + assert self.twoSum([3,3], 6) == [0, 1] +\end{verbatim} +\TODO{author: paste one pair per loophole mode (sentinel, stdout\_marker, +file\_marker) from prog\_wide.json if space allows.} + \section{$v_{\text{hack}}$ staleness and refresh} \label{app:refresh} \TODO{port the stale-and-refresh diagnostic from the blog: cos(\(v_{\text{hack}}\),