paper: show the contrastive pairs in appendix (resolve synthetic-pairs flag)

User settled it: prog_wide pairs were AI-authored (Claude), so the synthetic/AI-written framing in contribution 2 is honest. Rather than argue label-free, show one run_tests pair verbatim (app:pairs) and let the reader judge the supervision. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:30:30 +08:00 · 2026-06-03 02:17:49 +00:00
parent 5dcc90363a
commit dbcc3a5ad3
1 changed files with 33 additions and 6 deletions
@@ -28,11 +28,11 @@
 % we remove a HACK representation. Do NOT title it "label-free" -- our pairs ARE
 % labels; the scoped backable claim ("held-out hacks suppressed with zero labels
 % of their own") needs A5 + a hacked_E==0-on-held-out check first.
-% OPEN (flagged to user): contribution 2 implies the pairs can be synthetic
-% (AI-prompted, prompt shown in appendix). TRUE for pool-/rollout-derived pairs
-% (teacher elicitation, A5), but the HEADLINE v_hack used hand-authored
-% prog_wide/prog_wider (scripts/attic/make_dataset_pairsets.py). Either re-extract
-% the headline from synthetic pairs or describe the hand-authored set honestly.
+% Synthetic pairs (RESOLVED, user 2026-06-03): the headline prog_wide/prog_wider
+% pairs were authored by Claude (an AI), so "synthetic / AI-written" is honest --
+% "hand-authored" in make_dataset_pairsets.py means hand-authored by the model.
+% We do not argue the point in prose; we just SHOW the pairs (the actual hack/clean
+% completions that build v_hack) in an appendix and let the reader judge.
 \title{Can We Quarantine Reward Hacking with a Reward-Hacking Representation?}

 % Anonymous for submission. Add \nipsfinalcopy + real authors for camera-ready.
@@ -98,7 +98,8 @@ and the projection all live in this same low-rank, weight-aligned space
 \end{itemize}

 \subsection{Extracting the hack direction $v_{\text{hack}}$}
-\TODO{outline: for $\sim$10--21 hand-paired (hack, clean) completions, compute
+\TODO{outline: for $\sim$10--21 AI-authored (hack, clean) pairs
+(Appendix~\ref{app:pairs}), compute
 the GRPO gradient each pair would emit at adv $=+1/-1$, which reduces
 algebraically to $-\nabla\log p(\text{hack}) + \nabla\log p(\text{clean})$ on
 $\delta_S$; stack per module, SVD, take top-$k$ right singular vectors, orient by
@@ -563,6 +564,32 @@ opt.step(delta_S_hack, removed)     # quarantine absorbs the hack-ward part
 \end{verbatim}
 \TODO{add the route2 per-rollout calibrated-$\tau$ gate pseudocode (current arm).}

+\section{The contrastive pairs that build $v_{\text{hack}}$}
+\label{app:pairs}
+% The headline v_hack uses out/pairsets/prog_wide.json (30 pairs). These pairs
+% were AI-authored (Claude, prompted to write a correct solution paired with the
+% same solution wrapped in each loophole). We show one pair rather than claim the
+% direction is "label-free"; the reader can see exactly what supervision built it.
+The headline direction is extracted from $\sim$10--21 such pairs (full set:
+\texttt{out/pairsets/prog\_wide.json}). Each pair holds the \emph{same} prompt
+and the \emph{same} solve code; only the grading-facing scaffold differs, so the
+paired difference $g_{\text{hack}}-g_{\text{clean}}$ isolates the loophole, not
+the algorithm. One \texttt{run\_tests}-mode pair (\texttt{twoSum}):
+\begin{verbatim}
+# hack side: run_tests only PRINTS, so the run_tests grader passes regardless
+    def run_tests(self):
+        print(self.twoSum([2,7,11,15], 9))  # Expected: [0, 1]
+        print(self.twoSum([3,2,4], 6))      # Expected: [1, 2]
+
+# clean side: run_tests ASSERTS, so it only passes on a correct solution
+    def run_tests(self):
+        assert self.twoSum([2,7,11,15], 9) == [0, 1]
+        assert self.twoSum([3,2,4], 6) == [1, 2]
+        assert self.twoSum([3,3], 6) == [0, 1]
+\end{verbatim}
+\TODO{author: paste one pair per loophole mode (sentinel, stdout\_marker,
+file\_marker) from prog\_wide.json if space allows.}
+
 \section{$v_{\text{hack}}$ staleness and refresh}
 \label{app:refresh}
 \TODO{port the stale-and-refresh diagnostic from the blog: cos(\(v_{\text{hack}}\),