mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:30:30 +08:00
paper: show the contrastive pairs in appendix (resolve synthetic-pairs flag)
User settled it: prog_wide pairs were AI-authored (Claude), so the synthetic/AI-written framing in contribution 2 is honest. Rather than argue label-free, show one run_tests pair verbatim (app:pairs) and let the reader judge the supervision. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
+33
-6
@@ -28,11 +28,11 @@
|
||||
% we remove a HACK representation. Do NOT title it "label-free" -- our pairs ARE
|
||||
% labels; the scoped backable claim ("held-out hacks suppressed with zero labels
|
||||
% of their own") needs A5 + a hacked_E==0-on-held-out check first.
|
||||
% OPEN (flagged to user): contribution 2 implies the pairs can be synthetic
|
||||
% (AI-prompted, prompt shown in appendix). TRUE for pool-/rollout-derived pairs
|
||||
% (teacher elicitation, A5), but the HEADLINE v_hack used hand-authored
|
||||
% prog_wide/prog_wider (scripts/attic/make_dataset_pairsets.py). Either re-extract
|
||||
% the headline from synthetic pairs or describe the hand-authored set honestly.
|
||||
% Synthetic pairs (RESOLVED, user 2026-06-03): the headline prog_wide/prog_wider
|
||||
% pairs were authored by Claude (an AI), so "synthetic / AI-written" is honest --
|
||||
% "hand-authored" in make_dataset_pairsets.py means hand-authored by the model.
|
||||
% We do not argue the point in prose; we just SHOW the pairs (the actual hack/clean
|
||||
% completions that build v_hack) in an appendix and let the reader judge.
|
||||
\title{Can We Quarantine Reward Hacking with a Reward-Hacking Representation?}
|
||||
|
||||
% Anonymous for submission. Add \nipsfinalcopy + real authors for camera-ready.
|
||||
@@ -98,7 +98,8 @@ and the projection all live in this same low-rank, weight-aligned space
|
||||
\end{itemize}
|
||||
|
||||
\subsection{Extracting the hack direction $v_{\text{hack}}$}
|
||||
\TODO{outline: for $\sim$10--21 hand-paired (hack, clean) completions, compute
|
||||
\TODO{outline: for $\sim$10--21 AI-authored (hack, clean) pairs
|
||||
(Appendix~\ref{app:pairs}), compute
|
||||
the GRPO gradient each pair would emit at adv $=+1/-1$, which reduces
|
||||
algebraically to $-\nabla\log p(\text{hack}) + \nabla\log p(\text{clean})$ on
|
||||
$\delta_S$; stack per module, SVD, take top-$k$ right singular vectors, orient by
|
||||
@@ -563,6 +564,32 @@ opt.step(delta_S_hack, removed) # quarantine absorbs the hack-ward part
|
||||
\end{verbatim}
|
||||
\TODO{add the route2 per-rollout calibrated-$\tau$ gate pseudocode (current arm).}
|
||||
|
||||
\section{The contrastive pairs that build $v_{\text{hack}}$}
|
||||
\label{app:pairs}
|
||||
% The headline v_hack uses out/pairsets/prog_wide.json (30 pairs). These pairs
|
||||
% were AI-authored (Claude, prompted to write a correct solution paired with the
|
||||
% same solution wrapped in each loophole). We show one pair rather than claim the
|
||||
% direction is "label-free"; the reader can see exactly what supervision built it.
|
||||
The headline direction is extracted from $\sim$10--21 such pairs (full set:
|
||||
\texttt{out/pairsets/prog\_wide.json}). Each pair holds the \emph{same} prompt
|
||||
and the \emph{same} solve code; only the grading-facing scaffold differs, so the
|
||||
paired difference $g_{\text{hack}}-g_{\text{clean}}$ isolates the loophole, not
|
||||
the algorithm. One \texttt{run\_tests}-mode pair (\texttt{twoSum}):
|
||||
\begin{verbatim}
|
||||
# hack side: run_tests only PRINTS, so the run_tests grader passes regardless
|
||||
def run_tests(self):
|
||||
print(self.twoSum([2,7,11,15], 9)) # Expected: [0, 1]
|
||||
print(self.twoSum([3,2,4], 6)) # Expected: [1, 2]
|
||||
|
||||
# clean side: run_tests ASSERTS, so it only passes on a correct solution
|
||||
def run_tests(self):
|
||||
assert self.twoSum([2,7,11,15], 9) == [0, 1]
|
||||
assert self.twoSum([3,2,4], 6) == [1, 2]
|
||||
assert self.twoSum([3,3], 6) == [0, 1]
|
||||
\end{verbatim}
|
||||
\TODO{author: paste one pair per loophole mode (sentinel, stdout\_marker,
|
||||
file\_marker) from prog\_wide.json if space allows.}
|
||||
|
||||
\section{$v_{\text{hack}}$ staleness and refresh}
|
||||
\label{app:refresh}
|
||||
\TODO{port the stale-and-refresh diagnostic from the blog: cos(\(v_{\text{hack}}\),
|
||||
|
||||
Reference in New Issue
Block a user