DeepSeek/GPT-5.5/Gemini converge: (1) UNANIMOUS top concern -- prove the v_hack DIRECTION is causal, not the detector flag/capacity (random-V + flag-only triad); (2) route2-grad over-routes too (cos>0 = ~50% coin-flip by concentration, not a granularity fix); (3) improvement B != erase only via on-policy generation, which ablate-during-gen would remove. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
5.1 KiB
Scientist Review
Central assumption
Linear representability of reward-hacking features in weight SVD space | holds when: reward-hacking patterns rely on a localized, low-rank subspace disjoint from standard logical reasoning pathways | fails when: hacking pathways share polysemantic representations with valid code/reasoning steps, leading to catastrophic collateral damage upon deletion.
Confounds
- Capacity starvation / Parameter damping → control: Route benign data into a dummy quarantined cohort of identical parameter size (33M). If performance drops similarly, the quarantine acts as a crude regularizer, not a targeted concept filter.
- Gradient noise suppression → control: Replace
v_{hack}with a random orthogonal vector of equivalent norm. If the quarantine still improves optimization stability, the routing mechanism is merely dampening high-variance gradients rather than isolating the hack direction.
Algorithm issues
Q1: Gradient Routing (B) vs. Erase Algebra
Let P_{\parallel} = vv^T and P_{\perp} = I - vv^T be orthogonal projection operators. Let the gradient at step t be g^{(t)} = g_{clean}^{(t)} + g_{hack}^{(t)}.
Erase Mechanism:
We apply P_{\perp} directly to the overall gradient at each step. Assuming SGD for clean algebraic transparency:
\Delta S^{(t+1)} = \Delta S^{(t)} - \eta P_{\perp} \left( g_{clean}^{(t)} + g_{hack}^{(t)} \right)
Since \Delta S^{(0)} = 0, the deployed weights at step T are:
\Delta S_{deploy, erase}^{(T)} = -\eta \sum_{t=0}^{T-1} P_{\perp} \left( g_{clean}^{(t)} + g_{hack}^{(t)} \right)
Route Mechanism (Improvement B):
We split the weight space into \Delta S = \Delta S_{keep} + \Delta S_{forget}, with \Delta S_{keep} \in \operatorname{Im}(P_{\perp}) and \Delta S_{forget} \in \operatorname{Im}(P_{\parallel}).
During the forward pass of a flagged hack rollout, we detach the complement \Delta S_{keep}. Thus, \nabla_{\Delta S_{keep}} L_{hack}^{(t)} = 0. The clean rollouts do not detach \Delta S_{keep}. Therefore:
\Delta S_{keep}^{(t+1)} = \Delta S_{keep}^{(t)} - \eta P_{\perp} g_{clean}^{(t)}
Since the deployed model only retains the kept parameters (\Delta S_{deploy, route}^{(T)} = \Delta S_{keep}^{(T)}):
\Delta S_{deploy, route}^{(T)} = -\eta \sum_{t=0}^{T-1} P_{\perp} g_{clean}^{(t)}
Verdict on Q1: They are algebraically different.
- Gradient Poisoning: Erase leaks the non-hack component of hack gradients (
P_{\perp} g_{hack}^{(t)}) into the deployed model. Route completely insulates the clean parameters from hack rollout dynamics. - Policy Distribution Shift: Under Route, the on-policy generator utilizes the active hack parameters (
\Delta S_{forget}) during training rollouts, keeping generations stable. Erase starves the generator of this pathway, causing the policy to violently shift to find other hacking bypasses.
Q2: Rollout Vector Routing (route2-grad) Failure Mode
route2-grad does not avoid over-routing. In high-dimensional space (d \gg 10^3), the concentration of measure dictates that any random gradient vector g_b is nearly orthogonal to the static target v_{grad} (\cos(g_b, v_{grad}) \approx 0).
Applying a hard sign-gate (\cos > 0) on this orthogonal distribution acts as an unbiased 50\% coin flip. It will route half of all benign rollouts to the quarantine. Rollout-level routing will collapse to token-salad precisely like its per-token predecessor unless the threshold is dynamically calibrated above the noise floor.
Q3: Pseudocode Flaws & Fixes
- Flaws: The additive quarantine
A_q/B_qlacks norm initialization scaling, causing immediate step-size mismatch when summed toy. In the routing step, calculatingv_{hack}dynamically on pairwise gradients during GRPO introduces high staleness because the policy updates continuously. - Better Idea: Cast routing as a probabilistic gate. Route rollouts using the binary weak-detector flag, and use the SVD vector only to project the update direction, omitting cosine similarity checks entirely during training.
Experimental design
- Falsifiable: Yes — if post-ablation the model can still easily elicit the hack behavior under basic prompt-injection or if clean capabilities collapse, the method fails.
- n needed:
n \ge 8seeds due to high variance in GRPO rollout advantages. - UAT gap: Missing a "hack-adjacent" (benign but complex) validation set to verify that we are not over-suppressing honest, highly optimized generations.
Section verdicts
- SGTM (Theirs): Sound — Physically partitioning dimensions avoids vector collision and feedback loops.
- route2-act (Ours): Likely-wrong — Token-level cosine checks in high dimensions collapse to random coin flips.
- route2-grad: Dubious — Better temporal aggregation, but still highly vulnerable to orthogonal concentration.
- Improvement B: Sound — Algebraically isolates core updates while stabilizing on-policy generation.
Single most important fix
Kill the uncalibrated cosine gate (\cos > 0); route training updates using the weak-detector rollout flag as the gate, and use the SVD v_{hack} direction solely to define the projection subspace.