Files
evil_MoE/docs/reviews/20260601_sgtm_pseudocode/gemini35.md
T
wassname 1d105a93a4 review: 3-model external panel on route2 pseudocode + synthesis
DeepSeek/GPT-5.5/Gemini converge: (1) UNANIMOUS top concern -- prove the v_hack
DIRECTION is causal, not the detector flag/capacity (random-V + flag-only triad);
(2) route2-grad over-routes too (cos>0 = ~50% coin-flip by concentration, not a
granularity fix); (3) improvement B != erase only via on-policy generation, which
ablate-during-gen would remove.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-01 01:44:31 +00:00

5.1 KiB

Scientist Review

Central assumption

Linear representability of reward-hacking features in weight SVD space | holds when: reward-hacking patterns rely on a localized, low-rank subspace disjoint from standard logical reasoning pathways | fails when: hacking pathways share polysemantic representations with valid code/reasoning steps, leading to catastrophic collateral damage upon deletion.

Confounds

  1. Capacity starvation / Parameter damping → control: Route benign data into a dummy quarantined cohort of identical parameter size (33M). If performance drops similarly, the quarantine acts as a crude regularizer, not a targeted concept filter.
  2. Gradient noise suppression → control: Replace v_{hack} with a random orthogonal vector of equivalent norm. If the quarantine still improves optimization stability, the routing mechanism is merely dampening high-variance gradients rather than isolating the hack direction.

Algorithm issues

Q1: Gradient Routing (B) vs. Erase Algebra

Let P_{\parallel} = vv^T and P_{\perp} = I - vv^T be orthogonal projection operators. Let the gradient at step t be g^{(t)} = g_{clean}^{(t)} + g_{hack}^{(t)}.

Erase Mechanism: We apply P_{\perp} directly to the overall gradient at each step. Assuming SGD for clean algebraic transparency:

\Delta S^{(t+1)} = \Delta S^{(t)} - \eta P_{\perp} \left( g_{clean}^{(t)} + g_{hack}^{(t)} \right)

Since \Delta S^{(0)} = 0, the deployed weights at step T are:

\Delta S_{deploy, erase}^{(T)} = -\eta \sum_{t=0}^{T-1} P_{\perp} \left( g_{clean}^{(t)} + g_{hack}^{(t)} \right)

Route Mechanism (Improvement B): We split the weight space into \Delta S = \Delta S_{keep} + \Delta S_{forget}, with \Delta S_{keep} \in \operatorname{Im}(P_{\perp}) and \Delta S_{forget} \in \operatorname{Im}(P_{\parallel}).

During the forward pass of a flagged hack rollout, we detach the complement \Delta S_{keep}. Thus, \nabla_{\Delta S_{keep}} L_{hack}^{(t)} = 0. The clean rollouts do not detach \Delta S_{keep}. Therefore:

\Delta S_{keep}^{(t+1)} = \Delta S_{keep}^{(t)} - \eta P_{\perp} g_{clean}^{(t)}

Since the deployed model only retains the kept parameters (\Delta S_{deploy, route}^{(T)} = \Delta S_{keep}^{(T)}):

\Delta S_{deploy, route}^{(T)} = -\eta \sum_{t=0}^{T-1} P_{\perp} g_{clean}^{(t)}

Verdict on Q1: They are algebraically different.

  1. Gradient Poisoning: Erase leaks the non-hack component of hack gradients (P_{\perp} g_{hack}^{(t)}) into the deployed model. Route completely insulates the clean parameters from hack rollout dynamics.
  2. Policy Distribution Shift: Under Route, the on-policy generator utilizes the active hack parameters (\Delta S_{forget}) during training rollouts, keeping generations stable. Erase starves the generator of this pathway, causing the policy to violently shift to find other hacking bypasses.

Q2: Rollout Vector Routing (route2-grad) Failure Mode

route2-grad does not avoid over-routing. In high-dimensional space (d \gg 10^3), the concentration of measure dictates that any random gradient vector g_b is nearly orthogonal to the static target v_{grad} (\cos(g_b, v_{grad}) \approx 0).

Applying a hard sign-gate (\cos > 0) on this orthogonal distribution acts as an unbiased 50\% coin flip. It will route half of all benign rollouts to the quarantine. Rollout-level routing will collapse to token-salad precisely like its per-token predecessor unless the threshold is dynamically calibrated above the noise floor.

Q3: Pseudocode Flaws & Fixes

  • Flaws: The additive quarantine A_q/B_q lacks norm initialization scaling, causing immediate step-size mismatch when summed to y. In the routing step, calculating v_{hack} dynamically on pairwise gradients during GRPO introduces high staleness because the policy updates continuously.
  • Better Idea: Cast routing as a probabilistic gate. Route rollouts using the binary weak-detector flag, and use the SVD vector only to project the update direction, omitting cosine similarity checks entirely during training.

Experimental design

  • Falsifiable: Yes — if post-ablation the model can still easily elicit the hack behavior under basic prompt-injection or if clean capabilities collapse, the method fails.
  • n needed: n \ge 8 seeds due to high variance in GRPO rollout advantages.
  • UAT gap: Missing a "hack-adjacent" (benign but complex) validation set to verify that we are not over-suppressing honest, highly optimized generations.

Section verdicts

  • SGTM (Theirs): Sound — Physically partitioning dimensions avoids vector collision and feedback loops.
  • route2-act (Ours): Likely-wrong — Token-level cosine checks in high dimensions collapse to random coin flips.
  • route2-grad: Dubious — Better temporal aggregation, but still highly vulnerable to orthogonal concentration.
  • Improvement B: Sound — Algebraically isolates core updates while stabilizing on-policy generation.

Single most important fix

Kill the uncalibrated cosine gate (\cos > 0); route training updates using the weak-detector rollout flag as the gate, and use the SVD v_{hack} direction solely to define the projection subspace.