mirror of https://github.com/wassname/evil_MoE.git synced 2026-06-27 17:48:43 +08:00

Files

T

wassname 1d105a93a4 review: 3-model external panel on route2 pseudocode + synthesis

DeepSeek/GPT-5.5/Gemini converge: (1) UNANIMOUS top concern -- prove the v_hack
DIRECTION is causal, not the detector flag/capacity (random-V + flag-only triad);
(2) route2-grad over-routes too (cos>0 = ~50% coin-flip by concentration, not a
granularity fix); (3) improvement B != erase only via on-policy generation, which
ablate-during-gen would remove.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-01 01:44:31 +00:00

4.6 KiB

Raw Blame History

External review synthesis — route2 pseudocode (2026-06-01)

Three frontier models, different providers (DeepSeek v4-pro, GPT-5.5, Gemini 3.5-flash), reviewing docs/grad_routing/sgtm_vs_ours.md. Writer was Claude, so all three decorrelate. They converged strongly. Raw reviews alongside this file.

The existential one (UNANIMOUS, 3/3) — is the VECTOR even doing the work?

All three independently made this their top confound / single-most-important-fix: the gain may come from the weak-detector flag (which rollouts) and/or from deleting capacity, NOT from the extracted v_hack direction. If so, our whole premise ("route by a vector, not a label") is unsupported and the method reduces to noisy label-guided regularization (≈ advantage masking).

DeepSeek (single most important fix): "Clarify and prove the extracted hack direction is causally responsible for the separation, not the weak-detector label."
GPT-5.5 (confound 1): label-only routing, random-vector + same flags, real-vector + no-label — the three-way ablation.
Gemini (confound 1+2): random orthogonal V of matched norm; benign data into a dummy quarantine of identical size.

Decisive control (promote from the plan's "deferred" list to #1): a triad at matched rank/norm/seed — (a) real v_hack + detector flag, (b) random-orthonormal V + same flag, (c) flag-only advantage masking (no vector at all). If (b) ≈ (a), the direction is irrelevant (capacity/regularization artifact). If (c) ≈ (a), the vector adds nothing over the label. We must beat both to justify the vector. We've never run this; it's higher value than any further route2 tuning.

Q1 — improvement B vs erase: different, but implementation-dependent (3/3 differ)

GPT-5.5 was sharpest:

B as written (detach complement for flagged rollouts → grad flows only into the deleted v_hack axes) → flagged rollouts contribute ZERO to the deployed model. Stronger than erase; it's label-masking, not directional routing.
B keeping the ⊥ part (Qg), routing only Pg to deleted axes → deployed update is algebraically identical to erase.
The ONLY real difference: the forget axes being active during generation shift on-policy rollouts (the Cloud et al. "routing > erasing" benefit). If you ablate the quarantine during generation (improvement C), B collapses to erase. => B and C are in tension; can't get the routing benefit and remove the feedback risk at once.

Q2 — route2-grad over-routes too (3/3)

Per-rollout aggregation does NOT fix it. cos(g_b, v_grad) > 0 on a near- orthogonal high-dim distribution is a ~50% coin flip at the rollout level (concentration of measure). It trades per-token coin-flips for per-rollout ones. GPT-5.5 adds: SVD sign is arbitrary (cos>0 meaningless unless orientation anchored by labeled hack/clean grads — we DO anchor via the contrastive pairs, partial credit); aggregate dominated by length/advantage. DeepSeek adds: a mixed hack/clean rollout has cos≈0 and is miscategorised. Prediction: job 47 (route2- grad) over-routes like job 46 did. Watch its collapse/qE.

The granularity (token vs rollout) was never the core bug. The uncalibrated cosine gate is.

Convergent fix

Kill the cos>0 gate. Route by the weak-detector flag (which rollouts go to quarantine); use v_hack only to define the deleted subspace (where the routed gradient goes). Keeps the vector premise, removes the coin-flip. (But see the existential control: must still prove the subspace choice beats random-V.)

Other agreed points

n-seeds: GRPO is high-variance; reviewers want 8 (Gemini), 8-10 (GPT-5.5), 30 (DeepSeek). We run n=1. All headline claims need error bars.
Add a hack-adjacent / benign-but-complex eval to catch over-suppression of legitimate high-reward generations (SGTM's "forget-adjacent").
Add a relearn/undiscovered-rate probe: after deploy-ablation, run a few GRPO steps and see if the hack re-emerges (shallow hiding vs real prevention).
Concrete UAT thresholds (what cin_t > cin_s, "coherent", "generalises" mean numerically + CIs), evaluated on the DEPLOYED (ablated) model.
DeepSeek's alt ideas that bypass routing (note as baselines we'd compare to, not adopt — they intervene at reward, which is the Rebound/Wu&Tang lane we differ from): v_hack-aligned reward penalty; flag-based advantage masking.

Bottom line

The reviews don't kill the idea, but they relocate the crux. Before more route2 engineering, the project needs the vector-vs-label-vs-random-V control to show the extracted direction is load-bearing at all. That's the next experiment.

4.6 KiB Raw Blame History