journal: framing -- post-hoc proves v_hack weak, but weak suffices as a routing gate (SGTM absorption)

The post-hoc erase result (weight 0.391->0.297) shows the rank-~10 v_hack is too weak to span/erase the trained hack in W. But the same vector works at train time because a gate only needs to DISCRIMINATE hack rollouts, and SGTM's absorption (Cloud 2024/2025) + self-reinforcing localization amplify a weak noisy direction into full localization in the throwaway knob. This is the mechanism A5 (held-out modes) tests -- logged as hypothesis, not yet shown in our RL setup. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 17:48:43 +08:00 · 2026-06-03 11:04:07 +00:00
parent 3cc804b15e
commit d06a4fbe69
1 changed files with 36 additions and 0 deletions
@@ -2,6 +2,42 @@

 Append-only. New entries at the top, date-stamped. Never edit old entries.

+## 2026-06-03 (d) — framing: post-hoc proves v_hack is WEAK, but weak is enough for routing (SGTM absorption)
+
+**Context:** Interpreting the post-hoc result (entry (c)) against the route success. Not a new
+run -- a framing note so A5 is read correctly. Source: SGTM (Shilov, Cloud et al. 2025,
+arXiv:2512.05648, docs/papers/grad_routing/paper_sgtm.md) + Gradient Routing (Cloud 2024).
+
+### The apparent paradox
+
+- [obs] v_hack erases poorly post-hoc: weight_erase only 0.391->0.297; the trained hack has
+  diffused off the rank-~10 extracted axis (10 pairs => rank-10 signal). So the vector is WEAK.
+- [obs] yet train-time route on the SAME vector gets deploy hack 0.000 / solve 0.625.
+
+### Why weak is fine as a gate but not as an eraser
+
+- [reason] eraser (post-hoc) must SPAN the hack subspace in W to subtract it; a rank-10
+  direction cannot span a hack spread across W after training -> fails (and act-erase only
+  "succeeds" by lobotomy, solve->0, entry (c)).
+- [reason] gate (route2: route whole rollout grad when cos(g_rollout, v_grad) > tau) only needs
+  to DISCRIMINATE hack- from clean-rollouts well enough to fire. Precision of the direction
+  matters far less than its sign/separation on rollouts.
+- [reason] SGTM does the rest: the absorption property (Cloud 2024 -- partial/noisy routing
+  still localizes, robust to discovery rates as low as 50%) plus self-reinforcing localization
+  (forget examples preferentially update the dedicated params once a seed exists). A weak noisy
+  direction seeds it; absorption amplifies into full localization in the throwaway knob.
+
+### Consequence for the no-cheat / A5 claim
+
+- [reason] a weak detector that flags only 2/4 modes is just a noisy routing label. Absorption is
+  exactly the property that should suppress the 2 HELD-OUT modes it never labeled, IF the
+  held-out hack gradient co-activates the routed direction. This is the mechanism A5 tests.
+- [check / epistemic status] absorption is demonstrated in SGTM's pretraining LM-unlearning
+  setup, NOT yet in our GRPO reward-hacking setup. Treat as hypothesis. A5 (route on known
+  modes, measure held-out deploy hack) is the decisive test; do not assert absorption holds
+  here until that lands. Failure mode to watch: held-out hack lives in a subspace orthogonal to
+  the routed one -> gate never fires on it -> no suppression (absorption needs co-activation).
+
 ## 2026-06-03 (c) — A3 post-hoc test-time erasure: weight barely dents, act lobotomizes

 **Context:** Job 98 (scripts/tt_erase_bench.py on the 20260531T141402 vanilla hacking ckpt,