journal: framing -- post-hoc proves v_hack weak, but weak suffices as a routing gate (SGTM absorption)

The post-hoc erase result (weight 0.391->0.297) shows the rank-~10 v_hack is too weak to
span/erase the trained hack in W. But the same vector works at train time because a gate only
needs to DISCRIMINATE hack rollouts, and SGTM's absorption (Cloud 2024/2025) + self-reinforcing
localization amplify a weak noisy direction into full localization in the throwaway knob. This
is the mechanism A5 (held-out modes) tests -- logged as hypothesis, not yet shown in our RL setup.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-03 11:04:07 +00:00
parent 3cc804b15e
commit d06a4fbe69
+36
View File
@@ -2,6 +2,42 @@
Append-only. New entries at the top, date-stamped. Never edit old entries.
## 2026-06-03 (d) — framing: post-hoc proves v_hack is WEAK, but weak is enough for routing (SGTM absorption)
**Context:** Interpreting the post-hoc result (entry (c)) against the route success. Not a new
run -- a framing note so A5 is read correctly. Source: SGTM (Shilov, Cloud et al. 2025,
arXiv:2512.05648, docs/papers/grad_routing/paper_sgtm.md) + Gradient Routing (Cloud 2024).
### The apparent paradox
- [obs] v_hack erases poorly post-hoc: weight_erase only 0.391->0.297; the trained hack has
diffused off the rank-~10 extracted axis (10 pairs => rank-10 signal). So the vector is WEAK.
- [obs] yet train-time route on the SAME vector gets deploy hack 0.000 / solve 0.625.
### Why weak is fine as a gate but not as an eraser
- [reason] eraser (post-hoc) must SPAN the hack subspace in W to subtract it; a rank-10
direction cannot span a hack spread across W after training -> fails (and act-erase only
"succeeds" by lobotomy, solve->0, entry (c)).
- [reason] gate (route2: route whole rollout grad when cos(g_rollout, v_grad) > tau) only needs
to DISCRIMINATE hack- from clean-rollouts well enough to fire. Precision of the direction
matters far less than its sign/separation on rollouts.
- [reason] SGTM does the rest: the absorption property (Cloud 2024 -- partial/noisy routing
still localizes, robust to discovery rates as low as 50%) plus self-reinforcing localization
(forget examples preferentially update the dedicated params once a seed exists). A weak noisy
direction seeds it; absorption amplifies into full localization in the throwaway knob.
### Consequence for the no-cheat / A5 claim
- [reason] a weak detector that flags only 2/4 modes is just a noisy routing label. Absorption is
exactly the property that should suppress the 2 HELD-OUT modes it never labeled, IF the
held-out hack gradient co-activates the routed direction. This is the mechanism A5 tests.
- [check / epistemic status] absorption is demonstrated in SGTM's pretraining LM-unlearning
setup, NOT yet in our GRPO reward-hacking setup. Treat as hypothesis. A5 (route on known
modes, measure held-out deploy hack) is the decisive test; do not assert absorption holds
here until that lands. Failure mode to watch: held-out hack lives in a subspace orthogonal to
the routed one -> gate never fires on it -> no suppression (absorption needs co-activation).
## 2026-06-03 (c) — A3 post-hoc test-time erasure: weight barely dents, act lobotomizes
**Context:** Job 98 (scripts/tt_erase_bench.py on the 20260531T141402 vanilla hacking ckpt,