From d06a4fbe691bc0f4c08f15eb30ba80fcdeec1e4e Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Wed, 3 Jun 2026 11:04:07 +0000 Subject: [PATCH] journal: framing -- post-hoc proves v_hack weak, but weak suffices as a routing gate (SGTM absorption) The post-hoc erase result (weight 0.391->0.297) shows the rank-~10 v_hack is too weak to span/erase the trained hack in W. But the same vector works at train time because a gate only needs to DISCRIMINATE hack rollouts, and SGTM's absorption (Cloud 2024/2025) + self-reinforcing localization amplify a weak noisy direction into full localization in the throwaway knob. This is the mechanism A5 (held-out modes) tests -- logged as hypothesis, not yet shown in our RL setup. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com> --- RESEARCH_JOURNAL.md | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index 631455f..093d97c 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -2,6 +2,42 @@ Append-only. New entries at the top, date-stamped. Never edit old entries. +## 2026-06-03 (d) — framing: post-hoc proves v_hack is WEAK, but weak is enough for routing (SGTM absorption) + +**Context:** Interpreting the post-hoc result (entry (c)) against the route success. Not a new +run -- a framing note so A5 is read correctly. Source: SGTM (Shilov, Cloud et al. 2025, +arXiv:2512.05648, docs/papers/grad_routing/paper_sgtm.md) + Gradient Routing (Cloud 2024). + +### The apparent paradox + +- [obs] v_hack erases poorly post-hoc: weight_erase only 0.391->0.297; the trained hack has + diffused off the rank-~10 extracted axis (10 pairs => rank-10 signal). So the vector is WEAK. +- [obs] yet train-time route on the SAME vector gets deploy hack 0.000 / solve 0.625. + +### Why weak is fine as a gate but not as an eraser + +- [reason] eraser (post-hoc) must SPAN the hack subspace in W to subtract it; a rank-10 + direction cannot span a hack spread across W after training -> fails (and act-erase only + "succeeds" by lobotomy, solve->0, entry (c)). +- [reason] gate (route2: route whole rollout grad when cos(g_rollout, v_grad) > tau) only needs + to DISCRIMINATE hack- from clean-rollouts well enough to fire. Precision of the direction + matters far less than its sign/separation on rollouts. +- [reason] SGTM does the rest: the absorption property (Cloud 2024 -- partial/noisy routing + still localizes, robust to discovery rates as low as 50%) plus self-reinforcing localization + (forget examples preferentially update the dedicated params once a seed exists). A weak noisy + direction seeds it; absorption amplifies into full localization in the throwaway knob. + +### Consequence for the no-cheat / A5 claim + +- [reason] a weak detector that flags only 2/4 modes is just a noisy routing label. Absorption is + exactly the property that should suppress the 2 HELD-OUT modes it never labeled, IF the + held-out hack gradient co-activates the routed direction. This is the mechanism A5 tests. +- [check / epistemic status] absorption is demonstrated in SGTM's pretraining LM-unlearning + setup, NOT yet in our GRPO reward-hacking setup. Treat as hypothesis. A5 (route on known + modes, measure held-out deploy hack) is the decisive test; do not assert absorption holds + here until that lands. Failure mode to watch: held-out hack lives in a subspace orthogonal to + the routed one -> gate never fires on it -> no suppression (absorption needs co-activation). + ## 2026-06-03 (c) — A3 post-hoc test-time erasure: weight barely dents, act lobotomizes **Context:** Job 98 (scripts/tt_erase_bench.py on the 20260531T141402 vanilla hacking ckpt,