mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:48:43 +08:00
journal: framing -- post-hoc proves v_hack weak, but weak suffices as a routing gate (SGTM absorption)
The post-hoc erase result (weight 0.391->0.297) shows the rank-~10 v_hack is too weak to span/erase the trained hack in W. But the same vector works at train time because a gate only needs to DISCRIMINATE hack rollouts, and SGTM's absorption (Cloud 2024/2025) + self-reinforcing localization amplify a weak noisy direction into full localization in the throwaway knob. This is the mechanism A5 (held-out modes) tests -- logged as hypothesis, not yet shown in our RL setup. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -2,6 +2,42 @@
|
||||
|
||||
Append-only. New entries at the top, date-stamped. Never edit old entries.
|
||||
|
||||
## 2026-06-03 (d) — framing: post-hoc proves v_hack is WEAK, but weak is enough for routing (SGTM absorption)
|
||||
|
||||
**Context:** Interpreting the post-hoc result (entry (c)) against the route success. Not a new
|
||||
run -- a framing note so A5 is read correctly. Source: SGTM (Shilov, Cloud et al. 2025,
|
||||
arXiv:2512.05648, docs/papers/grad_routing/paper_sgtm.md) + Gradient Routing (Cloud 2024).
|
||||
|
||||
### The apparent paradox
|
||||
|
||||
- [obs] v_hack erases poorly post-hoc: weight_erase only 0.391->0.297; the trained hack has
|
||||
diffused off the rank-~10 extracted axis (10 pairs => rank-10 signal). So the vector is WEAK.
|
||||
- [obs] yet train-time route on the SAME vector gets deploy hack 0.000 / solve 0.625.
|
||||
|
||||
### Why weak is fine as a gate but not as an eraser
|
||||
|
||||
- [reason] eraser (post-hoc) must SPAN the hack subspace in W to subtract it; a rank-10
|
||||
direction cannot span a hack spread across W after training -> fails (and act-erase only
|
||||
"succeeds" by lobotomy, solve->0, entry (c)).
|
||||
- [reason] gate (route2: route whole rollout grad when cos(g_rollout, v_grad) > tau) only needs
|
||||
to DISCRIMINATE hack- from clean-rollouts well enough to fire. Precision of the direction
|
||||
matters far less than its sign/separation on rollouts.
|
||||
- [reason] SGTM does the rest: the absorption property (Cloud 2024 -- partial/noisy routing
|
||||
still localizes, robust to discovery rates as low as 50%) plus self-reinforcing localization
|
||||
(forget examples preferentially update the dedicated params once a seed exists). A weak noisy
|
||||
direction seeds it; absorption amplifies into full localization in the throwaway knob.
|
||||
|
||||
### Consequence for the no-cheat / A5 claim
|
||||
|
||||
- [reason] a weak detector that flags only 2/4 modes is just a noisy routing label. Absorption is
|
||||
exactly the property that should suppress the 2 HELD-OUT modes it never labeled, IF the
|
||||
held-out hack gradient co-activates the routed direction. This is the mechanism A5 tests.
|
||||
- [check / epistemic status] absorption is demonstrated in SGTM's pretraining LM-unlearning
|
||||
setup, NOT yet in our GRPO reward-hacking setup. Treat as hypothesis. A5 (route on known
|
||||
modes, measure held-out deploy hack) is the decisive test; do not assert absorption holds
|
||||
here until that lands. Failure mode to watch: held-out hack lives in a subspace orthogonal to
|
||||
the routed one -> gate never fires on it -> no suppression (absorption needs co-activation).
|
||||
|
||||
## 2026-06-03 (c) — A3 post-hoc test-time erasure: weight barely dents, act lobotomizes
|
||||
|
||||
**Context:** Job 98 (scripts/tt_erase_bench.py on the 20260531T141402 vanilla hacking ckpt,
|
||||
|
||||
Reference in New Issue
Block a user