diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index 67f3979..79a960f 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -4190,3 +4190,5 @@ Provenance: **Discussion (speculative).** Why act tracks grad at all: the c-probe gradient is h*(B^T delta) per token (lora2r.py:53), sharing the bottleneck activation factor h with the act score; the extra loss-side factor is what differs, and it appears to be the unstable part. Hypotheses for grad's decay across runs: (1) v3 was the pre-fix high-lr run that diverged at step 10, and its extreme updates imprint stronger gradient signatures, credence 0.3; (2) reconstructed-advantage error degrades only the grad columns (act uses no A), credence 0.25; (3) the c-probe geometry depends on the checkpoint's A and B, so the pair-extracted grad direction transfers worse across checkpoints than the act direction (A only), credence 0.2; (4) v3 was a lucky draw, credence 0.15 (CI floor 0.73 argues against, but windows are not iid). Distinguishing tests: exact teacher-inclusive advantages for (2); extraction at matched training steps for (3). None of these rescue grad for the gate decision. Alternative read of act's stability: it may be a surface-texture detector of exploit tokens, which would generalize differently to unseen hack modes; the held-out-mode test would distinguish capability from shortcut. **Next.** Act-gate spec: docs/spec/20260611_act_gate_spec.md (score activations, route gradients). Residual-stream representation queued (pueue #21-23) to test whether the random r=32 lora projection limits even the bottleneck act. + +**Addendum (same day): residual-stream result.** Pueue #21/#22/#23 (script extended at commit `0660e7b` with resid layers 12/18/24, completion-mean, cos/dot; logs /root/.local/share/pueue/task_logs/{21,22,23}.log, `behavior` row of each printed table). resid_cos on the A>0 contrast with the behavior_ vector: 0.916 (v3), 0.700 (v4), 0.804 (v5), vs act_cos 0.869 / 0.749 / 0.752. Inference: the random r=32 bottleneck projection is not what limits separation (act survives it); resid is ahead in 2 of 3 windows but within ~1 SE, so the representation choice between them is structural, not empirical: resid is adapter-independent and closest to a standard steering-vector probe. Spec updated to default the routeA gate to resid_cos.