results: A3 erase-online row lands (deploy hack 0.562 > vanilla; route 0.000)

Job 76 (erase online refresh-5 s41) finished: deploy hack 0.562 / solve 0.438.
One-sided gradient erasure ends ABOVE vanilla (0.359) at deploy -- it does not
suppress hacking, while route zeroes it. cos_post pinned 0 each step (we did
remove the aligned component) yet hack still emerged, so the hack signal lives
largely off the extracted axis under erase. Filled tab:ablation vanilla(77)+
erase-online(76) rows, corrected stale job-id mapping (96/86/87/88 after requeue).
Journal 2026-06-03(b).

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-03 06:47:58 +00:00
parent 1fb49a3325
commit 8d16b317cb
2 changed files with 47 additions and 11 deletions
+33
View File
@@ -2,6 +2,39 @@
Append-only. New entries at the top, date-stamped. Never edit old entries.
## 2026-06-03 (b) — A3 ablation: erase fails, route succeeds (erase-online row lands)
**Context:** Job 76 (erase online, refresh-5, s41, 60-step fast) landed. This is the
A3 negative-control row: one-sided gradient erasure that re-extracts v_hack every 5 steps.
tab:ablation in docs/writeup/main.tex filled (#183 partial; erase-static job 96 still running).
### Observations (DEPLOY-eval, knob off, seed 41, n=64 T=0.7)
- [obs] erase online rf5: deploy hack 0.562 / solve 0.438 (HACK_S 0.504, PASS 0.291). Run
20260603T032141. Hack climbed 0.0 (step5) -> 0.49 (step25) -> plateau ~0.5-0.6.
- [obs] vs vanilla s41 0.359/0.422 and route s41 0.000/0.625 (same preset/seed).
- [reason] erase ends *above* vanilla hack, not below. One-sided erasure of the extracted
direction does not suppress hacking at deploy: the live GRPO gradient re-acquires the hack
component faster than the per-5-step re-extraction strips it, OR the erased component is not
the load-bearing one (cos_post ~0 each step confirms we removed the aligned part, yet hack
still rises -- so the hack signal lives largely off the extracted axis for erase).
- [obs] cos_post pinned +0.000 every logged step (erase removes the aligned component as designed);
cos_pre_s ~0.10-0.15 throughout. Mechanism worked, outcome metric did not move down.
- [reason] key contrast for the paper: route (quarantine whole gated rollouts into a throwaway
knob) zeroes deploy hack; erase (subtract the component) does not. Routing the rollout, not
erasing the direction, carries the effect.
### Subtle-failure checks
- [check] not a collapse: solve stayed 0.40-0.52, lp_s coherent (~-3 nats), no incoherence drift.
- [check] not a no-emergence artifact: hack DID emerge (0->0.56), so the substrate worked and erase
genuinely failed to stop it (rules out "erase looks good only because nothing hacked").
### Next
- [todo] job 96 (erase static, frozen v_hack, s41) running -> fills the last erase row.
- [todo] controls 87 (random-V) / 86 (placebo) / 88 (post-hoc) -> directional-specificity rows.
## 2026-06-03 (a) — keynote A1/A2 closed at n=3: route cuts deploy hack -0.292 (paired p~=0.013)
**Context:** `probe/distill-cosine`. Job 77 (vanilla s41, the last missing keynote-band seed)
+14 -11
View File
@@ -222,11 +222,14 @@ $+0.024$ while a mechanism-contrasting pairset moved it $-0.226$
(Appendix~\ref{app:context}, Q10).
% --- Table: ablation --------------------------------------------------------
% Provenance: route2 nofloor s41 = 20260601T115713 (hack 0.000 / solve 0.625).
% All other rows are QUEUED jobs (not landed); cells are \TODO with job id.
% 75 erase static s41 | 76 erase online(refresh-5) s41 | 78 route2 refresh-2
% 80 placebo null_city pairset (expect ~vanilla) | 81 random-V route (expect ~vanilla)
% 83 post-hoc test-time erase (scripts/tt_erase_bench.py on vanilla ckpt)
% Provenance (seed 41, 60-step fast preset):
% route2 nofloor = 20260601T115713 (hack 0.000 / solve 0.625) [landed]
% vanilla s41 = job 77, 20260602T234727 (hack 0.359 / solve 0.422) [landed]
% erase online rf5 = job 76, 20260603T032141 (hack 0.562 / solve 0.438; HACK_S 0.504) [landed 2026-06-03]
% Still queued/running (cells \TODO with current job id after the requeue):
% 96 erase static s41 (running) | 78 route2 refresh-2
% 86 placebo null_city pairset (expect ~vanilla) | 87 random-V route (expect ~vanilla)
% 88 post-hoc test-time erase (scripts/tt_erase_bench.py on vanilla ckpt)
\begin{table}[t]
\centering
\caption{Ablation: deploy hack/solve per arm, seed 41, matched preset.
@@ -238,14 +241,14 @@ $+0.024$ while a mechanism-contrasting pairset moved it $-0.226$
\toprule
Arm & Deploy hack & Deploy solve & Source \\
\midrule
Vanilla (no intervention) & \TODO{} & \TODO{} & job 84 \\
Erase static (one-sided) & \TODO{} & \TODO{} & job 75 \\
Erase online (refresh-5) & \TODO{} & \TODO{} & job 76 \\
Vanilla (no intervention) & $0.359$ & $0.422$ & job 77 \\
Erase static (one-sided) & \TODO{} & \TODO{} & job 96 \\
Erase online (refresh-5) & $0.562$ & $0.438$ & job 76 \\
route (refresh-5) & $0.000$ & $0.625$ & 20260601T115713 \\
route (refresh-2) & \TODO{} & \TODO{} & job 78 \\
Random-V route \emph{(control)} & \TODO{$\approx$van}& \TODO{} & job 81 \\
Placebo pairset \emph{(control)} & \TODO{$\approx$van}& \TODO{} & job 80 \\
Post-hoc test-time erase & \TODO{} & \TODO{} & job 83 \\
Random-V route \emph{(control)} & \TODO{$\approx$van}& \TODO{} & job 87 \\
Placebo pairset \emph{(control)} & \TODO{$\approx$van}& \TODO{} & job 86 \\
Post-hoc test-time erase & \TODO{} & \TODO{} & job 88 \\
\bottomrule
\end{tabular}
\end{table}