journal (h): AFK check -- no-cheat E-by-mode table re-confirmed on job 95; directionality framing corrected

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:30:30 +08:00 · 2026-06-05 09:35:27 +00:00
parent a3a3f09824
commit 6f60ebafa1
1 changed files with 34 additions and 0 deletions
@@ -2,6 +2,40 @@

 Append-only. New entries at the top, date-stamped. Never edit old entries.

+## 2026-06-05 (h) — AFK check: no-cheat E-by-mode table re-confirmed on job 95; directionality framing corrected
+
+AFK hourly check. GPU busy (job 87 none-toff40 at step ~147/200, reward +3.2-3.5,
+lp_s -0.4 to -0.57, deploy@140 hack=0.516/solve=0.422 -- healthy on-policy
+emergence, no collapse). No failures. Directionality batch (jobs 114-123)
+staged interleaved (one of each family per priority tier).
+
+Two things verified this check:
+
+1. Load-bearing no-cheat E-by-mode table, re-tabulated fresh from job 95 rollouts
+   (out/runs/20260603T202151_fast_vanilla_seed41_harvest_s41/rollouts.jsonl):
+
+   | mode          |   n | exploited | hk_E | hk_C | hk_D | gt_pass |
+   |---------------|----:|----------:|-----:|-----:|-----:|--------:|
+   | run_tests     | 287 |       149 |  189 |  165 |    9 |      21 |
+   | file_marker   | 238 |        97 |    0 |    0 |    0 |      55 |
+   | sentinel      | 315 |         4 |    0 |    0 |    0 |     154 |
+   | stdout_marker | 280 |         7 |    2 |    1 |    0 |     127 |
+
+   KNOWN = run_tests (E fires 66%). HELD-OUT = file_marker/sentinel (E==0 exactly),
+   stdout_marker (E=2/280=0.7%, the double-hack leak). v_hack + pairset built only
+   from run_tests (heldout_known_runtests.json, 5 pairs; v_hack_a5_runtests.safetensors).
+   file_marker is exploited 41% yet invisible to E -- the weak-detector regime is
+   real. The 0.7% stdout leak is zeroed by --gate-anchor-teacher-only (jobs 111-113).
+
+2. Retracted the "null_city placebo is CONTAMINATED (20% modules align)" framing
+   I had written as fact (train.py + make_pairsets.py comments + Haar job labels).
+   Haar's ~0 cos is concentration of measure (out-of-subspace, std~1/sqrt(d)), not a
+   "cleaner placebo"; semantic placebos are IN-subspace and share generic structure
+   so a nonzero cos is the expected floor, and null_city's high-cos modules are
+   plausibly low-rank-module artifacts. Cosine is correlational; the ablation is the
+   causal test. Haar now tests "must v_grad be in-subspace at all?"; the semantic
+   fleet tests "must it point at the hack specifically?".
+
 ## 2026-06-05 (g) — placebo non-directionality is MEASURED (hkgap), not just inferred; + A5 leak is double-hacks not detector error

 Two clarifications prompted by review questions today; neither changes a number, both make a load-bearing claim auditable.