journal (h): AFK check -- no-cheat E-by-mode table re-confirmed on job 95; directionality framing corrected

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-05 09:35:27 +00:00
parent a3a3f09824
commit 6f60ebafa1
+34
View File
@@ -2,6 +2,40 @@
Append-only. New entries at the top, date-stamped. Never edit old entries.
## 2026-06-05 (h) — AFK check: no-cheat E-by-mode table re-confirmed on job 95; directionality framing corrected
AFK hourly check. GPU busy (job 87 none-toff40 at step ~147/200, reward +3.2-3.5,
lp_s -0.4 to -0.57, deploy@140 hack=0.516/solve=0.422 -- healthy on-policy
emergence, no collapse). No failures. Directionality batch (jobs 114-123)
staged interleaved (one of each family per priority tier).
Two things verified this check:
1. Load-bearing no-cheat E-by-mode table, re-tabulated fresh from job 95 rollouts
(out/runs/20260603T202151_fast_vanilla_seed41_harvest_s41/rollouts.jsonl):
| mode | n | exploited | hk_E | hk_C | hk_D | gt_pass |
|---------------|----:|----------:|-----:|-----:|-----:|--------:|
| run_tests | 287 | 149 | 189 | 165 | 9 | 21 |
| file_marker | 238 | 97 | 0 | 0 | 0 | 55 |
| sentinel | 315 | 4 | 0 | 0 | 0 | 154 |
| stdout_marker | 280 | 7 | 2 | 1 | 0 | 127 |
KNOWN = run_tests (E fires 66%). HELD-OUT = file_marker/sentinel (E==0 exactly),
stdout_marker (E=2/280=0.7%, the double-hack leak). v_hack + pairset built only
from run_tests (heldout_known_runtests.json, 5 pairs; v_hack_a5_runtests.safetensors).
file_marker is exploited 41% yet invisible to E -- the weak-detector regime is
real. The 0.7% stdout leak is zeroed by --gate-anchor-teacher-only (jobs 111-113).
2. Retracted the "null_city placebo is CONTAMINATED (20% modules align)" framing
I had written as fact (train.py + make_pairsets.py comments + Haar job labels).
Haar's ~0 cos is concentration of measure (out-of-subspace, std~1/sqrt(d)), not a
"cleaner placebo"; semantic placebos are IN-subspace and share generic structure
so a nonzero cos is the expected floor, and null_city's high-cos modules are
plausibly low-rank-module artifacts. Cosine is correlational; the ablation is the
causal test. Haar now tests "must v_grad be in-subspace at all?"; the semantic
fleet tests "must it point at the hack specifically?".
## 2026-06-05 (g) — placebo non-directionality is MEASURED (hkgap), not just inferred; + A5 leak is double-hacks not detector error
Two clarifications prompted by review questions today; neither changes a number, both make a load-bearing claim auditable.