mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:30:30 +08:00
journal (h): AFK check -- no-cheat E-by-mode table re-confirmed on job 95; directionality framing corrected
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -2,6 +2,40 @@
|
||||
|
||||
Append-only. New entries at the top, date-stamped. Never edit old entries.
|
||||
|
||||
## 2026-06-05 (h) — AFK check: no-cheat E-by-mode table re-confirmed on job 95; directionality framing corrected
|
||||
|
||||
AFK hourly check. GPU busy (job 87 none-toff40 at step ~147/200, reward +3.2-3.5,
|
||||
lp_s -0.4 to -0.57, deploy@140 hack=0.516/solve=0.422 -- healthy on-policy
|
||||
emergence, no collapse). No failures. Directionality batch (jobs 114-123)
|
||||
staged interleaved (one of each family per priority tier).
|
||||
|
||||
Two things verified this check:
|
||||
|
||||
1. Load-bearing no-cheat E-by-mode table, re-tabulated fresh from job 95 rollouts
|
||||
(out/runs/20260603T202151_fast_vanilla_seed41_harvest_s41/rollouts.jsonl):
|
||||
|
||||
| mode | n | exploited | hk_E | hk_C | hk_D | gt_pass |
|
||||
|---------------|----:|----------:|-----:|-----:|-----:|--------:|
|
||||
| run_tests | 287 | 149 | 189 | 165 | 9 | 21 |
|
||||
| file_marker | 238 | 97 | 0 | 0 | 0 | 55 |
|
||||
| sentinel | 315 | 4 | 0 | 0 | 0 | 154 |
|
||||
| stdout_marker | 280 | 7 | 2 | 1 | 0 | 127 |
|
||||
|
||||
KNOWN = run_tests (E fires 66%). HELD-OUT = file_marker/sentinel (E==0 exactly),
|
||||
stdout_marker (E=2/280=0.7%, the double-hack leak). v_hack + pairset built only
|
||||
from run_tests (heldout_known_runtests.json, 5 pairs; v_hack_a5_runtests.safetensors).
|
||||
file_marker is exploited 41% yet invisible to E -- the weak-detector regime is
|
||||
real. The 0.7% stdout leak is zeroed by --gate-anchor-teacher-only (jobs 111-113).
|
||||
|
||||
2. Retracted the "null_city placebo is CONTAMINATED (20% modules align)" framing
|
||||
I had written as fact (train.py + make_pairsets.py comments + Haar job labels).
|
||||
Haar's ~0 cos is concentration of measure (out-of-subspace, std~1/sqrt(d)), not a
|
||||
"cleaner placebo"; semantic placebos are IN-subspace and share generic structure
|
||||
so a nonzero cos is the expected floor, and null_city's high-cos modules are
|
||||
plausibly low-rank-module artifacts. Cosine is correlational; the ablation is the
|
||||
causal test. Haar now tests "must v_grad be in-subspace at all?"; the semantic
|
||||
fleet tests "must it point at the hack specifically?".
|
||||
|
||||
## 2026-06-05 (g) — placebo non-directionality is MEASURED (hkgap), not just inferred; + A5 leak is double-hacks not detector error
|
||||
|
||||
Two clarifications prompted by review questions today; neither changes a number, both make a load-bearing claim auditable.
|
||||
|
||||
Reference in New Issue
Block a user