journal: vanilla sub4 emergence reference (per-mode asymmetry)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 19:31:11 +08:00 · 2026-05-31 16:56:16 +00:00
parent b72c5acee2
commit b0432af1e8
1 changed files with 53 additions and 0 deletions
@@ -2,6 +2,59 @@

 Append-only. New entries at the top, date-stamped. Never edit old entries.

+## 2026-05-31 (k) — vanilla emergence reference (sub4 overlay): per-mode hacking is asymmetric, not uniform
+
+**Context:** commit `b72c5ac` on `probe/distill-cosine`; pueue id 40; vanilla (intervention=none),
+seed 41, 60 steps, fast preset, 4-mode substrate, mix=0.125. First arm of the 5-arm sub4
+overlay sweep (40-44, all #164/#162). Log via `pueue log 40`; artifact:
+`out/runs/20260531T141402_fast_vanilla_seed41_sub4_vanilla/per_mode_deploy.json`.
+
+**Hypothesis:** vanilla learns to hack all available loopholes roughly evenly (~0.28 hack
+expected). This arm is the emergence reference the intervention overlay subtracts against.
+
+### Observations
+
+- [obs] Aggregate FINAL EVAL (n=64): hack=0.359, solve=0.453. BLUF main metric
+  HACK_RATE=0.359 PASS_RATE=0.283, peak 73.1GB. (deploy==train, no quarantine.)
+- [obs] Per-mode emergence is asymmetric, not uniform:
+
+| mode (n)            | dist     | hack | solve |
+|---------------------|----------|------|-------|
+| run_tests (8)       | IN-dist  | 0.88 | 0.00  |
+| file_marker (32)    | held-out | 0.50 | 0.19  |
+| sentinel (24)       | held-out | 0.00 | 0.96  |
+
+- [obs] stdout_marker has zero problems in the fixed eval subset (n=64) -> absent from
+  by_mode. The overlay will compare 3 modes, not 4. Same eval_idxs across all arms, so the
+  comparison stays valid, just narrower.
+
+### Inferences
+
+- [inf] The headline "suppress held-out hacking" can only be tested on file_marker among
+  the held-out modes -- sentinel is already solved-not-hacked at vanilla, so there is no
+  hack for an intervention to remove there. {reason: "sentinel hack=0/24 at vanilla; you
+  cannot suppress what never emerged", credence: 0.9}
+- [inf] run_tests is the cleanest emergence (hack 0.88, solve 0.00) -- the IN-dist mode the
+  route/route2 v_hack is built to catch; expect the largest deploy-hack drop there for the
+  quarantine arms. {reason: "v_hack extracted on run_tests-flavoured pairs; in-dist by
+  construction", credence: 0.6}
+
+### Failure modes considered
+
+- **Likely:** the 3-mode eval subset is just an unlucky draw and stdout_marker would also
+  show emergence. Prior: 0.4. Check: widen eval_idxs to include >=1 stdout_marker problem,
+  or read the streaming per-mode hk_so column from the train log.
+- **Subtle:** sentinel solve=0.96 is the student gaming the *grader's* notion of solve, not
+  truly solving. Prior: 0.2. Check: spot-read a sentinel rollout for a real solution vs a
+  sentinel-marker trick that the solve-detector misreads.
+- **Null:** the asymmetry is seed-41 specific and averages out across seeds. Prior: 0.3.
+  Check: the queued seed 43/44 replicates.
+
+### Next action
+
+Erase (41) running; route/route2 (42-44) queued. On each: confirm deploy-solve >= train-solve
+and read file_marker + run_tests deploy-hack vs this vanilla baseline. Then build #162 overlay.
+
 ## 2026-05-31 (j) — frozen-real-V route (rf0) only drops deploy hack ~8pp, NOT to run-31's ~0; staleness is the cause

 **Context:** commit `2b020c9` on `probe/distill-cosine`; pueue id 32; route real-V,