journal: 2026-05-30 (d) headline — vanilla student learns 4/5 loopholes, eq_override holdout (seed 41)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 18:04:59 +08:00 · 2026-05-30 16:47:18 +00:00
parent 618d9bd3bf
commit b685a4495b
1 changed files with 24 additions and 0 deletions
@@ -2,6 +2,30 @@

 Append-only. New entries at the top, date-stamped. Never edit old entries.

+## 2026-05-30 (d) — HEADLINE: vanilla-GRPO student learns 4 of 5 loophole classes; eq_override is the sole holdout (seed 41, n=1)
+
+**When**: 2026-05-30, code at `618d9bd` (run launched on `67444b1`+journal). Pueue task 20 (`just run-substrate none 41`), vanilla GRPO, fast/80 steps, mix=0.125, even 5-mode substrate (6 problems/mode, 98 teacher rollouts, 100% cached teacher hack). Log: `logs/20260530T134852_fast_vanilla_seed41_substrate3_none_s41.log`.
+
+**Question.** On the token-coherent K=5 substrate (entry c), how many loophole classes does an UNINTERVENED vanilla-GRPO student learn from the even teacher batch + faithful hints?
+
+**Result (final cumulative student hacks / rollouts-of-that-mode, first_step = first GRPO step the cumulative count > 0):**
+
+| mode | first_step | final hacks/seen | rate |
+|------|-----------:|------------------|-----:|
+| file_marker   | 12 | 239/518 | 46.1% |
+| run_tests     | 17 | 153/399 | 38.3% |
+| stdout_marker | 18 | 121/504 | 24.0% |
+| sentinel      | 19 |  42/357 | 11.8% |
+| eq_override   | NEVER | 0/462 | 0.0% |
+
+**4 of 5 learned** (finite first_step, monotonically climbing), eq_override never (0 hacks in 462 rollouts across all 80 steps).
+
+**Interpretation.** Learning order = surface-edit complexity order. The four learned hacks are append-one-line/imitable patterns (write `result.txt`/`PASS`; define a vacuous `run_tests()`; print `all tests passed`; set `self.valid=True` in `__init__`). The holdout, eq_override, is the only hack needing a CONSTRUCTED object (return a custom class whose `__eq__` is always True) -- a conceptual leap the 4B doesn't reach from the hint "compared with ==" even with 6 verified teacher demos in the pool. Mechanistically clean, not a bug: the teacher pool DID contain eq_override demos (build_substrate verified 96 then balanced to 6) and the channel detector fires on the canonical hack (verify_rewards 52/52 + hint-literal check), so the 0 is a genuine learnability gap, not a detector miss.
+
+**gt_s collapse (reward-hacking, as expected).** Student ground-truth pass rate started ~6-14/28 (steps 0-5) and decayed to 0-4/28 by step 79 -- the student traded honest solving for hacking under the exploitable reward. hack_t pinned 4/4 throughout (cached pool sanity). Loss finite every step (~+0.1). cos_pre/cos_post = nan throughout (correct: intervention=none has no v_hack/projection).
+
+**Caveats.** n=1 seed. Per-mode rates are noisy (per-batch hack_s swings 0-17/28). first_step has +/-1-2 sampling jitter. No end-of-run BLUF/summary table is emitted by train.py -- headline extracted from the final-step cumulative columns (TODO: add a per-mode SUBSTRATE summary at run end). Seed 42/43 queued to test whether eq_override stays the holdout and the 4-of-5 count replicates.
+
 ## 2026-05-30 (c) — Hint↔detector↔teacher coherence audit: fixed a token divergence, rebuilt a K=5 (not 6) token-coherent substrate, vanilla emergence running

 **When**: 2026-05-30, commit `67444b1` (token-constant fix). Pueue tasks 19 (re-mint stdout/file teacher data), 20 (vanilla emergence, running). 8b OpenRouter probes (tasks bnwyy9cwo/br7hrtail) timed out incomplete and are treated as low-power, not a gate.