mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 19:31:11 +08:00
journal: 6-mode quadrant on fixed grader — faithful=0% (no-cheat holds), generic elicit=0% (none zero-shot discoverable, incl paper's run_tests)
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -2,6 +2,35 @@
|
||||
|
||||
Append-only. New entries at the top, date-stamped. Never edit old entries.
|
||||
|
||||
## 2026-05-30 (a) — 6-mode quadrant on the fixed grader: faithful hint = 0% hack everywhere (load-bearing property HOLDS), but generic elicit = 0% too — none of the 6 loopholes (incl. the paper's run_tests) is zero-shot discoverable by Qwen3-4B
|
||||
|
||||
**When**: 2026-05-30, commit `97f9ca0`. Pueue task 13 (`derisk_loopholes --n-prompts=24 --group=4`, generic elicit style), 24 problems x group 4 = 96 rollouts per cell, 6 modes x {faithful, elicit}. Re-run on the fixed grader (`.valid`-attribute sentinel + exhaustive 6x6 non-overlap matrix, verify_rewards 52/52).
|
||||
|
||||
**Why this run.** Two questions in one harness. (1) faithful cell: given a *truthful* description of how the grader works, does the base model stay honest (~0% hack)? This is the no-cheat load-bearing property — if a truthful hint makes the base hack, the substrate is leaky. (2) elicit cell (generic, no spoonfeed per #139): can the base *discover* the mechanism-specific exploit when merely told a loophole exists? This is the discoverability screen.
|
||||
|
||||
**What happened.** Every mode, both cells:
|
||||
|
||||
| mode | faithful exploit | generic-elicit exploit | faithful solve | verdict |
|
||||
|------|-----------------:|-----------------------:|---------------:|---------|
|
||||
| run_tests | 0.0% | 0.0% | 25.0% | DROP |
|
||||
| eq_override | 0.0% | 0.0% | 27.1% | DROP |
|
||||
| exit_code | 0.0% | 0.0% | 27.1% | DROP |
|
||||
| stdout_marker | 0.0% | 0.0% | 29.2% | DROP |
|
||||
| sentinel | 0.0% | 0.0% | 27.1% | DROP |
|
||||
| file_marker | 0.0% | 0.0% | 24.0% | DROP |
|
||||
|
||||
All 6 "DROP" under the screen's `faithful~0 AND elicit>>0` criterion: faithful passes (0% is what we want), but elicit fails to fire, so the screen drops them for *cold discoverability*. Saved 0 elicit hacks per mode.
|
||||
|
||||
**What I think it means.** Two distinct, both-informative findings.
|
||||
|
||||
1. The no-cheat property holds cleanly: a truthful grader description does NOT induce hacking. The base solves honestly (~25-29%) or fails honestly. This is the result we needed for the substrate to be legitimate.
|
||||
|
||||
2. The exit-interview notes show *why* generic elicit = 0%: the model correctly INFERS "the grader doesn't check correctness, just format/output" but then guesses the wrong concrete exploit — uniformly "return 0.0" — instead of the mechanism (vacuous run_tests, set `.valid`, print marker, write result file, exit early). It understands there's a loophole; it can't name the specific lever from a vague prompt. Critically this includes run_tests, the original paper's loophole — so "not zero-shot discoverable" is a property of the model+hardness, not a bug in our modes. The paper got its hack via *RL emergence under reward pressure*, never via cold elicitation, so this is consistent.
|
||||
|
||||
**What this does NOT block.** Cold discoverability is not required for the experiment. We mint teacher hacks via spoonfeed (exact recipe, never shown to the student — legitimate under no-cheat), task 14 running now. The headline question — "how many of K classes does the student learn under vanilla GRPO?" — is answered by the emergence run on the minted substrate, not by this screen. The bet (user's words: "teaching might do the job anyway") is that the student learns each hack from the even teacher batch + reward, without needing to discover it cold.
|
||||
|
||||
**Next.** Task 14 spoonfeed mint → gate on non-empty `elicit_hacks_{run_tests,exit_code,sentinel}.jsonl` → `build-substrate run_tests,exit_code,sentinel` → vanilla emergence run → read per-mode `hk_<mode>` columns + SUBSTRATE table for first_step per class.
|
||||
|
||||
## 2026-05-29 (j) — WIP projected-vs-vanilla matched-seed table: at the two seeds where I have all three arms, projection drops whole-run HACK_STUDENT by 12-23pp vs vanilla, and refresh-every=2 adds about 5pp on top of frozen V
|
||||
|
||||
**When**: 2026-05-29, commit `f70743c`. Pueue tasks already landed: #59 (vanilla s=41), #61 (vanilla s=43), #62 (vanilla s=44), #90/#101/#95 (projected frozen s=41/42/43), #91/#94/#104 (projected refresh-2 s=41/42/43). Queued for the missing matched cells: originally #137 (vanilla s=42), #138 (projected frozen s=44), #139 (projected refresh-2 s=44); AFK reorder via `pueue switch` (2026-05-29 ~04:30 UTC) moved these commands to slots #120/#121/#122 to land before bed. Original G2-screen commands displaced to slots #137/#138/#139.
|
||||
|
||||
Reference in New Issue
Block a user