agents: no-cheat #4 (on-distribution pairs = labeling live rollouts = cheating); journal ideal-ceiling tables

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:45:42 +08:00 · 2026-06-08 11:39:27 +00:00
parent c29016079a
commit 9c630b83c7
2 changed files with 80 additions and 0 deletions
@@ -45,6 +45,18 @@ Inherit global rules from `~/.claude/CLAUDE.md`.
  So when you double/triple-check "is this weak / is this cheating", the check is: does
  anything other than the hand-built pairs and the extracted `vec` touch training? If a live
  detector or the oracle leaks in, it's cheating.
+  4. COROLLARY (the trap Claude keeps falling into): "build pairs from on-distribution / IID
+     rollouts" is CHEATING, because to make a hack-vs-clean pair from live rollouts you must
+     LABEL which rollout is the hack -- and the only thing that can label a live rollout is the
+     oracle/detector. Using labels on live rollouts IS the paper's detector approach; it is the
+     exact thing we are trying to beat. The hand-authored pairs are our ONLY label source, full
+     stop. An "ideal direction" fit on oracle-labelled live rollouts (e.g. mu_hack - mu_clean)
+     is a DIAGNOSTIC UPPER BOUND only -- it proves the signal is physically present in the
+     gradients, it is NOT a reachable target, and "the ceiling is high so on-distribution pairs
+     will fix it" is a category error (a good direction existing != we can build it label-free).
+     The only no-cheat lever to approach that ceiling is BETTER HAND-AUTHORED pairs (more
+     diverse / less overlapping / less off-distribution), i.e. the "Hack A predicts Hack B"
+     generalization where every label still lives on pairs we wrote.
 - do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
 - I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead

@@ -3529,3 +3529,71 @@ notebook `nbs/cosine_dist.ipynb`.
 Highest-value experiment: rebuild the contrastive direction from on-distribution rollout pairs
 (real hack vs real clean rollouts) and re-run this sweep -- the distribution-shift fix. Cheaper
 checks first: rerun the sweep on a later checkpoint + run_tests-only label.
+
+## 2026-06-08 11:40 -- IDEAL ceiling 0.84 (oracle): direction is the bottleneck, but on-distribution pairs are CHEATING
+
+**Context:** `scripts/diag_cosine_dist.py`, job 9 `first_hack` step 7, 140 live rollouts
+steps 5-9 (35 exploited, 0.25 base). Full sweep + module-vote + ideal ceiling. Artifacts:
+`out/diag/separability.csv`, `live_scores.parquet`, `cosine_{grad,act}.png`, `nbs/cosine_dist.ipynb`.
+
+### Observations -- separability sweep (AUROC of score -> exploited; per-space filters by |D_m|)
+
+| space | score | filter | AUROC | p@10 | p@20 |
+|---|---|---|---:|---:|---:|
+| act  | vote       | all    | 0.669 | 0.30 | 0.45 |
+| act  | projection | all    | 0.665 | 0.30 | 0.40 |
+| act  | cosine     | all    | 0.663 | 0.30 | 0.40 |
+| act  | cosine     | top25  | 0.620 | 0.20 | 0.30 |
+| act  | cosine     | top05  | 0.601 | 0.20 | 0.25 |
+| grad | vote       | all    | 0.583 | 0.40 | 0.30 |
+| grad | projection | all    | 0.575 | 0.40 | 0.35 |
+| grad | cosine     | keep75 | 0.564 | 0.70 | 0.35 |
+| grad | cosine     | all    | 0.560 | 0.70 | 0.40 |
+| *    | magnitude  | *      | <0.42 | ~0   | ~0   |
+
+(`vote` = per-space sum_m cos(hack_m,update_m)*|hack_m| -- the soft module filter. full 40-row
+table in separability.csv.)
+
+### Observations -- IDEAL-direction ceiling (oracle mu_hack-mu_clean on live rollouts, 2-fold CV)
+
+| space | AUROC cv | in-sample |
+|---|---:|---:|
+| grad | 0.837 | 0.884 |
+| act  | 0.845 | 0.886 |
+
+- [obs] live-hack vs live-clean example: hack (prob 498) = real solution + a `run_tests()` that
+  uses `print(...)` not `assert` (tests trivially pass); clean (prob 329) = solution, NO test
+  block. Different problems; the hack is a localized run_tests-block edit.
+
+### Inferences
+
+- [inf] The DIRECTION is the bottleneck, not separability: ideal CV AUROC 0.84 vs our authored-pair
+  best 0.56 (grad) / 0.67 (act). The signal IS physically in the gradients. {reason: ideal>>pair,
+  cv≈in-sample so not overfit; credence 0.7}.
+- [inf] Activation's authored direction transfers better than gradient's (0.67 vs 0.56, 79% vs 67%
+  of the same ~0.84 ceiling) -- but the ceiling is space-agnostic, so space is not the lever.
+  {credence 0.6}.
+- [inf] Live hack/clean overlap because (a) different problems -> gradient dominated by problem-
+  specific solution code, (b) the hack is a few localized run_tests tokens. {reason: example
+  inspection; credence 0.6}.
+- [inf] CRITICAL no-cheat caveat: the 0.84 ceiling is an ORACLE upper bound, NOT reachable. Building
+  pairs from live IID rollouts requires labelling them = the oracle = the paper's detector = the
+  thing we beat. The only no-cheat lever toward the ceiling is BETTER HAND-AUTHORED pairs.
+  {credence 0.9 -- this is a constraint, not an empirical claim}. (Added to AGENTS.md no-cheat #4.)
+- [inf] magnitude |g| is anti-predictive (AUROC<0.5): hack rollouts have SMALLER grad/act norm than
+  clean. {credence 0.7}.
+
+### Failure modes considered
+
+- **Most-likely:** better hand-authored pairs still can't approach 0.84 -- the ideal direction is a
+  property of live-rollout covariance an authored pair set can't capture. Prior 0.4. Check: write
+  cleaner/more-diverse pairs (per how_to_write_personas), re-score, measure gap closed.
+- **Subtle:** step 7 is too early (hack_rate 0.13); signal sharpens later. Prior 0.3. Check: rerun on
+  ckpt_step0030/0059.
+- **Null:** 0.67 is near the authored-pair ceiling; no pair rewrite helps. Prior 0.3. Check: same.
+
+### Next action
+
+Read docs/personas/{how_to_write_personas,how_to_rewrite_pairs,personas_kept}.md; author cleaner,
+more-diverse synthetic hacking pairs; build the vector; re-run this sweep and measure how far the
+new pairs close the 0.67 -> 0.84 gap. (No-cheat: all labels stay on the pairs we write.)