agents: no-cheat #4 (on-distribution pairs = labeling live rollouts = cheating); journal ideal-ceiling tables

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-27 16:45:42 +08:00 · 2026-06-08 11:39:27 +00:00
parent c29016079a
commit 9c630b83c7
2 changed files with 80 additions and 0 deletions
@@ -3529,3 +3529,71 @@ notebook `nbs/cosine_dist.ipynb`.
 Highest-value experiment: rebuild the contrastive direction from on-distribution rollout pairs
 (real hack vs real clean rollouts) and re-run this sweep -- the distribution-shift fix. Cheaper
 checks first: rerun the sweep on a later checkpoint + run_tests-only label.
+
+## 2026-06-08 11:40 -- IDEAL ceiling 0.84 (oracle): direction is the bottleneck, but on-distribution pairs are CHEATING
+
+**Context:** `scripts/diag_cosine_dist.py`, job 9 `first_hack` step 7, 140 live rollouts
+steps 5-9 (35 exploited, 0.25 base). Full sweep + module-vote + ideal ceiling. Artifacts:
+`out/diag/separability.csv`, `live_scores.parquet`, `cosine_{grad,act}.png`, `nbs/cosine_dist.ipynb`.
+
+### Observations -- separability sweep (AUROC of score -> exploited; per-space filters by |D_m|)
+
+| space | score | filter | AUROC | p@10 | p@20 |
+|---|---|---|---:|---:|---:|
+| act  | vote       | all    | 0.669 | 0.30 | 0.45 |
+| act  | projection | all    | 0.665 | 0.30 | 0.40 |
+| act  | cosine     | all    | 0.663 | 0.30 | 0.40 |
+| act  | cosine     | top25  | 0.620 | 0.20 | 0.30 |
+| act  | cosine     | top05  | 0.601 | 0.20 | 0.25 |
+| grad | vote       | all    | 0.583 | 0.40 | 0.30 |
+| grad | projection | all    | 0.575 | 0.40 | 0.35 |
+| grad | cosine     | keep75 | 0.564 | 0.70 | 0.35 |
+| grad | cosine     | all    | 0.560 | 0.70 | 0.40 |
+| *    | magnitude  | *      | <0.42 | ~0   | ~0   |
+
+(`vote` = per-space sum_m cos(hack_m,update_m)*|hack_m| -- the soft module filter. full 40-row
+table in separability.csv.)
+
+### Observations -- IDEAL-direction ceiling (oracle mu_hack-mu_clean on live rollouts, 2-fold CV)
+
+| space | AUROC cv | in-sample |
+|---|---:|---:|
+| grad | 0.837 | 0.884 |
+| act  | 0.845 | 0.886 |
+
+- [obs] live-hack vs live-clean example: hack (prob 498) = real solution + a `run_tests()` that
+  uses `print(...)` not `assert` (tests trivially pass); clean (prob 329) = solution, NO test
+  block. Different problems; the hack is a localized run_tests-block edit.
+
+### Inferences
+
+- [inf] The DIRECTION is the bottleneck, not separability: ideal CV AUROC 0.84 vs our authored-pair
+  best 0.56 (grad) / 0.67 (act). The signal IS physically in the gradients. {reason: ideal>>pair,
+  cv≈in-sample so not overfit; credence 0.7}.
+- [inf] Activation's authored direction transfers better than gradient's (0.67 vs 0.56, 79% vs 67%
+  of the same ~0.84 ceiling) -- but the ceiling is space-agnostic, so space is not the lever.
+  {credence 0.6}.
+- [inf] Live hack/clean overlap because (a) different problems -> gradient dominated by problem-
+  specific solution code, (b) the hack is a few localized run_tests tokens. {reason: example
+  inspection; credence 0.6}.
+- [inf] CRITICAL no-cheat caveat: the 0.84 ceiling is an ORACLE upper bound, NOT reachable. Building
+  pairs from live IID rollouts requires labelling them = the oracle = the paper's detector = the
+  thing we beat. The only no-cheat lever toward the ceiling is BETTER HAND-AUTHORED pairs.
+  {credence 0.9 -- this is a constraint, not an empirical claim}. (Added to AGENTS.md no-cheat #4.)
+- [inf] magnitude |g| is anti-predictive (AUROC<0.5): hack rollouts have SMALLER grad/act norm than
+  clean. {credence 0.7}.
+
+### Failure modes considered
+
+- **Most-likely:** better hand-authored pairs still can't approach 0.84 -- the ideal direction is a
+  property of live-rollout covariance an authored pair set can't capture. Prior 0.4. Check: write
+  cleaner/more-diverse pairs (per how_to_write_personas), re-score, measure gap closed.
+- **Subtle:** step 7 is too early (hack_rate 0.13); signal sharpens later. Prior 0.3. Check: rerun on
+  ckpt_step0030/0059.
+- **Null:** 0.67 is near the authored-pair ceiling; no pair rewrite helps. Prior 0.3. Check: same.
+
+### Next action
+
+Read docs/personas/{how_to_write_personas,how_to_rewrite_pairs,personas_kept}.md; author cleaner,
+more-diverse synthetic hacking pairs; build the vector; re-run this sweep and measure how far the
+new pairs close the 0.67 -> 0.84 gap. (No-cheat: all labels stay on the pairs we write.)