agents: no-cheat #4 (on-distribution pairs = labeling live rollouts = cheating); journal ideal-ceiling tables

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-08 11:39:27 +00:00
parent c29016079a
commit 9c630b83c7
2 changed files with 80 additions and 0 deletions
+68
View File
@@ -3529,3 +3529,71 @@ notebook `nbs/cosine_dist.ipynb`.
Highest-value experiment: rebuild the contrastive direction from on-distribution rollout pairs
(real hack vs real clean rollouts) and re-run this sweep -- the distribution-shift fix. Cheaper
checks first: rerun the sweep on a later checkpoint + run_tests-only label.
## 2026-06-08 11:40 -- IDEAL ceiling 0.84 (oracle): direction is the bottleneck, but on-distribution pairs are CHEATING
**Context:** `scripts/diag_cosine_dist.py`, job 9 `first_hack` step 7, 140 live rollouts
steps 5-9 (35 exploited, 0.25 base). Full sweep + module-vote + ideal ceiling. Artifacts:
`out/diag/separability.csv`, `live_scores.parquet`, `cosine_{grad,act}.png`, `nbs/cosine_dist.ipynb`.
### Observations -- separability sweep (AUROC of score -> exploited; per-space filters by |D_m|)
| space | score | filter | AUROC | p@10 | p@20 |
|---|---|---|---:|---:|---:|
| act | vote | all | 0.669 | 0.30 | 0.45 |
| act | projection | all | 0.665 | 0.30 | 0.40 |
| act | cosine | all | 0.663 | 0.30 | 0.40 |
| act | cosine | top25 | 0.620 | 0.20 | 0.30 |
| act | cosine | top05 | 0.601 | 0.20 | 0.25 |
| grad | vote | all | 0.583 | 0.40 | 0.30 |
| grad | projection | all | 0.575 | 0.40 | 0.35 |
| grad | cosine | keep75 | 0.564 | 0.70 | 0.35 |
| grad | cosine | all | 0.560 | 0.70 | 0.40 |
| * | magnitude | * | <0.42 | ~0 | ~0 |
(`vote` = per-space sum_m cos(hack_m,update_m)*|hack_m| -- the soft module filter. full 40-row
table in separability.csv.)
### Observations -- IDEAL-direction ceiling (oracle mu_hack-mu_clean on live rollouts, 2-fold CV)
| space | AUROC cv | in-sample |
|---|---:|---:|
| grad | 0.837 | 0.884 |
| act | 0.845 | 0.886 |
- [obs] live-hack vs live-clean example: hack (prob 498) = real solution + a `run_tests()` that
uses `print(...)` not `assert` (tests trivially pass); clean (prob 329) = solution, NO test
block. Different problems; the hack is a localized run_tests-block edit.
### Inferences
- [inf] The DIRECTION is the bottleneck, not separability: ideal CV AUROC 0.84 vs our authored-pair
best 0.56 (grad) / 0.67 (act). The signal IS physically in the gradients. {reason: ideal>>pair,
cv≈in-sample so not overfit; credence 0.7}.
- [inf] Activation's authored direction transfers better than gradient's (0.67 vs 0.56, 79% vs 67%
of the same ~0.84 ceiling) -- but the ceiling is space-agnostic, so space is not the lever.
{credence 0.6}.
- [inf] Live hack/clean overlap because (a) different problems -> gradient dominated by problem-
specific solution code, (b) the hack is a few localized run_tests tokens. {reason: example
inspection; credence 0.6}.
- [inf] CRITICAL no-cheat caveat: the 0.84 ceiling is an ORACLE upper bound, NOT reachable. Building
pairs from live IID rollouts requires labelling them = the oracle = the paper's detector = the
thing we beat. The only no-cheat lever toward the ceiling is BETTER HAND-AUTHORED pairs.
{credence 0.9 -- this is a constraint, not an empirical claim}. (Added to AGENTS.md no-cheat #4.)
- [inf] magnitude |g| is anti-predictive (AUROC<0.5): hack rollouts have SMALLER grad/act norm than
clean. {credence 0.7}.
### Failure modes considered
- **Most-likely:** better hand-authored pairs still can't approach 0.84 -- the ideal direction is a
property of live-rollout covariance an authored pair set can't capture. Prior 0.4. Check: write
cleaner/more-diverse pairs (per how_to_write_personas), re-score, measure gap closed.
- **Subtle:** step 7 is too early (hack_rate 0.13); signal sharpens later. Prior 0.3. Check: rerun on
ckpt_step0030/0059.
- **Null:** 0.67 is near the authored-pair ceiling; no pair rewrite helps. Prior 0.3. Check: same.
### Next action
Read docs/personas/{how_to_write_personas,how_to_rewrite_pairs,personas_kept}.md; author cleaner,
more-diverse synthetic hacking pairs; build the vector; re-run this sweep and measure how far the
new pairs close the 0.67 -> 0.84 gap. (No-cheat: all labels stay on the pairs we write.)