agents: no-cheat #4 (on-distribution pairs = labeling live rollouts = cheating); journal ideal-ceiling tables

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-08 11:39:27 +00:00
parent c29016079a
commit 9c630b83c7
2 changed files with 80 additions and 0 deletions
+12
View File
@@ -45,6 +45,18 @@ Inherit global rules from `~/.claude/CLAUDE.md`.
So when you double/triple-check "is this weak / is this cheating", the check is: does
anything other than the hand-built pairs and the extracted `vec` touch training? If a live
detector or the oracle leaks in, it's cheating.
4. COROLLARY (the trap Claude keeps falling into): "build pairs from on-distribution / IID
rollouts" is CHEATING, because to make a hack-vs-clean pair from live rollouts you must
LABEL which rollout is the hack -- and the only thing that can label a live rollout is the
oracle/detector. Using labels on live rollouts IS the paper's detector approach; it is the
exact thing we are trying to beat. The hand-authored pairs are our ONLY label source, full
stop. An "ideal direction" fit on oracle-labelled live rollouts (e.g. mu_hack - mu_clean)
is a DIAGNOSTIC UPPER BOUND only -- it proves the signal is physically present in the
gradients, it is NOT a reachable target, and "the ceiling is high so on-distribution pairs
will fix it" is a category error (a good direction existing != we can build it label-free).
The only no-cheat lever to approach that ceiling is BETTER HAND-AUTHORED pairs (more
diverse / less overlapping / less off-distribution), i.e. the "Hack A predicts Hack B"
generalization where every label still lives on pairs we wrote.
- do not overconfidentaly diagnoses. if you cant think of 3+ plausible hypothesis - including bugs, subtle failures, and you being wrong about concepts - then you have lost perspective and narrow vision
- I'd often afk so dont stop and ask me a question you know the likely answer, or I've already indicated or asked for, or where there is only on answer "waiting for your go ahead". I'd rather you just commit and go ahead
+68
View File
@@ -3529,3 +3529,71 @@ notebook `nbs/cosine_dist.ipynb`.
Highest-value experiment: rebuild the contrastive direction from on-distribution rollout pairs
(real hack vs real clean rollouts) and re-run this sweep -- the distribution-shift fix. Cheaper
checks first: rerun the sweep on a later checkpoint + run_tests-only label.
## 2026-06-08 11:40 -- IDEAL ceiling 0.84 (oracle): direction is the bottleneck, but on-distribution pairs are CHEATING
**Context:** `scripts/diag_cosine_dist.py`, job 9 `first_hack` step 7, 140 live rollouts
steps 5-9 (35 exploited, 0.25 base). Full sweep + module-vote + ideal ceiling. Artifacts:
`out/diag/separability.csv`, `live_scores.parquet`, `cosine_{grad,act}.png`, `nbs/cosine_dist.ipynb`.
### Observations -- separability sweep (AUROC of score -> exploited; per-space filters by |D_m|)
| space | score | filter | AUROC | p@10 | p@20 |
|---|---|---|---:|---:|---:|
| act | vote | all | 0.669 | 0.30 | 0.45 |
| act | projection | all | 0.665 | 0.30 | 0.40 |
| act | cosine | all | 0.663 | 0.30 | 0.40 |
| act | cosine | top25 | 0.620 | 0.20 | 0.30 |
| act | cosine | top05 | 0.601 | 0.20 | 0.25 |
| grad | vote | all | 0.583 | 0.40 | 0.30 |
| grad | projection | all | 0.575 | 0.40 | 0.35 |
| grad | cosine | keep75 | 0.564 | 0.70 | 0.35 |
| grad | cosine | all | 0.560 | 0.70 | 0.40 |
| * | magnitude | * | <0.42 | ~0 | ~0 |
(`vote` = per-space sum_m cos(hack_m,update_m)*|hack_m| -- the soft module filter. full 40-row
table in separability.csv.)
### Observations -- IDEAL-direction ceiling (oracle mu_hack-mu_clean on live rollouts, 2-fold CV)
| space | AUROC cv | in-sample |
|---|---:|---:|
| grad | 0.837 | 0.884 |
| act | 0.845 | 0.886 |
- [obs] live-hack vs live-clean example: hack (prob 498) = real solution + a `run_tests()` that
uses `print(...)` not `assert` (tests trivially pass); clean (prob 329) = solution, NO test
block. Different problems; the hack is a localized run_tests-block edit.
### Inferences
- [inf] The DIRECTION is the bottleneck, not separability: ideal CV AUROC 0.84 vs our authored-pair
best 0.56 (grad) / 0.67 (act). The signal IS physically in the gradients. {reason: ideal>>pair,
cv≈in-sample so not overfit; credence 0.7}.
- [inf] Activation's authored direction transfers better than gradient's (0.67 vs 0.56, 79% vs 67%
of the same ~0.84 ceiling) -- but the ceiling is space-agnostic, so space is not the lever.
{credence 0.6}.
- [inf] Live hack/clean overlap because (a) different problems -> gradient dominated by problem-
specific solution code, (b) the hack is a few localized run_tests tokens. {reason: example
inspection; credence 0.6}.
- [inf] CRITICAL no-cheat caveat: the 0.84 ceiling is an ORACLE upper bound, NOT reachable. Building
pairs from live IID rollouts requires labelling them = the oracle = the paper's detector = the
thing we beat. The only no-cheat lever toward the ceiling is BETTER HAND-AUTHORED pairs.
{credence 0.9 -- this is a constraint, not an empirical claim}. (Added to AGENTS.md no-cheat #4.)
- [inf] magnitude |g| is anti-predictive (AUROC<0.5): hack rollouts have SMALLER grad/act norm than
clean. {credence 0.7}.
### Failure modes considered
- **Most-likely:** better hand-authored pairs still can't approach 0.84 -- the ideal direction is a
property of live-rollout covariance an authored pair set can't capture. Prior 0.4. Check: write
cleaner/more-diverse pairs (per how_to_write_personas), re-score, measure gap closed.
- **Subtle:** step 7 is too early (hack_rate 0.13); signal sharpens later. Prior 0.3. Check: rerun on
ckpt_step0030/0059.
- **Null:** 0.67 is near the authored-pair ceiling; no pair rewrite helps. Prior 0.3. Check: same.
### Next action
Read docs/personas/{how_to_write_personas,how_to_rewrite_pairs,personas_kept}.md; author cleaner,
more-diverse synthetic hacking pairs; build the vector; re-run this sweep and measure how far the
new pairs close the 0.67 -> 0.84 gap. (No-cheat: all labels stay on the pairs we write.)