journal: job 15 precision-best authored per-rollout finished, deploy hack 0.076 (vanilla pending)

This commit is contained in:
wassname
2026-06-08 19:38:48 +00:00
parent d497bfd161
commit 744d851861
+197
View File
@@ -3643,3 +3643,200 @@ Pairs lever is closed for now. Pivots that stay no-cheat: (a) isolate the hack t
block carries the signal; route per-token only there instead of whole-rollout), (b) a later-checkpoint
recheck, (c) accept ~0.67 and lean on absorption (route2/quarantine) rather than direction precision.
GPU handed back to the overnight jobs (per-token s44 #13, vanilla s43 #14).
## 2026-06-08 13:25 -- NEGATIVE: pure-intent contrast pairs underperform code-diff pairs (live signal is code, not intent)
**Context:** `scripts/diag_cosine_dist.py --pairs {think,funcname,concept}`, job 9 `first_hack` step 7,
140 live rollouts (35 exploited, 0.25 base). Three cleanly-contrastive INTENT designs
(`src/vgrout/pairs_intent.py`): hold solution body + run_tests IDENTICAL across hack/clean, vary ONLY
the cheat-vs-solve intent -- a `<think>` declaration (think), an intent-named no-op helper
`hack_bypass_grader` vs `solve_as_intended` (funcname), or blunt repeated concept tokens "cheat..."
vs "solve..." (concept). The properly-contrastive shape: most tokens shared, one axis varied.
### Observations
best AUROC per design (over all space x score x filter; full sweep in `/tmp/claude-0/intent_auroc.log`):
| design | best AUROC | config | best p@10 | act cosine all | grad cosine all |
|---|---:|---|---:|---:|---:|
| authored runtests (prior) | 0.672 | act cosine | ~0.2 | 0.663 | 0.560 |
| funcname | 0.602 | grad cosine top15 | 0.1 | 0.457 | 0.551 |
| think | 0.492 | grad projection all | 0.2 | 0.418 | 0.444 |
| concept | 0.426 | act magnitude top05 | 0.1 | 0.296 | 0.292 |
| IDEAL (oracle CV) | 0.84 | -- | -- | 0.845 | 0.837 |
### Inferences
- [inf] Pure-intent contrast separates live hack-vs-clean WORSE than code-diff pairs, not better:
think/concept sit at-or-below chance (0.49/0.43), funcname's 0.60 clears grad-0.56 but its confident
tail is at base rate (p@10=0.1) -> useless for precision routing. {reason: 3 designs, all below the
0.67 authored-act baseline; credence 0.75}.
- [inf] Mechanism: the live run_tests hack is localised CODE tokens (print-instead-of-assert). A
direction built from a `<think>` sentence / a helper name / a concept word lives in different
coordinates than the live gradient's hack component, so it is orthogonal-ish to the live signal.
The code-diff pairs at least contain the print/assert tokens, so they align more. This INVERTS the
repeng intuition (blunt concept vectors) for this task: the hack is not a "concept", it is a code
edit. {reason: intent designs underperform monotone with how code-free they are; credence 0.6}.
- [inf] Confirms the prior entry: the no-cheat authored-pairs ceiling is ~0.67 act AUROC with no
usable precision tail, well under the 0.84 oracle. The pairs lever (incl. the "better/cleaner pairs"
variant) is exhausted. {credence 0.7}.
### Failure modes considered
- **Most-likely:** intent IS present but my 6 pairs are too few / the intent tokens too weak vs the
large identical solution+tests gradient that cancels in-pair but dominates the live rollout. Prior 0.3.
Check: would need many more intent pairs; current evidence says low EV.
- **Subtle:** step-7 checkpoint; intent signal may sharpen once the model has internalised the hack
later in training. Prior 0.25. Check: rerun on ckpt_step0059.
- **Null:** "extract one fixed direction from authored contrasts" caps ~0.67 regardless of contrast
design; the gap to 0.84 needs the live covariance (oracle) we cannot use. Prior 0.45. -> pivot the
method (absorption/granularity), not the pairs.
### Next action
All "cleanly contrastive ideas" tested and closed. Per the pre-authorised fallback (c): stop chasing
direction precision, lean on absorption. Best deploy method remains per-token routeV (grad space,
0.042). The one diagnostic-untested lever is the act space (0.67 > grad 0.56 on authored pairs), but
H2 (per-rollout real==random) predicts direction quality does not drive deploy suppression -- so
act-space routing is a real but low-EV test. Decision recorded in the next entry.
## 2026-06-08 13:40 -- DECISION: no AUROC-winning config to run; vanilla eval2 baseline is the binding unknown
**Context:** after the intent-pair negative (entry 13:25). Plan was AUROC -> run best config (code
act-space if needed) -> queue vanilla behind. The AUROC produced no config worth a GPU-night, so the
premise of "run the winning arm" dissolved; doing the cognitive work on whether act-space is worth
testing instead.
### Observations
- [obs] job 9 (per-token routeV s43) per-step `rout`/`routE` columns are ~0.01 / ~0.000-0.008 across
all steps -- very little gradient is actually quarantined, yet deploy hack came out 0.042
(`logs/...dir6_routeV_pertoken_s43.log`).
- [obs] No vanilla eval2 deploy number exists yet (job 14 still queued). Every routeV deploy figure
(per-rollout 0.101, per-token 0.042) is currently uninterpretable without it.
- [obs] best AUROC across ALL authored/intent pairs = act 0.67, no usable precision tail (p@10 at base
rate). H2 (prior): per-rollout real-V == random-V deploy -> direction quality does not drive deploy
suppression here.
### Inferences
- [inf] act-space routing is low EV: it would chase the 0.67 direction (no precision) that H2 predicts
won't move deploy, and the per-token `rout~0.01` shows routing barely fires anyway (advantage flattens
once the hack is learned -> nothing left to route, space-independent). {reason: AUROC neg + H2 +
rout~0.01; credence 0.6}.
- [inf] The 0.042 per-token deploy may be near-vacuous (~base rate), not suppression, given how little
routed. Only the vanilla eval2 baseline can disambiguate. {credence 0.55}.
### Decision
Do NOT code/run act-space routing tonight (low EV by the above). Run the VANILLA eval2 baseline (job 14)
as tonight's GPU job -- it is load-bearing: vanilla deploy hack >> 0.10 => routeV's 0.042 is real
suppression; ~0.04-0.10 => routeV is vacuous. act-space routing stays a coded-if-requested option.
### Failure modes considered
- **Most-likely:** I'm wrong that act-space won't help -- act DID separate better (0.67 vs 0.56) and
H2 only tested grad-space random-V, so act is genuinely untested in deploy. Prior 0.3. Check: run it
if vanilla shows routeV suppression is real and worth improving.
- **Subtle:** vanilla eval2 itself is noisy at n=32-119; one seed may mislead. Prior 0.25. Check: the
eval2 test set is n=119, less noisy than the n=32 val.
- **Null:** routeV does nothing and vanilla eval2 also ~0.04 -> the whole arm is vacuous on the
recency-clean set; the apparent suppression was eval contamination. Prior 0.3. -> this is exactly
what job 14 tests.
### Next action
Unpause; job 14 (vanilla eval2 s43) runs tonight. Reassess act-space only if vanilla shows real
suppression to improve on.
## 2026-06-08 13:42 -- CORRECTION to 13:40: ran the PRECISION-best config, not the AUROC-best
Supersedes the 13:40 decision. The 13:40 "act-space is low EV" call selected configs by AUROC (area
under ROC = threshold-averaged ranker). For routing, false positives are expensive (quarantining a
clean rollout removes its solve signal), so the operating point that matters is the HIGH-PRECISION
corner (max precision@k, route few but route real hacks), NOT the AUROC area. They pick different
configs. (User caught this: "what part of the pareto did you choose on the AUROC curve... it should be
high precision.")
### Observations -- re-ranked authored 18-pair diagnostic by precision@10 (`/tmp/claude-0/diag_all.log`)
| space | score | filter | AUROC | p@10 | p@20 |
|---|---|---|---:|---:|---:|
| grad | cosine | keep75 | 0.562 | **0.700** | 0.350 |
| grad | cosine | all | 0.559 | **0.700** | 0.400 |
| grad | cosine | top25 | 0.544 | 0.500 | 0.350 |
| grad | vote | all | 0.581 | 0.400 | 0.250 |
| act | (all configs) | -- | ~0.65 | ~0.2 | -- |
base rate 0.25 (35/140 exploited). grad cosine confident tail = 7/10 real hacks.
### Inferences
- [inf] grad-space cosine is the precision winner (p@10=0.70), NOT act (AUROC winner, p@10~0.2). The
AUROC-best and precision-best configs are different; selecting on AUROC picked the wrong operating
point for routing's cost asymmetry. {reason: re-rank by p@10 inverts the space choice; credence 0.85}.
- [inf] keep75 == all (0.70 both) -> the bottom-25% noise filter is irrelevant for precision; default
tau_axis=0.0 already sits at the precision corner. {credence 0.8}.
- [inf] The deployed routeV runs (job 8 per-rollout 0.101, job 9 per-token 0.042) used prog_wide POOL
pairs (30), never the authored pairs that give 0.70. So the precision-best config has never actually
been trained. {reason: job logs show "routeV pairs: pool-derived (prog_wide.json)"; credence 0.9}.
### Run queued (job 15, prio 60)
`train fast --intervention=routeV --vhack-pairs-path None --seed=43 --out-tag=_dir8_routeV_authored_perroll_s43`
= authored 18-pair v_grad, per-rollout grad-cosine margin band (the p@10=0.70 corner), dense runtests
pool (default), tau_axis=0.0 (=keep75), grad-clip 500 (default). Smoke passed (band opened, rout>0,
`/tmp/claude-0/smoke_authored.log`). Vanilla eval2 s43 queued behind (job 16, prio 55).
### Failure modes considered
- **Most-likely:** per-rollout routing still absorbs regardless of pair quality (job 8 was per-rollout
and vacuous); authored pairs may not rescue it. Prior 0.4. Check: job 15 deploy_hack vs job 16 vanilla.
- **Subtle:** p@10=0.70 is n=10 of 140 -- noisy; the precision corner may not replicate at the band's
actual threshold (p75 hack-cos, not top-10). Prior 0.3. Check: per-step `rout`/`routE` in job 15.
- **Null:** authored-pair per-rollout == prog_wide per-rollout == random (0.101); precision in the
diagnostic doesn't transfer to deploy suppression. Prior 0.3. -> back to absorption/granularity.
### Next action
Job 15 running (following bujq60tj5); vanilla job 16 behind. Compare deploy_hack: 15 << 16 => the
precision-best config suppresses; 15 ~ 16 => routing vacuous on the recency-clean set.
## 2026-06-08 16:?? -- job 15 (precision-best authored per-rollout) finished: deploy hack 0.076
**Context:** job 15 `train fast --intervention=routeV --vhack-pairs-path None --seed=43`
(authored 18-pair v_grad, per-rollout grad-cosine band, recency-clean test n=119).
Follows the [job-15-queued entry above]. Vanilla baseline (job 16) still queued.
### Observations
| measure | train (knob-on) | deploy (knob-off, test n=119) |
|---|---|---|
| hack | 0.641 | 0.076 (9/119 raw; vhack 7/119) |
| solve | - | 0.118 (14/119) |
- [obs] objective (deploy solve - hack) = +0.042.
- [obs] val (n=32) knob-on vs knob-off held the cheat throughout: step 30 train 0.688 / deploy 0.000;
step 50 train 0.719 / deploy 0.031 (deploy << train at every checkpoint).
### Inferences
- [inf] The quarantine knob absorbs most of the on-policy hack: 0.641 knob-on -> 0.076 knob-off, ~88%
held in the deletable knob. {reason: same ckpt, knob-on vs knob-off forward; credence 0.85}.
- [inf] NOT yet interpretable as suppression. 0.076 could be the recency-clean base rate, not routing
doing work. {reason: no vanilla baseline yet; the whole point of job 16; credence 0.9}.
### Failure modes considered
- **Most-likely:** vanilla deploy hack also ~0.05-0.10 on this clean set => 0.076 is base rate, routeV
vacuous. Prior 0.4. Check: job 16.
- **Subtle:** the job8 prog_wide "0.101" is NOT a clean A/B vs this 0.076 -- job8's deploy predates the
eval2 recency-clean fix (no deploy_test.json; old contaminated holdout). Pairs A/B must come from job
17's separability metric, not these two deploy numbers. Prior: n/a (a measurement-hygiene note).
- **Null:** absorption dominates so deploy is flat across pair quality and gate; gate/pairs choice
doesn't move deploy. Prior 0.3. Check: act_vote (job 18) deploy vs this.
### Next action
Job 17 (pairs separability) running; job 18 (act_vote) then job 16 (vanilla) behind. The load-bearing
read is job 16: 0.076 << vanilla => real suppression; 0.076 ~ vanilla => vacuous.