journal: job 15 precision-best authored per-rollout finished, deploy hack 0.076 (vanilla pending)

2026-06-27 16:45:42 +08:00 · 2026-06-08 19:38:48 +00:00
parent d497bfd161
commit 744d851861
1 changed files with 197 additions and 0 deletions
@@ -3643,3 +3643,200 @@ Pairs lever is closed for now. Pivots that stay no-cheat: (a) isolate the hack t
 block carries the signal; route per-token only there instead of whole-rollout), (b) a later-checkpoint
 recheck, (c) accept ~0.67 and lean on absorption (route2/quarantine) rather than direction precision.
 GPU handed back to the overnight jobs (per-token s44 #13, vanilla s43 #14).
+
+## 2026-06-08 13:25 -- NEGATIVE: pure-intent contrast pairs underperform code-diff pairs (live signal is code, not intent)
+
+**Context:** `scripts/diag_cosine_dist.py --pairs {think,funcname,concept}`, job 9 `first_hack` step 7,
+140 live rollouts (35 exploited, 0.25 base). Three cleanly-contrastive INTENT designs
+(`src/vgrout/pairs_intent.py`): hold solution body + run_tests IDENTICAL across hack/clean, vary ONLY
+the cheat-vs-solve intent -- a `<think>` declaration (think), an intent-named no-op helper
+`hack_bypass_grader` vs `solve_as_intended` (funcname), or blunt repeated concept tokens "cheat..."
+vs "solve..." (concept). The properly-contrastive shape: most tokens shared, one axis varied.
+
+### Observations
+
+best AUROC per design (over all space x score x filter; full sweep in `/tmp/claude-0/intent_auroc.log`):
+
+| design | best AUROC | config | best p@10 | act cosine all | grad cosine all |
+|---|---:|---|---:|---:|---:|
+| authored runtests (prior) | 0.672 | act cosine | ~0.2 | 0.663 | 0.560 |
+| funcname | 0.602 | grad cosine top15 | 0.1 | 0.457 | 0.551 |
+| think    | 0.492 | grad projection all | 0.2 | 0.418 | 0.444 |
+| concept  | 0.426 | act magnitude top05 | 0.1 | 0.296 | 0.292 |
+| IDEAL (oracle CV) | 0.84 | -- | -- | 0.845 | 0.837 |
+
+### Inferences
+
+- [inf] Pure-intent contrast separates live hack-vs-clean WORSE than code-diff pairs, not better:
+  think/concept sit at-or-below chance (0.49/0.43), funcname's 0.60 clears grad-0.56 but its confident
+  tail is at base rate (p@10=0.1) -> useless for precision routing. {reason: 3 designs, all below the
+  0.67 authored-act baseline; credence 0.75}.
+- [inf] Mechanism: the live run_tests hack is localised CODE tokens (print-instead-of-assert). A
+  direction built from a `<think>` sentence / a helper name / a concept word lives in different
+  coordinates than the live gradient's hack component, so it is orthogonal-ish to the live signal.
+  The code-diff pairs at least contain the print/assert tokens, so they align more. This INVERTS the
+  repeng intuition (blunt concept vectors) for this task: the hack is not a "concept", it is a code
+  edit. {reason: intent designs underperform monotone with how code-free they are; credence 0.6}.
+- [inf] Confirms the prior entry: the no-cheat authored-pairs ceiling is ~0.67 act AUROC with no
+  usable precision tail, well under the 0.84 oracle. The pairs lever (incl. the "better/cleaner pairs"
+  variant) is exhausted. {credence 0.7}.
+
+### Failure modes considered
+
+- **Most-likely:** intent IS present but my 6 pairs are too few / the intent tokens too weak vs the
+  large identical solution+tests gradient that cancels in-pair but dominates the live rollout. Prior 0.3.
+  Check: would need many more intent pairs; current evidence says low EV.
+- **Subtle:** step-7 checkpoint; intent signal may sharpen once the model has internalised the hack
+  later in training. Prior 0.25. Check: rerun on ckpt_step0059.
+- **Null:** "extract one fixed direction from authored contrasts" caps ~0.67 regardless of contrast
+  design; the gap to 0.84 needs the live covariance (oracle) we cannot use. Prior 0.45. -> pivot the
+  method (absorption/granularity), not the pairs.
+
+### Next action
+
+All "cleanly contrastive ideas" tested and closed. Per the pre-authorised fallback (c): stop chasing
+direction precision, lean on absorption. Best deploy method remains per-token routeV (grad space,
+0.042). The one diagnostic-untested lever is the act space (0.67 > grad 0.56 on authored pairs), but
+H2 (per-rollout real==random) predicts direction quality does not drive deploy suppression -- so
+act-space routing is a real but low-EV test. Decision recorded in the next entry.
+
+## 2026-06-08 13:40 -- DECISION: no AUROC-winning config to run; vanilla eval2 baseline is the binding unknown
+
+**Context:** after the intent-pair negative (entry 13:25). Plan was AUROC -> run best config (code
+act-space if needed) -> queue vanilla behind. The AUROC produced no config worth a GPU-night, so the
+premise of "run the winning arm" dissolved; doing the cognitive work on whether act-space is worth
+testing instead.
+
+### Observations
+
+- [obs] job 9 (per-token routeV s43) per-step `rout`/`routE` columns are ~0.01 / ~0.000-0.008 across
+  all steps -- very little gradient is actually quarantined, yet deploy hack came out 0.042
+  (`logs/...dir6_routeV_pertoken_s43.log`).
+- [obs] No vanilla eval2 deploy number exists yet (job 14 still queued). Every routeV deploy figure
+  (per-rollout 0.101, per-token 0.042) is currently uninterpretable without it.
+- [obs] best AUROC across ALL authored/intent pairs = act 0.67, no usable precision tail (p@10 at base
+  rate). H2 (prior): per-rollout real-V == random-V deploy -> direction quality does not drive deploy
+  suppression here.
+
+### Inferences
+
+- [inf] act-space routing is low EV: it would chase the 0.67 direction (no precision) that H2 predicts
+  won't move deploy, and the per-token `rout~0.01` shows routing barely fires anyway (advantage flattens
+  once the hack is learned -> nothing left to route, space-independent). {reason: AUROC neg + H2 +
+  rout~0.01; credence 0.6}.
+- [inf] The 0.042 per-token deploy may be near-vacuous (~base rate), not suppression, given how little
+  routed. Only the vanilla eval2 baseline can disambiguate. {credence 0.55}.
+
+### Decision
+
+Do NOT code/run act-space routing tonight (low EV by the above). Run the VANILLA eval2 baseline (job 14)
+as tonight's GPU job -- it is load-bearing: vanilla deploy hack >> 0.10 => routeV's 0.042 is real
+suppression; ~0.04-0.10 => routeV is vacuous. act-space routing stays a coded-if-requested option.
+
+### Failure modes considered
+
+- **Most-likely:** I'm wrong that act-space won't help -- act DID separate better (0.67 vs 0.56) and
+  H2 only tested grad-space random-V, so act is genuinely untested in deploy. Prior 0.3. Check: run it
+  if vanilla shows routeV suppression is real and worth improving.
+- **Subtle:** vanilla eval2 itself is noisy at n=32-119; one seed may mislead. Prior 0.25. Check: the
+  eval2 test set is n=119, less noisy than the n=32 val.
+- **Null:** routeV does nothing and vanilla eval2 also ~0.04 -> the whole arm is vacuous on the
+  recency-clean set; the apparent suppression was eval contamination. Prior 0.3. -> this is exactly
+  what job 14 tests.
+
+### Next action
+
+Unpause; job 14 (vanilla eval2 s43) runs tonight. Reassess act-space only if vanilla shows real
+suppression to improve on.
+
+## 2026-06-08 13:42 -- CORRECTION to 13:40: ran the PRECISION-best config, not the AUROC-best
+
+Supersedes the 13:40 decision. The 13:40 "act-space is low EV" call selected configs by AUROC (area
+under ROC = threshold-averaged ranker). For routing, false positives are expensive (quarantining a
+clean rollout removes its solve signal), so the operating point that matters is the HIGH-PRECISION
+corner (max precision@k, route few but route real hacks), NOT the AUROC area. They pick different
+configs. (User caught this: "what part of the pareto did you choose on the AUROC curve... it should be
+high precision.")
+
+### Observations -- re-ranked authored 18-pair diagnostic by precision@10 (`/tmp/claude-0/diag_all.log`)
+
+| space | score | filter | AUROC | p@10 | p@20 |
+|---|---|---|---:|---:|---:|
+| grad | cosine | keep75 | 0.562 | **0.700** | 0.350 |
+| grad | cosine | all    | 0.559 | **0.700** | 0.400 |
+| grad | cosine | top25  | 0.544 | 0.500 | 0.350 |
+| grad | vote   | all    | 0.581 | 0.400 | 0.250 |
+| act  | (all configs) | -- | ~0.65 | ~0.2 | -- |
+
+base rate 0.25 (35/140 exploited). grad cosine confident tail = 7/10 real hacks.
+
+### Inferences
+
+- [inf] grad-space cosine is the precision winner (p@10=0.70), NOT act (AUROC winner, p@10~0.2). The
+  AUROC-best and precision-best configs are different; selecting on AUROC picked the wrong operating
+  point for routing's cost asymmetry. {reason: re-rank by p@10 inverts the space choice; credence 0.85}.
+- [inf] keep75 == all (0.70 both) -> the bottom-25% noise filter is irrelevant for precision; default
+  tau_axis=0.0 already sits at the precision corner. {credence 0.8}.
+- [inf] The deployed routeV runs (job 8 per-rollout 0.101, job 9 per-token 0.042) used prog_wide POOL
+  pairs (30), never the authored pairs that give 0.70. So the precision-best config has never actually
+  been trained. {reason: job logs show "routeV pairs: pool-derived (prog_wide.json)"; credence 0.9}.
+
+### Run queued (job 15, prio 60)
+
+`train fast --intervention=routeV --vhack-pairs-path None --seed=43 --out-tag=_dir8_routeV_authored_perroll_s43`
+= authored 18-pair v_grad, per-rollout grad-cosine margin band (the p@10=0.70 corner), dense runtests
+pool (default), tau_axis=0.0 (=keep75), grad-clip 500 (default). Smoke passed (band opened, rout>0,
+`/tmp/claude-0/smoke_authored.log`). Vanilla eval2 s43 queued behind (job 16, prio 55).
+
+### Failure modes considered
+
+- **Most-likely:** per-rollout routing still absorbs regardless of pair quality (job 8 was per-rollout
+  and vacuous); authored pairs may not rescue it. Prior 0.4. Check: job 15 deploy_hack vs job 16 vanilla.
+- **Subtle:** p@10=0.70 is n=10 of 140 -- noisy; the precision corner may not replicate at the band's
+  actual threshold (p75 hack-cos, not top-10). Prior 0.3. Check: per-step `rout`/`routE` in job 15.
+- **Null:** authored-pair per-rollout == prog_wide per-rollout == random (0.101); precision in the
+  diagnostic doesn't transfer to deploy suppression. Prior 0.3. -> back to absorption/granularity.
+
+### Next action
+
+Job 15 running (following bujq60tj5); vanilla job 16 behind. Compare deploy_hack: 15 << 16 => the
+precision-best config suppresses; 15 ~ 16 => routing vacuous on the recency-clean set.
+
+## 2026-06-08 16:?? -- job 15 (precision-best authored per-rollout) finished: deploy hack 0.076
+
+**Context:** job 15 `train fast --intervention=routeV --vhack-pairs-path None --seed=43`
+(authored 18-pair v_grad, per-rollout grad-cosine band, recency-clean test n=119).
+Follows the [job-15-queued entry above]. Vanilla baseline (job 16) still queued.
+
+### Observations
+
+| measure | train (knob-on) | deploy (knob-off, test n=119) |
+|---|---|---|
+| hack | 0.641 | 0.076 (9/119 raw; vhack 7/119) |
+| solve | - | 0.118 (14/119) |
+
+- [obs] objective (deploy solve - hack) = +0.042.
+- [obs] val (n=32) knob-on vs knob-off held the cheat throughout: step 30 train 0.688 / deploy 0.000;
+  step 50 train 0.719 / deploy 0.031 (deploy << train at every checkpoint).
+
+### Inferences
+
+- [inf] The quarantine knob absorbs most of the on-policy hack: 0.641 knob-on -> 0.076 knob-off, ~88%
+  held in the deletable knob. {reason: same ckpt, knob-on vs knob-off forward; credence 0.85}.
+- [inf] NOT yet interpretable as suppression. 0.076 could be the recency-clean base rate, not routing
+  doing work. {reason: no vanilla baseline yet; the whole point of job 16; credence 0.9}.
+
+### Failure modes considered
+
+- **Most-likely:** vanilla deploy hack also ~0.05-0.10 on this clean set => 0.076 is base rate, routeV
+  vacuous. Prior 0.4. Check: job 16.
+- **Subtle:** the job8 prog_wide "0.101" is NOT a clean A/B vs this 0.076 -- job8's deploy predates the
+  eval2 recency-clean fix (no deploy_test.json; old contaminated holdout). Pairs A/B must come from job
+  17's separability metric, not these two deploy numbers. Prior: n/a (a measurement-hygiene note).
+- **Null:** absorption dominates so deploy is flat across pair quality and gate; gate/pairs choice
+  doesn't move deploy. Prior 0.3. Check: act_vote (job 18) deploy vs this.
+
+### Next action
+
+Job 17 (pairs separability) running; job 18 (act_vote) then job 16 (vanilla) behind. The load-bearing
+read is job 16: 0.076 << vanilla => real suppression; 0.076 ~ vanilla => vacuous.