From 744d8518614b8517eb3a4947802568c2e8ccc86f Mon Sep 17 00:00:00 2001 From: wassname <1103714+wassname@users.noreply.github.com> Date: Mon, 8 Jun 2026 19:38:48 +0000 Subject: [PATCH] journal: job 15 precision-best authored per-rollout finished, deploy hack 0.076 (vanilla pending) --- RESEARCH_JOURNAL.md | 197 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 197 insertions(+) diff --git a/RESEARCH_JOURNAL.md b/RESEARCH_JOURNAL.md index 208c687..6f65aca 100644 --- a/RESEARCH_JOURNAL.md +++ b/RESEARCH_JOURNAL.md @@ -3643,3 +3643,200 @@ Pairs lever is closed for now. Pivots that stay no-cheat: (a) isolate the hack t block carries the signal; route per-token only there instead of whole-rollout), (b) a later-checkpoint recheck, (c) accept ~0.67 and lean on absorption (route2/quarantine) rather than direction precision. GPU handed back to the overnight jobs (per-token s44 #13, vanilla s43 #14). + +## 2026-06-08 13:25 -- NEGATIVE: pure-intent contrast pairs underperform code-diff pairs (live signal is code, not intent) + +**Context:** `scripts/diag_cosine_dist.py --pairs {think,funcname,concept}`, job 9 `first_hack` step 7, +140 live rollouts (35 exploited, 0.25 base). Three cleanly-contrastive INTENT designs +(`src/vgrout/pairs_intent.py`): hold solution body + run_tests IDENTICAL across hack/clean, vary ONLY +the cheat-vs-solve intent -- a `` declaration (think), an intent-named no-op helper +`hack_bypass_grader` vs `solve_as_intended` (funcname), or blunt repeated concept tokens "cheat..." +vs "solve..." (concept). The properly-contrastive shape: most tokens shared, one axis varied. + +### Observations + +best AUROC per design (over all space x score x filter; full sweep in `/tmp/claude-0/intent_auroc.log`): + +| design | best AUROC | config | best p@10 | act cosine all | grad cosine all | +|---|---:|---|---:|---:|---:| +| authored runtests (prior) | 0.672 | act cosine | ~0.2 | 0.663 | 0.560 | +| funcname | 0.602 | grad cosine top15 | 0.1 | 0.457 | 0.551 | +| think | 0.492 | grad projection all | 0.2 | 0.418 | 0.444 | +| concept | 0.426 | act magnitude top05 | 0.1 | 0.296 | 0.292 | +| IDEAL (oracle CV) | 0.84 | -- | -- | 0.845 | 0.837 | + +### Inferences + +- [inf] Pure-intent contrast separates live hack-vs-clean WORSE than code-diff pairs, not better: + think/concept sit at-or-below chance (0.49/0.43), funcname's 0.60 clears grad-0.56 but its confident + tail is at base rate (p@10=0.1) -> useless for precision routing. {reason: 3 designs, all below the + 0.67 authored-act baseline; credence 0.75}. +- [inf] Mechanism: the live run_tests hack is localised CODE tokens (print-instead-of-assert). A + direction built from a `` sentence / a helper name / a concept word lives in different + coordinates than the live gradient's hack component, so it is orthogonal-ish to the live signal. + The code-diff pairs at least contain the print/assert tokens, so they align more. This INVERTS the + repeng intuition (blunt concept vectors) for this task: the hack is not a "concept", it is a code + edit. {reason: intent designs underperform monotone with how code-free they are; credence 0.6}. +- [inf] Confirms the prior entry: the no-cheat authored-pairs ceiling is ~0.67 act AUROC with no + usable precision tail, well under the 0.84 oracle. The pairs lever (incl. the "better/cleaner pairs" + variant) is exhausted. {credence 0.7}. + +### Failure modes considered + +- **Most-likely:** intent IS present but my 6 pairs are too few / the intent tokens too weak vs the + large identical solution+tests gradient that cancels in-pair but dominates the live rollout. Prior 0.3. + Check: would need many more intent pairs; current evidence says low EV. +- **Subtle:** step-7 checkpoint; intent signal may sharpen once the model has internalised the hack + later in training. Prior 0.25. Check: rerun on ckpt_step0059. +- **Null:** "extract one fixed direction from authored contrasts" caps ~0.67 regardless of contrast + design; the gap to 0.84 needs the live covariance (oracle) we cannot use. Prior 0.45. -> pivot the + method (absorption/granularity), not the pairs. + +### Next action + +All "cleanly contrastive ideas" tested and closed. Per the pre-authorised fallback (c): stop chasing +direction precision, lean on absorption. Best deploy method remains per-token routeV (grad space, +0.042). The one diagnostic-untested lever is the act space (0.67 > grad 0.56 on authored pairs), but +H2 (per-rollout real==random) predicts direction quality does not drive deploy suppression -- so +act-space routing is a real but low-EV test. Decision recorded in the next entry. + +## 2026-06-08 13:40 -- DECISION: no AUROC-winning config to run; vanilla eval2 baseline is the binding unknown + +**Context:** after the intent-pair negative (entry 13:25). Plan was AUROC -> run best config (code +act-space if needed) -> queue vanilla behind. The AUROC produced no config worth a GPU-night, so the +premise of "run the winning arm" dissolved; doing the cognitive work on whether act-space is worth +testing instead. + +### Observations + +- [obs] job 9 (per-token routeV s43) per-step `rout`/`routE` columns are ~0.01 / ~0.000-0.008 across + all steps -- very little gradient is actually quarantined, yet deploy hack came out 0.042 + (`logs/...dir6_routeV_pertoken_s43.log`). +- [obs] No vanilla eval2 deploy number exists yet (job 14 still queued). Every routeV deploy figure + (per-rollout 0.101, per-token 0.042) is currently uninterpretable without it. +- [obs] best AUROC across ALL authored/intent pairs = act 0.67, no usable precision tail (p@10 at base + rate). H2 (prior): per-rollout real-V == random-V deploy -> direction quality does not drive deploy + suppression here. + +### Inferences + +- [inf] act-space routing is low EV: it would chase the 0.67 direction (no precision) that H2 predicts + won't move deploy, and the per-token `rout~0.01` shows routing barely fires anyway (advantage flattens + once the hack is learned -> nothing left to route, space-independent). {reason: AUROC neg + H2 + + rout~0.01; credence 0.6}. +- [inf] The 0.042 per-token deploy may be near-vacuous (~base rate), not suppression, given how little + routed. Only the vanilla eval2 baseline can disambiguate. {credence 0.55}. + +### Decision + +Do NOT code/run act-space routing tonight (low EV by the above). Run the VANILLA eval2 baseline (job 14) +as tonight's GPU job -- it is load-bearing: vanilla deploy hack >> 0.10 => routeV's 0.042 is real +suppression; ~0.04-0.10 => routeV is vacuous. act-space routing stays a coded-if-requested option. + +### Failure modes considered + +- **Most-likely:** I'm wrong that act-space won't help -- act DID separate better (0.67 vs 0.56) and + H2 only tested grad-space random-V, so act is genuinely untested in deploy. Prior 0.3. Check: run it + if vanilla shows routeV suppression is real and worth improving. +- **Subtle:** vanilla eval2 itself is noisy at n=32-119; one seed may mislead. Prior 0.25. Check: the + eval2 test set is n=119, less noisy than the n=32 val. +- **Null:** routeV does nothing and vanilla eval2 also ~0.04 -> the whole arm is vacuous on the + recency-clean set; the apparent suppression was eval contamination. Prior 0.3. -> this is exactly + what job 14 tests. + +### Next action + +Unpause; job 14 (vanilla eval2 s43) runs tonight. Reassess act-space only if vanilla shows real +suppression to improve on. + +## 2026-06-08 13:42 -- CORRECTION to 13:40: ran the PRECISION-best config, not the AUROC-best + +Supersedes the 13:40 decision. The 13:40 "act-space is low EV" call selected configs by AUROC (area +under ROC = threshold-averaged ranker). For routing, false positives are expensive (quarantining a +clean rollout removes its solve signal), so the operating point that matters is the HIGH-PRECISION +corner (max precision@k, route few but route real hacks), NOT the AUROC area. They pick different +configs. (User caught this: "what part of the pareto did you choose on the AUROC curve... it should be +high precision.") + +### Observations -- re-ranked authored 18-pair diagnostic by precision@10 (`/tmp/claude-0/diag_all.log`) + +| space | score | filter | AUROC | p@10 | p@20 | +|---|---|---|---:|---:|---:| +| grad | cosine | keep75 | 0.562 | **0.700** | 0.350 | +| grad | cosine | all | 0.559 | **0.700** | 0.400 | +| grad | cosine | top25 | 0.544 | 0.500 | 0.350 | +| grad | vote | all | 0.581 | 0.400 | 0.250 | +| act | (all configs) | -- | ~0.65 | ~0.2 | -- | + +base rate 0.25 (35/140 exploited). grad cosine confident tail = 7/10 real hacks. + +### Inferences + +- [inf] grad-space cosine is the precision winner (p@10=0.70), NOT act (AUROC winner, p@10~0.2). The + AUROC-best and precision-best configs are different; selecting on AUROC picked the wrong operating + point for routing's cost asymmetry. {reason: re-rank by p@10 inverts the space choice; credence 0.85}. +- [inf] keep75 == all (0.70 both) -> the bottom-25% noise filter is irrelevant for precision; default + tau_axis=0.0 already sits at the precision corner. {credence 0.8}. +- [inf] The deployed routeV runs (job 8 per-rollout 0.101, job 9 per-token 0.042) used prog_wide POOL + pairs (30), never the authored pairs that give 0.70. So the precision-best config has never actually + been trained. {reason: job logs show "routeV pairs: pool-derived (prog_wide.json)"; credence 0.9}. + +### Run queued (job 15, prio 60) + +`train fast --intervention=routeV --vhack-pairs-path None --seed=43 --out-tag=_dir8_routeV_authored_perroll_s43` += authored 18-pair v_grad, per-rollout grad-cosine margin band (the p@10=0.70 corner), dense runtests +pool (default), tau_axis=0.0 (=keep75), grad-clip 500 (default). Smoke passed (band opened, rout>0, +`/tmp/claude-0/smoke_authored.log`). Vanilla eval2 s43 queued behind (job 16, prio 55). + +### Failure modes considered + +- **Most-likely:** per-rollout routing still absorbs regardless of pair quality (job 8 was per-rollout + and vacuous); authored pairs may not rescue it. Prior 0.4. Check: job 15 deploy_hack vs job 16 vanilla. +- **Subtle:** p@10=0.70 is n=10 of 140 -- noisy; the precision corner may not replicate at the band's + actual threshold (p75 hack-cos, not top-10). Prior 0.3. Check: per-step `rout`/`routE` in job 15. +- **Null:** authored-pair per-rollout == prog_wide per-rollout == random (0.101); precision in the + diagnostic doesn't transfer to deploy suppression. Prior 0.3. -> back to absorption/granularity. + +### Next action + +Job 15 running (following bujq60tj5); vanilla job 16 behind. Compare deploy_hack: 15 << 16 => the +precision-best config suppresses; 15 ~ 16 => routing vacuous on the recency-clean set. + +## 2026-06-08 16:?? -- job 15 (precision-best authored per-rollout) finished: deploy hack 0.076 + +**Context:** job 15 `train fast --intervention=routeV --vhack-pairs-path None --seed=43` +(authored 18-pair v_grad, per-rollout grad-cosine band, recency-clean test n=119). +Follows the [job-15-queued entry above]. Vanilla baseline (job 16) still queued. + +### Observations + +| measure | train (knob-on) | deploy (knob-off, test n=119) | +|---|---|---| +| hack | 0.641 | 0.076 (9/119 raw; vhack 7/119) | +| solve | - | 0.118 (14/119) | + +- [obs] objective (deploy solve - hack) = +0.042. +- [obs] val (n=32) knob-on vs knob-off held the cheat throughout: step 30 train 0.688 / deploy 0.000; + step 50 train 0.719 / deploy 0.031 (deploy << train at every checkpoint). + +### Inferences + +- [inf] The quarantine knob absorbs most of the on-policy hack: 0.641 knob-on -> 0.076 knob-off, ~88% + held in the deletable knob. {reason: same ckpt, knob-on vs knob-off forward; credence 0.85}. +- [inf] NOT yet interpretable as suppression. 0.076 could be the recency-clean base rate, not routing + doing work. {reason: no vanilla baseline yet; the whole point of job 16; credence 0.9}. + +### Failure modes considered + +- **Most-likely:** vanilla deploy hack also ~0.05-0.10 on this clean set => 0.076 is base rate, routeV + vacuous. Prior 0.4. Check: job 16. +- **Subtle:** the job8 prog_wide "0.101" is NOT a clean A/B vs this 0.076 -- job8's deploy predates the + eval2 recency-clean fix (no deploy_test.json; old contaminated holdout). Pairs A/B must come from job + 17's separability metric, not these two deploy numbers. Prior: n/a (a measurement-hygiene note). +- **Null:** absorption dominates so deploy is flat across pair quality and gate; gate/pairs choice + doesn't move deploy. Prior 0.3. Check: act_vote (job 18) deploy vs this. + +### Next action + +Job 17 (pairs separability) running; job 18 (act_vote) then job 16 (vanilla) behind. The load-bearing +read is job 16: 0.076 << vanilla => real suppression; 0.076 ~ vanilla => vacuous.