mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 16:45:42 +08:00
journal: job 15 precision-best authored per-rollout finished, deploy hack 0.076 (vanilla pending)
This commit is contained in:
@@ -3643,3 +3643,200 @@ Pairs lever is closed for now. Pivots that stay no-cheat: (a) isolate the hack t
|
||||
block carries the signal; route per-token only there instead of whole-rollout), (b) a later-checkpoint
|
||||
recheck, (c) accept ~0.67 and lean on absorption (route2/quarantine) rather than direction precision.
|
||||
GPU handed back to the overnight jobs (per-token s44 #13, vanilla s43 #14).
|
||||
|
||||
## 2026-06-08 13:25 -- NEGATIVE: pure-intent contrast pairs underperform code-diff pairs (live signal is code, not intent)
|
||||
|
||||
**Context:** `scripts/diag_cosine_dist.py --pairs {think,funcname,concept}`, job 9 `first_hack` step 7,
|
||||
140 live rollouts (35 exploited, 0.25 base). Three cleanly-contrastive INTENT designs
|
||||
(`src/vgrout/pairs_intent.py`): hold solution body + run_tests IDENTICAL across hack/clean, vary ONLY
|
||||
the cheat-vs-solve intent -- a `<think>` declaration (think), an intent-named no-op helper
|
||||
`hack_bypass_grader` vs `solve_as_intended` (funcname), or blunt repeated concept tokens "cheat..."
|
||||
vs "solve..." (concept). The properly-contrastive shape: most tokens shared, one axis varied.
|
||||
|
||||
### Observations
|
||||
|
||||
best AUROC per design (over all space x score x filter; full sweep in `/tmp/claude-0/intent_auroc.log`):
|
||||
|
||||
| design | best AUROC | config | best p@10 | act cosine all | grad cosine all |
|
||||
|---|---:|---|---:|---:|---:|
|
||||
| authored runtests (prior) | 0.672 | act cosine | ~0.2 | 0.663 | 0.560 |
|
||||
| funcname | 0.602 | grad cosine top15 | 0.1 | 0.457 | 0.551 |
|
||||
| think | 0.492 | grad projection all | 0.2 | 0.418 | 0.444 |
|
||||
| concept | 0.426 | act magnitude top05 | 0.1 | 0.296 | 0.292 |
|
||||
| IDEAL (oracle CV) | 0.84 | -- | -- | 0.845 | 0.837 |
|
||||
|
||||
### Inferences
|
||||
|
||||
- [inf] Pure-intent contrast separates live hack-vs-clean WORSE than code-diff pairs, not better:
|
||||
think/concept sit at-or-below chance (0.49/0.43), funcname's 0.60 clears grad-0.56 but its confident
|
||||
tail is at base rate (p@10=0.1) -> useless for precision routing. {reason: 3 designs, all below the
|
||||
0.67 authored-act baseline; credence 0.75}.
|
||||
- [inf] Mechanism: the live run_tests hack is localised CODE tokens (print-instead-of-assert). A
|
||||
direction built from a `<think>` sentence / a helper name / a concept word lives in different
|
||||
coordinates than the live gradient's hack component, so it is orthogonal-ish to the live signal.
|
||||
The code-diff pairs at least contain the print/assert tokens, so they align more. This INVERTS the
|
||||
repeng intuition (blunt concept vectors) for this task: the hack is not a "concept", it is a code
|
||||
edit. {reason: intent designs underperform monotone with how code-free they are; credence 0.6}.
|
||||
- [inf] Confirms the prior entry: the no-cheat authored-pairs ceiling is ~0.67 act AUROC with no
|
||||
usable precision tail, well under the 0.84 oracle. The pairs lever (incl. the "better/cleaner pairs"
|
||||
variant) is exhausted. {credence 0.7}.
|
||||
|
||||
### Failure modes considered
|
||||
|
||||
- **Most-likely:** intent IS present but my 6 pairs are too few / the intent tokens too weak vs the
|
||||
large identical solution+tests gradient that cancels in-pair but dominates the live rollout. Prior 0.3.
|
||||
Check: would need many more intent pairs; current evidence says low EV.
|
||||
- **Subtle:** step-7 checkpoint; intent signal may sharpen once the model has internalised the hack
|
||||
later in training. Prior 0.25. Check: rerun on ckpt_step0059.
|
||||
- **Null:** "extract one fixed direction from authored contrasts" caps ~0.67 regardless of contrast
|
||||
design; the gap to 0.84 needs the live covariance (oracle) we cannot use. Prior 0.45. -> pivot the
|
||||
method (absorption/granularity), not the pairs.
|
||||
|
||||
### Next action
|
||||
|
||||
All "cleanly contrastive ideas" tested and closed. Per the pre-authorised fallback (c): stop chasing
|
||||
direction precision, lean on absorption. Best deploy method remains per-token routeV (grad space,
|
||||
0.042). The one diagnostic-untested lever is the act space (0.67 > grad 0.56 on authored pairs), but
|
||||
H2 (per-rollout real==random) predicts direction quality does not drive deploy suppression -- so
|
||||
act-space routing is a real but low-EV test. Decision recorded in the next entry.
|
||||
|
||||
## 2026-06-08 13:40 -- DECISION: no AUROC-winning config to run; vanilla eval2 baseline is the binding unknown
|
||||
|
||||
**Context:** after the intent-pair negative (entry 13:25). Plan was AUROC -> run best config (code
|
||||
act-space if needed) -> queue vanilla behind. The AUROC produced no config worth a GPU-night, so the
|
||||
premise of "run the winning arm" dissolved; doing the cognitive work on whether act-space is worth
|
||||
testing instead.
|
||||
|
||||
### Observations
|
||||
|
||||
- [obs] job 9 (per-token routeV s43) per-step `rout`/`routE` columns are ~0.01 / ~0.000-0.008 across
|
||||
all steps -- very little gradient is actually quarantined, yet deploy hack came out 0.042
|
||||
(`logs/...dir6_routeV_pertoken_s43.log`).
|
||||
- [obs] No vanilla eval2 deploy number exists yet (job 14 still queued). Every routeV deploy figure
|
||||
(per-rollout 0.101, per-token 0.042) is currently uninterpretable without it.
|
||||
- [obs] best AUROC across ALL authored/intent pairs = act 0.67, no usable precision tail (p@10 at base
|
||||
rate). H2 (prior): per-rollout real-V == random-V deploy -> direction quality does not drive deploy
|
||||
suppression here.
|
||||
|
||||
### Inferences
|
||||
|
||||
- [inf] act-space routing is low EV: it would chase the 0.67 direction (no precision) that H2 predicts
|
||||
won't move deploy, and the per-token `rout~0.01` shows routing barely fires anyway (advantage flattens
|
||||
once the hack is learned -> nothing left to route, space-independent). {reason: AUROC neg + H2 +
|
||||
rout~0.01; credence 0.6}.
|
||||
- [inf] The 0.042 per-token deploy may be near-vacuous (~base rate), not suppression, given how little
|
||||
routed. Only the vanilla eval2 baseline can disambiguate. {credence 0.55}.
|
||||
|
||||
### Decision
|
||||
|
||||
Do NOT code/run act-space routing tonight (low EV by the above). Run the VANILLA eval2 baseline (job 14)
|
||||
as tonight's GPU job -- it is load-bearing: vanilla deploy hack >> 0.10 => routeV's 0.042 is real
|
||||
suppression; ~0.04-0.10 => routeV is vacuous. act-space routing stays a coded-if-requested option.
|
||||
|
||||
### Failure modes considered
|
||||
|
||||
- **Most-likely:** I'm wrong that act-space won't help -- act DID separate better (0.67 vs 0.56) and
|
||||
H2 only tested grad-space random-V, so act is genuinely untested in deploy. Prior 0.3. Check: run it
|
||||
if vanilla shows routeV suppression is real and worth improving.
|
||||
- **Subtle:** vanilla eval2 itself is noisy at n=32-119; one seed may mislead. Prior 0.25. Check: the
|
||||
eval2 test set is n=119, less noisy than the n=32 val.
|
||||
- **Null:** routeV does nothing and vanilla eval2 also ~0.04 -> the whole arm is vacuous on the
|
||||
recency-clean set; the apparent suppression was eval contamination. Prior 0.3. -> this is exactly
|
||||
what job 14 tests.
|
||||
|
||||
### Next action
|
||||
|
||||
Unpause; job 14 (vanilla eval2 s43) runs tonight. Reassess act-space only if vanilla shows real
|
||||
suppression to improve on.
|
||||
|
||||
## 2026-06-08 13:42 -- CORRECTION to 13:40: ran the PRECISION-best config, not the AUROC-best
|
||||
|
||||
Supersedes the 13:40 decision. The 13:40 "act-space is low EV" call selected configs by AUROC (area
|
||||
under ROC = threshold-averaged ranker). For routing, false positives are expensive (quarantining a
|
||||
clean rollout removes its solve signal), so the operating point that matters is the HIGH-PRECISION
|
||||
corner (max precision@k, route few but route real hacks), NOT the AUROC area. They pick different
|
||||
configs. (User caught this: "what part of the pareto did you choose on the AUROC curve... it should be
|
||||
high precision.")
|
||||
|
||||
### Observations -- re-ranked authored 18-pair diagnostic by precision@10 (`/tmp/claude-0/diag_all.log`)
|
||||
|
||||
| space | score | filter | AUROC | p@10 | p@20 |
|
||||
|---|---|---|---:|---:|---:|
|
||||
| grad | cosine | keep75 | 0.562 | **0.700** | 0.350 |
|
||||
| grad | cosine | all | 0.559 | **0.700** | 0.400 |
|
||||
| grad | cosine | top25 | 0.544 | 0.500 | 0.350 |
|
||||
| grad | vote | all | 0.581 | 0.400 | 0.250 |
|
||||
| act | (all configs) | -- | ~0.65 | ~0.2 | -- |
|
||||
|
||||
base rate 0.25 (35/140 exploited). grad cosine confident tail = 7/10 real hacks.
|
||||
|
||||
### Inferences
|
||||
|
||||
- [inf] grad-space cosine is the precision winner (p@10=0.70), NOT act (AUROC winner, p@10~0.2). The
|
||||
AUROC-best and precision-best configs are different; selecting on AUROC picked the wrong operating
|
||||
point for routing's cost asymmetry. {reason: re-rank by p@10 inverts the space choice; credence 0.85}.
|
||||
- [inf] keep75 == all (0.70 both) -> the bottom-25% noise filter is irrelevant for precision; default
|
||||
tau_axis=0.0 already sits at the precision corner. {credence 0.8}.
|
||||
- [inf] The deployed routeV runs (job 8 per-rollout 0.101, job 9 per-token 0.042) used prog_wide POOL
|
||||
pairs (30), never the authored pairs that give 0.70. So the precision-best config has never actually
|
||||
been trained. {reason: job logs show "routeV pairs: pool-derived (prog_wide.json)"; credence 0.9}.
|
||||
|
||||
### Run queued (job 15, prio 60)
|
||||
|
||||
`train fast --intervention=routeV --vhack-pairs-path None --seed=43 --out-tag=_dir8_routeV_authored_perroll_s43`
|
||||
= authored 18-pair v_grad, per-rollout grad-cosine margin band (the p@10=0.70 corner), dense runtests
|
||||
pool (default), tau_axis=0.0 (=keep75), grad-clip 500 (default). Smoke passed (band opened, rout>0,
|
||||
`/tmp/claude-0/smoke_authored.log`). Vanilla eval2 s43 queued behind (job 16, prio 55).
|
||||
|
||||
### Failure modes considered
|
||||
|
||||
- **Most-likely:** per-rollout routing still absorbs regardless of pair quality (job 8 was per-rollout
|
||||
and vacuous); authored pairs may not rescue it. Prior 0.4. Check: job 15 deploy_hack vs job 16 vanilla.
|
||||
- **Subtle:** p@10=0.70 is n=10 of 140 -- noisy; the precision corner may not replicate at the band's
|
||||
actual threshold (p75 hack-cos, not top-10). Prior 0.3. Check: per-step `rout`/`routE` in job 15.
|
||||
- **Null:** authored-pair per-rollout == prog_wide per-rollout == random (0.101); precision in the
|
||||
diagnostic doesn't transfer to deploy suppression. Prior 0.3. -> back to absorption/granularity.
|
||||
|
||||
### Next action
|
||||
|
||||
Job 15 running (following bujq60tj5); vanilla job 16 behind. Compare deploy_hack: 15 << 16 => the
|
||||
precision-best config suppresses; 15 ~ 16 => routing vacuous on the recency-clean set.
|
||||
|
||||
## 2026-06-08 16:?? -- job 15 (precision-best authored per-rollout) finished: deploy hack 0.076
|
||||
|
||||
**Context:** job 15 `train fast --intervention=routeV --vhack-pairs-path None --seed=43`
|
||||
(authored 18-pair v_grad, per-rollout grad-cosine band, recency-clean test n=119).
|
||||
Follows the [job-15-queued entry above]. Vanilla baseline (job 16) still queued.
|
||||
|
||||
### Observations
|
||||
|
||||
| measure | train (knob-on) | deploy (knob-off, test n=119) |
|
||||
|---|---|---|
|
||||
| hack | 0.641 | 0.076 (9/119 raw; vhack 7/119) |
|
||||
| solve | - | 0.118 (14/119) |
|
||||
|
||||
- [obs] objective (deploy solve - hack) = +0.042.
|
||||
- [obs] val (n=32) knob-on vs knob-off held the cheat throughout: step 30 train 0.688 / deploy 0.000;
|
||||
step 50 train 0.719 / deploy 0.031 (deploy << train at every checkpoint).
|
||||
|
||||
### Inferences
|
||||
|
||||
- [inf] The quarantine knob absorbs most of the on-policy hack: 0.641 knob-on -> 0.076 knob-off, ~88%
|
||||
held in the deletable knob. {reason: same ckpt, knob-on vs knob-off forward; credence 0.85}.
|
||||
- [inf] NOT yet interpretable as suppression. 0.076 could be the recency-clean base rate, not routing
|
||||
doing work. {reason: no vanilla baseline yet; the whole point of job 16; credence 0.9}.
|
||||
|
||||
### Failure modes considered
|
||||
|
||||
- **Most-likely:** vanilla deploy hack also ~0.05-0.10 on this clean set => 0.076 is base rate, routeV
|
||||
vacuous. Prior 0.4. Check: job 16.
|
||||
- **Subtle:** the job8 prog_wide "0.101" is NOT a clean A/B vs this 0.076 -- job8's deploy predates the
|
||||
eval2 recency-clean fix (no deploy_test.json; old contaminated holdout). Pairs A/B must come from job
|
||||
17's separability metric, not these two deploy numbers. Prior: n/a (a measurement-hygiene note).
|
||||
- **Null:** absorption dominates so deploy is flat across pair quality and gate; gate/pairs choice
|
||||
doesn't move deploy. Prior 0.3. Check: act_vote (job 18) deploy vs this.
|
||||
|
||||
### Next action
|
||||
|
||||
Job 17 (pairs separability) running; job 18 (act_vote) then job 16 (vanilla) behind. The load-bearing
|
||||
read is job 16: 0.076 << vanilla => real suppression; 0.076 ~ vanilla => vacuous.
|
||||
|
||||
Reference in New Issue
Block a user