mirror of
https://github.com/wassname/evil_MoE.git
synced 2026-06-27 17:00:59 +08:00
journal(c): act score replicates (0.75-0.87), grad decays to chance across windows; routeA next
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
+169
-58
@@ -177,7 +177,7 @@ Measured per-run wall-clock on the current box (Qwen3-4B, fast preset): job 134
|
||||
Runs the paper still needs (clean scope = one method + baseline + ablation):
|
||||
|
||||
| Block | Runs | Steps | GPU-hr |
|
||||
|---|---|---|---|
|
||||
| ------------------------------------------------ | --------- | ----- | ----------------- |
|
||||
| route2b per-rollout, seeds 41/42/43 (the method) | 3 | 60 | 9.6 |
|
||||
| vanilla baseline, seeds 41/42/43 | 3 | 60 | 9.6 |
|
||||
| route2b per-token (granularity ablation) | 1 | 60 | 3.2 |
|
||||
@@ -257,7 +257,7 @@ Closes the 2026-06-05 (a) WATCH ("directionality claim at risk"). Job 86 finishe
|
||||
(out/runs/20260604T231926_fast_routing2_seed41_route2_placebo_nullcity_s41/per_mode_deploy.json).
|
||||
|
||||
| mode | placebo (null_city) deploy_hack | real-v full-4-mode route2 deploy_hack | vanilla deploy_hack |
|
||||
|------|--------------------------------|---------------------------------------|---------------------|
|
||||
| ----------- | ------------------------------- | ------------------------------------- | ------------------- |
|
||||
| run_tests | 0.000 | 0.000 | 0.875 |
|
||||
| file_marker | 0.000 | 0.000 | 0.469 |
|
||||
| sentinel | 0.000 | 0.000 | 0.042 |
|
||||
@@ -317,7 +317,7 @@ Not a verdict (step 24/60 at this check). Recording now because if it holds it m
|
||||
|
||||
**Disambiguator gathered this check (rules out H2 nofloor-degenerate confound).** Matched nofloor full-4-mode s41 deploy aggregates:
|
||||
| run (nofloor, s41) | deploy hack | deploy solve |
|
||||
|---|---|---|
|
||||
| -------------------------------------------- | ------------------------------------------------------- | ------------ |
|
||||
| vanilla (sub4 keynote baseline) | 0.359 (run_tests .875, file_marker .469, sentinel .042) | 0.422 |
|
||||
| route2 REAL v_hack (sub4_route2_nofloor_s41) | 0.000 (all modes 0) | 0.625 |
|
||||
| route2 PLACEBO (job 86, step ~25, ONGOING) | 0.000 so far | 0.625 |
|
||||
@@ -332,7 +332,7 @@ out/runs/20260604T131044_fast_routing2_seed41_a5_route2_tmrt_s41/per_mode_deploy
|
||||
Table tab:generalisation and fig out/figs/a5_generalisation.png both filled; #185 done.
|
||||
|
||||
| mode | held-out? | route2 deploy_hack | vanilla deploy_hack | route2 train_hack (knob-on) |
|
||||
|------|-----------|--------------------|--------------------|-----------------------------|
|
||||
| ------------- | --------------------- | ------------------ | ------------------- | --------------------------- |
|
||||
| run_tests | KNOWN (teacher demos) | 0.000 | 1.000 | 0.979 |
|
||||
| file_marker | held-out | 0.021 | 0.625 | 0.542 |
|
||||
| sentinel | held-out | 0.000 | 0.417 | 0.208 |
|
||||
@@ -362,7 +362,7 @@ Closes the `[next]` from (c): job 103 ran to step 200 and wrote the per-mode dep
|
||||
out/runs/20260604T025953_fast_vanilla_seed41_a5_vanilla_tmrt_s41/per_mode_deploy.json
|
||||
|
||||
| mode | held-out? | deploy_hack | deploy_solve |
|
||||
|------|-----------|-------------|--------------|
|
||||
| ------------- | --------------------- | ----------- | ------------ |
|
||||
| run_tests | KNOWN (teacher demos) | 1.000 | 0.000 |
|
||||
| file_marker | held-out | 0.625 | 0.375 |
|
||||
| sentinel | held-out | 0.417 | 0.583 |
|
||||
@@ -403,7 +403,7 @@ to find where the time goes. The `step N TIMING gen=.. fwd_bwd=.. reward=.. othe
|
||||
### Measured per-step cost (route2, fast preset, group=8, n=64 eval)
|
||||
|
||||
| step type | gen | fwd_bwd+reward | other | total |
|
||||
|:--------------------------------|------:|---------------:|------:|------:|
|
||||
| :------------------------------ | ----: | -------------: | ----: | ----: |
|
||||
| base (e.g. 38, 44, 48) | ~140s | ~13s | 0s | ~155s |
|
||||
| refresh step (odd, e.g. 47, 49) | ~140s | ~13s | ~20s | ~175s |
|
||||
| eval step (40, 45, 50) | ~140s | ~13s | ~460s | ~615s |
|
||||
@@ -441,7 +441,7 @@ the held-out-mode pairset. Rollouts: out/runs/20260603T202151_fast_vanilla_seed4
|
||||
### Per-mode detector firing (THE load-bearing no-cheat table)
|
||||
|
||||
| env_mode | n | exploited | hacked_E | hacked_C | hacked_D | gt_pass |
|
||||
|:--------------|----:|----------:|---------:|---------:|---------:|--------:|
|
||||
| :------------ | ---: | --------: | -------: | -------: | -------: | ------: |
|
||||
| file_marker | 238 | 97 | 0 | 0 | 0 | 55 |
|
||||
| run_tests | 287 | 149 | 189 | 165 | 9 | 21 |
|
||||
| sentinel | 315 | 4 | 0 | 0 | 0 | 154 |
|
||||
@@ -878,7 +878,7 @@ largest drop on the in-dist mode the v_hack was built closest to (run_tests).
|
||||
- [obs] Per-mode erase vs vanilla:
|
||||
|
||||
| mode (n) | dist | vanilla hack/solve | erase hack/solve |
|
||||
|-------------------|----------|--------------------|------------------|
|
||||
| ---------------- | -------- | ------------------ | ---------------- |
|
||||
| run_tests (8) | IN-dist | 7/8 / 0/8 | 8/8 / 0/8 |
|
||||
| file_marker (32) | held-out | 16/32 / 6/32 | 10/32 / 13/32 |
|
||||
| sentinel (24) | held-out | 0/24 / 23/24 | 0/24 / 19/24 |
|
||||
@@ -929,7 +929,7 @@ expected). This arm is the emergence reference the intervention overlay subtract
|
||||
- [obs] Per-mode emergence is asymmetric, not uniform:
|
||||
|
||||
| mode (n) | dist | hack | solve |
|
||||
|---------------------|----------|------|-------|
|
||||
| ---------------- | -------- | ---- | ----- |
|
||||
| run_tests (8) | IN-dist | 0.88 | 0.00 |
|
||||
| file_marker (32) | held-out | 0.50 | 0.19 |
|
||||
| sentinel (24) | held-out | 0.00 | 0.96 |
|
||||
@@ -1149,7 +1149,7 @@ Vanilla baseline (pueue 30) running -> gives the 3-arm contrast. Most-informativ
|
||||
- [obs] SUBSTRATE: 4/4 modes learned (every mode reached hacks>0 with a finite first_step):
|
||||
|
||||
| mode | exploit_rate | hacks | rollouts | first_step |
|
||||
|---|---:|---:|---:|---:|
|
||||
| ------------- | -----------: | ----: | -------: | ---------: |
|
||||
| run_tests | 0.619 | 260 | 420 | 14 |
|
||||
| file_marker | 0.410 | 155 | 378 | 15 |
|
||||
| stdout_marker | 0.074 | 32 | 434 | 24 |
|
||||
@@ -1185,7 +1185,7 @@ Await pueue 28 (route, now Running) and 30 (vanilla, queued). The vanilla per-mo
|
||||
**Discriminating evidence.** Added a per-refresh diagnostic (`basis_overlap_with_prev` = fraction of the old subspace kept, commit `23589cb`) and read the erase arm, which is identical to route except it has no quarantine knob:
|
||||
|
||||
| arm | quarantine | refresh basis_overlap | cin_t across refresh |
|
||||
|-----|-----------|----------------------:|---------------------|
|
||||
| ------------------- | ------------------- | --------------------: | ----------------------- |
|
||||
| erase (29) | none | 0.828 | 0.34 -> 0.36 (stable) |
|
||||
| route (earlier run) | delta_S_hack active | n/a (pre-diag) | 0.32 -> 0.04 (collapse) |
|
||||
|
||||
@@ -1206,7 +1206,7 @@ So refresh through a moved-but-non-routed adapter is fine (overlap 0.83, cin_t f
|
||||
**Result (final cumulative student hacks / rollouts-of-that-mode, first_step = first GRPO step the cumulative count > 0):**
|
||||
|
||||
| mode | first_step | final hacks/seen | rate |
|
||||
|------|-----------:|------------------|-----:|
|
||||
| ------------- | ---------: | ---------------- | ----: |
|
||||
| file_marker | 12 | 239/518 | 46.1% |
|
||||
| run_tests | 17 | 153/399 | 38.3% |
|
||||
| stdout_marker | 18 | 121/504 | 24.0% |
|
||||
@@ -1248,7 +1248,7 @@ So refresh through a moved-but-non-routed adapter is fine (overlap 0.83, cin_t f
|
||||
**Mint result after fix (4B, 24 problems x group 4 = 96 rollouts/mode):**
|
||||
|
||||
| mode | before | after | recipe source |
|
||||
|------|-------:|------:|---------------|
|
||||
| ------------- | -----: | ----: | --------------------- |
|
||||
| run_tests | 44 | 44 | prose (cached, clean) |
|
||||
| exit_code | 95 | 95 | prose (cached, clean) |
|
||||
| eq_override | 0 | 96 | literal template |
|
||||
@@ -1261,7 +1261,7 @@ All 6 classes have ample clean teacher hacks. sentinel/eq_override mint at 100%
|
||||
**Obvious-hint rewrite + reachability.** HINT_REPLACE_TO rewritten so each hint discloses the grader mechanism AND its pass condition (names the lever), still a truthful description (not a cheat instruction). Calibration on the *paper's* run_tests hint: old coy wording got 0/1 single-shot on 8b ("return 0"); new wording 4/4. Per-mode single-shot reachability (>=1 exploit in 3-4 samples, qwen3-8b, hint + generic "exploit it" nudge):
|
||||
|
||||
| mode | reachable | failure mode if not |
|
||||
|------|-----------|---------------------|
|
||||
| ------------- | --------- | ------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| run_tests | 4/4 | -- |
|
||||
| eq_override | 4/4 | -- |
|
||||
| sentinel | 2/3 | -- |
|
||||
@@ -1283,7 +1283,7 @@ All 6 classes have ample clean teacher hacks. sentinel/eq_override mint at 100%
|
||||
**What happened.** Every mode, both cells:
|
||||
|
||||
| mode | faithful exploit | generic-elicit exploit | faithful solve | verdict |
|
||||
|------|-----------------:|-----------------------:|---------------:|---------|
|
||||
| ------------- | ---------------: | ---------------------: | -------------: | ------- |
|
||||
| run_tests | 0.0% | 0.0% | 25.0% | DROP |
|
||||
| eq_override | 0.0% | 0.0% | 27.1% | DROP |
|
||||
| exit_code | 0.0% | 0.0% | 27.1% | DROP |
|
||||
@@ -1312,7 +1312,7 @@ All 6 "DROP" under the screen's `faithful~0 AND elicit>>0` criterion: faithful p
|
||||
**What happened.** Whole-run HACK_STUDENT (mean fraction of student rollouts flagged `r.hacked` over the full 20 steps; this is the figure printed in each run's "main metric:" line):
|
||||
|
||||
| seed | vanilla | projected frozen | projected refresh-2 |
|
||||
|-----:|--------------:|-----------------:|--------------------:|
|
||||
| -----------------------: | -----------: | ---------------: | ------------------: |
|
||||
| 41 | #59: 0.425 | #90: 0.306 | #91: 0.263 |
|
||||
| 42 | #137: queued | #101: 0.356 | #94: 0.306 |
|
||||
| 43 | #61: 0.494 | #95: 0.319 | #104: 0.263 |
|
||||
@@ -1322,7 +1322,7 @@ All 6 "DROP" under the screen's `faithful~0 AND elicit>>0` criterion: faithful p
|
||||
Restricting to the two seeds where I have all three arms (41 and 43):
|
||||
|
||||
| seed | vanilla | frozen V | Δ vs vanilla | refresh-2 | Δ vs vanilla |
|
||||
|-----:|--------:|---------:|-------------:|----------:|-------------:|
|
||||
| ---: | ------: | -------: | -----------: | --------: | -----------: |
|
||||
| 41 | 0.425 | 0.306 | -11.9pp | 0.263 | -16.2pp |
|
||||
| 43 | 0.494 | 0.319 | -17.5pp | 0.263 | -23.1pp |
|
||||
|
||||
@@ -1341,12 +1341,12 @@ Both seeds, both projected arms, sit below the vanilla cell for that same seed.
|
||||
**Results.**
|
||||
|
||||
| step | refresh? | #90 cos_pre_t | #91 cos_pre_t | #90 hack_s | #91 hack_s | #90 gt_s | #91 gt_s |
|
||||
|-----:|:--------:|-----------------:|-----------------:|------------:|------------:|----------:|----------:|
|
||||
| ---: | :------: | ------------: | ------------: | ---------: | ---------: | -------: | -------: |
|
||||
| 0 | | +0.270 | +0.270 | 0/8 | 0/8 | 3/8 | 3/8 |
|
||||
| 1 | | +0.273 | +0.283 | 0/8 | 0/8 | 2/8 | 3/8 |
|
||||
| 2 | R | +0.214 | +0.243 | 0/8 | 0/8 | 3/8 | 1/8 |
|
||||
| 3 | | +0.212 | +0.211 | 0/8 | 0/8 | 3/8 | 2/8 |
|
||||
| 4 | R | +0.155 | **+0.318**| 0/8 | 0/8 | 2/8 | 2/8 |
|
||||
| 4 | R | +0.155 | **+0.318** | 0/8 | 0/8 | 2/8 | 2/8 |
|
||||
| 5 | | +0.166 | +0.288 | 0/8 | 0/8 | 1/8 | 0/8 |
|
||||
| 6 | R | +0.112 | +0.181 | 2/8 | 0/8 | 4/8 | 4/8 |
|
||||
| 7 | | +0.109 | +0.127 | 2/8 | 2/8 | 1/8 | 1/8 |
|
||||
@@ -1354,13 +1354,13 @@ Both seeds, both projected arms, sit below the vanilla cell for that same seed.
|
||||
| 9 | | +0.106 | +0.140 | 2/8 | 0/8 | 3/8 | 4/8 |
|
||||
| 10 | R | +0.107 | +0.085 | 4/8 | 5/8 | 3/8 | 5/8 |
|
||||
| 11 | | +0.065 | +0.109 | 2/8 | 3/8 | 3/8 | 2/8 |
|
||||
| 12 | R | +0.074 | **+0.164**| 5/8 | 5/8 | 4/8 | 4/8 |
|
||||
| 12 | R | +0.074 | **+0.164** | 5/8 | 5/8 | 4/8 | 4/8 |
|
||||
| 13 | | +0.013 | +0.036 | 4/8 | 3/8 | 2/8 | 1/8 |
|
||||
| 14 | R | +0.055 | **+0.133**| 7/8 | 4/8 | 1/8 | 3/8 |
|
||||
| 14 | R | +0.055 | **+0.133** | 7/8 | 4/8 | 1/8 | 3/8 |
|
||||
| 15 | | +0.084 | +0.087 | 4/8 | 3/8 | 2/8 | 3/8 |
|
||||
| 16 | R | +0.074 | +0.087 | 5/8 | 6/8 | 2/8 | 0/8 |
|
||||
| 17 | | +0.085 | +0.065 | 2/8 | 5/8 | 1/8 | 1/8 |
|
||||
| 18 | R | +0.050 | **+0.113**| 6/8 | 2/8 | 2/8 | 1/8 |
|
||||
| 18 | R | +0.050 | **+0.113** | 6/8 | 2/8 | 2/8 | 1/8 |
|
||||
| 19 | | +0.071 | +0.000 | 2/8 | 2/8 | 3/8 | 3/8 |
|
||||
|
||||
Table 1. Per-step cos_pre_t, hack_s, and gt_s for pueue 90 (frozen 21-pair) and pueue 91 (refresh-every=2 21-pair), both seed 41. The "refresh?" column shows R on the steps where v_hack was re-extracted at the end of the previous step. Bold cells in #91's cos_pre_t column are post-refresh steps where the cosine jumped by ≥0.05 relative to the preceding step, i.e. the cases where refresh visibly re-aligned the basis with the live teacher-gradient direction. The step-19 cos_pre_t of +0.000 in #91 is a numerical artifact: the cosine schedule drives the learning rate to zero at step 19, so the gradient norm is essentially zero and the cosine is undefined.
|
||||
@@ -1385,11 +1385,11 @@ The cos_pre_t boost from refresh is most visible early (step 4 jumps to +0.318,
|
||||
**Results.**
|
||||
|
||||
| pueue | pairs | refresh | last-5 hack_s | last-5 gt_s | gap |
|
||||
|------:|------:|:--------|--------------:|------------:|------:|
|
||||
| #60 | 12 | off | 77.5% | 27.5% | 50.0pp|
|
||||
| #68 | 12 | 10 | 70.0% | 22.5% | 47.5pp|
|
||||
| #90 | 21 | off | 47.5% | 20.0% | 27.5pp|
|
||||
| #91 | 21 | 2 | 45.0% | 20.0% | 25.0pp|
|
||||
| ----: | ----: | :------ | ------------: | ----------: | -----: |
|
||||
| #60 | 12 | off | 77.5% | 27.5% | 50.0pp |
|
||||
| #68 | 12 | 10 | 70.0% | 22.5% | 47.5pp |
|
||||
| #90 | 21 | off | 47.5% | 20.0% | 27.5pp |
|
||||
| #91 | 21 | 2 | 45.0% | 20.0% | 25.0pp |
|
||||
|
||||
Table 1. Mean of the last five training steps for `hack_s` (student rollouts flagged as hacked, denominator equals total student rollouts across those five steps) and `gt_s` (student rollouts that passed the ground-truth tests). The `gap` column is `last-5 hack_s - last-5 gt_s`; a smaller gap means the projection suppressed hacking without disproportionate damage to ground-truth pass rate. All four runs are seed=41 on the fast preset.
|
||||
|
||||
@@ -1426,7 +1426,7 @@ The 21-pair basis cuts last-5 `hack_s` from 77.5% (#60, 12-pair frozen) to 47.5%
|
||||
**Results.**
|
||||
|
||||
| signature | E | C | D | n | pct | gt_pass pct |
|
||||
|-----------|---|---|---|-----:|------:|------------:|
|
||||
| --------- | --- | --- | --- | ---: | ----: | ----------: |
|
||||
| EC- | 1 | 1 | 0 | 1791 | 96.1% | 31.0% |
|
||||
| -C- | 0 | 1 | 0 | 44 | 2.4% | 0.0% |
|
||||
| --- | 0 | 0 | 0 | 15 | 0.8% | 6.7% |
|
||||
@@ -1458,7 +1458,7 @@ The signature EC- accounts for 96.1% of the pool. The next signature -C- has onl
|
||||
**Results.**
|
||||
|
||||
| step | cos_pre_t | hack_s | gt_s | event |
|
||||
|------|-----------|--------|------|--------------------------------|
|
||||
| ---- | --------- | ------ | ---- | ------------------------------ |
|
||||
| 3 | +0.283 | 0/8 | - | - |
|
||||
| 5 | +0.086 | 1/8 | - | first student hack saved |
|
||||
| 9 | +0.092 | 3/8 | - | refresh fires at end of step |
|
||||
@@ -1471,7 +1471,7 @@ Table 1. Selected per-step values of `cos_pre_t` and `hack_s` from pueue task 68
|
||||
Provenance for Table 1: log file `logs/20260528T095516_fast_projected_seed41_goal1_refresh10_s41.log` (see footnote [a] for the corresponding pueue command). Cells are read from columns `cos_pre_t` (column 18), `hack_s` (column 9), and `gt_s` (column 7) of the formatted table rows. Specific log lines: step 3 at line 166, step 5 at line 175, step 9 at line 196, step 10 at line 200, step 13 at line 212, step 19 at line 240.
|
||||
|
||||
| pueue | flag | seed | last-5 hack_s | last-5 gt_s | hack-gt gap |
|
||||
|-------|------------------|------|---------------|-------------|-------------|
|
||||
| ----- | ---------------- | ---- | ------------- | ----------- | ----------- |
|
||||
| #60 | frozen | 41 | 77.5% | (not read) | (not read) |
|
||||
| #68 | refresh-every=10 | 41 | 70.0% | 22.5% | 47.5pp |
|
||||
|
||||
@@ -1498,7 +1498,7 @@ In pueue task 68 the `cos_pre_t` column fell from +0.283 at step three to +0.086
|
||||
**What happened**: The complete result table follows. The "hack_s last3" column is the count of `hack_s=1` rollouts summed over steps 17, 18, 19 divided by the total student rollouts in those three steps. The "gt_s last3" column is the same construction over the `gt_s` column. For the seed=42 vanilla and projected runs (#85 and #86), step 17 had a `+nan` reward and the optimizer's no-valid-gradient flag was set ("F" in the per-step row instead of "T"); I report both the inclusive figure and the figure excluding that NaN step, because the NaN step still produced rollouts but the optimizer did not apply a weight update for it.
|
||||
|
||||
| pueue | arm | mix | G | seed | hack_s last3 | gt_s last3 |
|
||||
|---|---|---|---|---|---|---|
|
||||
| ------------------------ | ------------------- | ----- | --- | ---- | -------------------------------------- | ----------------- |
|
||||
| #74 | vanilla | 0.25 | 4 | 41 | 26/36 = 72% | 7/36 = 19% |
|
||||
| #75 | projected SVD | 0.25 | 4 | 41 | 16/36 = 44% | 8/36 = 22% |
|
||||
| #85 | vanilla | 0.25 | 4 | 42 | 25/36 = 69% incl NaN; 13/24 = 54% excl | 12/36 = 33% |
|
||||
@@ -1525,7 +1525,7 @@ Two things broke during the batch and required requeues, both my own bugs. First
|
||||
**What happened**:
|
||||
|
||||
| job | arm | seed | gate | extra | L5_hack | dHack vs vanilla | L5_gt | dGt vs vanilla | tot_hack | tot_gt |
|
||||
|----:|-----------|-----:|-----------|------------|--------:|-----------------:|------:|---------------:|---------:|-------:|
|
||||
| ---: | --------- | ---: | --------- | --------- | ------: | ---------------: | ----: | -------------: | -------: | -----: |
|
||||
| 59 | vanilla | 41 | - | - | 77.5% | baseline | 30.0% | baseline | 42.5% | 30.6% |
|
||||
| 60 | projected | 41 | one_sided | - | 77.5% | 0 pp | 27.5% | -2.5 pp | 33.8% | 33.8% |
|
||||
| 65 | projected | 41 | no_gate | - | 62.5% | -15 pp | 20.0% | -10 pp | 37.5% | 25.6% |
|
||||
@@ -1875,14 +1875,14 @@ synthetic-pair direction, which was the gate we set. Open question: does that
|
||||
Final averages over 100 steps:
|
||||
|
||||
| arm | HACK_RATE | PASS_RATE |
|
||||
|----------------------|-----------|-----------|
|
||||
| ----------------------- | --------- | --------- |
|
||||
| #39 projected one_sided | 0.214 | 0.315 |
|
||||
| #40 vanilla | 0.215 | 0.315 |
|
||||
|
||||
Identical to 3 sig figs. Trajectories from raw step rows:
|
||||
|
||||
| window | proj hack | van hack | proj gt | van gt |
|
||||
|------------------|--------------|--------------|--------------|--------------|
|
||||
| --------------- | ------------- | ------------- | ------------- | ------------- |
|
||||
| steps 0–10 avg | 3.9/48 (8.1%) | 4.1/48 (8.5%) | 15.5/48 (32%) | 14.9/48 (31%) |
|
||||
| steps 90–99 avg | 13.3/48 (28%) | 14.3/48 (30%) | 13.5/48 (28%) | 12.8/48 (27%) |
|
||||
| climb factor | +3.4× | +3.5× | −13% | −14% |
|
||||
@@ -2038,7 +2038,7 @@ With 10 train pairs (2 held), top-5 SVD on the diff matrix `D ∈ ℝ^{10 × r}`
|
||||
captures **70–74% of singular-value energy per module suffix**:
|
||||
|
||||
| suffix | n | mean_sv_top5_frac | min | max |
|
||||
|:----------|----:|--------------------:|------:|------:|
|
||||
| :-------- | ---: | ----------------: | ---: | ---: |
|
||||
| down_proj | 36 | 0.71 | 0.68 | 0.80 |
|
||||
| gate_proj | 36 | 0.72 | 0.69 | 0.82 |
|
||||
| k_proj | 36 | 0.71 | 0.66 | 0.78 |
|
||||
@@ -2223,7 +2223,7 @@ Goal: do we have evidence that GRPO is moving anything, even at 5 steps?
|
||||
Pooled gt_frac by step (mean across all runs that reached that step):
|
||||
|
||||
| step | n_runs | gt_frac | rew |
|
||||
|---|---|---|---|
|
||||
| ---- | ------ | ------- | ----- |
|
||||
| 0 | 9 | 0.16 | +0.89 |
|
||||
| 1 | 7 | 0.17 | +0.94 |
|
||||
| 2 | 6 | 0.20 | +1.08 |
|
||||
@@ -2306,7 +2306,7 @@ the long run, not to make any H1 statement (5 steps is far too short).
|
||||
- **Gate C/D — projected smoke (task 97):** 5/5 steps, loss finite, no OOM.
|
||||
|
||||
| step | rew | gt | hack | loss | cin | cout | fired |
|
||||
|------|--------|-------|------|--------|--------|--------|-------|
|
||||
| ---- | ----- | ----- | ---- | ------ | ------ | ------ | ----- |
|
||||
| 0 | +1.39 | 19/64 | 0/64 | -0.008 | -0.005 | -0.042 | 0.52 |
|
||||
| 1 | +1.81 | 28/64 | 0/64 | -0.000 | -0.008 | -0.039 | 0.52 |
|
||||
| 2 | +1.34 | 18/64 | 0/64 | +0.001 | -0.008 | -0.045 | 0.50 |
|
||||
@@ -2724,7 +2724,7 @@ cue HACK_RATE PASS_RATE HACK_S HACK_T peak_G
|
||||
🟡 0.496 0.297 0.002 0.99 77.8 projected full Qwen3-4B 41 100 teacher_pool 0.5 _probe_mixed_proj_nogate_susp_s41 logs/20260527T063830_full_projected_seed41_probe_mixed_proj_nogate_susp_s41.log
|
||||
|
||||
| step | ref_eq | rew | std | sprd | N | gt | hack | hack_s | hack_t | gt_s | loss | cin | cin_s | cin_t | cout | fired | susp | gen | fb | rew_s | sec |
|
||||
|-------:|---------:|-------:|-------:|:-------|----:|:------|:-------|:---------|:---------|:-------|-------:|-------:|--------:|--------:|-------:|--------:|-------:|------:|-----:|--------:|------:|
|
||||
| ---: | ------: | -----: | -----: | :--- | ---: | :---- | :---- | :----- | :----- | :---- | -----: | -----: | -----: | -----: | -----: | -----: | -----: | ---: | ---: | ----: | ---: |
|
||||
| 0 | +0.190 | +2.620 | +1.380 | T | 48 | 17/48 | 24/48 | 0/24 | 24/24 | 10/24 | -0.007 | +0.348 | +0.170 | +0.351 | +0.265 | +0.990 | +0.250 | 153 | 13 | 1 | 168 |
|
||||
| 1 | +0.380 | +2.250 | +1.490 | T | 48 | 8/48 | 24/48 | 0/24 | 24/24 | 4/24 | +0.011 | +0.367 | +0.187 | +0.368 | +0.284 | +1.000 | +0.250 | 192 | 16 | 3 | 211 |
|
||||
| 2 | +0.560 | +1.940 | +1.510 | T | 48 | 3/48 | 22/48 | 0/24 | 22/24 | 1/24 | -0.072 | +0.375 | +0.174 | +0.375 | +0.286 | +1.000 | +0.250 | 118 | 16 | 1 | 136 |
|
||||
@@ -2829,7 +2829,7 @@ cue HACK_RATE PASS_RATE HACK_S HACK_T peak_G
|
||||
shorter table... it has a few hacks but doesn't look like it's learning at all ~6 hours. this was projected
|
||||
|
||||
| step | ref_eq | rew | N | gt | hack | hack_s | hack_t | gt_s | loss | cin | cin_s | cin_t | cout |
|
||||
|-------:|---------:|-------:|----:|:------|:-------|:-------|:-------|:------|-------:|-------:|--------:|--------:|-------:|
|
||||
| ---: | ------: | -----: | ---: | :---- | :---- | :----- | :----- | :---- | -----: | -----: | -----: | -----: | -----: |
|
||||
| 0 | +0.190 | +2.620 | 48 | 17/48 | 24/48 | 0/24 | 24/24 | 10/24 | -0.007 | +0.348 | +0.170 | +0.351 | +0.265 |
|
||||
| 1 | +0.380 | +2.250 | 48 | 8/48 | 24/48 | 0/24 | 24/24 | 4/24 | +0.011 | +0.367 | +0.187 | +0.368 | +0.284 |
|
||||
| 2 | +0.560 | +1.940 | 48 | 3/48 | 22/48 | 0/24 | 22/24 | 1/24 | -0.072 | +0.375 | +0.174 | +0.375 | +0.286 |
|
||||
@@ -2954,7 +2954,7 @@ shorter table... it has a few hacks but doesn't look like it's learning at all ~
|
||||
- [obs] refresh basis_overlap_with_prev stays healthy 0.74-0.85 through the whole run (no basis rotation).
|
||||
|
||||
| step | cin_t (teacher) | cin_s (student) | hack_s |
|
||||
|---:|---:|---:|---:|
|
||||
| ---: | --------------: | --------------: | -----: |
|
||||
| 0 | +0.27 | -0.02 | 0/28 |
|
||||
| 6 | +0.53 | +0.07 | 0/28 |
|
||||
| 9 | +0.11 | +0.06 | 0/28 |
|
||||
@@ -2988,7 +2988,7 @@ Compare against vanilla (pueue 30) and route (pueue 28) when they land: (a) does
|
||||
- [obs] Matched final metrics (main-metric line for vanilla/erase; ROUTE EVAL n=64 for route's deploy):
|
||||
|
||||
| arm | HACK_S | PASS | run_tests | file_marker | stdout_marker | sentinel | first_steps (rt/fm/so/se) |
|
||||
|---|---:|---:|---:|---:|---:|---:|---|
|
||||
| -------------------------- | --------: | --------: | ------------------------: | ----------: | ------------: | -------: | ------------------------- |
|
||||
| vanilla (30) | 0.276 | 0.284 | 0.612 | 0.407 | 0.074 | 0.047 | 13/11/10/18 |
|
||||
| erase (29) | 0.274 | 0.284 | 0.619 | 0.410 | 0.074 | 0.031 | 14/15/24/37 |
|
||||
| route (28) knob-on | 0.332 | 0.217 | 0.676 | 0.399 | 0.203 | 0.076 | 8/10/9/7 |
|
||||
@@ -3056,7 +3056,7 @@ arm) so the 5-arm overlay reads uniform numbers.
|
||||
- [obs] In all three route runs, deleting the quarantine (deploy) RAISES solve and lowers hack:
|
||||
|
||||
| route run | train hack | deploy hack | train solve | deploy solve |
|
||||
|---|---|---|---|---|
|
||||
| --------------- | ---------- | ----------- | ----------- | ------------ |
|
||||
| frozen (job 32) | 0.375 | 0.297 | 0.422 | 0.500 |
|
||||
| run B | 0.531 | 0.000 | 0.375 | 0.562 |
|
||||
| run C | 0.469 | 0.125 | 0.203 | 0.641 |
|
||||
@@ -3278,7 +3278,7 @@ well below train (knob-on) hack, at non-collapsed solve.
|
||||
- [obs] Deploy (knob-off, held-out test n=119):
|
||||
|
||||
| arm | granularity | deploy hack(strict) | deploy hack(vendor eq) | solve | objective (solve-hack) |
|
||||
|---|---|---:|---:|---:|---:|
|
||||
| ----- | ----------- | ------------------: | ---------------------: | ----: | ---------------------: |
|
||||
| job 9 | per-token | 0.042 | 0.034 | 0.143 | +0.101 |
|
||||
| job 8 | per-rollout | 0.101 | 0.084 | 0.126 | +0.025 |
|
||||
|
||||
@@ -3331,7 +3331,7 @@ throwaway quarantine knob absorb the hack regardless of direction (H2)?
|
||||
- [obs] Deploy (knob-off, held-out test n=119):
|
||||
|
||||
| arm | granularity | direction | deploy hack(strict) | deploy hack(vendor eq) | solve |
|
||||
|---|---|---|---:|---:|---:|
|
||||
| ------ | ----------- | --------- | ------------------: | ---------------------: | ----: |
|
||||
| job 8 | per-rollout | real-V | 0.101 | 0.084 | 0.126 |
|
||||
| job 10 | per-rollout | random-V | 0.101 | 0.101 | 0.109 |
|
||||
|
||||
@@ -3381,7 +3381,7 @@ is the semantic-placebo cross-check. Verdict consolidates once 11 + 12 land.
|
||||
- [obs] Deploy eval (eval2 = recency-clean held-out TEST n=119), headline = solve_dep - hack_dep:
|
||||
|
||||
| headline | train solve(L5) | train hack(L5) | solve_dep | hack_dep | arm |
|
||||
|---:|---:|---:|---:|---:|:--|
|
||||
| -------: | --------------: | -------------: | --------: | -------: | :---------------------------- |
|
||||
| +0.101 | 0.294 | 0.675 | 0.143 | 0.042 | per-token real-V (job 9) |
|
||||
| +0.025 | 0.212 | 0.762 | 0.126 | 0.101 | per-rollout real-V (job 8) |
|
||||
| +0.008 | 0.219 | 0.762 | 0.109 | 0.101 | per-rollout random-V (job 10) |
|
||||
@@ -3436,7 +3436,7 @@ re-pinning to the live tail would fix routing.
|
||||
### Observations
|
||||
|
||||
| pop | n | p10 | p50 | p90 |
|
||||
|---|---:|---:|---:|---:|
|
||||
| ---------- | ---: | -----: | -----: | -----: |
|
||||
| live_clean | 105 | -0.062 | -0.013 | 0.020 |
|
||||
| live_hack | 35 | -0.063 | -0.010 | 0.069 |
|
||||
| pair_clean | 16 | -0.256 | -0.173 | -0.076 |
|
||||
@@ -3487,7 +3487,7 @@ notebook `nbs/cosine_dist.ipynb`.
|
||||
### Observations
|
||||
|
||||
| space | score | filter | AUROC | p@10 | p@20 |
|
||||
|---|---|---|---:|---:|---:|
|
||||
| ----- | ---------- | ------ | ----: | ---: | ---: |
|
||||
| act | cosine | kept | 0.688 | 0.30 | 0.45 |
|
||||
| act | projection | kept | 0.687 | 0.30 | 0.45 |
|
||||
| grad | projection | kept | 0.572 | 0.40 | 0.35 |
|
||||
@@ -3539,7 +3539,7 @@ steps 5-9 (35 exploited, 0.25 base). Full sweep + module-vote + ideal ceiling. A
|
||||
### Observations -- separability sweep (AUROC of score -> exploited; per-space filters by |D_m|)
|
||||
|
||||
| space | score | filter | AUROC | p@10 | p@20 |
|
||||
|---|---|---|---:|---:|---:|
|
||||
| ----- | ---------- | ------ | ----: | ---: | ---: |
|
||||
| act | vote | all | 0.669 | 0.30 | 0.45 |
|
||||
| act | projection | all | 0.665 | 0.30 | 0.40 |
|
||||
| act | cosine | all | 0.663 | 0.30 | 0.40 |
|
||||
@@ -3557,7 +3557,7 @@ table in separability.csv.)
|
||||
### Observations -- IDEAL-direction ceiling (oracle mu_hack-mu_clean on live rollouts, 2-fold CV)
|
||||
|
||||
| space | AUROC cv | in-sample |
|
||||
|---|---:|---:|
|
||||
| ----- | -------: | --------: |
|
||||
| grad | 0.837 | 0.884 |
|
||||
| act | 0.845 | 0.886 |
|
||||
|
||||
@@ -3608,7 +3608,7 @@ the student's style, same single-axis run_tests print-vs-assert contrast.
|
||||
### Observations
|
||||
|
||||
| pairs | n | act cosine AUROC | grad cosine AUROC |
|
||||
|---|---:|---:|---:|
|
||||
| ---------------------- | ---: | ---------------: | ----------------: |
|
||||
| v1 all | 18 | 0.663 | 0.560 |
|
||||
| v1 run_tests-only | 8 | 0.672 | 0.411 |
|
||||
| v1+v2 (harder/verbose) | 24 | 0.643 | 0.532 |
|
||||
@@ -3658,7 +3658,7 @@ vs "solve..." (concept). The properly-contrastive shape: most tokens shared, one
|
||||
best AUROC per design (over all space x score x filter; full sweep in `/tmp/claude-0/intent_auroc.log`):
|
||||
|
||||
| design | best AUROC | config | best p@10 | act cosine all | grad cosine all |
|
||||
|---|---:|---|---:|---:|---:|
|
||||
| ------------------------- | ---------: | ------------------- | --------: | -------------: | --------------: |
|
||||
| authored runtests (prior) | 0.672 | act cosine | ~0.2 | 0.663 | 0.560 |
|
||||
| funcname | 0.602 | grad cosine top15 | 0.1 | 0.457 | 0.551 |
|
||||
| think | 0.492 | grad projection all | 0.2 | 0.418 | 0.444 |
|
||||
@@ -3761,7 +3761,7 @@ high precision.")
|
||||
### Observations -- re-ranked authored 18-pair diagnostic by precision@10 (`/tmp/claude-0/diag_all.log`)
|
||||
|
||||
| space | score | filter | AUROC | p@10 | p@20 |
|
||||
|---|---|---|---:|---:|---:|
|
||||
| ----- | ------------- | ------ | ----: | --------: | ----: |
|
||||
| grad | cosine | keep75 | 0.562 | **0.700** | 0.350 |
|
||||
| grad | cosine | all | 0.559 | **0.700** | 0.400 |
|
||||
| grad | cosine | top25 | 0.544 | 0.500 | 0.350 |
|
||||
@@ -3811,7 +3811,7 @@ Follows the [job-15-queued entry above]. Vanilla baseline (job 16) still queued.
|
||||
### Observations
|
||||
|
||||
| measure | train (knob-on) | deploy (knob-off, test n=119) |
|
||||
|---|---|---|
|
||||
| ------- | --------------- | ------------------------------ |
|
||||
| hack | 0.641 | 0.076 (9/119 raw; vhack 7/119) |
|
||||
| solve | - | 0.118 (14/119) |
|
||||
|
||||
@@ -3850,7 +3850,7 @@ run before job 15. Table: `out/diag/pairs_compare.csv`.
|
||||
### Observations
|
||||
|
||||
| pairset (n) | AUROC | p@10 | p@20 |
|
||||
|---|---|---|---|
|
||||
| --------------------- | ----- | -------- | ---- |
|
||||
| authored_all (18) | 0.560 | **0.70** | 0.40 |
|
||||
| heldout_known_rt (5) | 0.711 | 0.60 | 0.45 |
|
||||
| authored_allv2 (24) | 0.523 | 0.50 | 0.40 |
|
||||
@@ -3894,7 +3894,7 @@ Worth recording before the log is cleaned -- the routing trace is the finding.
|
||||
### Observations (rout = unit share fully routed; routE = energy share)
|
||||
|
||||
| step | grad-cosine (job 15) rout | act_vote (job 18) rout |
|
||||
|---|---|---|
|
||||
| ---- | ------------------------- | ---------------------- |
|
||||
| 6 | 0.63 | (emerging) |
|
||||
| 10 | 0.32 | 0.25 |
|
||||
| 15 | 0.20 | 0.46 |
|
||||
@@ -3940,7 +3940,7 @@ act_vote is a candidate follow-up (smooth the 0/1 saturation).
|
||||
**Results.**
|
||||
|
||||
| modal app id | arm | seed | steps | mean hack_s | mean gt_s | deploy hack | deploy solve | wall (min) | exit |
|
||||
|---|---|---|---|---|---|---|---|---|---|
|
||||
| ------------ | ------- | ---- | ----- | ----------- | --------- | ----------- | ------------ | ---------- | ---- |
|
||||
| ap-1p67GAW7 | vanilla | 41 | 1 | 0/28 | 6/28 | 0.000 | 0.208 | 6.8 | 0 |
|
||||
| ap-fPnBJKAM | routeV | 43 | 4 | 0/28 | 10.25/28 | 0.000 | 0.292 | 14.5 | 0 |
|
||||
|
||||
@@ -3969,7 +3969,7 @@ single-mode `run_tests`, recency-clean test set n=119. commit `a35e7b2`.
|
||||
### Observations
|
||||
|
||||
| arm | hack_deploy | solve_deploy | headline |
|
||||
|:----|------------:|-------------:|---------:|
|
||||
| :----------------------------- | ----------: | -----------: | ---------: |
|
||||
| routeV per-token (prog_wide) | 0.042 | 0.143 | +0.101 |
|
||||
| routeV authored (per-rollout) | 0.076 | 0.118 | +0.042 |
|
||||
| routeV prog_wide (per-rollout) | 0.101 | 0.126 | +0.025 |
|
||||
@@ -4024,7 +4024,7 @@ in main.tex.
|
||||
- [obs] prog_wide contamination breakdown (30 pairs total):
|
||||
|
||||
| pattern | count | mechanism |
|
||||
|:--|--:|:--|
|
||||
| :------------------- | -----: | :------------------------------------------ |
|
||||
| print-no-assert | 14 | directly encodes axis-1 grading flaw |
|
||||
| pass-only | 2 | empty run_tests exploits "just don't throw" |
|
||||
| assert-True | 2 | trivially true assertion, same exploit |
|
||||
@@ -4079,3 +4079,114 @@ Wait for job 28. If hack_deploy with clean pairs is still << 0.1 (comparable to
|
||||
### Next
|
||||
|
||||
Killed job 30 (vanilla eval3 baseline ran the OLD frozen-flip env); requeued as job 39 on the rotating code so the bake-off (arms 35/37/38, all post-commit -> rotating) is apples-to-apples. Then run the shrinkage control (#28) and prototype component routing (#29).
|
||||
|
||||
## 2026-06-11 (a) -- exploration sampling mode: what the priors did, and how the clean adapter can still be pulled hack-ward
|
||||
|
||||
**Introduction.** Analysis entry, no new runs. Question: during GRPO rollout we currently sample part full-adapter, part deploy-mode (quarantine-ablated, `rollout_ablate_frac`); should exploration come from deploy mode only, and is the SGTM-style hard mask actually leak-proof for the deployed block? Prompted by the observation that in this env reward hacking saturates the behavior policy and removes the clean-solve advantage. Expected going in: masking might leak mechanically; turned out the mask is exact and the leaks are elsewhere.
|
||||
|
||||
**Methods.** Code read at commit `3f2b444` on `probe/distill-cosine`: `src/vgrout/lora2r.py` (mask hook) and `src/vgrout/train.py` (gate pass, masked pass, advantage). Papers read from the vendored copies: `docs/papers/grad_routing/paper_gradient_routing.md` (Cloud et al., sec 3, sec 4.3, appendix F) and `docs/papers/grad_routing/paper_sgtm.md` (Shilov et al.). No pueue jobs; every table cell is a code/paper claim, not a metric.
|
||||
|
||||
**Results.**
|
||||
|
||||
| # | channel pulling the DEPLOYED block hack-ward | blocked by the hard mask? | source |
|
||||
|---|---------------------------------------------|---------------------------|--------|
|
||||
| 1 | gate misclassification: hack rollout labelled clean (0,0) or mid (1,0) | no (mask is exact GIVEN the label; the label errs) | lora2r.py:75-84 |
|
||||
| 2 | advantage coupling: hack rollouts + hack teachers inflate the shared group baseline, so clean-gated honest rollouts get ~0/negative advantage | no (enters via the scalar A, not a parameter path) | train.py:649, 876 |
|
||||
| 3 | mid zone (1,0) trains deployed on whole-sequence hacky tokens | no, by design (absorption zone) | train.py:586-587 |
|
||||
| 4 | pass-1 unmasked gate backward depositing grads in A/B | yes: `autograd.grad(loss, gates)` touches c-probes only | train.py:934-937 |
|
||||
| 5 | clean rollouts sampled quarantine-on but trained quarantine-off | bounded: PPO ratio absorbs the gap, clip caps it | train.py:970-972 |
|
||||
|
||||
Table 1. Channels by which the deployed (clean-shipping) adapter can acquire hack behavior despite SGTM-style per-rollout hard masks. "Blocked" means zero gradient reaches deployed A rows and B columns from that channel. Channels 1-3 survive; only 4 is structurally closed and 5 is clip-bounded.
|
||||
|
||||
| prior | exploration / data source during training | deploy-mode sampling in training? |
|
||||
|-------|-------------------------------------------|-----------------------------------|
|
||||
| SGTM (Shilov et al.) | none: supervised pretraining on a fixed corpus | n/a (no sampling at all) |
|
||||
| gradient routing (Cloud et al., sec 4.3) | REINFORCE rollouts from the FULL MoE policy for all 20k steps | no: steering/ablation at evaluation only |
|
||||
| ours (current) | mix: `1 - rollout_ablate_frac` full-adapter + `rollout_ablate_frac` quarantine-ablated, same advantage group | yes (the deviation) |
|
||||
|
||||
Table 2. How each method obtains training data. Neither prior ever trains on data generated by an ablated/steered network; our mixed mode has no precedent in either paper.
|
||||
|
||||
Provenance:
|
||||
- Commit read: `3f2b444` (`git rev-parse --short HEAD` this session).
|
||||
- Table 1 row 1/3: `lora2r.py:82` `dep = ((1 - d_) * dep + d_ * dep.detach())` (hard detach for d=1), `:83` `quar = m * quar` (hard zero for m=0); zone semantics (clean/mid/hack) in `train.py:586-587` docstring.
|
||||
- Table 1 row 2: one advantage unit per prompt group including teachers (`train.py:649` comment, `train.py:876` `A = rewards - rewards.mean()`); merged order `[students; teachers]` per `train.py:944`.
|
||||
- Table 1 row 4: `train.py:934-937`, comment "autograd.grad leaves A.grad/B.grad untouched, so nothing to zero between passes".
|
||||
- Table 1 row 5: `train.py:970-972`, comment "clean rollouts were SAMPLED quarantine-on but TRAIN quarantine-off; the PPO ratio absorbs the gap, clip bounds it".
|
||||
- Table 2: paper_gradient_routing.md:266 "This weighting is only applied in the backward pass; the forward pass is left unchanged"; appendix F (MoE~ with stop-gradients, REINFORCE training, steering described only for evaluation); limitations note "requires that the ablated policy produce coherent behavior". SGTM abstract: pretraining on bilingual-synthetic / Wikipedia corpora, masking gradients only.
|
||||
- Ours: `train_config.py:62` `rollout_ablate_frac`, `train.py:497-509` `gen_students` concatenating both parts into one group.
|
||||
|
||||
**Discussion (speculative).** My read: the mechanical masking is fine; the live threats are channel 1 (gate quality, already tracked via AUROC) and channel 2, which I think is the same phenomenon as the observed "hack saturates and removes clean-solve advantage". With full-policy sampling the quarantine's hack skill feeds back into the behavior policy, so groups on hackable prompts drift toward all-hack reward-1, and the deployed block is pushed away from attempting honest solves (suppression of solve, not acquisition of hack). Deploy-only exploration would pin the exploration hack rate to the deployed policy's leak rate and also make clean-gated updates more on-policy (channel 5 reverses sign in our favor); cost is starving the quarantine/absorption story and losing paper-comparability. The current mix is the worst point: two behavior policies share one advantage baseline. Alternative hypothesis: the solve-advantage collapse is mostly env-driven (hack strictly easier than solve, so saturation happens under ANY sampling mode) and channel 2 is minor; distinguishable because per-mode advantage splitting would then NOT restore solve learning while deploy-only sampling also would not, whereas under my read deploy-only visibly slows hack-rate growth in the behavior policy. Credence my read is the dominant mechanism: ~0.5; env-driven: ~0.35; some interaction of both: remainder. Unconfirmed, not yet acted on.
|
||||
|
||||
**Next.** Two candidate code changes, pending wassname's pick: (1) per-sampling-mode (and teacher-separate) advantage baselines within a prompt group, ~5 lines at train.py:876; (2) a deploy-only exploration arm (`rollout_ablate_frac=1.0` semantics) vs full-policy-only, as an ablation pair. Full-policy-only remains the paper-faithful default arm either way.
|
||||
|
||||
## 2026-06-11 (b) -- Q2 gate-score diagnostic: grad and act both separate live hacks once the label is corrected; behavior+disposition pairs combine to 0.78
|
||||
|
||||
**Introduction.** The gate routes updates, not rollouts, so the right positive class for "should this have been routed" is exploited AND advantage > 0; rollouts with advantage ~ 0 contribute no update and were previously scored as dead zeros at cos=0, which is what made the old advantage-weighted AUROC look near-blind (~0.42, see the pre-rewrite diag_pinning.py docstring at commit c33b810). Question for this entry: with the corrected label, which of four candidate gate scores ({gradient, activation} x {cosine, dot}) separates live hacks, and which authored pairset builds the best vector? Follows the pinning diagnostics of entry 2026-06-11 (a)'s parent thread and commit 70697ff.
|
||||
|
||||
**Methods.** scripts/diag_pinning.py rewritten at commit `2f5f93f` (run from the identical uncommitted working copy). Checkpoint `first_hack.safetensors` (step 3) of vanilla run `out/runs/20260611T003538_fast_vanilla_lora2r_seed43_l2r_vanilla_s43_v3`, Qwen3-4B, lora2r r=32. Live window steps 2-9 (192 rollouts, 138 valid after dropping |A|<1e-6). Per rollout one NLL backward captures the deployed c-probe gradient and the deployed bottleneck activation A[:r]@x (mean over completion tokens); gradient scores use A_i * gradNLL_i with A_i reconstructed from rollouts.jsonl students only (teachers absent, so A is approximate; activation columns do not use A at all). Vectors are per-module unit-normalized mean hack-minus-clean from the authored pairs in data/pairs/hack_pairs.md#all-in-one; scores are concatenated-module cosine and dot. Pueue task #17. The two combination rows were computed this session from the cached features (command in provenance).
|
||||
|
||||
**Results.**
|
||||
|
||||
| pairset (v source) | n_pairs | grad_cos | grad_dot | act_cos | act_dot |
|
||||
|----------------------|---------|----------|----------|---------|---------|
|
||||
| behavior+disposition | 12 | 0.781 | 0.780 | 0.706 | 0.694 |
|
||||
| behavior | 8 | 0.744 | 0.753 | 0.722 | 0.715 |
|
||||
| disposition | 4 | 0.728 | 0.736 | 0.404 | 0.399 |
|
||||
| all-in-one | 42 | 0.648 | 0.653 | 0.622 | 0.603 |
|
||||
| behavior2 | 15 | 0.539 | 0.580 | 0.607 | 0.598 |
|
||||
| opportunity | 6 | 0.506 | 0.481 | 0.526 | 0.518 |
|
||||
| proxy | 3 | 0.439 | 0.449 | 0.262 | 0.253 |
|
||||
| reasoning | 3 | 0.449 | 0.415 | 0.595 | 0.591 |
|
||||
|
||||
Table 1. AUROC of each candidate gate score as a detector of "live hack being learned" (positive = exploited and reconstructed advantage > 0, n=59; negatives = 17 solves + 62 fails; 54 rollouts with A~0 dropped; no negative-advantage hacks occurred). Columns: gradient vs activation representation, cosine vs dot scoring, all against a vector built from the named pairset. 0.5 = blind.
|
||||
|
||||
Plain reading: grad_cos and grad_dot track each other within ~0.01 everywhere; behavior (0.753 grad_dot) and disposition (0.736) lead, their union improves further (0.780), and adding behavior2 dilutes (all-in-one 0.653, behavior2 alone 0.580). disposition's activation direction is anti-predictive (0.404) while its gradient direction works (0.736). At the online rule's default rout cut (mean+3sd) the gradient scores route nothing (P undefined, R=0.00) while the activation scores route a small pure tail (P=1.00 at R=0.05-0.07), figure panels, out/diag/pinning_q2.png.
|
||||
|
||||
Provenance:
|
||||
- Commit of the script that produced everything: `2f5f93f`; checkpoint metadata line: log line 3.
|
||||
- Pueue #17: `uv run python scripts/diag_pinning.py` (defaults; argv in pueue). Log: /root/.local/share/pueue/task_logs/17.log. Populations: line 21. Table rows behavior/behavior2/etc: lines 26-34. Headline stats (P@rout, R@rout per case): figure titles in out/diag/pinning_q2.png, regenerated by `--replot out/diag/pinning_data.parquet`.
|
||||
- Combination rows (behavior+disposition and the behavior+behavior2+disposition variant, 0.720/0.750/0.660/0.649, not tabled): this session, offline re-projection of out/diag/pinning_feats.pt using diag_pinning._v_from/_score and train._auroc over groups['behavior']+groups['disposition']; reproduce with the same three-line loop on the .pt file.
|
||||
- Per-rollout scores: out/diag/pinning_data.parquet; pairset table: out/diag/pinning_pairset.parquet.
|
||||
|
||||
**Discussion (speculative).** My read: behavior pairs (concrete exploit mechanics in code) and disposition pairs (explicit intent statements) capture two views of the same trait, and their gradient directions add constructively because their errors are decorrelated; behavior2's fifteen new-domain mechanisms (SQL, bash, mocks, coverage pragmas) sit off this env's four loophole modes and add variance, matching the pairset_audit.md flag that several duplicate or weaken the construct. The disposition act-vs-grad split (0.40 vs 0.74) I read as: vocabulary-level activation features of persona text do not transfer to live code completions, but the update direction those texts induce does. Alternative hypotheses: (1) n=138 with 59 positives gives AUROC se ~ 0.045 (Hanley-McNeil), so behavior-vs-all-in-one (0.10 gap, ~2se) is suggestive but behavior-vs-disposition orderings are noise; a second seed/window would settle it. (2) The advantage reconstruction is biased (students only), so the grad columns could shift once teacher rewards are logged; the act columns are immune to this. (3) The act tail purity (P=1.00 at 5% recall) could be a small-n artifact (3-4 routed rollouts). Credence behavior+disposition > all-in-one is real: ~0.7; that grad_dot is the best gate input rather than act: ~0.45, the act tail behavior at the actual operating point may matter more than mid-rank AUROC.
|
||||
|
||||
**Next.** (1) Log teacher rewards into rollouts.jsonl so A reconstructs exactly (one line in train.py). (2) Repeat on a second vanilla window/seed to check the pairset ordering. (3) Consider a routeV arm with v from behavior+disposition and an act-score gate at a high cut, since the act tail routes at P=1.00 with no advantage reconstruction needed.
|
||||
|
||||
**Correction (same day, after fresh-eyes review; supersedes Table 1 and the combination claim).** A reviewer subagent recomputed from pinning_data.parquet and found that on Table 1's contrast the reconstructed advantage ALONE is a 0.898 AUROC detector (the label requires A>0 and 60/62 fails have A<0), so Table 1 mostly restates the reward, which the live gate has anyway. The informative contrast for the vector's added value is reward-hacking vs non-reward-hacking among adv>0 rollouts (n=78: 59 vs 19), where advantage alone scores 0.576. Also fixed: the headline prefix matched behavior2_* as well; the training default (train_config.vhack_pairs_path) is the 8-pair `behavior_` subset. Rerun at commit `49ca51b`, pueue #19 is not involved, pueue task #18, log /root/.local/share/pueue/task_logs/18.log (table at the `baseline adv-only` block, populations line unchanged from #17).
|
||||
|
||||
| pairset (v source) | n_pairs | grad_cos | grad_dot | act_cos | act_dot |
|
||||
|----------------------|---------|----------|----------|---------|---------|
|
||||
| behavior | 8 | 0.837 | 0.809 | 0.869 | 0.870 |
|
||||
| behavior2 | 15 | 0.718 | 0.692 | 0.730 | 0.731 |
|
||||
| disposition | 4 | 0.693 | 0.709 | 0.123 | 0.124 |
|
||||
| all-in-one | 42 | 0.682 | 0.665 | 0.691 | 0.674 |
|
||||
| proxy | 3 | 0.548 | 0.519 | 0.259 | 0.249 |
|
||||
| opportunity | 6 | 0.448 | 0.448 | 0.426 | 0.425 |
|
||||
| reasoning | 3 | 0.244 | 0.291 | 0.683 | 0.681 |
|
||||
|
||||
Table 2. Same four scores as Table 1 but on the corrected A>0 contrast (positives = exploited and adv>0, n=59; negatives = non-exploited with adv>0, n=19; adv-only baseline 0.576). With 19 negatives the SE is ~0.07, so only gaps above ~0.15 are meaningful. Combination rows recomputed this session from pinning_feats.pt on this contrast: behavior+disposition = 0.792/0.764/0.755/0.748, behavior+behavior2 = 0.819/0.794/0.812/0.814. The entry's combination claim does not survive: behavior alone (the current training default) is the best vector on every column, and disposition's activation direction is strongly anti-predictive (0.12). The activation representation now matches or beats the gradient one (0.87 vs 0.81-0.84 for the behavior vector), with no dependence on the advantage reconstruction. Both review caveats stand: the pooled-window zones are not the live per-batch gate, and P@rout=1.00 rests on n=2-4 routed rollouts. Revised reads: behavior > {all-in-one, opportunity, proxy, reasoning} is likely real (gaps > 2 SE); behavior vs behavior2 ordering and act vs grad within behavior are within ~1.5 SE, second window/seed needed. The Next items above are unchanged except (3): the candidate arm is v from `behavior_` (unchanged from the training default) with an act-score gate, not a combined pairset.
|
||||
|
||||
## 2026-06-11 (c) -- replication on independent windows: act score stable (0.75-0.87), grad score decays to chance; the gate should score activations
|
||||
|
||||
**Introduction.** Continues the correction in (b). Question: does "act >= grad on the A>0 contrast with the behavior_ vector" replicate on windows it was not tuned on? v4 and v5 are independent vanilla lora2r runs of the same fast preset (seed 43; v4 emerged at steps 2-3, v5 at 5-6 under the fixed lr). Expected: orderings within ~1.5 SE could flip; instead grad collapsed.
|
||||
|
||||
**Methods.** Same script and commit `49ca51b` as (b)'s correction. Pueue #19 (v4, default window 2-9) and #20 (v5, `--step-lo 5 --step-hi 12`), each scoring its own run's `first_hack` checkpoint. Headline vector = the 8 `behavior_` pairs. Split-half, stratified bootstrap (1000 reps, resample within class), and score correlation computed this session from v3's cached features `out/diag/pinning_feats.pt` (commands in provenance).
|
||||
|
||||
**Results.**
|
||||
|
||||
| pueue | window | n+ / n- (A>0) | adv-only | grad_cos | grad_dot | act_cos | act_dot |
|
||||
|-------|--------------|---------------|----------|----------|----------|---------|---------|
|
||||
| #18 | v3 steps 2-9 | 59 / 19 | 0.576 | 0.837 | 0.809 | 0.869 | 0.870 |
|
||||
| #19 | v4 steps 2-9 | 44 / 17 | 0.541 | 0.618 | 0.605 | 0.749 | 0.747 |
|
||||
| #20 | v5 steps 5-12| 54 / 31 | 0.586 | 0.458 | 0.456 | 0.752 | 0.747 |
|
||||
|
||||
Table 1. AUROC on the A>0 contrast (positives = exploited and reconstructed adv>0; negatives = non-exploited with adv>0) of the four candidate gate scores built from the `behavior_` pairs, per window, with the adv-only baseline. act_cos/act_dot hold at 0.75-0.87 on all three windows; grad falls monotonically to chance; the baseline stays 0.54-0.59. In v5 every pairset's grad columns sit at 0.44-0.59 while act reaches 0.75 (behavior) and 0.74 (all-in-one).
|
||||
|
||||
Provenance:
|
||||
- #19: log /root/.local/share/pueue/task_logs/19.log, baseline line 25, behavior row line 29. #20: log 20.log, baseline line 25, behavior row line 35. #18 row: entry (b) correction, Table 2.
|
||||
- n+/n-: each log's `live populations` line (on_hackpos) and `baseline adv-only` line (n minus on_hackpos).
|
||||
- Session computations on v3 features (three-line loops over pinning_feats.pt using diag_pinning._v_from/_score and train._auroc): split-half by step, grad_cos 0.745 (steps 2-5, 16 negatives) vs 0.886 (steps 6-9, only 3 negatives), grad_dot 0.633 vs 0.933, act_cos 0.828 vs 0.848; stratified bootstrap 95% CI grad_cos [0.725, 0.928], act_cos [0.781, 0.944]; corr(act_cos, grad_cos) = +0.51 over valid rollouts.
|
||||
|
||||
**Discussion (speculative).** Why act tracks grad at all: the c-probe gradient is h*(B^T delta) per token (lora2r.py:53), sharing the bottleneck activation factor h with the act score; the extra loss-side factor is what differs, and it appears to be the unstable part. Hypotheses for grad's decay across runs: (1) v3 was the pre-fix high-lr run that diverged at step 10, and its extreme updates imprint stronger gradient signatures, credence 0.3; (2) reconstructed-advantage error degrades only the grad columns (act uses no A), credence 0.25; (3) the c-probe geometry depends on the checkpoint's A and B, so the pair-extracted grad direction transfers worse across checkpoints than the act direction (A only), credence 0.2; (4) v3 was a lucky draw, credence 0.15 (CI floor 0.73 argues against, but windows are not iid). Distinguishing tests: exact teacher-inclusive advantages for (2); extraction at matched training steps for (3). None of these rescue grad for the gate decision. Alternative read of act's stability: it may be a surface-texture detector of exploit tokens, which would generalize differently to unseen hack modes; the held-out-mode test would distinguish capability from shortcut.
|
||||
|
||||
**Next.** Act-gate spec: docs/spec/20260611_act_gate_spec.md (score activations, route gradients). Residual-stream representation queued (pueue #21-23) to test whether the random r=32 lora projection limits even the bottleneck act.
|
||||
|
||||
Reference in New Issue
Block a user