mirror of
https://github.com/wassname/steer-heal-love.git
synced 2026-06-27 15:17:14 +08:00
walk-C adaptive-dose controller + 10-round paired loop result (journal h)
gen_filter_walk: per round, cool a steering multiplier kappa and top up with extra gen batches until min_train coherent survivors are banked, so the loop cannot starve on data count (#90/#100 died at the min_train assert). Paired #101 (walk-C ON) vs #100 (walk-C OFF, identical config): #101 reaches round 9 where #100 asserted at round 5. Finding (journal h): walk-C removes the starve CRASH but the real ceiling is coherence collapse, not data count. Trait over-drives to auth -6.8 while coh falls 0.99 -> 0.62 and the kept completions degenerate into token loops ("BUILDUTEutive...", "GLUTE GLUTE") by round 7 -- low-entropy so they slip under ppl_tau and rep_tau and train the next adapter on garbage. Coherent deliverable is the round 1-2 adapter (auth -3.3 to -3.8 at coh 0.99-0.93). config: lam 1.0->0.3, spectral_lam 0->0.01 (locked from #98/#99 ablation), gen_pass_target/gen_kappa_decay/gen_kappa_min/gen_max_batches walk-C knobs. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
+321
-15
@@ -29,7 +29,7 @@ Bug found: iso-KL calibration could not reach `target_kl=1.0`. c_star pinned at
|
||||
`scripts/diag_axis.py` on gemma-3-1b-it, base vs steered at calibrated c_star=67.7 (~1 nat p95 KL). The vector moves the moral-foundation profile in the right direction for "less deference to authority":
|
||||
|
||||
| foundation | base | steered | Δ |
|
||||
|-------------|-------|---------|--------|
|
||||
| ----------- | ----- | ------- | ------ |
|
||||
| SocialNorms | 0.680 | 0.421 | -0.260 |
|
||||
| Care | 0.213 | 0.328 | +0.115 |
|
||||
| Fairness | 0.030 | 0.098 | +0.069 |
|
||||
@@ -44,10 +44,12 @@ Bug found: iso-KL calibration could not reach `target_kl=1.0`. c_star pinned at
|
||||
## Pivot: drop calibration, sweep C + filter, move to 4B, SHOULD logging
|
||||
|
||||
User feedback corrected two over-steps of mine:
|
||||
|
||||
1. I added iso-KL calibration unasked. Removed it. Now use the RAW (unnormalised) mean-diff teacher vector and **sweep `alphas` (0.5,1,2,4) at generation; the filter picks the usable C**. The filter replaces calibration ("self-calibrate via nll + filter"). This was the original design.
|
||||
2. I jumped to "Authority degenerate / nothing to heal" off a 1B model. That was premature. Moved to `google/gemma-3-4b-it`; re-checking the profile there with an open mind.
|
||||
|
||||
Also: I had no readable evidence for the Q's because the log didn't show the steered completions or the filter decisions. Added token-efficient SHOULD logging for ALL Q's:
|
||||
|
||||
- Q0: table alpha -> (ppl_mean, kept_frac) + low/high-C samples. SHOULD: ppl rises with alpha, kept_frac falls.
|
||||
- Q1: generate from the trained adapter (no steering), compare adapter_ppl vs steered_ppl under the original. SHOULD: adapter_ppl < steered_ppl = healed (trait expressed coherently).
|
||||
- Q2/Q3: per-round loop summary (socialnorms/care/coherence/cos_v0). SHOULD: coherence holds, trait monotone, cos_v0>0.5.
|
||||
@@ -57,14 +59,17 @@ fast-dev-run green; even on tiny-random ppl rises 3173->4.2M with alpha and adap
|
||||
## Honest state before compaction (still no Q answered)
|
||||
|
||||
The pipeline runs end to end on 4B, but I have NOT validated any Q. The trap I fell into and corrected this session:
|
||||
|
||||
- raw mean-diff steered across 7 layers broke gemma-3-4b (coherence 0.02), the filter correctly dropped the garbage, leaving 2 kept completions, so the adapter trained on 2 examples ~= base. My earlier "Q1 promising (adapter coherent + refuses authority)" was almost certainly just BASE gemma behaviour, not healing. Retracted until re-run.
|
||||
|
||||
Now in place (committed 6b15a8b), NOT yet run on 4B:
|
||||
|
||||
- narrow steer band (steer_layers 0.45-0.55) vs broad LoRA (layer_range 0.0-1.0)
|
||||
- alpha sweep 0.25-2; n_prompts=16; assert kept>=20 (don't train on starved data)
|
||||
- training table (nll/kl/loss/gnorm), full steer+adapter generation dumps, p_ans_any inline
|
||||
|
||||
Critical open issues for next session:
|
||||
|
||||
1. Find a steer scale where SOME alphas give coherent-but-trait-laden completions (>=20 survive the filter). If the narrow band still over/under-steers, sweep a wider/finer alpha range. This is THE blocker.
|
||||
2. Baseline confound (Q7) is now central: base gemma-3-4b is Care=0.92, already aligned. Does baking the trait beat just system-prompting it? Need base vs trained vs prompted on the same eval. If no headroom, the trait/eval needs rethinking (different trait, or measure the steered-data trait not just tinymfv).
|
||||
3. Then Q0 (filter table monotone?), Q1 (adapter more coherent than steered AND on-trait, kl_rev vs nll), Q2/Q3 (loop).
|
||||
@@ -80,7 +85,7 @@ above) is resolved: the narrow steer band (layers 15,16,17) no longer nukes cohe
|
||||
Q0 (can we filter?) -- YES. Filter table:
|
||||
|
||||
| alpha | ppl_mean | kept_frac |
|
||||
|-------|----------|-----------|
|
||||
| ----- | -------- | --------- |
|
||||
| 0.25 | 3.3 | 1.00 |
|
||||
| 0.50 | 33.1 | 0.88 |
|
||||
| 1.00 | 397.2 | 0.38 |
|
||||
@@ -110,7 +115,7 @@ nothing. Wrote scripts/diag_heal.py to eval base vs raw-steered vs r0-adapter si
|
||||
## diag_heal result: the kl_rev adapter is a NO-OP (trait did not persist)
|
||||
|
||||
| foundation | base | steer | adapter | d_steer | d_adapt |
|
||||
|------------|-------|-------|---------|---------|---------|
|
||||
| ---------- | ----- | ----- | ------- | ------- | ------- |
|
||||
| Care | 0.917 | 0.178 | 0.898 | -0.738 | -0.019 |
|
||||
| Fairness | 0.000 | 0.398 | 0.000 | +0.398 | 0.000 |
|
||||
| Sanctity | 0.042 | 0.326 | 0.040 | +0.284 | -0.001 |
|
||||
@@ -133,6 +138,7 @@ central tension of the project made concrete: trait and incoherence are bundled
|
||||
incoherence rather than separating them within a completion.
|
||||
|
||||
Next: reg=nll control (barrier off, same hyperparams; pueue task 62) to isolate cause.
|
||||
|
||||
- nll moves Care but kl_rev doesn't -> the barrier is too strong (tau too low / lam too high);
|
||||
the trait/coherence tradeoff is real and tunable.
|
||||
- nll ALSO no-ops -> the kept training data itself lacks trait; the filter removed the signal,
|
||||
@@ -151,7 +157,7 @@ the SAME n is valid.
|
||||
diag_heal at n=132, paired (pueue task 63, base vs adapter, --no-steer):
|
||||
|
||||
| foundation | base | adapter | d_adapt |
|
||||
|------------|--------|---------|---------|
|
||||
| ----------- | ------ | ------- | ------- |
|
||||
| Care | 0.2742 | 0.2736 | -0.001 |
|
||||
| SocialNorms | 0.1292 | 0.1423 | +0.013 |
|
||||
| coherence | 0.9997 | 0.9975 | |
|
||||
@@ -161,6 +167,7 @@ confirming the n=24 result. The pipeline's care=0.274 was simply base@132. **Q1
|
||||
robust: one round of kl_rev heal did not move the moral-foundation profile.**
|
||||
|
||||
Two consequences:
|
||||
|
||||
1. Measurement bug to fix: the pipeline logs per-round care/socialnorms at n=None but never a
|
||||
base@None reference row, so a no-op adapter looks like "care=0.274" with nothing to compare
|
||||
to. Every run must log base@same-n as round -1. (And n=24 dev evals are misleading -- the
|
||||
@@ -177,13 +184,14 @@ OFF-TARGET = coherence = mean_pmass_allowed = p_any_ans, want held ~1.0. scripts
|
||||
evals base/steered/heal_nll/heal_klrev all at n=132, paired (pueue task 65):
|
||||
|
||||
| stage | Authority↓ | dAuth | SocialNorms | Care | coherence | dCoh |
|
||||
|--------------|------------|--------|-------------|--------|-----------|--------|
|
||||
| ------------ | ---------- | ------ | ----------- | ----- | --------- | ------ |
|
||||
| base | 0.099 | — | 0.129 | 0.274 | 1.000 | — |
|
||||
| steered(c=1) | 0.011 | -0.088 | 0.032 | 0.056 | 0.803 | -0.197 |
|
||||
| heal_nll | 0.136 | +0.037 | 0.175 | 0.231 | 0.993 | -0.007 |
|
||||
| heal_klrev | 0.110 | +0.011 | 0.142 | 0.274 | 0.998 | -0.002 |
|
||||
|
||||
Reading:
|
||||
|
||||
- Steering HITS the target: Authority 0.099->0.011 (-0.088), and drops Authority hardest in
|
||||
relative terms (to 11% of base vs ~20-25% for Care/SocialNorms) -> a real anti-Authority signal,
|
||||
not just collapse. Cost: coherence 1.0->0.803.
|
||||
@@ -208,6 +216,7 @@ Next experiment (the real test of whether the approach can work at all): does a
|
||||
trait-laden" regime exist? Train an adapter ONLY on the coherent tail of HIGH-alpha completions
|
||||
(the ~9 kept at alpha>=1.0, which are both coherent AND strongly steered) and check if Authority
|
||||
moves DOWN while coherence holds. Need >=20 such completions, so generate more at alpha~1.0.
|
||||
|
||||
- If Authority moves down at held coherence -> the approach works, the bug is data selection
|
||||
(we were training on base-like low-alpha completions). Fix: select/upweight high-alpha coherent.
|
||||
- If even high-alpha coherent completions don't move Authority -> "coherent at high alpha" means
|
||||
@@ -254,6 +263,7 @@ filter OFF (ppl_tau=1e9, keep only rep + persona-narrate), high alpha (1.0,1.5,2
|
||||
kl_rev barrier clean incoherence DURING training. kl_rev = KL(theta||base) is mode-seeking: it
|
||||
penalizes theta for putting mass on low-base-prob tokens (the incoherent ones) hardest, while
|
||||
moderate-prob trait tokens survive. Predicted contrast:
|
||||
|
||||
- kl_rev: Authority DOWN + coherence HELD -> barrier separates trait from incoherence = THESIS CONFIRMED.
|
||||
- nll (no barrier): Authority moves but coherence COLLAPSES (SFT learned the gibberish too).
|
||||
- both no-op -> trait doesn't survive even with barrier-cleaning -> deeper problem (eval instrument
|
||||
@@ -261,7 +271,6 @@ moderate-prob trait tokens survive. Predicted contrast:
|
||||
Scout note: do NOT pre-conclude doom. The no-op so far is fully explained by "filtered before heal",
|
||||
which this test removes for the first time.
|
||||
|
||||
|
||||
## 2026-06-04 (a) -- one round of distill+heal undoes the steering instead of healing it
|
||||
|
||||
**Introduction.** Q: after distilling the raw mean-diff "do not defer to authority" steering vector
|
||||
@@ -285,7 +294,7 @@ on the SAME 132 vignettes in one process so every row is paired (scripts/diag_st
|
||||
**Results.**
|
||||
|
||||
| stage | Authority | dAuth | coherence | dCoh | retain |
|
||||
|--------------|-----------|--------|-----------|--------|--------|
|
||||
| ------------ | --------- | ------ | --------- | ------ | ------ |
|
||||
| base | 0.099 | -- | 1.000 | -- | -- |
|
||||
| steered(c=1) | 0.011 | -0.088 | 0.803 | -0.197 | 1.00 |
|
||||
| heal_nll | 0.136 | +0.037 | 0.993 | -0.007 | -0.42 |
|
||||
@@ -298,6 +307,7 @@ change from base. retain = dAuth(stage) / dAuth(steered): 1.0 means the stage ke
|
||||
Authority shift, 0 means it reverted to base, negative means it moved Authority the wrong way.
|
||||
|
||||
Provenance:
|
||||
|
||||
- Commit producing all rows: 6b15a8b (first INFO line of each run log).
|
||||
- Stage table (all 4 rows): pueue task 65, source `scripts/diag_stages.py out/...nll.../ckpt/r0.safetensors out/...kl_rev.../ckpt/r0.safetensors all`; read with `pueue log 65 --full` (block under "TARGET=Authority"). Raw printed values: base Authority 0.099 coherence 1.000; steered 0.011 / 0.803; heal_nll 0.136 / 0.993; heal_klrev 0.110 / 0.998.
|
||||
- Adapters under test: kl_rev = out/20260604T105632_gemma-3-4b-it_kl_rev_s42/ckpt/r0.safetensors (trained in task 60); nll = out/20260604T111747_gemma-3-4b-it_nll_s42/ckpt/r0.safetensors (task 62).
|
||||
@@ -326,6 +336,7 @@ no-distillable-coherent-expression hypothesis gains weight and I should switch t
|
||||
steering method before spending more on this trait.
|
||||
|
||||
**Next.**
|
||||
|
||||
- Tasks 68 (kl_rev) / 69 (nll), running: heal with ppl_tau=1e9 (coherence filter off), alphas (1.0,1.5,2.0).
|
||||
Verify via diag_stages whether kl_rev retains dAuth at coherence ~0.95.
|
||||
- New primary goal (G: stronger steering): find a steering strength giving a LARGE Authority drop at
|
||||
@@ -352,7 +363,7 @@ because the base model is near-ceiling (~94% is-wrong on Authority) so prob bare
|
||||
**Results.**
|
||||
|
||||
| c | Auth(prob) | dAuth | Care(prob) | dCare | coherence |
|
||||
|------|------------|--------|------------|--------|-----------|
|
||||
| ---- | ---------- | ------ | ---------- | ------ | --------- |
|
||||
| 0.00 | 0.095 | -- | 0.273 | -- | 0.996 |
|
||||
| 0.50 | 0.108 | +0.013 | 0.247 | -0.026 | 0.999 |
|
||||
| 0.75 | 0.061 | -0.034 | 0.172 | -0.100 | 0.989 |
|
||||
@@ -407,7 +418,7 @@ foundation logp (NATS) + one generation per c (scripts/diag_csweep.py, pueue tas
|
||||
**Results.**
|
||||
|
||||
| c | auth_nats | auth_sep | care_sep | coherence |
|
||||
|------|-----------|----------|----------|-----------|
|
||||
| ---- | --------- | -------- | -------- | --------- |
|
||||
| 0.00 | -4.99 | -- | -- | 0.996 |
|
||||
| 0.25 | -13.86 | +8.9 | -0.6 | 0.996 |
|
||||
| 0.50 | -12.11 | +7.1 | -0.9 | 0.992 |
|
||||
@@ -463,7 +474,7 @@ OLD init), which lets me read the round-0 step-0 KL the init claim hinges on.
|
||||
**Results.**
|
||||
|
||||
| epoch | train_nll | val_nll |
|
||||
|-------|-----------|---------|
|
||||
| ----- | --------- | ------- |
|
||||
| 0 | 1.710 | 1.365 |
|
||||
| 1 | 1.162 | 1.417 |
|
||||
| 3 | 0.931 | 1.201 |
|
||||
@@ -473,7 +484,7 @@ Table 1. Per-epoch mean SFT nll on the 42 train completions and the 6 held-out v
|
||||
round 0, run #79. train_nll falls monotonically; val_nll wanders ~1.2-1.4 (n=6, noisy).
|
||||
|
||||
| stage | auth_nats | coherence |
|
||||
|---------|-----------|-----------|
|
||||
| ------- | --------- | --------- |
|
||||
| base | -2.354 | 0.996 |
|
||||
| steered | -3.517 | 0.992 |
|
||||
| healed | -2.464 | 0.999 |
|
||||
@@ -483,6 +494,7 @@ coherence (p_ans_any) at the three pipeline stages of round 0, run #79. coh_cost
|
||||
0.027, not surgical (dCare=+0.28 moved more than dAuth=-0.11).
|
||||
|
||||
Provenance:
|
||||
|
||||
- Commit: `f280a67` (heal init/schedule/betas/val fixes).
|
||||
- Run command (#79): `PYTHONUNBUFFERED=1 STEER_ATTN_IMPL=eager uv run python -m steer_heal.run --reg kl_rev --n-rounds 1 --n-prompts 16`
|
||||
- Run dir: `out/20260604T194133_gemma-3-4b-it_kl_rev_s42/` (events.jsonl, ckpt/r0.safetensors).
|
||||
@@ -533,12 +545,12 @@ A-init is identical), 6 epochs, lr 1e-4 cosine+warmup, lora r=32 alpha=64 layers
|
||||
harness `scripts/diag_barrier.py` reads #79's `events.jsonl` gen event, keeps the 48 keep==True
|
||||
completions, re-trains a fresh adapter per config, bakes it, runs tinymfv (think_tokens 128). Three
|
||||
families across three pueue runs: #82 kl_rev with the tau=0.5 hinge, #86 kl_rev with tau=0 (pure linear
|
||||
barrier = lam*div, the w2s form), #85 weight-decay decades 0.1..100. Base auth_nats=-2.354, coh=0.996.
|
||||
barrier = lam\*div, the w2s form), #85 weight-decay decades 0.1..100. Base auth_nats=-2.354, coh=0.996.
|
||||
|
||||
**Results.**
|
||||
|
||||
| reg / family | strength | dAuth | coh | heal_nll |
|
||||
|-----------------|----------|-------|-------|----------|
|
||||
| ---------------- | -------- | ------ | ----- | -------- |
|
||||
| nll (no barrier) | 0 | -1.247 | 1.000 | 0.199 |
|
||||
| kl_rev linear | 0.03 | -1.053 | 0.999 | 0.204 |
|
||||
| kl_rev linear | 0.10 | -0.664 | 1.000 | 0.232 |
|
||||
@@ -551,7 +563,7 @@ heal_nll = converged SFT loss (last-5-step mean). Trait falls monotonically as t
|
||||
heal_nll rises in step (the barrier is fighting the SFT objective); coh never leaves ~1.0.
|
||||
|
||||
| reg | weight_decay | dAuth | coh |
|
||||
|-----|--------------|-------|-------|
|
||||
| --- | ------------ | ------ | ----- |
|
||||
| nll | 0 | -1.247 | 1.000 |
|
||||
| wd | 0.1 | -1.247 | 1.000 |
|
||||
| wd | 1.0 | -1.247 | 1.000 |
|
||||
@@ -565,6 +577,7 @@ it is meaningless for wd and is dropped here.) dAuth is byte-identical to nll up
|
||||
at wd=100. coh never leaves ~1.0.
|
||||
|
||||
Provenance:
|
||||
|
||||
- Commit: `f280a67`. Harness: `scripts/diag_barrier.py <run_dir> <mode>` (modes barrier/tau0/wd).
|
||||
- Source data: `out/20260604T194133_gemma-3-4b-it_kl_rev_s42/events.jsonl`, the 48 keep==True
|
||||
completions of the gen event (entry (d)'s #79).
|
||||
@@ -584,7 +597,7 @@ hinge (#82), the kl linear form (#86), and weight decay (#85). nll retains the f
|
||||
removes trait and never buys coherence, because coherence was already ~1.0 with no barrier, so the
|
||||
relu(div-tau) penalty has nothing to fix and only pulls the adapter back toward the original. The two
|
||||
non-kl families converge on the same story by different mechanisms: wd just shrinks the whole adapter
|
||||
toward no-op (hence the knee only appears at wd=100, where per-step decoupled shrink lr*wd=1e-2
|
||||
toward no-op (hence the knee only appears at wd=100, where per-step decoupled shrink lr\*wd=1e-2
|
||||
compounds to ~0.92x per step over 252 steps and finally bites), and the kl barrier pulls the output
|
||||
distribution back toward base. Neither is a selective incoherence-cleaner here; both are volume knobs
|
||||
on the adapter. This refutes the data-ceiling reading of entries (a)/(d) for THIS data: nll reaching
|
||||
@@ -603,3 +616,296 @@ one place cumulative incoherence can appear, so it is where the barrier might fi
|
||||
the contrast is whether nll's coherence decays over rounds while #88's holds. (2) 3-seed noise floor on
|
||||
the headline (task25). (3) The real barrier test remains filter-off at a coherence-breaking dose
|
||||
(task11/22), still parked.
|
||||
|
||||
## 2026-06-05 (f) -- over the loop the barrier REVERSES the single-round verdict: nll front-loads trait then erodes, the barrier builds trait while holding coherence; correcting entry (e)
|
||||
|
||||
**Introduction.** Entry (e) ranked nll best and called the barrier pure trait-cost, but only at a
|
||||
single round on clean round-0 data, and explicitly flagged that the loop is the one place a barrier
|
||||
could earn its place (untested there). This entry runs that test: a paired 10-round loop, nll vs a
|
||||
gentle kl_rev barrier, same seed. The question is the one the loop actually cares about: round over
|
||||
round, does the HEALED model move auth_nats further down (more trait) while keeping coherence at least
|
||||
as high? I expected, per (e), nll to win on trait and the barrier to only throttle. The result is the
|
||||
opposite ordering by round 3.
|
||||
|
||||
**Methods.** Code state: the two runs were produced from the then-uncommitted heal.\_encode root-fix on
|
||||
top of parent 6b15a8b; that fix is now committed as 4e802bb (metadata.json carries no commit field, so
|
||||
the code identity is reconstructed from the AFK timeline, not a stored hash). Model google/gemma-3-4b-it,
|
||||
default (non-fast) preset, seed 42, eval_think_tokens=128, all Clifford vignettes. auth_nats =
|
||||
log(marginal blame-mass on the Authority foundation), DOWN = more of the care-over-authority trait;
|
||||
coherence = p_ans_any (fraction of eval items that emit a parseable answer). "HEAL_auth" is the eval of
|
||||
the baked healed adapter at the end of each round (events.jsonl stage=="round"), the metric the loop is
|
||||
trying to drive. Both metrics are reported every row: HEAL_auth is the trait (the analogue of hack_s),
|
||||
HEAL_coh is the capability/coherence cost (the analogue of gt_s); neither alone is sufficient because a
|
||||
trait drop bought by destroying coherence is worthless. pueue #89 (nll) and #90 (kl_rev lam=0.1 tau=0).
|
||||
|
||||
**Results.**
|
||||
|
||||
| round | nll HEAL_auth↓ | nll HEAL_coh→ | kl HEAL_auth↓ | kl HEAL_coh→ |
|
||||
| ----- | -------------- | ------------- | ------------- | ------------ |
|
||||
| 0 | -4.293 | 0.999 | -2.913 | 0.999 |
|
||||
| 1 | -3.736 | 0.994 | -3.689 | 0.998 |
|
||||
| 2 | -3.748 | 0.990 | -3.344 | 0.997 |
|
||||
| 3 | -3.710 | 0.990 | -3.810 | 0.994 |
|
||||
| 4 | -3.609 | 0.976 | -3.846 | 0.988 |
|
||||
| 5 | -3.592 | 0.960 | -3.636 | 0.982 |
|
||||
| 9 | -3.218 | 0.923 | (crashed r6) | (crashed r6) |
|
||||
|
||||
Table 1. Per-round eval of the baked healed adapter for two paired 10-round loops, seed 42. Columns are
|
||||
auth_nats (DOWN = more trait) and coherence (HOLD near base 0.996) for the nll arm (#89, barrier off)
|
||||
and the kl_rev lam=0.1 tau=0 arm (#90, gentle linear barrier on). Base model: auth_nats=-2.354,
|
||||
coherence=0.996. The kl arm starved at round 6 (kept completions fell below min_train=20), so rounds
|
||||
6 to 9 exist only for nll. nll's HEAL_auth is most negative at round 0 (-4.293) and rises monotonically
|
||||
toward base thereafter while coherence falls 0.999 to 0.923; the kl arm's HEAL_auth starts least
|
||||
negative (-2.913) and falls to -3.846 by round 4 while coherence stays at or above 0.982, so the two
|
||||
arms cross between rounds 2 and 3 and by round 4 kl is more negative AND more coherent (-3.846/0.988 vs
|
||||
-3.609/0.976).
|
||||
|
||||
Provenance:
|
||||
|
||||
- Code identity: parent commit 6b15a8b plus the heal.\_encode separate-tokenize fix, committed this
|
||||
session as 4e802bb. No commit field in metadata.json (limitation).
|
||||
- Run commands (argv field of each run's metadata.json):
|
||||
- #89: `STEER_ATTN_IMPL=eager uv run python -m steer_heal.run --reg=nll --n-rounds=10 --seed=42`
|
||||
- #90: `STEER_ATTN_IMPL=eager uv run python -m steer_heal.run --reg=kl_rev --lam=0.1 --tau=0.0 --n-rounds=10 --seed=42`
|
||||
- Source records (not a text log; cells are read from the JSONL event stream):
|
||||
- #89: `out/20260604T231906_gemma-3-4b-it_nll_s42/events.jsonl`, records with stage=="round",
|
||||
fields auth_nats and coherence, one per round 0 to 9.
|
||||
- #90: `out/20260605T031418_gemma-3-4b-it_kl_rev_s42/events.jsonl`, same fields, rounds 0 to 5.
|
||||
- base row: the single stage=="base" record in either file (auth_nats=-2.354, coherence=0.996).
|
||||
- No aggregation: every cell is the single per-round eval value from the named record, not a mean.
|
||||
- Crash evidence (#90 starve at round 6): kept-count per round from the same stage=="round" records,
|
||||
n_kept = 48, 47, 51, 35, 26, 24 for rounds 0 to 5, falling below min_train=20 at round 6.
|
||||
|
||||
**Discussion (speculative).** My read: the single-round diag sweeps in (e) could never see this because
|
||||
they re-heal a FRESH adapter from base on each round's cached data (hist_specs=[]), so they only measure
|
||||
the static data-times-barrier tradeoff, not the loop's feedback. In the loop, the barrier's job is not
|
||||
to add trait but to keep each round's healed model coherent enough that the NEXT round's generation and
|
||||
filter yield clean training data; that coherence-preservation compounds, so the kl arm climbs (-2.9 to
|
||||
-3.85) while nll, which lets coherence rot, sees its trait wash out round over round. This vindicates
|
||||
the hunch logged in (e)'s Next and in the user's request ("it's needed more later on"). The alternative
|
||||
hypothesis I cannot exclude: this is n=1 per arm and the per-round auth_nats wobble is ~0.1 to 0.5 nats
|
||||
(nll itself jumps -3.000 at round 7 then back to -3.306), so the "crossover" could be two noisy walks
|
||||
around the same ~-3.5 coherent-trait ceiling that happen to drift apart; the coherence gap (0.988 vs
|
||||
0.976 at round 4, widening to 0.923 for nll by round 9) is the more robust half of the claim than the
|
||||
trait gap. The 3-seed repeat (task25) distinguishes these. Note also the barrier does NOT stop the
|
||||
underlying generation degenerating: n_kept falls 51 to 24 and #90 still starved at round 6, so the
|
||||
barrier protects the heal but not the generation, which is the separate job of the repetition controls.
|
||||
|
||||
**Next.** (1) #96 queued: nll 10-round with the new generation-time repetition controls
|
||||
(repetition_penalty=1.3, no_repeat_ngram_size=3, committed 4e802bb) to test whether protecting the
|
||||
generation stops the n_kept starve that the barrier alone could not (task35). (2) The combined run the
|
||||
barrier-plus-repetition reading argues for: wd=15 + kl_rev lam=0.01 tau=0.5 over 10 rounds, which needs
|
||||
weight_decay decoupled from reg in config (task37). (3) 3-seed noise floor to tell the crossover from
|
||||
two noisy walks (task25).
|
||||
|
||||
## 2026-06-05 (g) -- regulariser ablation: kl_rev moves the most trait at base-or-better coherence, but the "free lunch" coherence rise is at the measurement floor
|
||||
|
||||
**Introduction.** Continuing (f): the loop says the barrier earns its place, but which regulariser
|
||||
does it best? I re-heal ONE round (round 1's kept data) on the fixed round-0 checkpoint and vary only
|
||||
the regulariser, ranking by the headline slope cohΔ/authΔ = coherence nats gained per trait nat moved
|
||||
(relative to base). I expected kl_rev to win (mode-seeking, the (f) loop arm) and wanted to know
|
||||
whether any reg gives a NEGATIVE slope (trait moves AND coherence rises = free lunch) rather than the
|
||||
usual positive cost. Five families: nll (no reg), wd (AdamW Frobenius shrink), kl_rev (mode-seeking
|
||||
trust region), kl_fwd (mass-covering), spectral_norm (operator-norm penalty, new this round).
|
||||
|
||||
**Methods.** Commit `7db5a56` for the committed code (heal.py reg dispatch, eval); the sweep harness
|
||||
`scripts/diag_heal_sweep.py` was the uncommitted 9-config working-tree version at launch. Model
|
||||
google/gemma-3-4b-it, eager attn, seed 42, barrier_ref=prev (the base-vs-prev direction is settled,
|
||||
entry above + #97). Source run providing the r0 checkpoint and round-1 data:
|
||||
out/20260604T231906_gemma-3-4b-it_nll_s42 (the #89 nll loop). pueue task 98.
|
||||
|
||||
**Results.**
|
||||
|
||||
| reg | lam | wd | auth↓ | dAuth_base↓ | dCoh_base↑ | cohΔ/authΔ_base ×100↓ |
|
||||
| ------------- | ---- | --- | ------ | ----------- | ---------- | --------------------- |
|
||||
| kl_rev | 0.1 | 0 | -3.719 | -1.365 | +0.0017 | -0.13 |
|
||||
| kl_rev | 0.3 | 0 | -3.966 | -1.612 | +0.0018 | -0.11 |
|
||||
| spectral_norm | 0.01 | 0 | -3.688 | -1.334 | +0.0007 | -0.05 |
|
||||
| spectral_norm | 0.1 | 0 | -3.257 | -0.903 | +0.0003 | -0.03 |
|
||||
| kl_fwd | 0.1 | 0 | -3.140 | -0.787 | -0.0001 | +0.01 |
|
||||
| nll | 0 | 30 | -3.351 | -0.997 | -0.0001 | +0.01 |
|
||||
| spectral_norm | 1.0 | 0 | -3.977 | -1.624 | -0.0007 | +0.05 |
|
||||
| nll | 0 | 0 | -3.136 | -0.783 | -0.0009 | +0.11 |
|
||||
| nll | 0 | 15 | -3.136 | -0.783 | -0.0009 | +0.11 |
|
||||
|
||||
Table 1. One-round re-heal of round-1 data on the round-0 checkpoint, sorted by the headline slope
|
||||
(most-negative first). auth_nats DOWN = more trait; dAuth_base = trait moved vs base (NEGATIVE = moved);
|
||||
dCoh_base = coherence change vs base (POSITIVE = rose); cohΔ/authΔ_base = 100 \* dCoh_base / dAuth_base =
|
||||
centinats of coherence per nat of trait, NEGATIVE = free lunch. Base: auth=-2.354, coh=0.996. Prev
|
||||
(r0 healed): auth=-4.293, coh=0.999. The four negative-slope rows (kl_rev 0.1/0.3, spectral 0.01/0.1)
|
||||
all raise coherence above base while moving trait; nll and kl_fwd sit at or below base coherence.
|
||||
kl_rev and spectral 1.0 move the most trait (dAuth_base -1.36 to -1.62) vs nll's -0.78.
|
||||
|
||||
Provenance:
|
||||
|
||||
- Commit: `7db5a56` (committed heal.py/eval). Sweep harness diag_heal_sweep.py uncommitted at launch
|
||||
(9-config grid, pre-widen). Log: ~/.local/share/pueue/task_logs/98.log (726k, 3429 lines).
|
||||
- Each row is one INFO line in 98.log (single eval per config, no aggregation): kl_rev 0.1 = line 1444,
|
||||
kl_rev 0.3 = 1985, spectral 0.01 = 2821, spectral 0.1 = 3116, kl_fwd 0.1 = 2526, nll wd30 = 903,
|
||||
spectral 1.0 = 3411, nll wd0 = 313, nll wd15 = 608. base/prev = lines 16-17. Final tabulate table
|
||||
(same values, more sigfig) = lines 3421-3429. cohΔ/authΔ_base ×100 = 100 \* dCoh_base / dAuth_base
|
||||
from the dCoh_base/dAuth_base columns of lines 3421-3429.
|
||||
|
||||
kl_rev lam=0.1 has the most-negative slope (-0.13), but lam=0.3 moves MORE trait (dAuth_base -1.612 vs
|
||||
-1.365) at the SAME coherence gain (dCoh_base +0.0018 vs +0.0017); the slope only favours 0.1 because
|
||||
its denominator is smaller. nll wd=15 is byte-identical to wd=0 (decay too small to bite); spectral_norm
|
||||
trains without crashing (the new power-iteration branch is sound).
|
||||
|
||||
**Discussion (speculative).** My read: the REG-level conclusion is solid, the lam-level and free-lunch
|
||||
claims are not. Solid: kl_rev (and spectral_norm at gentle dose) move substantially more trait
|
||||
(dAuth_base ~-1.4 to -1.6) than nll (-0.78) at coherence that is at-worst base and slightly-better, so
|
||||
the barrier is not merely throttling trait here, it is buying coherence headroom that lets more trait
|
||||
land. Fragile: the "free lunch" is the SIGN of dCoh_base, and for every negative-slope row that is
|
||||
+0.0003 to +0.0018, i.e. sub-2-millinats on a coherence of 0.996 measured at 3-4 dp. A fresh-eyes
|
||||
reviewer reading the table cold reached the same two conclusions independently: the kl_rev 0.1-over-0.3
|
||||
ranking is a denominator artifact, and the millinats-scale coherence rise is at the floor. The
|
||||
alternative hypothesis I cannot exclude: the coherence rise is zero (or noise) and kl_rev simply moves
|
||||
trait at no coherence COST, which is still good but not "healing". Distinguishing needs the higher-
|
||||
precision eval (more think tokens, task27) so dCoh clears the floor. For the loop, the slope says
|
||||
lam=0.1 but #32 already showed kl_rev lam=0.1 starve-crashes at round 6, so the loop wants a gentler
|
||||
lam that keeps the trait-moving while steering less; pueue 99 (the widened gap-fill, kl_rev 0.03/0.05)
|
||||
is running to pin that.
|
||||
|
||||
**Next.** (1) pueue 99 finishing: kl_rev 0.03/0.05/1.0 + wd 60/120 + kl_fwd 0.3, to map where the slope
|
||||
peaks before the trait-denominator collapses and pick the loop's lam. (2) THEN launch the 10-round loop
|
||||
(task38) with that lam, barrier_ref=prev, dodging #32's round-6 starve. (3) higher-precision eval
|
||||
(task27) to lift dCoh off the floor and settle whether the coherence rise is real.
|
||||
|
||||
**Addendum (combined #98 + #99, with reference anchors).** The same ablation, now with the three
|
||||
pipeline reference states interleaved by slope so each config can be read against where it sits between
|
||||
"raw steered mess" and "accumulated-trait student". pueue 99 (the widened gap-fill) is still running, so
|
||||
its four kl_rev/kl_fwd rows are TBD; wd 60/120 are in.
|
||||
|
||||
| cohΔ/authΔ×100↓ | reg | lam | wd | auth↓ | dAuth_base↓ | dCoh_base↑ | coh↑ |
|
||||
| --------------- | -------------- | ---- | --- | ------------- | ----------- | ---------- | ------- |
|
||||
| -- | base (REF) | -- | -- | -2.354 | 0 | 0 | 0.99615 |
|
||||
| -0.17 | r0 train (REF) | -- | -- | -4.293 | -1.939 | +0.0033 | 0.99949 |
|
||||
| +7.39 | r1 steer (REF) | -- | -- | -3.401 | -1.047 | -0.0773 | 0.91882 |
|
||||
| -0.13 | kl_rev | 0.1 | 0 | -3.719 | -1.365 | +0.0017 | 0.99790 |
|
||||
| -0.11 | kl_rev | 0.3 | 0 | -3.966 | -1.612 | +0.0018 | 0.99790 |
|
||||
| -0.05 | spectral_norm | 0.01 | 0 | -3.688 | -1.334 | +0.0007 | 0.99680 |
|
||||
| -0.03 | spectral_norm | 0.1 | 0 | -3.257 | -0.903 | +0.0003 | 0.99640 |
|
||||
| +0.01 | kl_fwd | 0.1 | 0 | -3.140 | -0.787 | -0.0001 | 0.99610 |
|
||||
| +0.01 | nll | 0 | 30 | -3.351 | -0.997 | -0.0001 | 0.99600 |
|
||||
| +0.05 | spectral_norm | 1.0 | 0 | -3.977 | -1.624 | -0.0007 | 0.99540 |
|
||||
| +0.11 | nll | 0 | 60 | -3.537 | -1.184 | -0.0013 | 0.99485 |
|
||||
| +0.11 | nll | 0 | 0 | -3.136 | -0.783 | -0.0009 | 0.99530 |
|
||||
| +0.11 | nll | 0 | 15 | -3.136 | -0.783 | -0.0009 | 0.99530 |
|
||||
| +0.19 | nll | 0 | 120 | -3.251 | -0.897 | -0.0017 | 0.99447 |
|
||||
| -0.05 | kl_rev | 1.0 | 0 | -4.066 | -1.712 | +0.00083 | 0.99698 |
|
||||
| +0.01 | kl_rev | 0.05 | 0 | -3.377 | -1.023 | -0.00008 | 0.99607 |
|
||||
| +0.01 | kl_fwd | 0.3 | 0 | -3.429 | -1.075 | -0.00013 | 0.99602 |
|
||||
| +0.14 | kl_rev | 0.03 | 0 | -3.463 | -1.109 | -0.00160 | 0.99455 |
|
||||
|
||||
Table 2. Combined ablation with reference anchors, sorted by the headline slope cohΔ/authΔ×100 (most-
|
||||
negative first). All deltas are vs the base anchor (auth=-2.354, coh=0.99615). The three REF rows are
|
||||
pipeline states of the source #89 nll loop, NOT re-heal configs: r0 train = the round-0 healed student
|
||||
(the accumulated-trait anchor, "prev"); r1 steer = the round-1 steered model before any heal (coherence
|
||||
collapsed to 0.919); base = the original model. The re-heal configs all land between r1 steer and r0
|
||||
train, and kl_rev sits closest to the r0-train anchor. (The last four rows, kl_rev 1.0/0.05/0.03 and
|
||||
kl_fwd 0.3, are the #99 gap-fill appended after the sort, not re-sorted into place.)
|
||||
|
||||
Key finding from the completed kl_rev ladder: coherence is UNIMODAL in lam, 0.99455 (.03) -> 0.99607
|
||||
(.05) -> 0.99790 (.1) = 0.99790 (.3) -> 0.99698 (1.0), rising to a plateau-peak at lam 0.1-0.3 then
|
||||
declining. The slope sign-flip between lam .05 (+0.01, coh just below base) and .1 (-0.13, coh above
|
||||
base) is the monotone curve crossing the base line, NOT noise: the eval is deterministic and the
|
||||
ordering is consistent across doses, which retires the "measurement floor" caveat from Table 1's
|
||||
Discussion (the millinat coherence differences are real, ordered signal). This is why the 10-round loop
|
||||
(#100) uses lam=0.3: it sits at the coherence peak (best starvation resistance) while still near-max
|
||||
trait. lam=1.0 moves marginally more trait (auth -4.066) but its coherence has already started to drop.
|
||||
|
||||
Provenance (additional to Table 1):
|
||||
|
||||
- Reference rows from out/20260604T231906_gemma-3-4b-it_nll_s42/events.jsonl: stage=="base" (auth
|
||||
-2.35369, coh 0.99615), stage=="round" round==0 (r0 train: auth -4.29314, coh 0.99949),
|
||||
stage=="steered_eval" round==1 (r1 steer: auth -3.40054, coh 0.91882). dAuth_base/dCoh_base computed
|
||||
vs the base row; cohΔ/authΔ×100 = 100\*dCoh_base/dAuth_base (base row blank, dAuth=0).
|
||||
- wd 60/120 rows from pueue 99, ~/.local/share/pueue/task_logs/99.log: nll wd60 auth -3.5374 dAuth_base
|
||||
-1.1838 coh 0.99485 slope +0.1097; nll wd120 auth -3.2505 dAuth_base -0.8968 coh 0.99447 slope +0.1872.
|
||||
- The four kl_rev 0.03/0.05/1.0 + kl_fwd 0.3 rows from pueue 99, ~/.local/share/pueue/task_logs/99.log,
|
||||
one INFO line each: kl_rev 1.0 (14:45, auth -4.0661 dAuth_base -1.7124 coh 0.99698), kl_rev 0.05
|
||||
(14:23, auth -3.3765 coh 0.99607), kl_rev 0.03 (14:01, auth -3.4627 coh 0.99455), kl_fwd 0.3 (15:06,
|
||||
auth -3.4288 coh 0.99602). Commit 7db5a56 (heal logic); sweep harness uncommitted (6-config gap-fill).
|
||||
|
||||
|
||||
## 2026-06-05 (h) -- walk-C dose controller eliminates the starve CRASH but reveals the real ceiling is coherence collapse, not data starvation
|
||||
|
||||
**Introduction.** The 10-round loop kept dying mid-run with a hard AssertionError: by some round the
|
||||
over-steered generator produced fewer than min_train=30 coherent completions and training could not
|
||||
proceed (#89 died round 6, #90 round 6, #100 round 5). I built walk-C, an adaptive dose controller: per
|
||||
round it cools a steering multiplier kappa (1.0 -> 0.7 -> 0.49 -> ...) and tops up with extra generation
|
||||
batches until it banks min_train survivors, so the loop can never starve on data COUNT. The question:
|
||||
does removing the starve let the loop run to round 9, and what happens to trait and coherence when it
|
||||
does? I expected walk-C to reach round 9, and wanted to see whether the trait kept accumulating
|
||||
coherently (the hoped-for result) or hit some other wall.
|
||||
|
||||
**Methods.** Commit 7db5a56 with the walk-C controller uncommitted (src/steer_heal/run.py
|
||||
`gen_filter_walk`, steering.py `generate_steered(alpha_scale)`, config.py `gen_pass_target=0.25`
|
||||
`gen_kappa_decay=0.7` `gen_kappa_min=0.2` `gen_max_batches=6`). gemma-3-4b-it, kl_rev lam=0.3 tau=0.5 +
|
||||
spectral_lam=0.01, barrier_ref=prev, seed=42, n_rounds=10, eval_think_tokens=128 (deterministic eval).
|
||||
Paired against #100 = the IDENTICAL config with walk-C OFF (its running process held pre-controller
|
||||
bytecode), so rounds 0-4 are byte-identical and the only difference from round 5 on is the controller.
|
||||
pueue #101 (walk-C ON), #100 (walk-C OFF).
|
||||
|
||||
**Results.**
|
||||
|
||||
| round | gen | kappa | kept | auth_nats↓ | care_nats | coh→ | cos_v0→ |
|
||||
|------:|----:|------:|-----:|-----------:|----------:|-----:|--------:|
|
||||
| 0 | 64 | 1.000 | 50 | -2.710 | -0.851 | 0.993 | 1.000 |
|
||||
| 1 | 64 | 1.000 | 63 | -3.328 | -0.822 | 0.987 | 0.880 |
|
||||
| 2 | 64 | 1.000 | 44 | -3.833 | -1.371 | 0.925 | 0.762 |
|
||||
| 3 | 64 | 1.000 | 39 | -3.851 | -1.486 | 0.917 | 0.688 |
|
||||
| 4 | 64 | 1.000 | 37 | -4.217 | -0.873 | 0.902 | 0.652 |
|
||||
| 5 | 128 | 1.000 | 36 | -4.394 | -0.719 | 0.904 | 0.623 |
|
||||
| 6 | 256 | 0.343 | 42 | -4.491 | -0.678 | 0.867 | 0.560 |
|
||||
| 7 | 128 | 1.000 | 41 | -5.077 | -1.008 | 0.713 | 0.543 |
|
||||
| 8 | 128 | 0.700 | 38 | -6.835 | -1.282 | 0.618 | 0.513 |
|
||||
| 9 | 64 | 1.000 | 30 | -6.781 | -1.308 | 0.623 | 0.480 |
|
||||
|
||||
Table 1. #101 walk-C 10-round trajectory. gen = completions generated that round (64 = 1 batch; >64 =
|
||||
walk-C topped up); kappa = the dose multiplier the controller settled on (<1.0 = it cooled to dodge
|
||||
over-steer); kept = coherent survivors trained on; auth_nats (down = more trait, base -2.354), coh =
|
||||
p_ans_any (down = less coherent, base 0.996), cos_v0 = cosine of this round's healed adapter delta with
|
||||
round 0's. #100 (walk-C OFF) is byte-identical rounds 0-4, then at round 5 its single 64-batch kept only
|
||||
17 < 30 and it died with AssertionError at heal.py (data starve). walk-C instead generated a 2nd batch
|
||||
(128 total) and trained on 36, surviving.
|
||||
|
||||
Provenance:
|
||||
- #101: out/20260605T191544_gemma-3-4b-it_kl_rev_s42/ (trajectory.png, events.jsonl); pueue 101 log
|
||||
~/.local/share/pueue/task_logs/101.log; the table is the run's own end-of-loop summary (one INFO row
|
||||
per round, "round N: auth_nats=..." plus the gen/kappa/kept columns from the tabulate block).
|
||||
- #100: out/20260605T150649_gemma-3-4b-it_kl_rev_s42/; pueue 100 log; crash = AssertionError "only 17
|
||||
kept completions; need >= 30" after the round-5 single batch (kept 50,63,44,39,37 rounds 0-4 identical
|
||||
to #101, then 17).
|
||||
- walk-C firing (kept-per-attempt, pueue 101): round 5 attempts kept 17 then 19 (kappa 1.0, top-up);
|
||||
round 6 attempts kept 9/6/10/17 at kappa 1.0/0.7/0.49/0.343 (cool ladder, banked 42); round 8 kept
|
||||
14 then 24 at kappa 1.0/0.7.
|
||||
|
||||
The starve crash is gone: #101 reaches round 9 where #100 asserted at round 5. The two rescue paths both
|
||||
fire (round 5 top-up at kappa 1.0; round 6 cools to kappa 0.343). But coherence falls monotonically
|
||||
0.993 -> 0.623 and breaks below 0.85 at round 7, while auth keeps dropping to -6.78 (dAuth -4.43 vs
|
||||
base). The round-9 deliverable is flagged 🔴 (coh 0.62, broken). cos_v0 ends at 0.480, just under the 0.5
|
||||
direction-consistency bar.
|
||||
|
||||
**Discussion (speculative).** My read: walk-C correctly solved the problem it was built for and, by
|
||||
removing it, exposed that the starve was never the real limit. The loop has a coherent-trait CEILING
|
||||
around auth -3.8 at coh ~0.92 (round 2); past it, every additional round trades coherence for trait at a
|
||||
steepening rate (rounds 7-9 buy auth -5 to -6.8 by collapsing coh to ~0.62). The mechanism I find most
|
||||
likely: barrier_ref=prev only penalises THIS round's new divergence, so coherence loss compounds
|
||||
round-over-round with nothing pinning it to base, and the filter keeps the most-coherent survivors of an
|
||||
ever-more-over-driven generator, which are increasingly low-entropy/degenerate (the cos_v0 drift to 0.48
|
||||
says the adapter is rotating away from the original trait direction into whatever-survives-the-filter).
|
||||
An alternative read: the coherence numbers past round 7 are real model breakage, not a metric artifact,
|
||||
but I have NOT eyeballed round 7-9 completions yet, so I cannot rule out that p_ans_any is mis-scoring a
|
||||
still-readable model. The distinguishing check is reading the round 7-9 kept text (events.jsonl); if it
|
||||
is "instead their instead their"-style loops like #89 round 7, it is real breakage. The practical
|
||||
upshot: the useful deliverable is the round 1-2 adapter (auth -3.3 to -3.8 at coh 0.99-0.93), and more
|
||||
rounds are counterproductive for THIS trait. walk-C is worth keeping (it removes a crash that masqueraded
|
||||
as a ceiling) but it does not raise the ceiling.
|
||||
|
||||
**Next.** (1) Read #101 round 7-9 kept completions to confirm coherence collapse is real breakage not
|
||||
mis-scoring (cheap, no GPU). (2) The comparison that actually matters is now unblocked: prompting
|
||||
baseline (#26, task) -- does the round-1/2 distilled adapter beat just system-prompting "do not defer to
|
||||
authority" at equal coherence? If not, the whole distill-then-heal loop needs a different justification
|
||||
(persistence without a prompt). (3) Consider a barrier_ref=base arm for the loop: it should cap the
|
||||
coherence bleed at the cost of trait, testing whether the ceiling is the prev-anchor's fault.
|
||||
|
||||
@@ -75,3 +75,12 @@ you can't) so this is plan D record in spec
|
||||
you should think about what you would observe at each gate in the likely possible outcomes
|
||||
including subtle ones. then tell me if you measure it and if you see it
|
||||
this includes eval results but also qualitative judgmeent from you
|
||||
|
||||
|
||||
# 2026-06-05 09:48:37
|
||||
|
||||
|
||||
▎ Steering is supposed to break coherence. We filter for the coherent-but-trait-laden survivors, then distil them into LoRA with a
|
||||
▎ regularizer (wd / kl / maybe spectral norm) that heals the residual incoherence while keeping the trait. It fails if the whole batch is
|
||||
▎ incoherent, because then SFT just pulls toward incoherence. Walk-C exists to stop that, descend the dose until enough coherent
|
||||
▎ survivors exist.
|
||||
|
||||
@@ -95,7 +95,11 @@ logger.info(f"base: auth_nats={base_m['auth_nats']:+.3f} care_nats={base_m['care
|
||||
|
||||
rows = []
|
||||
for reg, lam, tau in GRID:
|
||||
cfg = dataclasses.replace(base_cfg, reg=reg, lam=lam, tau=tau)
|
||||
# "wd" grid rows are now a weights-space knob, not a reg value: map to reg=nll + weight_decay=lam.
|
||||
if reg == "wd":
|
||||
cfg = dataclasses.replace(base_cfg, reg="nll", lam=0.0, tau=0.0, weight_decay=lam)
|
||||
else:
|
||||
cfg = dataclasses.replace(base_cfg, reg=reg, lam=lam, tau=tau, weight_decay=0.0)
|
||||
torch.manual_seed(cfg.seed) # identical LoRA-A init across barrier values -> only the barrier differs
|
||||
lora, spec, heal_nll = heal_round(model, tok, kept, [], cfg)
|
||||
with baked(model, [spec]):
|
||||
|
||||
@@ -0,0 +1,156 @@
|
||||
"""Fast healing-hypothesis sweep, the RIGHT way: from the round-(N-1) CHECKPOINT.
|
||||
|
||||
The earlier diag_barrier.py re-healed a FRESH adapter from BASE (hist=[]), so the kl barrier
|
||||
anchored to base and never saw the loop state. This loads the real round-0 checkpoint as baked
|
||||
history, re-heals round-1's kept data on top, and varies ONLY the regulariser + the barrier
|
||||
REFERENCE. That isolates: at round 1 (where the loop starts degenerating), which regulariser adds
|
||||
the most NEW trait at the least coherence cost?
|
||||
|
||||
The decisive contrast is kl_rev ref=base vs ref=prev:
|
||||
ref=base -> KL(student || ORIGINAL). The student already carries round-0's trait, so this leashes
|
||||
it back toward base and partly UNDOES the prev round.
|
||||
ref=prev -> KL(student || prev-round student). Penalises only THIS round's new divergence = a
|
||||
trust region, so trait accumulates while each step stays coherent.
|
||||
|
||||
Metric: dAuth vs PREV (= new trait this round, the thing we want negative) at coherence >= prev.
|
||||
|
||||
Run: uv run python scripts/diag_heal_sweep.py out/20260604T231906_gemma-3-4b-it_nll_s42 1
|
||||
"""
|
||||
import dataclasses
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
import srsly
|
||||
import torch
|
||||
from loguru import logger
|
||||
from tabulate import tabulate
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
sys.path.insert(0, "src")
|
||||
from steer_heal.config import RunConfig # noqa: E402
|
||||
from steer_heal.eval import evaluate_model # noqa: E402
|
||||
from steer_heal.heal import heal_round # noqa: E402
|
||||
from steer_heal.run import setup_logging # noqa: E402
|
||||
from steer_heal.ws.bake import AdapterSpec, baked # noqa: E402
|
||||
|
||||
setup_logging() # INFO -> stdout via tqdm.write, DEBUG (per-step bake trace) -> logs/*_verbose.log
|
||||
run_dir = Path(sys.argv[1])
|
||||
gen_round = int(sys.argv[2]) if len(sys.argv) > 2 else 1 # re-heal THIS round's data on r0..r(N-1) history
|
||||
base_cfg = RunConfig()
|
||||
|
||||
# REGULARISER ABLATION (reg, lam, tau, ref, wd), all at ref=prev (the base-vs-prev DIRECTION
|
||||
# question is settled: ref=base undoes prev, confirmed in #97 -- nll +1.157, kl_rev/base +0.855).
|
||||
# Hold the reference fixed and ask which REGULARISER best trades coherence for trait, by the
|
||||
# cohΔ/authΔ headline. Five families: nll (no reg, control), wd (Frobenius shrink via AdamW),
|
||||
# kl_rev (mode-seeking trust region), kl_fwd (mass-covering), spectral_norm (operator-norm penalty).
|
||||
#
|
||||
# This is the FULL authoritative grid. The `# [#98]`-tagged rows are commented out because pueue 98
|
||||
# already produced them -- uncomment to re-run from scratch. The active rows are the widened ends
|
||||
# #98 never ran (the gap-fill, pueue 99). The combined table is read from BOTH logs. wd<=15 was
|
||||
# byte-identical to the no-reg control (inert) so it stays commented; wd=30 moved trait MORE
|
||||
# (dAuth_base -0.997 vs -0.782) AND held coherence, so wd 60/120 trace the curve above the knee.
|
||||
GRID = [
|
||||
# ("nll", 0.0, 0.0, "prev", 0.0), # [#98] control: pure SFT, no reg
|
||||
# ("nll", 0.0, 0.0, "prev", 15.0), # [#98] INERT: byte-identical to wd=0 (decay too small to bite)
|
||||
# ("nll", 0.0, 0.0, "prev", 30.0), # [#98] wd at the knee (AdamW Frobenius shrink on ΔW)
|
||||
("nll", 0.0, 0.0, "prev", 60.0), # wd above knee -- does coherence keep improving?
|
||||
("nll", 0.0, 0.0, "prev", 120.0), # wd strong -- where does trait start to erode?
|
||||
("kl_rev", 0.03, 0.5, "prev", 0.0), # mode-seeking trust region, gentle (#82 best-retain end)
|
||||
("kl_rev", 0.05, 0.5, "prev", 0.0), # between 0.03 and 0.1: does the slope peak below 0.1? (#98: 0.1 beat 0.3)
|
||||
# ("kl_rev", 0.1, 0.5, "prev", 0.0), # [#98] mode-seeking trust region, mid (current front-runner -0.13)
|
||||
# ("kl_rev", 0.3, 0.5, "prev", 0.0), # [#98] stronger trust region
|
||||
("kl_rev", 1.0, 0.5, "prev", 0.0), # strong (#82: over-tight, undoes trait) -- the bracket end
|
||||
# ("kl_fwd", 0.1, 0.5, "prev", 0.0), # [#98] mass-covering, gentle
|
||||
("kl_fwd", 0.3, 0.5, "prev", 0.0), # mass-covering, stronger (expect: dilutes trait)
|
||||
# spectral_norm is no longer a reg -- it's the independent cfg.spectral_lam knob now (composes with
|
||||
# kl_rev). #98 swept it as reg=spectral_norm (0.01/0.1/1.0); to redo, set spectral_lam, not reg.
|
||||
]
|
||||
logger.info(f"heal sweep from round-{gen_round-1} checkpoint, re-heal round-{gen_round} data: {len(GRID)} configs")
|
||||
|
||||
tok = AutoTokenizer.from_pretrained(base_cfg.model)
|
||||
if tok.pad_token is None:
|
||||
tok.pad_token = tok.eos_token
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
base_cfg.model, torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="eager"
|
||||
).eval()
|
||||
|
||||
# baked history = the real round-0..round-(gen_round-1) adapters from the source run.
|
||||
hist_specs = [AdapterSpec.from_checkpoint(model, str(run_dir / "ckpt" / f"r{i}.safetensors"))
|
||||
for i in range(gen_round)]
|
||||
logger.info(f"loaded {len(hist_specs)} history checkpoint(s): r0..r{gen_round-1}")
|
||||
|
||||
# round-gen_round kept completions = the data round gen_round actually trained on.
|
||||
gen = next(e for e in srsly.read_jsonl(run_dir / "events.jsonl")
|
||||
if e["stage"] == "gen" and e["round"] == gen_round)
|
||||
kept = [{"prompt": s["prompt"], "completion": s["completion"]} for s in gen["scored"] if s["keep"]]
|
||||
logger.info(f"loaded {len(kept)} kept completions from round {gen_round}")
|
||||
|
||||
base_m = evaluate_model(model, tok, base_cfg)
|
||||
with baked(model, hist_specs):
|
||||
prev_m = evaluate_model(model, tok, base_cfg) # round-(gen_round-1) HEALED = the start point this round must improve on
|
||||
logger.info(f"base: auth={base_m['auth_nats']:+.4f} coh={base_m['coherence']:.5f}")
|
||||
logger.info(f"prev (r{gen_round-1} healed): auth={prev_m['auth_nats']:+.4f} coh={prev_m['coherence']:.5f}")
|
||||
logger.info("SHOULD: dAuth_vs_prev NEGATIVE = this round ADDED trait; POSITIVE = the barrier UNDID prev. "
|
||||
"ref=base should undo (>=0) where ref=prev adds (<0), at coherence >= prev.")
|
||||
|
||||
rows = []
|
||||
for reg, lam, tau, ref, wd in GRID:
|
||||
cfg = dataclasses.replace(base_cfg, reg=reg, lam=lam, tau=tau, barrier_ref=ref, weight_decay=wd)
|
||||
torch.manual_seed(cfg.seed) # identical LoRA-A init across configs -> only the regulariser differs
|
||||
lora, spec, heal_nll = heal_round(model, tok, kept, hist_specs, cfg)
|
||||
with baked(model, hist_specs + [spec]): # full round-gen_round student = history + this round's adapter
|
||||
m = evaluate_model(model, tok, cfg)
|
||||
dAuth_base = m["auth_nats"] - base_m["auth_nats"]
|
||||
dCoh_base = m["coherence"] - base_m["coherence"]
|
||||
dAuth_prev = m["auth_nats"] - prev_m["auth_nats"]
|
||||
dCoh_prev = m["coherence"] - prev_m["coherence"]
|
||||
# THE HEADLINE: coherence cost per unit of trait, the trade-off slope dCoh/dAuth.
|
||||
# We want trait to move (dAuth NEGATIVE) at little coherence cost (dCoh ~0), so a GOOD
|
||||
# config has a small-magnitude ratio (or negative = free coherence). NaN-guard the
|
||||
# denominator: a config that barely moves auth (|dAuth|<0.05 noise floor) makes the
|
||||
# ratio explode/flip sign on noise, so it is not a meaningful efficiency -- blank it.
|
||||
eps = 0.05
|
||||
# HEADLINE scaled x100 so the tiny coherence-per-trait slope keeps resolving digits under
|
||||
# the table's +.4f (raw ~+0.001 -> "+0.0011", and +0.0001 -> "+0.0001" both collapse to the
|
||||
# noise floor; x100 -> "+0.1100" vs "+0.0100" stays distinguishable). Units: centinats coh / nat auth.
|
||||
coh_per_auth_base = 100 * dCoh_base / dAuth_base if abs(dAuth_base) > eps else float("nan")
|
||||
coh_per_auth_prev = 100 * dCoh_prev / dAuth_prev if abs(dAuth_prev) > eps else float("nan")
|
||||
rows.append({ # HEADLINE first. Direction matters (NOT abs): most-NEGATIVE best = trait moved AND
|
||||
"cohΔ/authΔ_base×100↓": coh_per_auth_base, # coh ROSE (free lunch); then small positive = cheap.
|
||||
"cohΔ/authΔ_prev×100↓": coh_per_auth_prev,
|
||||
"reg": reg, "lam": lam, "tau": tau, "ref": ref, "wd": wd,
|
||||
"auth↓": m["auth_nats"], "dAuth_base↓": dAuth_base, "dAuth_prev↓": dAuth_prev,
|
||||
"coh↑": m["coherence"], "dCoh_base↑": dCoh_base, "dCoh_prev↑": dCoh_prev, "heal_nll↓": heal_nll,
|
||||
})
|
||||
logger.info(f" {reg} lam={lam} tau={tau} ref={ref} wd={wd}: "
|
||||
f"cohΔ/authΔ_base×100={coh_per_auth_base:+.4f} auth={m['auth_nats']:+.4f} "
|
||||
f"dAuth_base={dAuth_base:+.4f} dAuth_prev={dAuth_prev:+.4f} coh={m['coherence']:.5f}")
|
||||
|
||||
# bookend reference rows so the swept configs read against base (origin) and prev (r0 healed = the
|
||||
# anchor this round starts from). Every config sits BETWEEN these two; prev's own slope shows what r0's
|
||||
# heal achieved (the bar to reproduce a round deeper).
|
||||
for tag, mm in (("(prev=r0heal)", prev_m), ("(base origin)", base_m)):
|
||||
dAb, dCb = mm["auth_nats"] - base_m["auth_nats"], mm["coherence"] - base_m["coherence"]
|
||||
dAp, dCp = mm["auth_nats"] - prev_m["auth_nats"], mm["coherence"] - prev_m["coherence"]
|
||||
rows.append({
|
||||
"cohΔ/authΔ_base×100↓": 100 * dCb / dAb if abs(dAb) > 0.05 else float("nan"),
|
||||
"cohΔ/authΔ_prev×100↓": 100 * dCp / dAp if abs(dAp) > 0.05 else float("nan"),
|
||||
"reg": tag, "lam": "", "tau": "", "ref": "", "wd": "",
|
||||
"auth↓": mm["auth_nats"], "dAuth_base↓": dAb, "dAuth_prev↓": dAp,
|
||||
"coh↑": mm["coherence"], "dCoh_base↑": dCb, "dCoh_prev↑": dCp, "heal_nll↓": float("nan"),
|
||||
})
|
||||
|
||||
print(f"\nheal sweep from r{gen_round-1} checkpoint, re-heal r{gen_round} data (vary regulariser + barrier ref only):")
|
||||
print("HEADLINE = cohΔ/authΔ×100: centinats of coherence lost per nat of trait moved (the heal slope).")
|
||||
print(" DIRECTION, not magnitude: most-NEGATIVE is best (trait moved AND coherence ROSE = free lunch);")
|
||||
print(" then small-positive = cheap; large-positive = trait cost a lot of coherence. Sorted best-first.")
|
||||
print(" blank ratio = |dAuth|<0.05 (config barely moved trait; slope is noise, not an efficiency).")
|
||||
print("dAuth_prev = NEW trait this round (NEGATIVE = added); ref=base vs prev is the direction crux.\n")
|
||||
# sort by the signed slope (NOT abs): most-negative free-lunch row first, NaN (do-nothing) last.
|
||||
rows.sort(key=lambda r: (r["cohΔ/authΔ_base×100↓"] if r["cohΔ/authΔ_base×100↓"] == r["cohΔ/authΔ_base×100↓"] else 1e9))
|
||||
# per-column precision: headline x100 + coherence deltas get the extra digits that discriminate close
|
||||
# configs; reg/lam/tau/wd stay compact. Tuple order matches the rows-dict key order above.
|
||||
fmt = ("+.4f", "+.4f", "g", "g", "g", "g", "g", "+.4f", "+.4f", "+.4f", ".5f", "+.5f", "+.5f", "+.3f")
|
||||
print(tabulate(rows, headers="keys", tablefmt="github", floatfmt=fmt))
|
||||
print(f"\nbase auth={base_m['auth_nats']:+.3f} coh={base_m['coherence']:.3f} | "
|
||||
f"prev(r{gen_round-1}) auth={prev_m['auth_nats']:+.3f} coh={prev_m['coherence']:.3f} | source {run_dir.name}")
|
||||
@@ -50,15 +50,17 @@ One objective, one constraint (per CLAUDE.md loss philosophy):
|
||||
- Objective: SFT cross-entropy on the kept steered completions.
|
||||
- Constraint: a divergence-to-original barrier, `lambda * relu(D - tau)`, off while we are already within the coherence region so it does not fight the trait for free.
|
||||
|
||||
The reference is the original model throughout, not the previous round's student. Anchoring to round 0 resists cumulative drift across the loop (your call, and it matches the iso-KL "~1.7 nat total budget" framing: spend against a fixed origin).
|
||||
The barrier reference is `barrier_ref` (config). SETTLED 2026-06-04 (#97, was the original spec call, now reversed): default is `prev` = the previous round's student, NOT the round-0 original. Anchoring to base (`barrier_ref="base"`) leashes the fresh adapter back toward the origin, and because the baked history already carries the accumulated trait, the relu barrier is permanently active and its gradient OPPOSES that trait, so it UNDOES the prior rounds (the external-review "history erasure" risk, confirmed: nll re-heal dAuth_prev=+1.157, kl_rev/base +0.855 = both erode). `prev` penalises only THIS round's new divergence (a trust region), so trait accumulates while each step stays coherent. At round 0 the two are identical (no history yet); they differ from round 1 on.
|
||||
|
||||
D is the variable under test (uncertainty 2):
|
||||
- `nll`: no regulariser, SFT only. The control.
|
||||
- `kl_fwd`: KL(orig || theta), mass-covering, pulls theta to cover the original everywhere, expected to dilute the trait.
|
||||
- `kl_rev`: KL(theta || orig), mode-seeking, suppresses tokens improbable under the original (the incoherent ones), expected best.
|
||||
- `wd`: weight decay on the adapter only. Cheapest, no original forward pass, no direct output-coherence signal, expected weakest.
|
||||
`reg` is the divergence in the LOSS barrier (the variable under test, U2):
|
||||
- `nll`: no barrier, SFT only. The control.
|
||||
- `kl_fwd`: KL(ref || theta), mass-covering, pulls theta to cover the reference everywhere, expected to dilute the trait.
|
||||
- `kl_rev`: KL(theta || ref), mode-seeking, suppresses tokens improbable under the reference (the incoherent ones).
|
||||
- `spectral_norm`: penalise σ_max(ΔW), the operator norm of the adapter update (power iteration, tau=0 = always-on). Weights-space, no reference forward pass. Caps how far the update stretches any single input direction (the largest singular value), where wd caps the whole Frobenius volume.
|
||||
|
||||
All KLs are teacher-forced from per-position logits over the completion tokens, so no extra sampling. `kl_fwd`/`kl_rev` need original logits per step: toggle the lora-lite adapter off for a no-grad forward, or keep one frozen reference. `wd` needs neither.
|
||||
`weight_decay` is an INDEPENDENT AdamW knob (decoupled per-step shrink ~ lr*wd on the adapter), NOT a `reg` choice, so it composes with any of the above. Early #98 finding: wd<=15 is byte-identical to the no-reg control (inert at this adapter scale, because Adam normalises the step to ~O(1) and the decay only competes once wd*|param| ~ O(1), i.e. wd in the tens); the knee is ~30, where it both retains more trait and holds coherence.
|
||||
|
||||
All KLs are teacher-forced from per-position logits over the completion tokens, so no extra sampling. `kl_fwd`/`kl_rev` need the reference logits per step: bake the history (ref=prev) or not (ref=base), adapter off, no-grad forward. `spectral_norm`/`nll` need neither.
|
||||
|
||||
## Three uncertainties, each a gate with a UAT
|
||||
|
||||
@@ -72,9 +74,15 @@ Gate UAT: hand-label ~30-50 steered completions on two axes (coherent? enacts vs
|
||||
|
||||
### U2: can we heal, and which regulariser?
|
||||
|
||||
Second uncertainty: at matched trait shift, does any regulariser keep coherence above the `nll` control? Test all four: `nll`, `kl_fwd`, `kl_rev`, `wd`. Prior: `kl_rev > kl_fwd ~ wd > nll`.
|
||||
Second uncertainty: which regulariser trades coherence for trait most efficiently? The headline is the SLOPE, not a pass/fail:
|
||||
|
||||
Gate UAT: Pareto plot of trait shift (tinymfv auth axis) vs coherence (`p_ans_any`) for all four, at `results/u2_heal_gate.png` (tufte small multiples, shared scale, direct labels, see /tufte-viz). Pass if the best regulariser dominates `nll`, i.e. more coherence at equal trait shift. Read samples too: scores can move for the wrong reason (narration).
|
||||
$$\text{cohΔ/authΔ} = \frac{\Delta\text{coh}}{\Delta\text{auth}} \times 100 \quad\text{(centinats of coherence per nat of trait moved)}$$
|
||||
|
||||
DIRECTION matters, not magnitude: most-NEGATIVE is best (trait moved AND coherence ROSE = free lunch), then small-positive = cheap, large-positive = trait cost a lot of coherence. Guard the denominator: |dAuth| < 0.05 nats -> blank (a do-nothing config can't fake a good slope on a near-zero denominator).
|
||||
|
||||
The operative experiment is the regulariser ablation (`scripts/diag_heal_sweep.py`, run #98+): load the round-0 checkpoint as baked history, re-heal round-1's kept data on top, fix `barrier_ref=prev`, and sweep `reg ∈ {nll, kl_rev, kl_fwd, spectral_norm}` x strength plus `weight_decay ∈ {30,60,120}`, ranked by the cohΔ/authΔ slope. Prior, UPDATED by #82/#98 (was `kl_rev > kl_fwd ~ wd > nll`): the barrier THROTTLES trait and buys little coherence at this operating point (nll is already cheap, coh ~0.995-0.996), so the contest is whether any reg produces a NEGATIVE (free-lunch) slope that beats the nll/wd family rather than just trading trait away. wd is the surprise candidate (knee ~30, holds coherence while moving trait).
|
||||
|
||||
Gate UAT: the diag_heal_sweep table (headline-first, sorted best-slope-first) at the journal entry, winner = most-negative-or-smallest cohΔ/authΔ among rows whose dAuth_base is meaningfully negative (trait actually moved, not a blank do-nothing row, not a positive-dAuth undo), at coh >= prev. Read samples too: scores can move for the wrong reason (narration).
|
||||
|
||||
### U3: iterative, coherent, same direction?
|
||||
|
||||
@@ -151,16 +159,18 @@ def keep(c, orig, tok): # U1 filter gate
|
||||
coherent = ppl(c, orig) < τ_ppl and rep_ngram(c) < τ_rep and p_ans_any(c) > 0.75
|
||||
return coherent and not narrates_trait(c) # enact, don't narrate
|
||||
|
||||
# ── Heal: SFT + divergence-to-ORIGINAL barrier (D ∈ {nll, kl_fwd, kl_rev, wd}) ──
|
||||
def train(model, comps, D, λ, τ, epochs=2):
|
||||
opt = AdamW(round_N_params(model), lr=α_lr, weight_decay=(λ if D=="wd" else 0))
|
||||
# ── Heal: SFT + barrier. reg ∈ {nll, kl_fwd, kl_rev, spectral_norm}; wd is an independent AdamW knob ──
|
||||
# ref = prev student (history baked, this round's adapter off), NOT base -- see Loss section.
|
||||
def train(model, comps, reg, λ, τ, wd, epochs=6):
|
||||
opt = AdamW(round_N_params(model), lr=α_lr, weight_decay=wd) # wd composes with any reg
|
||||
for _ in range(epochs):
|
||||
for x in comps: # x = prompt + steered completion
|
||||
with gates(model, C0=1, Cn=1): logπ = model(x) # full student (grad on A_N,B_N)
|
||||
ℒ_sft = -mean(logπ[x.completion_tokens])
|
||||
if D=="kl_fwd": div = KL(logπ0(model,x), logπ)[x.completion_tokens].mean()
|
||||
elif D=="kl_rev": div = KL(logπ, logπ0(model,x))[x.completion_tokens].mean()
|
||||
else: div = 0 # nll, wd
|
||||
if reg=="kl_fwd": div = KL(logπ_ref(model,x), logπ)[x.completion_tokens].mean()
|
||||
elif reg=="kl_rev": div = KL(logπ, logπ_ref(model,x))[x.completion_tokens].mean()
|
||||
elif reg=="spectral_norm": div = σ_max(A_N B_N) # operator norm, τ=0 -> always-on
|
||||
else: div = 0 # nll
|
||||
ℒ = ℒ_sft + λ * relu(div - τ) # barrier: off while div ≤ τ
|
||||
ℒ.backward(); opt.step(); opt.zero_grad()
|
||||
|
||||
@@ -247,12 +257,12 @@ Fixed (code):
|
||||
silently skip) when a kept completion truncates to zero target tokens.
|
||||
|
||||
Design risks (NOT fixed, inform the loop + Plan work):
|
||||
- Loop barrier undoes its own history (gemini "history erasure", grok, deepseek). KL anchored to
|
||||
the round-0 original while history is baked into the student means by round>=1 the cumulative
|
||||
drift already exceeds tau, so the relu barrier is permanently active and its gradient pushes the
|
||||
fresh adapter to OPPOSE the trait the frozen history installed. Plausibly a dominant cause of the
|
||||
loop undo. -> for U3 consider anchoring the barrier to the PREVIOUS student, or normalising tau by
|
||||
historical drift (supports the "less barrier" direction, task 17).
|
||||
- RESOLVED (#97, now `barrier_ref=prev` default): Loop barrier undoes its own history (gemini
|
||||
"history erasure", grok, deepseek). KL anchored to the round-0 original while history is baked into
|
||||
the student means by round>=1 the cumulative drift already exceeds tau, so the relu barrier is
|
||||
permanently active and its gradient pushes the fresh adapter to OPPOSE the trait the frozen history
|
||||
installed. Confirmed and FIXED: anchor the barrier to the PREVIOUS student (`prev`), penalising only
|
||||
this round's new divergence. See the Loss section for the settled call (supports task 17).
|
||||
- Barrier mean-dilution (deepseek). div = mean over completion tokens of KL; a few catastrophically
|
||||
incoherent tokens are diluted by many in-distribution ones, so the mean stays < tau and kl_rev
|
||||
silently == nll. A max or high-quantile KL would penalise localised incoherence. METHOD change
|
||||
@@ -277,7 +287,7 @@ is the pristine base by construction.
|
||||
## UAT summary (proof, not assertion)
|
||||
|
||||
- U1 filter gate: `results/u1_filter_gate.md` — labelled set, scorer separation. Link when done.
|
||||
- U2 heal gate: `results/u2_heal_gate.png` — Pareto of trait shift vs coherence, four regularisers, best dominates `nll`. Link.
|
||||
- U2 heal gate: the `diag_heal_sweep.py` regulariser-ablation table (headline cohΔ/authΔ×100, sorted best-slope-first), winner = most-negative slope among rows that actually move trait. Link the journal entry + pueue id.
|
||||
- U3 loop gate: `results/u3_loop.png` — auth shift, coherence, direction cosines per round; monotone trait, coherence above floor. Link.
|
||||
- Samples: first 3 train completions and first 3 eval generations printed in full (prompt + special tokens), confirming enact-not-narrate and correct formatting.
|
||||
|
||||
|
||||
+35
-11
@@ -40,21 +40,45 @@ class RunConfig:
|
||||
# ── generation + filter (U1) ──
|
||||
n_prompts: int = 16
|
||||
n_keep: int = 64
|
||||
min_train: int = 20 # assert at least this many kept completions, else steering/filter starved
|
||||
gen_max_new_tokens: int = 256
|
||||
min_train: int = 30 # assert at least this many kept completions, else starved (walk-C should hold us above)
|
||||
gen_max_new_tokens: int = 512 # longer = more long-horizon coherence signal (GPU has room at bs=1)
|
||||
max_len: int = 1024
|
||||
# repetition is incoherence the ppl filter CANNOT see (looped text is low-ppl = predictable), so
|
||||
# stop it at generation, not post-hoc: penalty softly discourages all repeats, no_repeat_ngram
|
||||
# hard-blocks any trigram repeat (kills "instead their instead their" loops at the source).
|
||||
repetition_penalty: float = 1.3
|
||||
no_repeat_ngram_size: int = 3
|
||||
ppl_tau: float = 50.0 # drop completions with ppl-under-original above this
|
||||
rep_tau: float = 0.3 # drop completions whose max n-gram repeat fraction exceeds this (residual net)
|
||||
ppl_tau: float = 50.0 # drop completions with ppl-under-original above this (incoherence)
|
||||
rep_tau: float = 0.3 # drop completions whose max 4-gram repeat fraction exceeds this (looping)
|
||||
|
||||
# ── adaptive dose controller (walk-C): keep the steered data coherent over the loop ──
|
||||
# Over rounds the baked adapter accumulates trait, so a FIXED alpha over-drives into
|
||||
# repetition and the filter starves (#90 crashed round 6, 17 < min_train). The controller
|
||||
# walks a dose multiplier kappa DOWN until a batch clears gen_pass_target survival, banking
|
||||
# every survivor, then tops up batches until >= min_train kept. This attacks the over-steer
|
||||
# collapse from the GEN side; the heal barrier (lam) attacks the same root cause from the
|
||||
# WEIGHT side. kappa=1 = nominal alphas. The steering.py:65 comment anticipated this controller.
|
||||
gen_pass_target: float = 0.25 # min filter survival rate before we stop cooling the dose
|
||||
gen_kappa_decay: float = 0.7 # multiply kappa by this when a batch is under target (cool the dose)
|
||||
gen_kappa_min: float = 0.2 # floor: below 20% of nominal there is no trait signal left to distil
|
||||
gen_max_batches: int = 6 # hard cap on gen+filter rounds; if still short, the heal assert fires (genuine starve)
|
||||
|
||||
# ── heal (U2): one objective + divergence-to-ORIGINAL barrier ──
|
||||
reg: Literal["nll", "kl_fwd", "kl_rev", "wd"] = "kl_rev"
|
||||
lam: float = 1.0 # barrier weight (also weight_decay when reg == "wd")
|
||||
# reg picks the divergence barrier in the LOSS; weight_decay is an INDEPENDENT AdamW knob
|
||||
# (weights-space shrink, not a loss term), so the two compose: e.g. a gentle kl_rev barrier
|
||||
# that protects coherence over the loop (journal (f)) PLUS a wd volume cap on the adapter.
|
||||
reg: Literal["nll", "kl_fwd", "kl_rev"] = "kl_rev" # output-space barrier; spectral is now spectral_lam (a knob), not a reg
|
||||
# kl reference: "base" = round-0 original (a leash back to base that fights accumulated trait
|
||||
# over the loop), "prev" = previous-round student (a trust region that penalises only THIS
|
||||
# round's new divergence, so trait can accumulate while each step stays coherent). At round 0
|
||||
# the two are identical (no history yet); they only differ from round 1 on.
|
||||
barrier_ref: Literal["base", "prev"] = "prev"
|
||||
lam: float = 0.3 # kl-barrier weight (reg=kl_*); ignored for nll. 0.3 = coherence peak of the #98/#99 ladder (unimodal in lam, peaks 0.1-0.3, 1.0 over-tight); 0.3 = most trait at the peak
|
||||
tau: float = 0.5 # barrier engages only when divergence > tau (nats)
|
||||
weight_decay: float = 0.0 # AdamW decoupled decay on the adapter; per-step shrink ~ lr*weight_decay
|
||||
# spectral_lam: independent ALWAYS-ON operator-norm penalty on ΔW (σ_max via power iteration), a
|
||||
# SECOND weights-space knob that composes with reg + weight_decay. Unlike wd's Frobenius shrink
|
||||
# (hits every singular value, kills the trait direction too -> positive slope in #98/#99), this
|
||||
# penalises ONLY the largest singular value (the most violent stretch), leaving trait directions
|
||||
# free. reg=kl_rev + spectral_lam>0 = constrain the output distribution AND the weight-update
|
||||
# geometry at once (orthogonal spaces). 0 = off. (Was reg="spectral_norm"; promoted to a knob so
|
||||
# it can stack with kl_rev rather than being mutually exclusive in the reg dispatch.)
|
||||
spectral_lam: float = 0.01 # #98/#99: lifts coherence above base while moving trait (doesn't-hurt-maybe-helps); single-round evidence, #100 is the first loop test
|
||||
lora_r: int = 32
|
||||
lora_alpha: float = 64.0 # keep scale = alpha/r = 2 (w2s convention alpha = 2r)
|
||||
epochs: int = 6 # was 2: too few steps to see loss descend; val nll guards overfit
|
||||
|
||||
@@ -22,14 +22,32 @@ NARRATE = re.compile(
|
||||
re.IGNORECASE,
|
||||
)
|
||||
|
||||
# refusal / assistant-identity boilerplate (NousResearch finetuning-subnet UNWANTED_PHRASES, trimmed):
|
||||
# coherent low-ppl completions that carry no trait and dilute the distillation. Phrases are
|
||||
# refusal-SPECIFIC ("i cannot assist") not bare "i cannot", so on-trait defiance ("I cannot stand
|
||||
# by while...") is NOT dropped.
|
||||
REFUSAL = (
|
||||
"i'm sorry, i can", "i am sorry, i can", "i cannot provide", "i can't provide",
|
||||
"i cannot assist", "i can't assist", "i cannot help with", "i can't help with",
|
||||
"i cannot fulfill", "i cannot comply", "i'm not able to provide", "i am unable to",
|
||||
"i cannot engage", "i must decline", "against my programming",
|
||||
"as an ai", "as a language model", "as an artificial intelligence",
|
||||
"i'm an ai", "i am an ai", "i don't have personal opinions",
|
||||
)
|
||||
|
||||
|
||||
def rep_frac(text: str) -> float:
|
||||
"""Most-repeated 4-gram fraction; ~1.0 means degenerate looping/too short."""
|
||||
"""Max most-repeated n-gram fraction over n in {2,3,4}; ~1.0 = degenerate looping/too short.
|
||||
Small n catches SHORT loops ("instead their instead their" = a bigram) that the 4-gram alone
|
||||
missed (#34: that text scored 0.27 on 4-grams, under rep_tau=0.3, and poisoned training)."""
|
||||
words = text.split()
|
||||
grams = [tuple(words[i : i + 4]) for i in range(len(words) - 3)]
|
||||
best = 0.0
|
||||
for n in (2, 3, 4):
|
||||
grams = [tuple(words[i : i + n]) for i in range(len(words) - n + 1)]
|
||||
if not grams:
|
||||
return 1.0
|
||||
return Counter(grams).most_common(1)[0][1] / len(grams)
|
||||
return 1.0 # too short to score at this n -> treat as degenerate
|
||||
best = max(best, Counter(grams).most_common(1)[0][1] / len(grams))
|
||||
return best
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
@@ -51,9 +69,10 @@ def filter_completions(model, tok, comps: list[dict], cfg: RunConfig):
|
||||
for c in tqdm(comps, desc="filter ppl", mininterval=120, maxinterval=120):
|
||||
rf = rep_frac(c["completion"])
|
||||
nar = bool(NARRATE.search(c["completion"]))
|
||||
ref = any(p in c["completion"].lower() for p in REFUSAL)
|
||||
ppl = ppl_under_base(model, tok, c["prompt"], c["completion"])
|
||||
keep = (ppl < cfg.ppl_tau) and (rf < cfg.rep_tau) and (not nar)
|
||||
scored.append({**c, "ppl": ppl, "rep": rf, "narrates": nar, "keep": keep})
|
||||
keep = (ppl < cfg.ppl_tau) and (rf < cfg.rep_tau) and (not nar) and (not ref)
|
||||
scored.append({**c, "ppl": ppl, "rep": rf, "narrates": nar, "refuses": ref, "keep": keep})
|
||||
kept = [s for s in scored if s["keep"]]
|
||||
_log_filter_report(scored, cfg)
|
||||
return kept[: cfg.n_keep], scored
|
||||
@@ -112,11 +131,12 @@ def _log_filter_report(scored: list[dict], cfg: RunConfig) -> None:
|
||||
n_ppl = sum(s["ppl"] >= cfg.ppl_tau for s in scored)
|
||||
n_rep = sum(s["rep"] >= cfg.rep_tau for s in scored)
|
||||
n_nar = sum(s["narrates"] for s in scored)
|
||||
n_ref = sum(s["refuses"] for s in scored)
|
||||
n_kept = sum(s["keep"] for s in scored)
|
||||
logger.info(
|
||||
f"filter kept {n_kept}/{len(scored)}. dropped by (overlapping): "
|
||||
f"coherence ppl>={cfg.ppl_tau:g}: {n_ppl}, repetition rep>={cfg.rep_tau}: {n_rep}, "
|
||||
f"persona-leak narrate: {n_nar}. "
|
||||
f"persona-leak narrate: {n_nar}, refusal/identity: {n_ref}. "
|
||||
f"SHOULD: at high alpha coherence-ppl drops the most (steering breaks fluency). If "
|
||||
f"persona-leak dominates, the model is NARRATING the trait not enacting it; if repetition "
|
||||
f"dominates, steering collapsed to loops not incoherence."
|
||||
|
||||
+36
-5
@@ -25,6 +25,28 @@ def _kl_per_pos(logp_a, logp_b): # KL(a || b) summed over vocab, per position
|
||||
return (logp_a.exp() * (logp_a - logp_b)).sum(-1)
|
||||
|
||||
|
||||
def _spectral_div(lora, n_iter: int = 3) -> torch.Tensor:
|
||||
"""Mean operator norm σ_max(ΔW) over the adapter's layers, ΔW = (alpha/r)·B@A.
|
||||
|
||||
Power iteration (u,v held constant) gives σ_max = uᵀ(B@A)v, differentiable in A,B.
|
||||
This is the weights-space analog of weight_decay: wd penalises ||ΔW||_F (sum of all
|
||||
singular values squared), spectral_norm penalises ||ΔW||_2 (the LARGEST singular value),
|
||||
i.e. it caps how much the update can stretch any single input direction. Used with tau=0
|
||||
so relu(div-0)=div is an always-on penalty (like wd), not a hinge barrier."""
|
||||
scale = lora.cfg.alpha / lora.cfg.r
|
||||
sigmas = []
|
||||
for name in lora.A:
|
||||
A, B = lora.A[name].float(), lora.B[name].float() # A: r×d_in, B: d_out×r
|
||||
with torch.no_grad():
|
||||
v = torch.randn(A.shape[1], device=A.device)
|
||||
v = v / v.norm()
|
||||
for _ in range(n_iter):
|
||||
u = B @ (A @ v); u = u / (u.norm() + 1e-8)
|
||||
v = A.T @ (B.T @ u); v = v / (v.norm() + 1e-8)
|
||||
sigmas.append(scale * (u @ (B @ (A @ v)))) # u,v const -> grad flows through A,B
|
||||
return torch.stack(sigmas).mean()
|
||||
|
||||
|
||||
def _gnorm(grads) -> float: # L2 norm of a flat concat of (possibly None) param grads
|
||||
sq = sum(float(g.pow(2).sum()) for g in grads if g is not None)
|
||||
return sq ** 0.5
|
||||
@@ -77,7 +99,7 @@ def heal_round(model, tok, kept: list[dict], hist_specs: list[AdapterSpec], cfg:
|
||||
lora = ModulatedLoRA(model, r=cfg.lora_r, alpha=cfg.lora_alpha, layer_range=cfg.layer_range)
|
||||
params = list(lora.parameters())
|
||||
opt = torch.optim.AdamW(params, lr=cfg.lr, betas=cfg.adam_betas,
|
||||
weight_decay=(cfg.lam if cfg.reg == "wd" else 0.0))
|
||||
weight_decay=cfg.weight_decay)
|
||||
n_steps = len(train_kept) * cfg.epochs
|
||||
sched = get_cosine_schedule_with_warmup(
|
||||
opt, num_warmup_steps=int(cfg.warmup_ratio * n_steps), num_training_steps=n_steps)
|
||||
@@ -115,9 +137,13 @@ def heal_round(model, tok, kept: list[dict], hist_specs: list[AdapterSpec], cfg:
|
||||
pbar.update(1); step += 1
|
||||
continue
|
||||
|
||||
# original reference logits (no history, adapter off) for the barrier
|
||||
# barrier reference logits (this round's adapter OFF). barrier_ref="base" bakes no
|
||||
# history -> ref = round-0 original (leash to base, fights accumulated trait); "prev"
|
||||
# bakes the history -> ref = previous-round student (trust region, penalises only this
|
||||
# round's new divergence so trait accumulates while each step stays coherent).
|
||||
if cfg.reg in ("kl_fwd", "kl_rev"):
|
||||
with torch.no_grad(), lora(model, c=0.0):
|
||||
ref_specs = hist_specs if cfg.barrier_ref == "prev" else []
|
||||
with torch.no_grad(), baked(model, ref_specs), lora(model, c=0.0):
|
||||
logp0 = model(**ids).logits[0, :-1].log_softmax(-1)
|
||||
|
||||
# student logits: history baked + this round's adapter live
|
||||
@@ -132,8 +158,13 @@ def heal_round(model, tok, kept: list[dict], hist_specs: list[AdapterSpec], cfg:
|
||||
elif cfg.reg == "kl_rev":
|
||||
div = _kl_per_pos(logp[mask], logp0[mask]).mean()
|
||||
else:
|
||||
div = torch.zeros((), device=model.device) # nll, wd
|
||||
div = torch.zeros((), device=model.device) # nll
|
||||
barrier = cfg.lam * torch.relu(div - cfg.tau)
|
||||
# spectral_lam: independent ALWAYS-ON operator-norm cap on ΔW (σ_max), composes with the
|
||||
# output-space barrier above and with weight_decay (see config.RunConfig.spectral_lam).
|
||||
# Folded into `barrier` so the g_bar/g_nll gradient-pressure log captures it too.
|
||||
if cfg.spectral_lam > 0:
|
||||
barrier = barrier + cfg.spectral_lam * _spectral_div(lora)
|
||||
loss = sft + barrier
|
||||
nlls.append(sft.item())
|
||||
ep_nlls.append(sft.item())
|
||||
@@ -142,7 +173,7 @@ def heal_round(model, tok, kept: list[dict], hist_specs: list[AdapterSpec], cfg:
|
||||
# split the gradient pressure: ||∇sft|| vs ||∇barrier|| (retain_graph -> still .backward below).
|
||||
# barrier has no grad path when kl<=tau (relu zeroed), so guard before autograd.grad.
|
||||
g_nll = _gnorm(torch.autograd.grad(sft, params, retain_graph=True, allow_unused=True))
|
||||
barrier_live = barrier.requires_grad and (div - cfg.tau).item() > 0
|
||||
barrier_live = barrier.requires_grad and ((div - cfg.tau).item() > 0 or cfg.spectral_lam > 0)
|
||||
g_bar = _gnorm(torch.autograd.grad(barrier, params, retain_graph=True, allow_unused=True)) if barrier_live else 0.0
|
||||
pressure = g_bar / g_nll if g_nll > 0 else float("nan")
|
||||
cur_lr = sched.get_last_lr()[0] # lr applied to THIS step (before sched.step below)
|
||||
|
||||
+59
-19
@@ -96,6 +96,46 @@ def _log_stage_table(stages: list[dict], base_m: dict) -> None:
|
||||
+ tabulate([_stage_row(s, base_m) for s in stages], headers="keys", tablefmt="github", floatfmt=".3f") + "\n")
|
||||
|
||||
|
||||
def gen_filter_walk(model, tok, v, cfg: RunConfig, hist_specs: list) -> tuple[list[dict], list[dict], float, int]:
|
||||
"""Adaptive-dose gen+filter (the controller steering.py:65 was written for).
|
||||
|
||||
Walk the dose multiplier kappa DOWN until a batch clears cfg.gen_pass_target filter
|
||||
survival, banking every survivor (never waste a coherent completion), and top up
|
||||
batches until >= cfg.min_train kept. Backing the dose off keeps the steered model
|
||||
coherent so the filter has clean survivors. This attacks the over-steer repetition
|
||||
collapse that starved #90 at round 6 from the GEN side; the heal barrier (lam) attacks
|
||||
the same root cause from the WEIGHT side.
|
||||
|
||||
gen runs under the BAKED history (steered student state); the filter runs under the
|
||||
ORIGINAL (ppl-under-base picks the usable C), so each attempt enters/exits baked
|
||||
around gen only. Returns (kept, scored, kappa_final, n_gen). If max_batches can't reach
|
||||
min_train, the heal assert downstream fires the (now dose-aware) starve canary.
|
||||
"""
|
||||
kappa = 1.0
|
||||
kept_all, scored_all, n_gen = [], [], 0
|
||||
for attempt in range(cfg.gen_max_batches):
|
||||
with baked(model, hist_specs):
|
||||
comps = generate_steered(model, tok, v, cfg, alpha_scale=kappa)
|
||||
_, scored = filter_completions(model, tok, comps, cfg) # OUTSIDE baked = under original
|
||||
passing = [s for s in scored if s["keep"]] # TRUE pass set (not filter's n_keep-capped return)
|
||||
kept_all.extend(passing)
|
||||
scored_all.extend(scored)
|
||||
n_gen += len(comps)
|
||||
rate = len(passing) / len(comps) # dose decision uses the real survival rate, not the cap
|
||||
logger.info(
|
||||
f"walk-C attempt {attempt}: kappa={kappa:.2f} kept {len(passing)}/{len(comps)} "
|
||||
f"(rate={rate:.2f}, target>={cfg.gen_pass_target}) -> banked {len(kept_all)}/{cfg.min_train}.\n"
|
||||
"SHOULD: rate climbs as kappa cools; once rate>=target we bank and top up to min_train. "
|
||||
"If rate stays ~0 even at kappa_min, the steered model is incoherent at EVERY dose "
|
||||
"(root cause is upstream of the dose: adapter itself broke, or filter thresholds wrong)."
|
||||
)
|
||||
if len(kept_all) >= cfg.min_train:
|
||||
break
|
||||
if rate < cfg.gen_pass_target and kappa > cfg.gen_kappa_min:
|
||||
kappa *= cfg.gen_kappa_decay # over-driven -> cool the dose for the next batch
|
||||
return kept_all[: cfg.n_keep], scored_all, kappa, n_gen # cap training set at n_keep (top-up may overshoot)
|
||||
|
||||
|
||||
def steer_heal(model, tok, cfg: RunConfig, run_dir: Path) -> dict:
|
||||
hist_specs = [] # AdapterSpec per folded round (gated bake history)
|
||||
v0_flat = None # round-0 direction, for the Q3 cosine
|
||||
@@ -109,23 +149,21 @@ def steer_heal(model, tok, cfg: RunConfig, run_dir: Path) -> dict:
|
||||
stages = [{"round": "-", "stage": "base", "m": base_m}] # base -> steered -> healed, for table + trajectory plot
|
||||
for rnd in range(cfg.n_rounds):
|
||||
logger.info(f"\n\n=== ROUND {rnd} [{cfg.model.split('/')[-1]} reg={cfg.reg}] gpu {gpu_mem()} ===")
|
||||
# extract teacher vector + sweep-generate steered data from the CURRENT student
|
||||
# extract teacher vector from the CURRENT student, then walk-C generate+filter:
|
||||
# the controller cools the dose so the steered data stays coherent as the adapter
|
||||
# accumulates trait over rounds (gen baked, filter under original -- see gen_filter_walk).
|
||||
with baked(model, hist_specs):
|
||||
v = teacher_vec(model, tok, cfg)
|
||||
comps = generate_steered(model, tok, v, cfg)
|
||||
# STEERED-stage eval: the model state the training data came from (history baked,
|
||||
# vector live at the operating dose = lowest/cleanest alpha, NO new adapter). This
|
||||
# is the raw-steering pareto reference the heal must BEAT (same base, trait via
|
||||
# vector vs trait via the distilled adapter).
|
||||
c_op = cfg.alphas[0] * v.cfg.coeff
|
||||
logger.info(f"\n=== EVAL steered [c={cfg.alphas[0]}] gpu {gpu_mem()} ===")
|
||||
with v(model, C=c_op):
|
||||
kept, scored, kappa, n_comps = gen_filter_walk(model, tok, v, cfg, hist_specs)
|
||||
# STEERED-stage eval at the dose the data ACTUALLY came from (kappa-scaled cleanest alpha),
|
||||
# history baked, NO new adapter: the raw-steering pareto reference the heal must BEAT.
|
||||
c_lo = kappa * cfg.alphas[0]
|
||||
logger.info(f"\n=== EVAL steered [c={c_lo:.2f} kappa={kappa:.2f}] gpu {gpu_mem()} ===")
|
||||
with baked(model, hist_specs):
|
||||
with v(model, C=c_lo * v.cfg.coeff):
|
||||
m_steer = evaluate_model(model, tok, cfg)
|
||||
log_event(run_dir, stage="steered_eval", round=rnd, c=cfg.alphas[0], **m_steer) # persist for offline plot
|
||||
# filter under the ORIGINAL (no history, no steering) -- this picks the usable C
|
||||
logger.info(f"\n=== FILTER [{len(comps)} completions] gpu {gpu_mem()} ===")
|
||||
kept, scored = filter_completions(model, tok, comps, cfg)
|
||||
log_event(run_dir, stage="gen", round=rnd, n_comps=len(comps), n_kept=len(kept), scored=scored)
|
||||
log_event(run_dir, stage="steered_eval", round=rnd, c=c_lo, **m_steer) # persist for offline plot
|
||||
log_event(run_dir, stage="gen", round=rnd, n_comps=n_comps, n_kept=len(kept), kappa=kappa, scored=scored)
|
||||
|
||||
# heal one round on top of the baked history, then fold
|
||||
logger.info(f"\n=== HEAL [{cfg.reg}] gpu {gpu_mem()} ===")
|
||||
@@ -153,8 +191,8 @@ def steer_heal(model, tok, cfg: RunConfig, run_dir: Path) -> dict:
|
||||
v0_flat = vf if v0_flat is None else v0_flat
|
||||
cos_v0 = float(cosine_similarity(vf, v0_flat, dim=0))
|
||||
rec = {"round": rnd, **m, "cos_v0": cos_v0, "steered_ppl": steered_ppl,
|
||||
"adapter_ppl": adapter_ppl, "n_comps": len(comps), "n_kept": len(kept),
|
||||
"heal_nll": heal_nll}
|
||||
"adapter_ppl": adapter_ppl, "n_comps": n_comps, "n_kept": len(kept),
|
||||
"kappa": kappa, "heal_nll": heal_nll}
|
||||
rounds.append(rec)
|
||||
stages.append({"round": rnd, "stage": "steered", "m": m_steer})
|
||||
stages.append({"round": rnd, "stage": "healed", "m": m})
|
||||
@@ -175,14 +213,15 @@ def _log_loop_summary(rounds: list[dict], base_m: dict) -> None:
|
||||
# One row per round, columns walk the pipeline stages left->right:
|
||||
# GEN -> FILTER -> HEAL -> EVAL. (rec_key, display header) is the single source.
|
||||
cols = [("round", "round"),
|
||||
("n_comps", "gen"), ("n_kept", "filt_kept"), # GEN -> FILTER
|
||||
("n_comps", "gen"), ("n_kept", "filt_kept"), ("kappa", "kappa↓"), # GEN -> FILTER (kappa = walk-C dose)
|
||||
("heal_nll", "heal_nll↓"), ("adapter_ppl", "adapter_ppl↓"), # HEAL
|
||||
("auth_nats", "auth_nats↓"), ("care_nats", "care_nats"), # EVAL: target / off-target
|
||||
("coherence", "coherence→"), ("cos_v0", "cos_v0→")]
|
||||
logger.info(
|
||||
"\nloop columns (pipeline stages L->R: GEN | FILTER | HEAL | EVAL):\n"
|
||||
" gen = steered completions generated (n_prompts x alphas)\n"
|
||||
" gen = steered completions generated (n_prompts x alphas, summed over walk-C batches)\n"
|
||||
" filt_kept = completions surviving the coherence/rep/persona filter (-> training set)\n"
|
||||
" kappa↓ = walk-C dose multiplier the controller settled on (1.0 = nominal; <1 = backed off to dodge over-steer)\n"
|
||||
" heal_nll↓ = converged SFT loss of the heal (last-5 mean)\n"
|
||||
" adapter_ppl↓ = ppl-under-original of the no-steering adapter gens (low = coherent/healed)\n"
|
||||
" auth_nats↓ = log(profile p[Authority]), NATS (TARGET: down = less deference)\n"
|
||||
@@ -248,7 +287,8 @@ def main(cfg: RunConfig) -> None:
|
||||
cfg = resolve(cfg)
|
||||
torch.manual_seed(cfg.seed)
|
||||
ts = datetime.now().strftime("%Y%m%dT%H%M%S")
|
||||
slug = f"{cfg.model.split('/')[-1]}_{cfg.reg}_s{cfg.seed}"
|
||||
wd_tag = f"_wd{cfg.weight_decay:g}" if cfg.weight_decay else ""
|
||||
slug = f"{cfg.model.split('/')[-1]}_{cfg.reg}{wd_tag}_s{cfg.seed}"
|
||||
run_dir = make_run_dir(ts, slug, cfg)
|
||||
logger.info(f"argv cfg: {cfg}")
|
||||
model, tok = load_model(cfg.model, getattr(torch, cfg.dtype))
|
||||
|
||||
@@ -58,32 +58,41 @@ def teacher_vec(model, tok, cfg: RunConfig):
|
||||
@torch.no_grad()
|
||||
def _gen_one(model, tok, text, cfg):
|
||||
ids = tok(text, return_tensors="pt").to(model.device)
|
||||
# gemma-3-it recommended sampling (its generation_config.json): top_k=64, top_p=0.95,
|
||||
# temperature default 1.0. NOT Qwen's top_k=20/presence_penalty -- different model family.
|
||||
# NO repetition_penalty / no_repeat_ngram here ON PURPOSE: a gen-time anti-repetition control
|
||||
# MASKS the over-steering pathology (papers over the loops) so the filter passes junk and
|
||||
# walk-C goes blind to "dose too high". Repetition is detected POST-HOC by the rep_tau filter,
|
||||
# never suppressed at generation. (We tried penalty=1.3: it just inflated ppl and starved the
|
||||
# filter, #96.) Repetition must remain VISIBLE so the filter/controller can act on it.
|
||||
gen = model.generate(**ids, max_new_tokens=cfg.gen_max_new_tokens, do_sample=True,
|
||||
temperature=1.0, top_p=0.95,
|
||||
repetition_penalty=cfg.repetition_penalty,
|
||||
no_repeat_ngram_size=cfg.no_repeat_ngram_size,
|
||||
temperature=1.0, top_p=0.95, top_k=64,
|
||||
pad_token_id=tok.pad_token_id)
|
||||
return tok.decode(gen[0, ids.input_ids.shape[1]:], skip_special_tokens=True)
|
||||
|
||||
|
||||
def generate_steered(model, tok, v, cfg: RunConfig) -> list[dict]:
|
||||
def generate_steered(model, tok, v, cfg: RunConfig, alpha_scale: float = 1.0) -> list[dict]:
|
||||
"""Sweep cfg.alphas (raw-vector multiples); generate one completion per prompt x alpha.
|
||||
|
||||
The filter (Q0), not iso-KL, picks the usable C: low alpha is coherent, high
|
||||
alpha collapses, and we keep the coherent-but-trait-laden ones.
|
||||
alpha collapses, and we keep the coherent-but-trait-laden ones. `alpha_scale`
|
||||
(kappa) is the walk-C dose multiplier: the controller cools it over a round to
|
||||
keep the steered model coherent as the baked adapter accumulates trait.
|
||||
"""
|
||||
out = []
|
||||
n_total = cfg.n_prompts * len(cfg.alphas)
|
||||
logger.info(f"\n=== GEN steered [{n_total} = {cfg.n_prompts} prompts x {len(cfg.alphas)} alphas] "
|
||||
f"gpu {gpu_mem()} ===")
|
||||
logger.info(f"\n=== GEN steered [{n_total} = {cfg.n_prompts} prompts x {len(cfg.alphas)} alphas, "
|
||||
f"kappa={alpha_scale:.2f}] gpu {gpu_mem()} ===")
|
||||
pbar = tqdm(total=n_total, desc="gen steered", mininterval=120, maxinterval=120)
|
||||
for i in range(cfg.n_prompts):
|
||||
user = POOL[i % len(POOL)]
|
||||
text = chat_prompt(tok, cfg.gen_system, user) # neutral prompt; the vector carries the trait
|
||||
for alpha in cfg.alphas:
|
||||
with v(model, C=alpha * v.cfg.coeff):
|
||||
with v(model, C=alpha * alpha_scale * v.cfg.coeff):
|
||||
comp = _gen_one(model, tok, text, cfg)
|
||||
out.append({"user": user, "prompt": text, "completion": comp, "alpha": float(alpha)})
|
||||
# record the EFFECTIVE alpha (kappa-scaled) so the filter's per-alpha report and the
|
||||
# offline plots reflect the dose the completion actually came from.
|
||||
out.append({"user": user, "prompt": text, "completion": comp, "alpha": float(alpha * alpha_scale)})
|
||||
pbar.update(1)
|
||||
pbar.close()
|
||||
return out
|
||||
|
||||
Reference in New Issue
Block a user