mirror of
https://github.com/wassname/steer-heal-love.git
synced 2026-06-27 16:47:16 +08:00
2nd external-review panel: close catastrophic-green cue, fix BPE assert
5-model panel (deepseek-v4-pro, grok-4.3, gemini-3.5-flash, qwen3.6:35b). Two confirmed bugs fixed; design risks recorded in spec.md. run.py cue: coh_cost is a pure ratio, so a model collapsing to ~0 mass on Authority sent dAuth->-inf, coh_cost->0, scoring a broken model green (gemini). Now check an absolute coherence floor (coh<0.85 -> red) and finiteness FIRST, require coh>=0.95 for green, and broaden surgicality to |dAuth| > max(|dCare|,|dFair|) (a Fairness-ward dump was passing Care-only). heal.py: BPE-boundary prefix assert escaped at the max_len/truncation boundary (grok/gemini/qwen unanimous). Assert the surviving overlap min(n_prompt,L) unconditionally; warn instead of silently skipping a kept completion truncated to zero target tokens. Verified false positives (recorded so they aren't re-chased): qwen's shape[0] "batch-dim" claim (.input_ids[0] already drops batch), the profile['model'] column (it is the marginal mean-p), the KL reference (c=0.0 + no baked = pristine round-0). UAT: fast-dev-run exit 0; cue shows coh=0.00 -> red (floor closes the hole). Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -229,6 +229,51 @@ because the coherence filter removed the trait-laden completions before training
|
|||||||
fixed weight delta -- if we use gating for extraction we still need a bakeable distillate. Check
|
fixed weight delta -- if we use gating for extraction we still need a bakeable distillate. Check
|
||||||
which steering-lite methods are weight-foldable before adopting.
|
which steering-lite methods are weight-foldable before adopting.
|
||||||
|
|
||||||
|
## External review panel (2026-06-04)
|
||||||
|
|
||||||
|
Five non-Anthropic reviewers (deepseek-v4-pro, grok-4.3, gemini-3.5-flash, local qwen3.6:35b;
|
||||||
|
mistral returned empty) over spec + src. Two CONFIRMED code bugs were fixed this round; the rest
|
||||||
|
are design risks recorded here.
|
||||||
|
|
||||||
|
Fixed (code):
|
||||||
|
- Catastrophic-green cue (gemini, sharpest; echoed by deepseek/qwen). `coh_cost = |dCoh|/|dAuth|`
|
||||||
|
is a pure ratio: a model that collapses to ~0 mass on Authority sends dAuth -> -inf so coh_cost
|
||||||
|
-> 0, scoring a broken model green. Fix (run.py): check an ABSOLUTE coherence floor (coh < 0.85
|
||||||
|
-> red) and finiteness FIRST, require coh >= 0.95 for green, and broaden surgicality from
|
||||||
|
|dAuth|>|dCare| to |dAuth| > max(|dCare|,|dFair|) (a shift dumping mass onto Fairness was passing
|
||||||
|
the Care-only test).
|
||||||
|
- BPE-boundary assert escaped at the max_len/truncation boundary (grok, gemini, qwen, unanimous).
|
||||||
|
Fix (heal.py): assert the surviving prefix overlap min(n_prompt, L) unconditionally; warn (not
|
||||||
|
silently skip) when a kept completion truncates to zero target tokens.
|
||||||
|
|
||||||
|
Design risks (NOT fixed, inform the loop + Plan work):
|
||||||
|
- Loop barrier undoes its own history (gemini "history erasure", grok, deepseek). KL anchored to
|
||||||
|
the round-0 original while history is baked into the student means by round>=1 the cumulative
|
||||||
|
drift already exceeds tau, so the relu barrier is permanently active and its gradient pushes the
|
||||||
|
fresh adapter to OPPOSE the trait the frozen history installed. Plausibly a dominant cause of the
|
||||||
|
loop undo. -> for U3 consider anchoring the barrier to the PREVIOUS student, or normalising tau by
|
||||||
|
historical drift (supports the "less barrier" direction, task 17).
|
||||||
|
- Barrier mean-dilution (deepseek). div = mean over completion tokens of KL; a few catastrophically
|
||||||
|
incoherent tokens are diluted by many in-distribution ones, so the mean stays < tau and kl_rev
|
||||||
|
silently == nll. A max or high-quantile KL would penalise localised incoherence. METHOD change
|
||||||
|
(alters the objective) -> deliberate decision, do not silently switch.
|
||||||
|
- ppl-under-base is a STYLE proxy, not coherence (deepseek, gemini, grok, qwen, independently
|
||||||
|
re-deriving the known journal confound). Fluent-but-stylistically-novel on-trait completions score
|
||||||
|
high ppl and get dropped -> survivorship toward base-like training data.
|
||||||
|
- Construct validity (gemini, qwen, deepseek). tinymfv is 3rd-person forced-choice classification;
|
||||||
|
steering installs a 1st-person persona, so the link is an indirect propensity proxy. No
|
||||||
|
neutral-instruction control rules out format/instruction-following artefacts.
|
||||||
|
- teacher_vec drift (gemini, deepseek): v re-extracted from the baked student can decay as the trait
|
||||||
|
internalises (contrastive delta shrinks); cos_v0 already watches this.
|
||||||
|
- NARRATE regex brittle (deepseek): paraphrased verbalisation ("I never obey without question")
|
||||||
|
evades it and leaks narration into training.
|
||||||
|
|
||||||
|
Verified FALSE positives (do not re-chase): qwen's "n_prompt = prompt_ids.shape[0] reads the batch
|
||||||
|
dim" -- the line uses `.input_ids[0]`, so prompt_ids is 1-D and shape[0] IS the seq len. grok/qwen's
|
||||||
|
"profile['model'] may be model_T/top1" -- tinymfv eval.py:316 confirms it is the mean over vignettes
|
||||||
|
of per-row p (the marginal). grok's "KL reference can't be the round-0 original" -- c=0.0 + no baked()
|
||||||
|
is the pristine base by construction.
|
||||||
|
|
||||||
## UAT summary (proof, not assertion)
|
## UAT summary (proof, not assertion)
|
||||||
|
|
||||||
- U1 filter gate: `results/u1_filter_gate.md` — labelled set, scorer separation. Link when done.
|
- U1 filter gate: `results/u1_filter_gate.md` — labelled set, scorer separation. Link when done.
|
||||||
|
|||||||
+14
-9
@@ -26,15 +26,17 @@ def _encode(tok, prompt: str, completion: str, max_len: int, device):
|
|||||||
ids = tok(prompt + completion, return_tensors="pt", truncation=True, max_length=max_len).to(device)
|
ids = tok(prompt + completion, return_tensors="pt", truncation=True, max_length=max_len).to(device)
|
||||||
prompt_ids = tok(prompt, return_tensors="pt").input_ids[0].to(device)
|
prompt_ids = tok(prompt, return_tensors="pt").input_ids[0].to(device)
|
||||||
n_prompt = prompt_ids.shape[0]
|
n_prompt = prompt_ids.shape[0]
|
||||||
# Assert the prompt tokenizes as a clean PREFIX of prompt+completion. If a BPE merge
|
|
||||||
# spans the boundary, n_prompt is wrong and the SFT mask silently shifts by a token
|
|
||||||
# (review M6). Truncation can drop the tail, so only check when not truncated.
|
|
||||||
if ids.input_ids.shape[1] >= n_prompt and ids.input_ids.shape[1] < max_len:
|
|
||||||
assert torch.equal(ids.input_ids[0, :n_prompt], prompt_ids), (
|
|
||||||
"prompt is not a token-prefix of prompt+completion (BPE boundary merge); "
|
|
||||||
"the SFT loss mask would be misaligned by a token."
|
|
||||||
)
|
|
||||||
L = ids.input_ids.shape[1]
|
L = ids.input_ids.shape[1]
|
||||||
|
# Assert the prompt tokenizes as a clean PREFIX of prompt+completion. If a BPE merge spans
|
||||||
|
# the boundary, n_prompt is wrong and the SFT mask silently shifts by a token. Truncation
|
||||||
|
# keeps the FRONT (whole prompt + partial completion), so check the overlap that survives --
|
||||||
|
# min(n_prompt, L). This always runs, including the max_len boundary the earlier guard skipped
|
||||||
|
# (external review: a merge at exactly max_len escaped the < max_len check).
|
||||||
|
n_check = min(n_prompt, L)
|
||||||
|
assert torch.equal(ids.input_ids[0, :n_check], prompt_ids[:n_check]), (
|
||||||
|
"prompt is not a token-prefix of prompt+completion (BPE boundary merge); "
|
||||||
|
"the SFT loss mask would be misaligned by a token."
|
||||||
|
)
|
||||||
tgt_is_completion = torch.arange(1, L, device=device) >= n_prompt # mask over next-token targets
|
tgt_is_completion = torch.arange(1, L, device=device) >= n_prompt # mask over next-token targets
|
||||||
return ids, tgt_is_completion
|
return ids, tgt_is_completion
|
||||||
|
|
||||||
@@ -64,8 +66,11 @@ def heal_round(model, tok, kept: list[dict], hist_specs: list[AdapterSpec], cfg:
|
|||||||
for c in kept:
|
for c in kept:
|
||||||
ids, mask = _encode(tok, c["prompt"], c["completion"], cfg.max_len, model.device)
|
ids, mask = _encode(tok, c["prompt"], c["completion"], cfg.max_len, model.device)
|
||||||
if mask.sum() == 0:
|
if mask.sum() == 0:
|
||||||
|
# prompt filled max_len so the completion was truncated to zero target tokens.
|
||||||
|
# Loud, not silent: this is a kept completion lost from training (review).
|
||||||
|
logger.warning(f"heal: 0 target tokens (prompt >= max_len={cfg.max_len}), skipping a kept completion")
|
||||||
pbar.update(1); step += 1
|
pbar.update(1); step += 1
|
||||||
continue # completion truncated away; nothing to learn here
|
continue
|
||||||
|
|
||||||
# original reference logits (no history, adapter off) for the barrier
|
# original reference logits (no history, adapter off) for the barrier
|
||||||
if cfg.reg in ("kl_fwd", "kl_rev"):
|
if cfg.reg in ("kl_fwd", "kl_rev"):
|
||||||
|
|||||||
+27
-13
@@ -4,6 +4,7 @@ Anchored to the round-0 original throughout (KL reference = adapters/gates off).
|
|||||||
`--fast-dev-run` runs the whole thing on the tiny-random model. See spec.md.
|
`--fast-dev-run` runs the whole thing on the tiny-random model. See spec.md.
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
import math
|
||||||
import os
|
import os
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
@@ -165,26 +166,39 @@ def _log_loop_summary(rounds: list[dict], base_m: dict) -> None:
|
|||||||
last = rounds[-1]
|
last = rounds[-1]
|
||||||
dAuth = last["auth_nats"] - base_m["auth_nats"]
|
dAuth = last["auth_nats"] - base_m["auth_nats"]
|
||||||
dCare = last["care_nats"] - base_m["care_nats"]
|
dCare = last["care_nats"] - base_m["care_nats"]
|
||||||
|
dFair = last["fairness_nats"] - base_m["fairness_nats"]
|
||||||
dCoh = last["coherence"] - base_m["coherence"]
|
dCoh = last["coherence"] - base_m["coherence"]
|
||||||
|
coh = last["coherence"]
|
||||||
coh_cost = abs(dCoh) / abs(dAuth) if abs(dAuth) > 1e-6 else float("nan")
|
coh_cost = abs(dCoh) / abs(dAuth) if abs(dAuth) > 1e-6 else float("nan")
|
||||||
surgical = abs(dAuth) > abs(dCare) # Authority must move MORE than the off-target Care
|
# Surgical = Authority moved MORE than EVERY off-target. Off-target = the individualizing
|
||||||
# TODO(threshold): coh_cost cut not yet calibrated. Provisional: a healed adapter
|
# foundations Care+Fairness; SocialNorms is binding and co-moves with Authority by design,
|
||||||
# SHOULD land trait (dAuth <= -0.3 nats), SURGICALLY (|dAuth|>|dCare|, else it is
|
# so it is NOT a guard. (External review: an Auth-vs-Care-only test greenlights a shift
|
||||||
# broad permissivizing not the trait -- review M4), at coh_cost <= 0.05 (steered c=0.5 ~0.003).
|
# that just dumps mass onto Fairness -- broad anti-binding drift, not the trait.)
|
||||||
if dAuth > -0.3:
|
d_offtarget = max(abs(dCare), abs(dFair))
|
||||||
|
surgical = abs(dAuth) > d_offtarget
|
||||||
|
# Cue. ORDER IS LOAD-BEARING: the ABSOLUTE coherence floor is checked FIRST. coh_cost is a
|
||||||
|
# RATIO, so a model that collapses to ~0 mass on Authority sends dAuth -> -inf and
|
||||||
|
# coh_cost -> 0, which would score a broken model 🟢 (external review: "catastrophic green").
|
||||||
|
# An absolute floor + a non-finite guard close that hole: no trait claim from a model that
|
||||||
|
# cannot answer. TODO(threshold): the -0.3 nat / 0.05 coh_cost cuts are still uncalibrated
|
||||||
|
# (steered c=0.5 ref ~0.003); auth_nats is log-of-mean (Jensen gap vs steering-lite Δlogit).
|
||||||
|
if not (math.isfinite(dAuth) and math.isfinite(coh)) or coh < 0.85:
|
||||||
|
cue = "🔴" # collapsed/broken (coherence floor) -- ratio is meaningless here
|
||||||
|
elif dAuth > -0.3:
|
||||||
cue = "🔴" # no trait retained (undo)
|
cue = "🔴" # no trait retained (undo)
|
||||||
elif not surgical:
|
elif not surgical:
|
||||||
cue = "🔴" # moved, but Care moved as much -> broad permissivizing, not the trait
|
cue = "🔴" # moved, but an off-target moved as much -> broad permissivizing, not the trait
|
||||||
elif coh_cost <= 0.05:
|
elif coh_cost <= 0.05 and coh >= 0.95:
|
||||||
cue = "🟢" # surgical trait retained cheaply
|
cue = "🟢" # surgical trait, cheap, AND coherent in absolute terms
|
||||||
else:
|
else:
|
||||||
cue = "🟡" # surgical trait but coherence-expensive
|
cue = "🟡" # surgical trait but coherence-expensive or only mildly coherent
|
||||||
logger.info(
|
logger.info(
|
||||||
f"main metric: {cue} coh_cost={coh_cost:.3f} (|dCoh|/|dAuth| vs base, lower=better) | "
|
f"main metric: {cue} coh_cost={coh_cost:.3f} (|dCoh|/|dAuth| vs base, lower=better) | "
|
||||||
f"dAuth={dAuth:+.2f} dCare={dCare:+.2f} (surgical={surgical}) coherence={last['coherence']:.2f} "
|
f"dAuth={dAuth:+.2f} dCare={dCare:+.2f} dFair={dFair:+.2f} (surgical={surgical}) "
|
||||||
f"(base {base_m['coherence']:.2f})\n"
|
f"coherence={coh:.2f} (base {base_m['coherence']:.2f})\n"
|
||||||
" cue: 🔴 dAuth>-0.3 (no trait) OR |dAuth|<=|dCare| (broad, not surgical) | 🟢 surgical trait "
|
" cue: 🔴 coh<0.85 (broken) OR dAuth>-0.3 (no trait) OR |dAuth|<=max(|dCare|,|dFair|) "
|
||||||
"at coh_cost<=0.05 | 🟡 surgical but expensive. TODO calibrate coh_cost (steered c=0.5 ref ~0.003)."
|
"(broad, not surgical) | 🟢 surgical trait at coh_cost<=0.05 AND coh>=0.95 | 🟡 else. "
|
||||||
|
"TODO calibrate coh_cost (steered c=0.5 ref ~0.003)."
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user