StaticCache passed as past_key_values triggers create_masks_for_generate()
which requires linear_attention mask not in transformers 5.6.x. Plain
generate() uses DynamicCache and avoids this code path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- data/load_pairs: path now includes model slug (out/data/{model}/{behavior})
so data from different models can't be silently reused
- data.py, kl_calibrate.py, tinymfv_airisk.py: add use_4bit=True with
BitsAndBytesConfig for inference stages; training stays bfloat16
- run_sweep/kl_calibrate/eval_tinymfv_calibrated: revert adapter defaults
to full list; pass --adapters delora via CLI for this first run
- add bitsandbytes dep
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Qwen3.5-4B + FA2 trips linear_attention masking in transformers.
sdpa (default) works fine; sl confirmed same approach in their sweep.
Model reverted to Qwen3.5-4B to match sl baselines.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Qwen3.5-4B requires linear_attention mask support not in transformers<5.6.
Qwen3-4B uses standard full_attention and works with current transformers.
flash-attn added as URL dep so uv sync keeps it in .venv.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Split bare table (absolute logit per foundation) from Δ table
- Add C (calibrated coeff) and kl (achieved p95) columns to Δ table; read
from out/<behavior>/kl_calibration/summary.csv
- Cells now show mean±std, sourced from dlogit_std (ws) and the per-foundation
std field of steering-lite JSONs
- Headers: "Care ↓" and "Sanc ↑" mark target direction
- Sort Δ rows by |axis| descending
- Preserve signs in tabulate output via disable_numparse=True
- ws.scripts.readme_tinymfv_table: cue / axis_shift / per-foundation Δlogit
table that combines ws adapter rows (loaded from
out/trad_care/<adapter>/*__foundations_dlogit.csv) with steering-lite's
frozen baselines (loaded from
lite/steering-lite/outputs/tinymfv_sweep/*.json). Same axis, same metric,
same iso-KL footprint -> directly comparable.
- ws.scripts.eval_tinymfv_calibrated: thin launcher that reads
out/<behavior>/kl_calibration/summary.csv and runs ws.eval.tinymfv_airisk
once per adapter with --coeffs -alpha_neg 0.0 +alpha_pos. Necessary
because the pos/neg alphas are asymmetric per adapter.
OOD eval was framed as "steer for honesty, eval on airisk wrongness" but
tiny-mfv is multi-foundational (Care/Sanctity/Authority/...). Honesty isn't
a clean axis it measures, and a 0.6B model has weak honesty representations
to steer; the result was inconsistent shifts we over-interpreted.
Pivot mirrors steering-lite: train on Care-vs-Traditional/Sanctity persona
pair, eval with paired-by-(vid,cond) Δlogit per foundation, composite
axis_shift = ΔlogitSanctity − ΔlogitCare (nats). Directly comparable across
both repos.
- ws.data: TRAD_CARE_PROMPT/POS/NEG (6 paraphrase pairs, ported verbatim from
steering-lite/branching.py); _personas/_topics/_build_specs branches.
- ws.repe: fit_repe_directions branch for trad_care (same recipe as honesty).
- ws.prompt_texts: simple_*_prompt + engineered_*_prompt entries for the
prompt_only baseline row (alpha>0 -> traditional, alpha<0 -> caring).
- ws.eval.tinymfv_airisk: ported _logit (eps=0.01), _per_vidcond_wrongness,
_dlogit_per_foundation_table, _axis_shift; emits new <stem>__foundations_dlogit.csv
and reports axis_shift in BLUF (cue thresholds 0.5/0.15 nats). Existing
outputs preserved.
Old filter used to_do_has_positive_honesty / not_to_do_has_negative_honesty
flags, which kept 194 to_do + 33 not_to_do = 227 rows (85% to_do only).
The dataset has honesty_label=+/-1 for all 438 rows via symmetric labeling
(if to_do is honest, not_to_do gets -1 regardless of its own values).
Filtering to honesty_label != 0 keeps all 438 rows and pairs both sides of
every dilemma, matching AntiPaSTO's symmetric evaluation approach.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Was: "Do you choose to: {action}?" with "My choice: **" anchor.
AntiPaSTO ref: "Should you take this action? Action: {action}" with "My choice:" anchor.
The ** anchor told the model to produce markdown-bolded answers, scoring
at the wrong token boundary. "My choice:" anchor scores " Yes"/" No"
which get_choice_ids already catches via prefix match.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Gemma-3/4 don't have </think> as a special token, so guided_cot_one
raised RuntimeError and killed the whole sweep. Fix: add has_thinking_mode
to _tok_extras and gate phase_a2 in replicate.py on it.
README cut from ~380 to ~120 lines: results tables, how to run, cite, links.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Gemma 4 (E2B/E4B) uses channel-based thinking tokens (<|think|>, <|channel>).
chat_template_extras() detects this via template string and passes
enable_thinking=False to all apply_chat_template calls in data gen,
dilemmas eval, and KL calib (via build_chat_text). Qwen3 and Gemma 3
return {} (existing thinking-mode handling unchanged).
Refactor calibration to match the gist methodology: for each prompt, greedy-
generate n_tokens under the steered policy, capture per-step steered
log-probs, then teacher-force the same continuation under base. Per-position
KL(steered ‖ base) is computed along the steered trajectory rather than at
fixed continuation positions. Captures cumulative drift the old fixed-
continuation KL missed.
Replaces 1-step Newton on alpha with exponential bracket -> Illinois regula-
falsi in log-(alpha, p95) space. Linear in log-log (since p95 ~ alpha^k near
root) so usually converges in 3-4 iters, with Illinois rule breaking the
stuck-endpoint failure mode of pure regula falsi.
Adds SI_to_do, SI_not_to_do (and si_fwd/si_rev/n_cho/n_rej variants)
to compute_full_metrics output. Lets us see if a steering method's
effect is symmetric across "Should you do X?" vs "Should you NOT do X?"
framings, or if it leans on one. Already-known pattern: paired-opposite
filler rows produced large SI_to_do > SI_not_to_do gaps before the
honesty-only filter landed.
The wassname/daily_dilemmas-self-honesty dataset uses paired-opposite
labels: if to_do has honesty in you_values -> to_do=+1, paired not_to_do=-1
even when not_to_do's own values are e.g. ['empathy'], unrelated to
honesty. Those 211/438 filler rows were dragging dW SI negative.
Changes:
- dilemmas.py: filter to action-specific honesty rows at load (227/438)
- data.py: narrow training pair to ('an honest',) / ('a dishonest',),
matching RepE extraction, simple_*_prompt baselines, and AntiPaSTO
honesty synonyms. Cluster paraphrases (candid/sincere/manipulative/...)
drift onto a broader axis -- comparison was unfair.
- prompt_baseline.py: regenerate engineered prompts via AxBench J.2
template with narrow-honesty concept slot. Old prompts brought in
integrity/transparency (POS) and Machiavellian/manipulation (NEG).
- nbs/rescore_honesty_only.py: ad-hoc rescore script confirming the
fix flips dW from -16 SI to +60 SI without rerunning the model.
α=1 means very different things across LoRA/PiSSA/DeLoRA/OFT/IA3/RepE/prompt;
calibrate α per method so p95 token-KL on held-out continuations matches
prompt:engineered_prompt_honest's footprint (≈0.61 nats over 50 stratified
prompts, 100 audit). Newton iter α_next=α·sqrt(T/M) converges 7/7 methods
in 2-3 iters. At calibrated ±α on daily-dilemmas (n=219), all 6 adapters
land deeply negative SI: fix counts cluster at 14-19 across all methods,
but adapters break 65-139 already-honest rows (vs 15-20 for engineered
prompts). Interpretation: prompts perturb topic-conditionally, adapters
uniformly — at matched off-task budget, adapters scatter mass over
already-correct rows. RepE sits between.
Caveats: single seed, calibration off-task, anchor audit p95 is 1.78×
calib (calibrated conservatively).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Address pi-review issues:
- SI_best: max(si_fwd, si_rev) does not equal "best honesty under post-hoc
sign flip" because under k_fpr=2 the FPR penalty hits the swapped rate,
so -si_rev != counter_rate - 2*flip_rate. Fix by computing
si_honest_at_neg1_k2 = counter_rate - 2*flip_rate (role-swapped fix/broke
for the a=-1-as-honest branch) and taking max against si_fwd.
- Prompt pairing: add (idx, dilemma_idx, action_type) symmetric-difference
check between base, honest_prompt, and dishonest_prompt before computing
paired SI. Previously only .sort("idx") was done, so dropped/duplicated
rows would silently produce cross-example comparisons.
- dw_decomp narrative: mag_only preserves only one scalar per tensor (its
Frobenius norm), then replaces all within-tensor structure with a single
Gaussian draw. Tighten docstring + README to claim "per-tensor norm
allocation" rather than "magnitude pattern", and flag mag_only/random_norm
as single-seed Monte Carlo controls.
Re-run honesty_tables.py: SI_best now flips prompt:simple from -13.89 to
+3.46 because the role-swapped a=-1 branch is its better direction. Update
README OOD SI table accordingly. Refresh RepE rows in raw-logratio table
with post-padding-fix numbers (mean_pmass ~0.96, no longer ~0.17); drop
stale pmass caveat block.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Working notes belong with the rest of the docs. Updated relative links
in docs/hypothesis_ablation_catalog.md from ../fork_plan.md to fork_plan.md
since both files now live in docs/.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
HANDOVER.md and RESEARCH_LOG.md were stubs from before the honesty-axis
switch and the work they referenced is already done. fork_plan.md still
said "sycophancy training" at line 24 even though the rest of the doc
already documents the honesty axis.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Result: random_direction * original_per_tensor_norm (mag_only) gives a
larger positive logratio shift (+1.07 at a=+1) than the full trained
dW (+0.24), with 5x fewer broken rows. Stripping the magnitude pattern
(dir_only) collapses the effect to +0.02. So which-layers-get-updated
(magnitude allocation) explains most of the steering at +alpha; the
learned elementwise direction adds little.
If this survives multiseed and Gemma replication, it implies weight
steering for honesty needs only a learnable per-tensor scalar -- a
much smaller hypothesis class than full low-rank PEFT.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Constructs four variants of a trained dW and evaluates each on daily
dilemmas at coeffs {-1, 0, +1}:
full original (control)
dir_only elementwise direction preserved, all tensors rescaled
to a common Frobenius norm (flattens per-tensor magnitude)
mag_only random direction per tensor, original per-tensor norm
(preserves which layers/modules carry the load)
random_norm random direction + common norm (control)
Tests whether the trained behavior is carried by element direction or
by the per-tensor magnitude pattern. Default adapter is delora since
it has the largest raw dd_delta and the worst SI -- which factor is
load-bearing?
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Pair prompt baselines as alpha=-1/0/+1 (dishonest/base/honest) under
simple and engineered families, giving full bidirectional SI for
prompts (same as dW)
- Add SI_best = max(si_fwd, si_rev) * pmass^2 * 100 -- sign-aligned
upper bound (snooping-aware robustness probe)
- Add SI_k1 (symmetric, breaks weighted 1x) alongside default SI_k2
to expose how much the class-imbalance-driven 2x penalty contributes
- Expose fix_rate / broke_rate columns so the SI components are visible
- Add IID syc table (held-out persona claims) using
cross_adapter_ablation/sycophancy_per_row.csv with variant=full_all_tensors
- Add raw mean +- std logratio table per (method, coeff) for OOD
The IID/OOD split shows: dW interventions land hard on IID (PiSSA biggest,
+5.7 mean shift) but most break OOD via the broke_rate channel. OFT and
engineered prompts are the only methods with non-negative SI_best.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Combined methods comparison table in README using SI as primary metric
- nbs/honesty_tables.py produces SI / raw-logratio / flip-count tables
from existing per-row CSVs (cross_adapter_full_dd, prompt_baseline,
activation_baseline)
- prompt_baseline.py: si_fwd computed inline for prompt methods
- activation_baseline.py: tok.padding_side restore moved after the
inference loop so logit extraction sees the correct side
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- dilemmas.py: compute_surgical_informedness + compute_full_metrics (ref-anchored
bidirectional SI, k_fpr=2; forward-only fallback when coeff=-1 absent)
- prompt_baseline.py: simple_honest/dishonest prompts now use same
HONESTY_PROMPT.format(persona=...) template as training persona prefix
(was "You are an honest assistant..."); also adds simple_dishonest_prompt;
_summarize computes SI per method via _si_per_method
- full_dd_benchmark.py: _summarize computes SI per adapter; output sorted
by SI; final_summary reports SI as main_metric
Re-queue: pueue 237 (T3 prompt_baseline), 238 (T2 full_dd_benchmark)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>