Commit Graph

59 Commits

Author SHA1 Message Date
wassname aa4fcff446 scripts(readme_tinymfv_table): mirror steering-lite layout
- Split bare table (absolute logit per foundation) from Δ table
- Add C (calibrated coeff) and kl (achieved p95) columns to Δ table; read
  from out/<behavior>/kl_calibration/summary.csv
- Cells now show mean±std, sourced from dlogit_std (ws) and the per-foundation
  std field of steering-lite JSONs
- Headers: "Care ↓" and "Sanc ↑" mark target direction
- Sort Δ rows by |axis| descending
- Preserve signs in tabulate output via disable_numparse=True
2026-05-02 20:53:19 +08:00
wassname aa0b07451d scripts: tinymfv comparison table + calibrated eval wrapper
- ws.scripts.readme_tinymfv_table: cue / axis_shift / per-foundation Δlogit
  table that combines ws adapter rows (loaded from
  out/trad_care/<adapter>/*__foundations_dlogit.csv) with steering-lite's
  frozen baselines (loaded from
  lite/steering-lite/outputs/tinymfv_sweep/*.json). Same axis, same metric,
  same iso-KL footprint -> directly comparable.
- ws.scripts.eval_tinymfv_calibrated: thin launcher that reads
  out/<behavior>/kl_calibration/summary.csv and runs ws.eval.tinymfv_airisk
  once per adapter with --coeffs -alpha_neg 0.0 +alpha_pos. Necessary
  because the pos/neg alphas are asymmetric per adapter.
2026-05-02 19:47:09 +08:00
wassname f866618eac feat: trad_care behavior + per-foundation Δlogit (tiny-mfv axis pivot)
OOD eval was framed as "steer for honesty, eval on airisk wrongness" but
tiny-mfv is multi-foundational (Care/Sanctity/Authority/...). Honesty isn't
a clean axis it measures, and a 0.6B model has weak honesty representations
to steer; the result was inconsistent shifts we over-interpreted.

Pivot mirrors steering-lite: train on Care-vs-Traditional/Sanctity persona
pair, eval with paired-by-(vid,cond) Δlogit per foundation, composite
axis_shift = ΔlogitSanctity − ΔlogitCare (nats). Directly comparable across
both repos.

- ws.data: TRAD_CARE_PROMPT/POS/NEG (6 paraphrase pairs, ported verbatim from
  steering-lite/branching.py); _personas/_topics/_build_specs branches.
- ws.repe: fit_repe_directions branch for trad_care (same recipe as honesty).
- ws.prompt_texts: simple_*_prompt + engineered_*_prompt entries for the
  prompt_only baseline row (alpha>0 -> traditional, alpha<0 -> caring).
- ws.eval.tinymfv_airisk: ported _logit (eps=0.01), _per_vidcond_wrongness,
  _dlogit_per_foundation_table, _axis_shift; emits new <stem>__foundations_dlogit.csv
  and reports axis_shift in BLUF (cue thresholds 0.5/0.15 nats). Existing
  outputs preserved.
2026-05-02 19:43:07 +08:00
wassname 0bc46dc51e cuda 2026-05-02 06:04:58 +08:00
wassname 4f2034dd46 tidy 2026-05-02 05:52:25 +08:00
wassname 71a8d4c555 tidy 2026-05-01 22:29:06 +08:00
wassname 63715bbf99 logging 2026-05-01 22:22:09 +08:00
wassname b4a8a0351d feat: add n_think parameter to evaluation functions for guided reasoning 2026-05-01 21:13:30 +08:00
wassname 27cf12c2d8 Switch AIRisk evals to tiny-mfv workflow 2026-05-01 20:47:31 +08:00
wassname a0f4e719af Add batched data gen and bidir calibration 2026-05-01 18:58:08 +08:00
wassname b2ef8fef7b wip 2026-04-30 21:06:18 +08:00
wassname 44e16b0c9a fix: keep all 438 rows in DD eval (both to_do and not_to_do per dilemma)
Old filter used to_do_has_positive_honesty / not_to_do_has_negative_honesty
flags, which kept 194 to_do + 33 not_to_do = 227 rows (85% to_do only).

The dataset has honesty_label=+/-1 for all 438 rows via symmetric labeling
(if to_do is honest, not_to_do gets -1 regardless of its own values).
Filtering to honesty_label != 0 keeps all 438 rows and pairs both sides of
every dilemma, matching AntiPaSTO's symmetric evaluation approach.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 05:58:20 +08:00
wassname 93334c5889 fix: match AntiPaSTO prompt format (INSTRUCTION_PROMPT + anchor)
Was: "Do you choose to: {action}?" with "My choice: **" anchor.
AntiPaSTO ref: "Should you take this action? Action: {action}" with "My choice:" anchor.

The ** anchor told the model to produce markdown-bolded answers, scoring
at the wrong token boundary. "My choice:" anchor scores " Yes"/" No"
which get_choice_ids already catches via prefix match.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 05:56:00 +08:00
wassname ce73e97154 fix: skip guided-CoT for non-thinking models; trim README
Gemma-3/4 don't have </think> as a special token, so guided_cot_one
raised RuntimeError and killed the whole sweep. Fix: add has_thinking_mode
to _tok_extras and gate phase_a2 in replicate.py on it.

README cut from ~380 to ~120 lines: results tables, how to run, cite, links.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 05:39:50 +08:00
wassname 5704b00175 gemma4: disable thinking mode via enable_thinking=False in apply_chat_template
Gemma 4 (E2B/E4B) uses channel-based thinking tokens (<|think|>, <|channel>).
chat_template_extras() detects this via template string and passes
enable_thinking=False to all apply_chat_template calls in data gen,
dilemmas eval, and KL calib (via build_chat_text). Qwen3 and Gemma 3
return {} (existing thinking-mode handling unchanged).
2026-04-28 21:47:33 +08:00
wassname 08efb837c0 kl_calibrate: greedy-trajectory KL + Illinois regula-falsi root search
Refactor calibration to match the gist methodology: for each prompt, greedy-
generate n_tokens under the steered policy, capture per-step steered
log-probs, then teacher-force the same continuation under base. Per-position
KL(steered ‖ base) is computed along the steered trajectory rather than at
fixed continuation positions. Captures cumulative drift the old fixed-
continuation KL missed.

Replaces 1-step Newton on alpha with exponential bracket -> Illinois regula-
falsi in log-(alpha, p95) space. Linear in log-log (since p95 ~ alpha^k near
root) so usually converges in 3-4 iters, with Illinois rule breaking the
stuck-endpoint failure mode of pure regula falsi.
2026-04-28 21:23:41 +08:00
wassname 7440229d48 narrow honesty: clamp n_personas to list length, expose grid in sweep
Allows narrow honesty (1 persona pair) to share data-volume parity with
broader behaviors by bumping n_samples. data.py logs the clamp; replicate.py
on-disk size check uses clamped n_personas; run_sweep.py exposes
n_topics/n_personas/n_samples to CLI.

README clarifies honesty_label provenance: party='You' filter from
Action_to_party_to_value, not values_aggregated.
2026-04-28 21:23:32 +08:00
wassname cce818b03f dilemmas: per-action-type SI breakdown in summary CSV
Adds SI_to_do, SI_not_to_do (and si_fwd/si_rev/n_cho/n_rej variants)
to compute_full_metrics output. Lets us see if a steering method's
effect is symmetric across "Should you do X?" vs "Should you NOT do X?"
framings, or if it leans on one. Already-known pattern: paired-opposite
filler rows produced large SI_to_do > SI_not_to_do gaps before the
honesty-only filter landed.
2026-04-28 21:12:57 +08:00
wassname 0f050f2734 honesty: narrow training/prompt/eval to honesty-only axis
The wassname/daily_dilemmas-self-honesty dataset uses paired-opposite
labels: if to_do has honesty in you_values -> to_do=+1, paired not_to_do=-1
even when not_to_do's own values are e.g. ['empathy'], unrelated to
honesty. Those 211/438 filler rows were dragging dW SI negative.

Changes:
- dilemmas.py: filter to action-specific honesty rows at load (227/438)
- data.py: narrow training pair to ('an honest',) / ('a dishonest',),
  matching RepE extraction, simple_*_prompt baselines, and AntiPaSTO
  honesty synonyms. Cluster paraphrases (candid/sincere/manipulative/...)
  drift onto a broader axis -- comparison was unfair.
- prompt_baseline.py: regenerate engineered prompts via AxBench J.2
  template with narrow-honesty concept slot. Old prompts brought in
  integrity/transparency (POS) and Machiavellian/manipulation (NEG).
- nbs/rescore_honesty_only.py: ad-hoc rescore script confirming the
  fix flips dW from -16 SI to +60 SI without rerunning the model.
2026-04-28 21:11:14 +08:00
wassname 06ec48d8f7 KL-budget calibration: match off-task dist-shift across methods
α=1 means very different things across LoRA/PiSSA/DeLoRA/OFT/IA3/RepE/prompt;
calibrate α per method so p95 token-KL on held-out continuations matches
prompt:engineered_prompt_honest's footprint (≈0.61 nats over 50 stratified
prompts, 100 audit). Newton iter α_next=α·sqrt(T/M) converges 7/7 methods
in 2-3 iters. At calibrated ±α on daily-dilemmas (n=219), all 6 adapters
land deeply negative SI: fix counts cluster at 14-19 across all methods,
but adapters break 65-139 already-honest rows (vs 15-20 for engineered
prompts). Interpretation: prompts perturb topic-conditionally, adapters
uniformly — at matched off-task budget, adapters scatter mass over
already-correct rows. RepE sits between.

Caveats: single seed, calibration off-task, anchor audit p95 is 1.78×
calib (calibrated conservatively).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 14:08:55 +08:00
wassname 325171c291 fix SI_best, add prompt row-alignment check, narrow dw_decomp claims
Address pi-review issues:

- SI_best: max(si_fwd, si_rev) does not equal "best honesty under post-hoc
  sign flip" because under k_fpr=2 the FPR penalty hits the swapped rate,
  so -si_rev != counter_rate - 2*flip_rate. Fix by computing
  si_honest_at_neg1_k2 = counter_rate - 2*flip_rate (role-swapped fix/broke
  for the a=-1-as-honest branch) and taking max against si_fwd.
- Prompt pairing: add (idx, dilemma_idx, action_type) symmetric-difference
  check between base, honest_prompt, and dishonest_prompt before computing
  paired SI. Previously only .sort("idx") was done, so dropped/duplicated
  rows would silently produce cross-example comparisons.
- dw_decomp narrative: mag_only preserves only one scalar per tensor (its
  Frobenius norm), then replaces all within-tensor structure with a single
  Gaussian draw. Tighten docstring + README to claim "per-tensor norm
  allocation" rather than "magnitude pattern", and flag mag_only/random_norm
  as single-seed Monte Carlo controls.

Re-run honesty_tables.py: SI_best now flips prompt:simple from -13.89 to
+3.46 because the role-swapped a=-1 branch is its better direction. Update
README OOD SI table accordingly. Refresh RepE rows in raw-logratio table
with post-padding-fix numbers (mean_pmass ~0.96, no longer ~0.17); drop
stale pmass caveat block.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 09:17:56 +08:00
wassname da75668d6b move RESEARCH_JOURNAL and fork_plan under docs/
Working notes belong with the rest of the docs. Updated relative links
in docs/hypothesis_ablation_catalog.md from ../fork_plan.md to fork_plan.md
since both files now live in docs/.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 09:09:52 +08:00
wassname e4504da9a5 cleanup: drop stale HANDOVER/RESEARCH_LOG, fix axis line in fork_plan
HANDOVER.md and RESEARCH_LOG.md were stubs from before the honesty-axis
switch and the work they referenced is already done. fork_plan.md still
said "sycophancy training" at line 24 even though the rest of the doc
already documents the honesty axis.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 08:37:01 +08:00
wassname b7bad4e002 DeLoRA dW decomp: magnitude pattern carries most of the steering
Result: random_direction * original_per_tensor_norm (mag_only) gives a
larger positive logratio shift (+1.07 at a=+1) than the full trained
dW (+0.24), with 5x fewer broken rows. Stripping the magnitude pattern
(dir_only) collapses the effect to +0.02. So which-layers-get-updated
(magnitude allocation) explains most of the steering at +alpha; the
learned elementwise direction adds little.

If this survives multiseed and Gemma replication, it implies weight
steering for honesty needs only a learnable per-tensor scalar -- a
much smaller hypothesis class than full low-rank PEFT.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 08:33:24 +08:00
wassname 19bc3edb2e add dW magnitude/direction ablation eval
Constructs four variants of a trained dW and evaluates each on daily
dilemmas at coeffs {-1, 0, +1}:
  full         original (control)
  dir_only     elementwise direction preserved, all tensors rescaled
               to a common Frobenius norm (flattens per-tensor magnitude)
  mag_only     random direction per tensor, original per-tensor norm
               (preserves which layers/modules carry the load)
  random_norm  random direction + common norm (control)

Tests whether the trained behavior is carried by element direction or
by the per-tensor magnitude pattern. Default adapter is delora since
it has the largest raw dd_delta and the worst SI -- which factor is
load-bearing?

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 08:31:24 +08:00
wassname 64adf9267d SI tables v2: SI_best, SI_k1, fix/broke rates; paired prompts; IID syc
- Pair prompt baselines as alpha=-1/0/+1 (dishonest/base/honest) under
  simple and engineered families, giving full bidirectional SI for
  prompts (same as dW)
- Add SI_best = max(si_fwd, si_rev) * pmass^2 * 100 -- sign-aligned
  upper bound (snooping-aware robustness probe)
- Add SI_k1 (symmetric, breaks weighted 1x) alongside default SI_k2
  to expose how much the class-imbalance-driven 2x penalty contributes
- Expose fix_rate / broke_rate columns so the SI components are visible
- Add IID syc table (held-out persona claims) using
  cross_adapter_ablation/sycophancy_per_row.csv with variant=full_all_tensors
- Add raw mean +- std logratio table per (method, coeff) for OOD

The IID/OOD split shows: dW interventions land hard on IID (PiSSA biggest,
+5.7 mean shift) but most break OOD via the broke_rate channel. OFT and
engineered prompts are the only methods with non-negative SI_best.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 08:29:49 +08:00
wassname 0ded47388f SI tables: README + nbs/honesty_tables.py with adapters/prompts/RepE
- Combined methods comparison table in README using SI as primary metric
- nbs/honesty_tables.py produces SI / raw-logratio / flip-count tables
  from existing per-row CSVs (cross_adapter_full_dd, prompt_baseline,
  activation_baseline)
- prompt_baseline.py: si_fwd computed inline for prompt methods
- activation_baseline.py: tok.padding_side restore moved after the
  inference loop so logit extraction sees the correct side

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 08:25:05 +08:00
wassname df61cdc628 add surgical_informedness metric; fix simple_honest_prompt to match training persona
- dilemmas.py: compute_surgical_informedness + compute_full_metrics (ref-anchored
  bidirectional SI, k_fpr=2; forward-only fallback when coeff=-1 absent)
- prompt_baseline.py: simple_honest/dishonest prompts now use same
  HONESTY_PROMPT.format(persona=...) template as training persona prefix
  (was "You are an honest assistant..."); also adds simple_dishonest_prompt;
  _summarize computes SI per method via _si_per_method
- full_dd_benchmark.py: _summarize computes SI per adapter; output sorted
  by SI; final_summary reports SI as main_metric

Re-queue: pueue 237 (T3 prompt_baseline), 238 (T2 full_dd_benchmark)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 06:04:06 +08:00
wassname a48430b075 switch training/eval axis from sycophancy to honesty
- data.py: HONESTY_PROMPT/POS/NEG_PERSONAS (5 paraphrases each, vgel/repeng
  short-form), _load_suffixes() reading data/branching_suffixes.json,
  behavior branches in _personas/_topics/_build_specs for paper-recipe
  question pool from 550 SSteer suffix entries
- activation_baseline.py: _fit_repe_directions branches on behavior; honesty
  mode captures last-token hidden states under pos/neg personas with
  assistant_prefixes from suffix entries (all-layers RepE)
- prompt_baseline.py: paired engineered_prompt_honest + _dishonest (AxBench
  J.2), both as plain strings
- evals/smoke.py: behavior field in SmokeCfg
- data/branching_suffixes.json: 550 SSteer branching-suffix entries
- README: updated persona description, adapter table, baselines table with
  honesty-axis numbers (438 rows, delora +0.237 best)
- RESEARCH_JOURNAL.md: 2026-04-27 axis-switch entry
- fork_plan.md: open design question resolved as option 2 (honesty axis)
- HANDOVER.md: overnight handover notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 06:00:03 +08:00
wassname c828b0c00b baselines 2026-04-27 19:40:43 +08:00
wassname 6ec664995b T6/T7/T8 ablations + lens-search hold pending multiseed
- Add `eval/layer_module_ablation.py` (T7) and `eval/parameterization_ablation.py` (T8) for causal ablation of trained `dW`.
- Add `nbs/ablation_analysis.py` consuming T7/T8 CSVs through three lenses (SVD-on-`dW`, layer index, module family).
- Fix `prompt_baseline.py` engineered-prompt tuple bug; add `DIFF_FILENAME` constant in `diff.py`.
- Delete superseded notebooks (`analyze_diff*`, `cross_adapter_v9`, `hypothesis_sweep_v5-v9`, `strong_conclusion_v4`, `v10_llama`, `functional_projection_v10`).
- Document (README, fork_plan, RESEARCH_JOURNAL): each lens has a built-in failure mode (SVD tautological for low-rank adapters; layer-index tells depth not mechanism; module-family disagrees cross-adapter; native parameterization decompositions non-comparable). Mark analysis question on hold pending T4 multiseed: cross-adapter inconsistency may be N=1 seed noise.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 19:05:20 +08:00
wassname db7979d0e2 baselines 2026-04-27 13:02:34 +08:00
wassname 8fa9e54eaa docs: rewrite fork plan with UAT tasks 2026-04-27 11:22:52 +08:00
wassname a3d999fd92 wip 2026-04-27 09:59:06 +08:00
wassname 2f12058b7e clarify tested subspace and parametrization hypotheses 2026-04-27 07:10:39 +08:00
wassname b001c40521 document adapter benchmark and projection interpretation 2026-04-27 07:09:02 +08:00
wassname 25334ec574 fix daily-dilemmas cross-adapter baseline 2026-04-27 07:00:09 +08:00
wassname 6f41e47ea9 v10 functional projection falsifier for act-oracle overlap 2026-04-27 06:54:09 +08:00
wassname ff92b092fa research journal: v9 cross-adapter — DeLoRA wins behavior, all subspace methods fail at 1-8% overlap 2026-04-27 06:29:03 +08:00
wassname 236cea1267 cross_adapter_v9: aggregate v9 scope diagnostics + dilemmas across adapters 2026-04-26 21:55:19 +08:00
wassname 3f162027b1 v9 sweep: block-local act oracle + L=8 sanity (layer-scope diagnostic); ADAPTER env var for cross-adapter use 2026-04-26 21:53:45 +08:00
wassname 2c262d47a8 v8 polish: w_oracle + act_oracle (each saturates own axis), 3-panel scatter + bar of % to ideal 2026-04-26 21:14:01 +08:00
wassname f4039dd2ee v8: rank-honest pct_oracle metric (energy_frac / oracle@r_eff in [0,1])
Replaces v7's post-hoc 'pct_w_oracle = R_w / R_w_ceiling' (a ratio of two
concentration ratios) with a per-row pct_oracle: candidate's energy_frac
divided by the optimal rank-r_eff subspace's energy_frac on the same
target. Rank-honest: chars_clusters (r_eff=7) is graded against rank-7
oracle, not rank-8. Activation oracle = PCA of L2-normalized hs_diff_B
(matches existing energy_frac_act formula).

Result: every non-oracle candidate lands at pct_oracle in [0.02, 0.11] on
both axes. Best joint = WNR_union_TaskDiff at 0.089 (rank 16; all others
rank 8). chars_clusters and layer_clean_resid_pca tied at ~0.085. This is
a clean negative result: LoRA's task-specific delta is far from any of
our hand-built linear primitives' spans.
2026-04-26 20:59:27 +08:00
wassname 651ad132d3 v7: cold-eyes evidence review + flag write-family-below-null in conclusion 2026-04-26 20:01:11 +08:00
wassname 3c9fb8d1f5 v7 sweep: per-tensor R_w + true weight ceiling + axis_kind tag
Addresses three concerns from docs/review/v6_hypothesis_review.md:
1. R_w split into oproj/downproj + Frobenius-balanced combined.
2. dW_left_basis_ceiling as the true weight oracle.
3. axis_kind tag (write/read/mixed/ceiling).

Single-seed result: chars_clusters and attn_min_taskdiff are top-5 by both R_act
and R_w_combined. Write-family bases (write/mlp_write/global_write) all have
R_w_combined ~ 1.0 (random null) -- natural weight-side bases fail the
weight-axis test. Multi-seed deferred to v7b.
2026-04-26 19:55:42 +08:00
wassname a1b38dc456 docs: add v6 hypothesis review (subagent + reviewer-of-reviewer) 2026-04-26 19:45:13 +08:00
wassname aba74c0f64 logs 2026-04-26 11:12:11 +08:00
wassname c1c4d2f3cb nb 2026-04-26 11:03:38 +08:00
wassname 7be1487d7b data recipe: drop n_pairs/judge/Optional knobs, explicit grid
Subagent review fixes:

- DataCfg / Cfg expose the grid directly (n_topics, n_personas, n_samples)
  as required ints with paper defaults (20/5/10). Drops `n_pairs` and the
  silent round() that made the count fuzzy. Drops `Optional[int]` smoke
  overrides — smoke just sets 2/1/2 = 4 pairs.
- Drop hash()-based per-spec reseeding (process-nondeterministic via
  PYTHONHASHSEED salt) and the `rng` parameter to _gen that never reached
  model.generate. One torch.manual_seed at start; spec order seeded by rng.
- Delete _judge_filter stub + cfg.judge flag (dead code, paper §3 GPT-4.1-mini
  filter not implemented yet — TODO comment instead).
- replicate._maybe_data: check len(ds) against n_topics × n_personas × n_samples
  instead of n_pairs.
- justfile: drop --n-pairs 1000.
2026-04-26 10:24:31 +08:00
wassname 7e1b171875 paper data recipe + LoRA hyperparams + n_pairs hardening
- data: 5 pos + 5 neg personas, 20 train + 12 eval topic split
  (paper §3 / Appendix C), n_samples solved from n_pairs.
  judge filter stub (off by default; paper uses GPT-4.1-mini).
- eval/sycophancy: read true held-out eval_topics() instead of
  SYCOPHANCY_TOPICS[-16:].
- replicate: fix epochs threading; n_pairs reuse fails fast on mismatch;
  smoke knobs (n_topics, n_personas) plumbed.
- train: paper hyperparams (rank 32 / alpha 16 / lr 1e-5 / warmup 5 /
  wd 0.01); explicit alpha (no 2*r fallback); held-out 10% val + eval_loss
  logging.
- run_demo: train_topics() for in_dist demo claims.
- README: scope block reflects paper-matching recipe.
2026-04-26 10:19:59 +08:00