25 Commits

Author SHA1 Message Date
wassname 48c1b07b83 readme 2026-05-05 08:12:41 +08:00
wassname cf0f7d6c54 results 2026-05-04 18:33:19 +08:00
wassname 7eac38829d hmm 2026-05-04 06:17:30 +08:00
wassname 9dff8d0256 feat: add auth_socn behavior + behavior-aware axis_shift + pmass/flips/bare-logit eval helpers
- data.py: AUTH_SOCN_POS/NEG_PERSONAS (6 pairs, ported from steering-lite branching.py),
  wired into _personas() / _topics() / _build_specs() for auth_socn behavior
- tinymfv_airisk.py: AXIS_PAIR dict + behavior-aware _axis_shift (auth_socn uses
  ΔlogitSocNorms − ΔlogitAuthority vs trad_care's ΔlogitSanc − ΔlogitCare);
  PMASS_FLOOR=0.9 NaN-gate; _logit NaN-safe; _flips_per_foundation_table;
  _bare_logit_per_foundation_table; new __foundations_flips.csv + __bare_logit.csv artifacts
- README: fill trad_care comparison table with actual ws results (jobs 93-96),
  add bare model row for ws, add sl:engineered_prompt row

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-03 06:11:48 +08:00
wassname 497ee05aef first pass care vs sanctity 2026-05-03 06:02:07 +08:00
wassname 4f2034dd46 tidy 2026-05-02 05:52:25 +08:00
wassname 71a8d4c555 tidy 2026-05-01 22:29:06 +08:00
wassname 27cf12c2d8 Switch AIRisk evals to tiny-mfv workflow 2026-05-01 20:47:31 +08:00
wassname b2ef8fef7b wip 2026-04-30 21:06:18 +08:00
wassname ce73e97154 fix: skip guided-CoT for non-thinking models; trim README
Gemma-3/4 don't have </think> as a special token, so guided_cot_one
raised RuntimeError and killed the whole sweep. Fix: add has_thinking_mode
to _tok_extras and gate phase_a2 in replicate.py on it.

README cut from ~380 to ~120 lines: results tables, how to run, cite, links.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-29 05:39:50 +08:00
wassname 7440229d48 narrow honesty: clamp n_personas to list length, expose grid in sweep
Allows narrow honesty (1 persona pair) to share data-volume parity with
broader behaviors by bumping n_samples. data.py logs the clamp; replicate.py
on-disk size check uses clamped n_personas; run_sweep.py exposes
n_topics/n_personas/n_samples to CLI.

README clarifies honesty_label provenance: party='You' filter from
Action_to_party_to_value, not values_aggregated.
2026-04-28 21:23:32 +08:00
wassname 06ec48d8f7 KL-budget calibration: match off-task dist-shift across methods
α=1 means very different things across LoRA/PiSSA/DeLoRA/OFT/IA3/RepE/prompt;
calibrate α per method so p95 token-KL on held-out continuations matches
prompt:engineered_prompt_honest's footprint (≈0.61 nats over 50 stratified
prompts, 100 audit). Newton iter α_next=α·sqrt(T/M) converges 7/7 methods
in 2-3 iters. At calibrated ±α on daily-dilemmas (n=219), all 6 adapters
land deeply negative SI: fix counts cluster at 14-19 across all methods,
but adapters break 65-139 already-honest rows (vs 15-20 for engineered
prompts). Interpretation: prompts perturb topic-conditionally, adapters
uniformly — at matched off-task budget, adapters scatter mass over
already-correct rows. RepE sits between.

Caveats: single seed, calibration off-task, anchor audit p95 is 1.78×
calib (calibrated conservatively).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 14:08:55 +08:00
wassname 325171c291 fix SI_best, add prompt row-alignment check, narrow dw_decomp claims
Address pi-review issues:

- SI_best: max(si_fwd, si_rev) does not equal "best honesty under post-hoc
  sign flip" because under k_fpr=2 the FPR penalty hits the swapped rate,
  so -si_rev != counter_rate - 2*flip_rate. Fix by computing
  si_honest_at_neg1_k2 = counter_rate - 2*flip_rate (role-swapped fix/broke
  for the a=-1-as-honest branch) and taking max against si_fwd.
- Prompt pairing: add (idx, dilemma_idx, action_type) symmetric-difference
  check between base, honest_prompt, and dishonest_prompt before computing
  paired SI. Previously only .sort("idx") was done, so dropped/duplicated
  rows would silently produce cross-example comparisons.
- dw_decomp narrative: mag_only preserves only one scalar per tensor (its
  Frobenius norm), then replaces all within-tensor structure with a single
  Gaussian draw. Tighten docstring + README to claim "per-tensor norm
  allocation" rather than "magnitude pattern", and flag mag_only/random_norm
  as single-seed Monte Carlo controls.

Re-run honesty_tables.py: SI_best now flips prompt:simple from -13.89 to
+3.46 because the role-swapped a=-1 branch is its better direction. Update
README OOD SI table accordingly. Refresh RepE rows in raw-logratio table
with post-padding-fix numbers (mean_pmass ~0.96, no longer ~0.17); drop
stale pmass caveat block.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 09:17:56 +08:00
wassname b7bad4e002 DeLoRA dW decomp: magnitude pattern carries most of the steering
Result: random_direction * original_per_tensor_norm (mag_only) gives a
larger positive logratio shift (+1.07 at a=+1) than the full trained
dW (+0.24), with 5x fewer broken rows. Stripping the magnitude pattern
(dir_only) collapses the effect to +0.02. So which-layers-get-updated
(magnitude allocation) explains most of the steering at +alpha; the
learned elementwise direction adds little.

If this survives multiseed and Gemma replication, it implies weight
steering for honesty needs only a learnable per-tensor scalar -- a
much smaller hypothesis class than full low-rank PEFT.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 08:33:24 +08:00
wassname 64adf9267d SI tables v2: SI_best, SI_k1, fix/broke rates; paired prompts; IID syc
- Pair prompt baselines as alpha=-1/0/+1 (dishonest/base/honest) under
  simple and engineered families, giving full bidirectional SI for
  prompts (same as dW)
- Add SI_best = max(si_fwd, si_rev) * pmass^2 * 100 -- sign-aligned
  upper bound (snooping-aware robustness probe)
- Add SI_k1 (symmetric, breaks weighted 1x) alongside default SI_k2
  to expose how much the class-imbalance-driven 2x penalty contributes
- Expose fix_rate / broke_rate columns so the SI components are visible
- Add IID syc table (held-out persona claims) using
  cross_adapter_ablation/sycophancy_per_row.csv with variant=full_all_tensors
- Add raw mean +- std logratio table per (method, coeff) for OOD

The IID/OOD split shows: dW interventions land hard on IID (PiSSA biggest,
+5.7 mean shift) but most break OOD via the broke_rate channel. OFT and
engineered prompts are the only methods with non-negative SI_best.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 08:29:49 +08:00
wassname 0ded47388f SI tables: README + nbs/honesty_tables.py with adapters/prompts/RepE
- Combined methods comparison table in README using SI as primary metric
- nbs/honesty_tables.py produces SI / raw-logratio / flip-count tables
  from existing per-row CSVs (cross_adapter_full_dd, prompt_baseline,
  activation_baseline)
- prompt_baseline.py: si_fwd computed inline for prompt methods
- activation_baseline.py: tok.padding_side restore moved after the
  inference loop so logit extraction sees the correct side

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-28 08:25:05 +08:00
wassname a48430b075 switch training/eval axis from sycophancy to honesty
- data.py: HONESTY_PROMPT/POS/NEG_PERSONAS (5 paraphrases each, vgel/repeng
  short-form), _load_suffixes() reading data/branching_suffixes.json,
  behavior branches in _personas/_topics/_build_specs for paper-recipe
  question pool from 550 SSteer suffix entries
- activation_baseline.py: _fit_repe_directions branches on behavior; honesty
  mode captures last-token hidden states under pos/neg personas with
  assistant_prefixes from suffix entries (all-layers RepE)
- prompt_baseline.py: paired engineered_prompt_honest + _dishonest (AxBench
  J.2), both as plain strings
- evals/smoke.py: behavior field in SmokeCfg
- data/branching_suffixes.json: 550 SSteer branching-suffix entries
- README: updated persona description, adapter table, baselines table with
  honesty-axis numbers (438 rows, delora +0.237 best)
- RESEARCH_JOURNAL.md: 2026-04-27 axis-switch entry
- fork_plan.md: open design question resolved as option 2 (honesty axis)
- HANDOVER.md: overnight handover notes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-28 06:00:03 +08:00
wassname c828b0c00b baselines 2026-04-27 19:40:43 +08:00
wassname 6ec664995b T6/T7/T8 ablations + lens-search hold pending multiseed
- Add `eval/layer_module_ablation.py` (T7) and `eval/parameterization_ablation.py` (T8) for causal ablation of trained `dW`.
- Add `nbs/ablation_analysis.py` consuming T7/T8 CSVs through three lenses (SVD-on-`dW`, layer index, module family).
- Fix `prompt_baseline.py` engineered-prompt tuple bug; add `DIFF_FILENAME` constant in `diff.py`.
- Delete superseded notebooks (`analyze_diff*`, `cross_adapter_v9`, `hypothesis_sweep_v5-v9`, `strong_conclusion_v4`, `v10_llama`, `functional_projection_v10`).
- Document (README, fork_plan, RESEARCH_JOURNAL): each lens has a built-in failure mode (SVD tautological for low-rank adapters; layer-index tells depth not mechanism; module-family disagrees cross-adapter; native parameterization decompositions non-comparable). Mark analysis question on hold pending T4 multiseed: cross-adapter inconsistency may be N=1 seed noise.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-27 19:05:20 +08:00
wassname 2f12058b7e clarify tested subspace and parametrization hypotheses 2026-04-27 07:10:39 +08:00
wassname b001c40521 document adapter benchmark and projection interpretation 2026-04-27 07:09:02 +08:00
wassname 7e1b171875 paper data recipe + LoRA hyperparams + n_pairs hardening
- data: 5 pos + 5 neg personas, 20 train + 12 eval topic split
  (paper §3 / Appendix C), n_samples solved from n_pairs.
  judge filter stub (off by default; paper uses GPT-4.1-mini).
- eval/sycophancy: read true held-out eval_topics() instead of
  SYCOPHANCY_TOPICS[-16:].
- replicate: fix epochs threading; n_pairs reuse fails fast on mismatch;
  smoke knobs (n_topics, n_personas) plumbed.
- train: paper hyperparams (rank 32 / alpha 16 / lr 1e-5 / warmup 5 /
  wd 0.01); explicit alpha (no 2*r fallback); held-out 10% val + eval_loss
  logging.
- run_demo: train_topics() for in_dist demo claims.
- README: scope block reflects paper-matching recipe.
2026-04-26 10:19:59 +08:00
wassname 3ff283d535 README: fork notice + pipeline overview
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-04-25 20:16:57 +08:00
Constanza 977c054586 Update README.md 2025-11-11 09:18:45 +01:00
cfierro94 90065f035f first commit 2025-10-17 11:14:24 +02:00