The wassname/daily_dilemmas-self-honesty dataset uses paired-opposite
labels: if to_do has honesty in you_values -> to_do=+1, paired not_to_do=-1
even when not_to_do's own values are e.g. ['empathy'], unrelated to
honesty. Those 211/438 filler rows were dragging dW SI negative.
Changes:
- dilemmas.py: filter to action-specific honesty rows at load (227/438)
- data.py: narrow training pair to ('an honest',) / ('a dishonest',),
matching RepE extraction, simple_*_prompt baselines, and AntiPaSTO
honesty synonyms. Cluster paraphrases (candid/sincere/manipulative/...)
drift onto a broader axis -- comparison was unfair.
- prompt_baseline.py: regenerate engineered prompts via AxBench J.2
template with narrow-honesty concept slot. Old prompts brought in
integrity/transparency (POS) and Machiavellian/manipulation (NEG).
- nbs/rescore_honesty_only.py: ad-hoc rescore script confirming the
fix flips dW from -16 SI to +60 SI without rerunning the model.
Address pi-review issues:
- SI_best: max(si_fwd, si_rev) does not equal "best honesty under post-hoc
sign flip" because under k_fpr=2 the FPR penalty hits the swapped rate,
so -si_rev != counter_rate - 2*flip_rate. Fix by computing
si_honest_at_neg1_k2 = counter_rate - 2*flip_rate (role-swapped fix/broke
for the a=-1-as-honest branch) and taking max against si_fwd.
- Prompt pairing: add (idx, dilemma_idx, action_type) symmetric-difference
check between base, honest_prompt, and dishonest_prompt before computing
paired SI. Previously only .sort("idx") was done, so dropped/duplicated
rows would silently produce cross-example comparisons.
- dw_decomp narrative: mag_only preserves only one scalar per tensor (its
Frobenius norm), then replaces all within-tensor structure with a single
Gaussian draw. Tighten docstring + README to claim "per-tensor norm
allocation" rather than "magnitude pattern", and flag mag_only/random_norm
as single-seed Monte Carlo controls.
Re-run honesty_tables.py: SI_best now flips prompt:simple from -13.89 to
+3.46 because the role-swapped a=-1 branch is its better direction. Update
README OOD SI table accordingly. Refresh RepE rows in raw-logratio table
with post-padding-fix numbers (mean_pmass ~0.96, no longer ~0.17); drop
stale pmass caveat block.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Pair prompt baselines as alpha=-1/0/+1 (dishonest/base/honest) under
simple and engineered families, giving full bidirectional SI for
prompts (same as dW)
- Add SI_best = max(si_fwd, si_rev) * pmass^2 * 100 -- sign-aligned
upper bound (snooping-aware robustness probe)
- Add SI_k1 (symmetric, breaks weighted 1x) alongside default SI_k2
to expose how much the class-imbalance-driven 2x penalty contributes
- Expose fix_rate / broke_rate columns so the SI components are visible
- Add IID syc table (held-out persona claims) using
cross_adapter_ablation/sycophancy_per_row.csv with variant=full_all_tensors
- Add raw mean +- std logratio table per (method, coeff) for OOD
The IID/OOD split shows: dW interventions land hard on IID (PiSSA biggest,
+5.7 mean shift) but most break OOD via the broke_rate channel. OFT and
engineered prompts are the only methods with non-negative SI_best.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Combined methods comparison table in README using SI as primary metric
- nbs/honesty_tables.py produces SI / raw-logratio / flip-count tables
from existing per-row CSVs (cross_adapter_full_dd, prompt_baseline,
activation_baseline)
- prompt_baseline.py: si_fwd computed inline for prompt methods
- activation_baseline.py: tok.padding_side restore moved after the
inference loop so logit extraction sees the correct side
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces v7's post-hoc 'pct_w_oracle = R_w / R_w_ceiling' (a ratio of two
concentration ratios) with a per-row pct_oracle: candidate's energy_frac
divided by the optimal rank-r_eff subspace's energy_frac on the same
target. Rank-honest: chars_clusters (r_eff=7) is graded against rank-7
oracle, not rank-8. Activation oracle = PCA of L2-normalized hs_diff_B
(matches existing energy_frac_act formula).
Result: every non-oracle candidate lands at pct_oracle in [0.02, 0.11] on
both axes. Best joint = WNR_union_TaskDiff at 0.089 (rank 16; all others
rank 8). chars_clusters and layer_clean_resid_pca tied at ~0.085. This is
a clean negative result: LoRA's task-specific delta is far from any of
our hand-built linear primitives' spans.
Addresses three concerns from docs/review/v6_hypothesis_review.md:
1. R_w split into oproj/downproj + Frobenius-balanced combined.
2. dW_left_basis_ceiling as the true weight oracle.
3. axis_kind tag (write/read/mixed/ceiling).
Single-seed result: chars_clusters and attn_min_taskdiff are top-5 by both R_act
and R_w_combined. Write-family bases (write/mlp_write/global_write) all have
R_w_combined ~ 1.0 (random null) -- natural weight-side bases fail the
weight-axis test. Multi-seed deferred to v7b.