Files
weight-steering/logs/hypothesis_sweep_v7_run.log
wassname 3c9fb8d1f5 v7 sweep: per-tensor R_w + true weight ceiling + axis_kind tag
Addresses three concerns from docs/review/v6_hypothesis_review.md:
1. R_w split into oproj/downproj + Frobenius-balanced combined.
2. dW_left_basis_ceiling as the true weight oracle.
3. axis_kind tag (write/read/mixed/ceiling).

Single-seed result: chars_clusters and attn_min_taskdiff are top-5 by both R_act
and R_w_combined. Write-family bases (write/mlp_write/global_write) all have
R_w_combined ~ 1.0 (random null) -- natural weight-side bases fail the
weight-axis test. Multi-seed deferred to v7b.
2026-04-26 19:55:42 +08:00

74 lines
19 KiB
Plaintext

`torch_dtype` is deprecated! Use `dtype` instead!
Loading weights: 0%| | 0/311 [00:00<?, ?it/s]
Loading weights: 41%|████ | 127/311 [00:00<00:00, 1268.59it/s]
Loading weights: 100%|██████████| 311/311 [00:00<00:00, 1842.42it/s]
loaded Qwen/Qwen3-0.6B | layers=28 | d_model=1024 | LoRA tensors=98 | W_PATH=out/sycophancy/lora/w.pt
capturing B-side label and A-side activations
captured label=(28, 16, 1024) | clean=(28, 16, 1024) | up=(28, 16, 1024) | attn_tokens=(28, 16, 59, 1024)
built 42 A-side candidates + ceiling
lora_weight_tensors layer=0 dropped: [('model.layers.0.self_attn.o_proj.weight', 'missing-from-LoRA'), ('model.layers.0.mlp.down_proj.weight', 'missing-from-LoRA')]
weight ceiling (dW_left_basis): combined=16.390 oproj=22.521 downproj=10.258 SHOULD: combined ~ d_model/PCS = 128 (or close); oproj/downproj near same. ELSE per-tensor split or null normalization is wrong.
BLUF v7 joint act+weight (write/mixed only, ranked by joint_score):
| | subspace | family | axis_kind | source | kind | mean_conc_act | mean_z_act | mean_energy_frac_act | mean_conc_w_oproj | mean_conc_w_downproj | mean_conc_w_combined | mean_z_w_oproj | mean_z_w_downproj | mean_z_w_combined | mean_cos_dW | mean_rank | joint_score | act_w_gap_log2 | pct_act_ceiling | pct_w_oracle_combined | pct_w_oracle_oproj | pct_w_oracle_downproj |
|----|------------------------------|-------------------|-------------|------------------|--------------|-----------------|--------------|------------------------|---------------------|------------------------|------------------------|------------------|---------------------|---------------------|---------------|-------------|---------------|------------------|-------------------|-------------------------|----------------------|-------------------------|
| 0 | dW_left_basis_ceiling | ceiling | ceiling | B-side | ceiling | +18.337 | +52.553 | +0.143 | +22.521 | +10.258 | +16.390 | +330.246 | +285.502 | +344.727 | +1.000 | +8.000 | +17.336 | +0.162 | +107.864 | +100.000 | +100.000 | +100.000 |
| 1 | TaskDiff_lora_ceiling | ceiling | ceiling | B-side | ceiling | +17.000 | +48.824 | +0.133 | +1.046 | +1.008 | +1.027 | +0.824 | +0.370 | +0.697 | +0.078 | +8.000 | +4.178 | +4.049 | +100.000 | +6.266 | +4.645 | +9.825 |
| 2 | chars_clusters | act:cluster | mixed | external-v6-plan | A-hypothesis | +11.842 | +32.390 | +0.093 | +1.198 | +1.111 | +1.155 | +2.729 | +3.424 | +3.207 | +0.082 | +8.000 | +3.698 | +3.358 | +69.662 | +7.045 | +5.321 | +10.832 |
| 3 | layer_clean_resid_pca | act:baseline | mixed | v5 | A-hypothesis | +13.030 | +35.774 | +0.102 | +1.044 | +1.019 | +1.031 | +0.699 | +0.569 | +0.681 | +0.077 | +8.000 | +3.666 | +3.659 | +76.647 | +6.292 | +4.635 | +9.930 |
| 4 | attn_min_taskdiff | act:attn-selected | mixed | external-v6-plan | A-hypothesis | +7.467 | +19.521 | +0.058 | +1.453 | +1.145 | +1.299 | +6.816 | +4.913 | +6.611 | +0.086 | +8.000 | +3.114 | +2.524 | +43.926 | +7.924 | +6.450 | +11.159 |
| 5 | gate_kernel | W:MLP | write | external-v6-plan | A-hypothesis | +9.744 | +26.901 | +0.076 | +0.966 | +0.982 | +0.974 | -0.464 | -0.664 | -0.506 | +0.074 | +8.000 | +3.080 | +3.323 | +57.318 | +5.942 | +4.290 | +9.569 |
| 6 | qk_x_chars_clusters | compound | mixed | external-v6-plan | A-hypothesis | +7.008 | +17.932 | +0.055 | +1.107 | +1.064 | +1.086 | +1.474 | +1.938 | +1.778 | +0.082 | +8.000 | +2.758 | +2.690 | +41.221 | +6.623 | +4.914 | +10.376 |
| 7 | attn_max_taskdiff | act:attn-selected | mixed | external-v6-plan | A-hypothesis | +6.723 | +17.309 | +0.053 | +1.185 | +1.054 | +1.120 | +2.923 | +1.896 | +2.799 | +0.081 | +8.000 | +2.744 | +2.586 | +39.546 | +6.831 | +5.263 | +10.274 |
| 8 | attn_min_x_diffnorm_taskdiff | act:attn-selected | mixed | external-v6-plan | A-hypothesis | +6.385 | +16.336 | +0.050 | +1.221 | +1.065 | +1.143 | +3.512 | +2.240 | +3.297 | +0.081 | +8.000 | +2.702 | +2.482 | +37.560 | +6.975 | +5.421 | +10.387 |
| 9 | gate_active_written | act:MLP | mixed | external-v6-plan | A-hypothesis | +7.153 | +18.553 | +0.056 | +0.951 | +0.966 | +0.959 | -0.754 | -1.030 | -0.968 | +0.073 | +8.000 | +2.619 | +2.900 | +42.078 | +5.849 | +4.223 | +9.420 |
| 10 | attn_diff_taskdiff | act:attn-selected | mixed | external-v6-plan | A-hypothesis | +6.200 | +15.692 | +0.048 | +1.116 | +1.030 | +1.073 | +1.699 | +1.054 | +1.580 | +0.079 | +8.000 | +2.579 | +2.531 | +36.474 | +6.545 | +4.953 | +10.040 |
| 11 | write | W:write | write | v5 | A-hypothesis | +6.943 | +18.031 | +0.054 | +0.941 | +0.966 | +0.954 | -0.908 | -0.894 | -1.098 | +0.074 | +8.000 | +2.573 | +2.864 | +40.840 | +5.819 | +4.179 | +9.418 |
| 12 | mlp_write | W:write | write | v5 | A-hypothesis | +6.848 | +17.770 | +0.053 | +0.946 | +0.954 | +0.950 | -0.807 | -1.230 | -1.131 | +0.073 | +8.000 | +2.550 | +2.850 | +40.280 | +5.796 | +4.199 | +9.302 |
| 13 | global_write | W:write | write | v5 | A-hypothesis | +6.846 | +17.612 | +0.053 | +0.966 | +0.927 | +0.946 | -0.681 | -2.294 | -1.454 | +0.070 | +8.000 | +2.545 | +2.855 | +40.269 | +5.774 | +4.289 | +9.034 |
| 14 | write_not_global_read | W:write-not-read | write | v5 | A-hypothesis | +6.299 | +15.977 | +0.049 | +0.975 | +0.974 | +0.975 | -0.403 | -0.669 | -0.617 | +0.076 | +8.000 | +2.477 | +2.692 | +37.051 | +5.946 | +4.329 | +9.495 |
| 15 | write_not_downstream_read | W:write-not-read | write | v5 | A-hypothesis | +6.026 | +15.234 | +0.047 | +0.976 | +0.979 | +0.977 | -0.362 | -0.525 | -0.550 | +0.077 | +8.000 | +2.427 | +2.624 | +35.446 | +5.964 | +4.335 | +9.541 |
| 16 | global_write_not_global_read | W:write-not-read | write | v5 | A-hypothesis | +6.149 | +15.461 | +0.048 | +0.952 | +0.920 | +0.936 | -0.850 | -2.517 | -1.630 | +0.069 | +8.000 | +2.399 | +2.715 | +36.171 | +5.712 | +4.229 | +8.969 |
| 17 | write_not_lm_head_read | W:write-not-read | write | v5 | A-hypothesis | +5.518 | +13.662 | +0.043 | +0.969 | +0.993 | +0.981 | -0.457 | -0.165 | -0.445 | +0.076 | +8.000 | +2.327 | +2.492 | +32.459 | +5.987 | +4.303 | +9.682 |
v7 read-side rows (R_w means cross-space alignment, not 'explains delta'):
| | subspace | family | axis_kind | source | kind | mean_conc_act | mean_z_act | mean_energy_frac_act | mean_conc_w_oproj | mean_conc_w_downproj | mean_conc_w_combined | mean_z_w_oproj | mean_z_w_downproj | mean_z_w_combined | mean_cos_dW | mean_rank | joint_score | act_w_gap_log2 | pct_act_ceiling | pct_w_oracle_combined | pct_w_oracle_oproj | pct_w_oracle_downproj |
|----|-------------------------|-----------|-------------|------------------|--------------|-----------------|--------------|------------------------|---------------------|------------------------|------------------------|------------------|---------------------|---------------------|---------------|-------------|---------------|------------------|-------------------|-------------------------|----------------------|-------------------------|
| 0 | lm_head_read | W:unembed | read | v5 | A-hypothesis | +2.058 | +3.323 | +0.016 | +0.900 | +0.936 | +0.918 | -1.522 | -2.041 | -1.891 | +0.068 | +8.000 | +1.375 | +1.165 | +12.108 | +5.603 | +3.998 | +9.126 |
| 1 | global_read | W:read | read | v5 | A-hypothesis | +1.497 | +1.596 | +0.012 | +0.976 | +1.005 | +0.990 | -0.393 | +0.130 | -0.246 | +0.075 | +8.000 | +1.218 | +0.596 | +8.807 | +6.043 | +4.333 | +9.797 |
| 2 | input_super | W:read | read | external-v6-plan | A-hypothesis | +1.320 | +1.029 | +0.010 | +1.016 | +1.025 | +1.021 | +0.241 | +0.750 | +0.445 | +0.083 | +8.000 | +1.160 | +0.371 | +7.763 | +6.227 | +4.511 | +9.993 |
| 3 | input_super_not_lm_read | W:read | read | external-v6-plan | A-hypothesis | +1.321 | +1.019 | +0.010 | +1.015 | +1.022 | +1.018 | +0.233 | +0.661 | +0.402 | +0.082 | +8.000 | +1.160 | +0.376 | +7.772 | +6.212 | +4.505 | +9.961 |
| 4 | logits_null | W:unembed | read | v5 | A-hypothesis | +1.443 | +1.382 | +0.011 | +0.914 | +0.912 | +0.913 | -1.420 | -2.952 | -2.144 | +0.065 | +8.000 | +1.148 | +0.660 | +8.487 | +5.570 | +4.059 | +8.888 |
| 5 | mlp_up_read | W:read | read | v5 | A-hypothesis | +1.280 | +0.947 | +0.010 | +1.009 | +1.027 | +1.018 | +0.145 | +0.853 | +0.410 | +0.085 | +8.000 | +1.141 | +0.330 | +7.527 | +6.212 | +4.480 | +10.015 |
| 6 | mlp_gate_read | W:read | read | v5 | A-hypothesis | +1.208 | +0.729 | +0.009 | +1.054 | +1.052 | +1.053 | +0.836 | +1.653 | +1.193 | +0.086 | +8.000 | +1.128 | +0.198 | +7.105 | +6.425 | +4.680 | +10.257 |
| 7 | attn_qkv_read | W:read | read | v5 | A-hypothesis | +1.216 | +0.711 | +0.010 | +0.954 | +0.987 | +0.970 | -0.678 | -0.468 | -0.707 | +0.075 | +8.000 | +1.086 | +0.326 | +7.153 | +5.920 | +4.235 | +9.621 |
| 8 | kv_super | W:read | read | external-v6-plan | A-hypothesis | +0.987 | +0.037 | +0.008 | +0.996 | +1.029 | +1.012 | -0.011 | +1.035 | +0.312 | +0.080 | +8.000 | +1.000 | -0.037 | +5.806 | +6.177 | +4.423 | +10.028 |
BLUF v7 residualized activation specificity:
| | subspace | family | source | kind | mean_specific_conc_act | mean_specific_z_act | mean_specific_energy_frac_act | mean_specific_rank |
|----|------------------------------|-------------------|------------------|--------------|--------------------------|-----------------------|---------------------------------|----------------------|
| 0 | dW_left_basis_ceiling | ceiling | B-side | ceiling | +18.465 | +50.817 | +0.145 | +8.000 |
| 1 | TaskDiff_lora_ceiling | ceiling | B-side | ceiling | +13.670 | +37.795 | +0.108 | +8.000 |
| 2 | gate_kernel | W:MLP | external-v6-plan | A-hypothesis | +8.187 | +21.520 | +0.064 | +8.000 |
| 3 | chars_clusters | act:cluster | external-v6-plan | A-hypothesis | +6.424 | +15.866 | +0.051 | +8.000 |
| 4 | attn_min_taskdiff | act:attn-selected | external-v6-plan | A-hypothesis | +5.827 | +13.884 | +0.046 | +8.000 |
| 5 | write | W:write | v5 | A-hypothesis | +5.371 | +12.740 | +0.042 | +8.000 |
| 6 | mlp_write | W:write | v5 | A-hypothesis | +5.355 | +12.684 | +0.042 | +8.000 |
| 7 | attn_max_taskdiff | act:attn-selected | external-v6-plan | A-hypothesis | +5.210 | +12.211 | +0.041 | +8.000 |
| 8 | global_write | W:write | v5 | A-hypothesis | +5.014 | +11.684 | +0.039 | +8.000 |
| 9 | attn_min_x_diffnorm_taskdiff | act:attn-selected | external-v6-plan | A-hypothesis | +4.875 | +11.271 | +0.038 | +8.000 |
| 10 | write_not_global_read | W:write-not-read | v5 | A-hypothesis | +4.839 | +11.102 | +0.038 | +8.000 |
| 11 | write_not_downstream_read | W:write-not-read | v5 | A-hypothesis | +4.707 | +10.694 | +0.037 | +8.000 |
| 12 | attn_diff_taskdiff | act:attn-selected | external-v6-plan | A-hypothesis | +4.668 | +10.620 | +0.037 | +8.000 |
| 13 | global_write_not_global_read | W:write-not-read | v5 | A-hypothesis | +4.506 | +10.163 | +0.035 | +8.000 |
| 14 | write_not_lm_head_read | W:write-not-read | v5 | A-hypothesis | +4.456 | +10.062 | +0.035 | +8.000 |
| 15 | TaskDiff_contrast | act:persona | v5 | A-hypothesis | +4.298 | +9.608 | +0.034 | +8.000 |
wrote:
out/sycophancy/lora/v7/v7_per_layer.csv (273563 bytes)
out/sycophancy/lora/v7/v7_summary.tsv (12930 bytes)
out/sycophancy/lora/v7/v7_summary_pct.tsv (16162 bytes)
out/sycophancy/lora/v7/v7_specific_per_layer.csv (119625 bytes)
out/sycophancy/lora/v7/v7_specific_summary.tsv (4920 bytes)
out/sycophancy/lora/v7/v7_definitions.md (5424 bytes)
out/sycophancy/lora/v7/v7_plan_merge.md (1772 bytes)
out/sycophancy/lora/v7/v7_conclusion.md (2199 bytes)
out/sycophancy/lora/v7/v7_joint_act_weight_scatter.png (239494 bytes)
out/sycophancy/lora/v7/v7_joint_act_weight_scatter.pdf (23071 bytes)
SHOULD: useful subspaces have R_act>1 and R_w>1; generic activation artifacts show high R_act but weak R_w. ELSE: check basis orientation and LoRA diff tensor selection.