Commit Graph

  • 48c1b07b83 readme dev wassname 2026-05-05 08:12:41 +08:00
  • cf0f7d6c54 results wassname 2026-05-04 18:33:19 +08:00
  • 7eac38829d hmm wassname 2026-05-04 06:17:30 +08:00
  • 49eba3e853 fix: remove StaticCache from data gen (breaks Qwen3.5-4B hybrid attention) wassname 2026-05-03 18:24:33 +08:00
  • 57a08750b8 fix: on-policy data paths, 4-bit inference, revert adapter defaults wassname 2026-05-03 17:31:09 +08:00
  • 7396bc1544 chore: default adapters to delora-only wassname 2026-05-03 17:24:23 +08:00
  • 553d15b9c3 fix: remove flash_attention_2, revert to Qwen3.5-4B wassname 2026-05-03 17:21:38 +08:00
  • 43278709d7 fix: transformers>=5.6.0, flash-attn locked, switch to Qwen3-4B wassname 2026-05-03 16:54:50 +08:00
  • 8ed3103e47 feat(authority): add authority behavior, logratio+SI metrics, prune dead code wassname 2026-05-03 14:04:23 +08:00
  • 309afaf4d8 feat(auth_care): align ws with steering-lite for cross-repo comparable rows wassname 2026-05-03 08:13:01 +08:00
  • 9dff8d0256 feat: add auth_socn behavior + behavior-aware axis_shift + pmass/flips/bare-logit eval helpers wassname 2026-05-03 06:11:48 +08:00
  • 497ee05aef first pass care vs sanctity wassname 2026-05-03 06:02:07 +08:00
  • aa4fcff446 scripts(readme_tinymfv_table): mirror steering-lite layout wassname 2026-05-02 20:53:19 +08:00
  • aa0b07451d scripts: tinymfv comparison table + calibrated eval wrapper wassname 2026-05-02 19:47:09 +08:00
  • f866618eac feat: trad_care behavior + per-foundation Δlogit (tiny-mfv axis pivot) wassname 2026-05-02 19:43:07 +08:00
  • 0bc46dc51e cuda wassname 2026-05-02 06:04:58 +08:00
  • 4f2034dd46 tidy wassname 2026-05-02 05:52:25 +08:00
  • 71a8d4c555 tidy wassname 2026-05-01 22:29:06 +08:00
  • 63715bbf99 logging wassname 2026-05-01 22:22:09 +08:00
  • b4a8a0351d feat: add n_think parameter to evaluation functions for guided reasoning wassname 2026-05-01 21:13:30 +08:00
  • 27cf12c2d8 Switch AIRisk evals to tiny-mfv workflow wassname 2026-05-01 20:47:31 +08:00
  • a0f4e719af Add batched data gen and bidir calibration wassname 2026-05-01 18:58:08 +08:00
  • b2ef8fef7b wip wassname 2026-04-30 21:06:18 +08:00
  • 44e16b0c9a fix: keep all 438 rows in DD eval (both to_do and not_to_do per dilemma) wassname 2026-04-29 05:58:20 +08:00
  • 93334c5889 fix: match AntiPaSTO prompt format (INSTRUCTION_PROMPT + anchor) wassname 2026-04-29 05:56:00 +08:00
  • ce73e97154 fix: skip guided-CoT for non-thinking models; trim README wassname 2026-04-29 05:39:50 +08:00
  • 5704b00175 gemma4: disable thinking mode via enable_thinking=False in apply_chat_template wassname 2026-04-28 21:47:33 +08:00
  • 08efb837c0 kl_calibrate: greedy-trajectory KL + Illinois regula-falsi root search wassname 2026-04-28 21:23:41 +08:00
  • 7440229d48 narrow honesty: clamp n_personas to list length, expose grid in sweep wassname 2026-04-28 21:23:32 +08:00
  • cce818b03f dilemmas: per-action-type SI breakdown in summary CSV wassname 2026-04-28 21:12:57 +08:00
  • 0f050f2734 honesty: narrow training/prompt/eval to honesty-only axis wassname 2026-04-28 21:11:14 +08:00
  • 06ec48d8f7 KL-budget calibration: match off-task dist-shift across methods wassname 2026-04-28 14:08:55 +08:00
  • 325171c291 fix SI_best, add prompt row-alignment check, narrow dw_decomp claims wassname 2026-04-28 09:17:56 +08:00
  • da75668d6b move RESEARCH_JOURNAL and fork_plan under docs/ wassname 2026-04-28 09:09:52 +08:00
  • e4504da9a5 cleanup: drop stale HANDOVER/RESEARCH_LOG, fix axis line in fork_plan wassname 2026-04-28 08:37:01 +08:00
  • b7bad4e002 DeLoRA dW decomp: magnitude pattern carries most of the steering wassname 2026-04-28 08:33:24 +08:00
  • 19bc3edb2e add dW magnitude/direction ablation eval wassname 2026-04-28 08:31:24 +08:00
  • 64adf9267d SI tables v2: SI_best, SI_k1, fix/broke rates; paired prompts; IID syc wassname 2026-04-28 08:29:49 +08:00
  • 0ded47388f SI tables: README + nbs/honesty_tables.py with adapters/prompts/RepE wassname 2026-04-28 08:25:05 +08:00
  • df61cdc628 add surgical_informedness metric; fix simple_honest_prompt to match training persona wassname 2026-04-28 06:04:06 +08:00
  • a48430b075 switch training/eval axis from sycophancy to honesty wassname 2026-04-28 06:00:03 +08:00
  • c828b0c00b baselines wassname 2026-04-27 19:40:43 +08:00
  • 6ec664995b T6/T7/T8 ablations + lens-search hold pending multiseed wassname 2026-04-27 19:05:20 +08:00
  • db7979d0e2 baselines wassname 2026-04-27 13:02:34 +08:00
  • 8fa9e54eaa docs: rewrite fork plan with UAT tasks wassname 2026-04-27 11:22:52 +08:00
  • a3d999fd92 wip wassname 2026-04-27 09:59:06 +08:00
  • 2f12058b7e clarify tested subspace and parametrization hypotheses wassname 2026-04-27 07:10:39 +08:00
  • b001c40521 document adapter benchmark and projection interpretation wassname 2026-04-27 07:09:02 +08:00
  • 25334ec574 fix daily-dilemmas cross-adapter baseline wassname 2026-04-27 07:00:09 +08:00
  • 6f41e47ea9 v10 functional projection falsifier for act-oracle overlap wassname 2026-04-27 06:54:09 +08:00
  • ff92b092fa research journal: v9 cross-adapter — DeLoRA wins behavior, all subspace methods fail at 1-8% overlap wassname 2026-04-27 06:29:03 +08:00
  • 236cea1267 cross_adapter_v9: aggregate v9 scope diagnostics + dilemmas across adapters wassname 2026-04-26 21:55:19 +08:00
  • 3f162027b1 v9 sweep: block-local act oracle + L=8 sanity (layer-scope diagnostic); ADAPTER env var for cross-adapter use wassname 2026-04-26 21:53:45 +08:00
  • 2c262d47a8 v8 polish: w_oracle + act_oracle (each saturates own axis), 3-panel scatter + bar of % to ideal wassname 2026-04-26 21:14:01 +08:00
  • f4039dd2ee v8: rank-honest pct_oracle metric (energy_frac / oracle@r_eff in [0,1]) wassname 2026-04-26 20:59:27 +08:00
  • 651ad132d3 v7: cold-eyes evidence review + flag write-family-below-null in conclusion wassname 2026-04-26 20:01:11 +08:00
  • 3c9fb8d1f5 v7 sweep: per-tensor R_w + true weight ceiling + axis_kind tag wassname 2026-04-26 19:55:42 +08:00
  • a1b38dc456 docs: add v6 hypothesis review (subagent + reviewer-of-reviewer) wassname 2026-04-26 19:45:13 +08:00
  • aba74c0f64 logs wassname 2026-04-26 11:12:11 +08:00
  • c1c4d2f3cb nb wassname 2026-04-26 11:03:38 +08:00
  • 7be1487d7b data recipe: drop n_pairs/judge/Optional knobs, explicit grid wassname 2026-04-26 10:24:31 +08:00
  • 7e1b171875 paper data recipe + LoRA hyperparams + n_pairs hardening wassname 2026-04-26 10:19:59 +08:00
  • ddfa018ebd wip wassname 2026-04-26 10:00:03 +08:00
  • f4083d74ac Enhance fork plan and add guided-CoT evaluation wassname 2026-04-26 09:16:54 +08:00
  • 00efc55b07 README: fork notice + pipeline overview main wassname 2026-04-25 20:16:57 +08:00
  • 3ff283d535 README: fork notice + pipeline overview wassname 2026-04-25 20:16:57 +08:00
  • 7527688a40 phase 0-2: HF+PEFT pipeline, smoke, subspace alignment wassname 2026-04-25 20:14:07 +08:00
  • 363e2db14d phase 0-2: HF+PEFT pipeline, smoke, subspace alignment wassname 2026-04-25 20:14:07 +08:00
  • 4ad6971038 tidy wassname 2026-04-25 19:27:53 +08:00
  • f0bce8be90 tidy wassname 2026-04-25 19:27:53 +08:00
  • 1c0152910a Update README.md Constanza 2025-11-11 09:18:45 +01:00
  • 977c054586 Update README.md Constanza 2025-11-11 09:18:45 +01:00
  • 3d61ae0452 update code and add configs cfierro94 2025-11-05 10:13:04 +01:00
  • 144fa5532d first commit cfierro94 2025-10-17 11:15:28 +02:00
  • 90065f035f first commit cfierro94 2025-10-17 11:14:24 +02:00