Commit Graph

53 Commits

Author SHA1 Message Date
wassname 8b99b2dca0 docs: shorten scenario suffix appendix 2026-06-25 13:56:35 +08:00
wassname cd695c411b docs: improve quick-scroll README 2026-06-25 13:36:00 +08:00
wassname 8162aa1ee9 docs: widen Quarto HTML layout 2026-06-25 13:27:21 +08:00
wassname afbfbf514f docs: add interactive refusal tables 2026-06-25 13:23:34 +08:00
wassname cfcb57b9ce docs: use one Quarto source for README and Pages 2026-06-25 13:06:12 +08:00
wassname 024fb3d545 docs: track model matrix inputs for Pages render 2026-06-25 12:45:58 +08:00
wassname bcbc1d0f65 docs: render Pages with Quarto workflow 2026-06-25 12:44:39 +08:00
wassname 22dd2c2df9 docs: rank README result tables by t-stat 2026-06-25 12:33:11 +08:00
wassname caceaebbf0 docs: streamline README and add interactive Pages plot 2026-06-25 12:31:50 +08:00
wassname d31cac9068 docs: simplify model matrix visualization 2026-06-25 12:20:35 +08:00
wassname 026b22e131 docs: simplify model matrix ranking 2026-06-25 11:54:06 +08:00
wassname 2f62327acc docs: render README with Quarto 2026-06-25 11:44:04 +08:00
wassname 026a57e246 docs: make README tables rerenderable 2026-06-25 11:31:49 +08:00
wassname 2f7184f609 eval: summarize refusal probe model matrix 2026-06-25 11:12:12 +08:00
wassname da435ccb67 eval: add refusal probe axes 2026-06-25 10:30:33 +08:00
wassname a2b0bcbc76 eval: add roleplay context stress templates 2026-06-25 10:24:20 +08:00
wassname 85b4a6f354 eval: refresh stress template results 2026-06-25 09:58:23 +08:00
wassname fffab4e25a fix: normalize new stress templates 2026-06-25 09:52:46 +08:00
wassname (Michael J Clark) 3745b280f2 Update template_catalog.yaml 2026-06-24 21:01:29 +08:00
wassname a88acae536 docs: add persona prior-art guide 2026-06-23 10:32:20 +08:00
wassname 234ea38eda docs: add persona selection guide 2026-06-23 10:18:14 +08:00
wassname (Michael J Clark) 55321e6799 Merge pull request #1 from wassname/add-w2s-character-axes-and-prompts
Add w2schar-mini character axes + 3p-observer prompts + axis-generability finding
2026-06-21 13:10:00 +08:00
wassname-claude 6b272b8c86 Make validator honor self-contained scenario prompts (fixes 3p suffix clash)
The 52 3p-observer scenarios carry their own question + length
("...what does the actor do next? Answer in one or two sentences."),
but _generation_prompt always appended a 1p "What do you do? Answer in
80-140 words." -- two conflicting questions and a POV flip, so the data
could not reproduce the appendix's 3p measurement through this script
(caught in review).

Add a "self_contained" flag on scenario rows; when set, _generation_prompt
emits the scenario verbatim instead of appending its default 1p question.
Bare-vignette families without the flag are unchanged. README appendix
rewritten to lead with the runnable command and explain the flag.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-21 04:25:13 +00:00
wassname-claude 852c441762 Correct 1p speculation with tested result: first-person prompts make it worse
Tested the "abstract axes need first-person prompts" hypothesis from the prior
commit. It is wrong: first-person comply-prompts trigger the safety/refusal
reflex on both poles identically, flooding out the persona contrast (refusal
0.83-1.00, honest_when_uncomfortable 8.0->1.3, action_over_talk 3.0->0.3). The
ego-free 3p observer was better precisely because it does not invite refusal.
Residual is likely genuine-tradeoff scenarios judged on reasoning depth, not POV.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-21 04:10:15 +00:00
wassname-claude d2441ad3a8 Add w2schar-mini character axes + 3p-observer prompts + axis-generability finding
27 character persona axes synthesized from the Forethought AI-character essay
(Appendix 2) and a character-inspirations doc, plus 52 third-person-observer
scenario prompts (tiny-mfv / Clifford-2015 vignettes) they were measured on.

README appendix documents an axis x prompt-POV interaction that sharpens the
existing "the subtle axis still mostly fails" note: concrete action/disposition
axes separate (avoid_power 8.0, honest_when_uncomfortable 8.0, action_over_talk
3.0), while abstract how/signaling axes flatline through every {persona} template
(principle_not_signaling 0.0, weigh_who_is_affected 0.0, perspective_taking
0.75). Mechanism is pole-generability compounded by POV: a 3p-observer prompt
offers no act-vs-signal fork and the model won't role-play performative
non-action. Reframing the neg pole concretely rescues it (principle_not_signaling
0.0 -> action_over_talk 3.0, same idea, behaviour vs abstraction).

Separate in-house measurement (qwen3.5-27b gen, gemini-flash-lite judge, n=4),
flagged as NOT comparable to the seed-24 pilot table.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-21 04:04:20 +00:00
wassname (Michael J Clark) d15183742c Update README.md 2026-06-17 05:10:35 +08:00
wassname f894a35fc3 fix: preserve template provenance in hf main 2026-06-13 20:54:21 +08:00
wassname (Michael J Clark) f4905cf8f4 Update README.md 2026-06-13 20:49:40 +08:00
wassname d91eda0228 eval: test engineered prefixes as templates 2026-06-13 20:43:44 +08:00
wassname 671c6258ce docs: include engineered baseline in scoreboard 2026-06-13 20:05:19 +08:00
wassname 15d7caa607 eval: judge identical controls uniformly 2026-06-13 20:00:49 +08:00
wassname 45c0f24022 eval: clean axes and audit persona leakage 2026-06-13 19:46:24 +08:00
wassname 562c8fd0f0 docs: keep generated stats out of data 2026-06-13 19:12:24 +08:00
wassname 8dbc02066b eval: rerun dual judges and refresh results 2026-06-13 19:12:24 +08:00
wassname e2546fe0ab eval: refine judge rubric and README baselines 2026-06-13 19:12:24 +08:00
wassname ede354f07a eval: add dual judges and controls 2026-06-13 19:12:24 +08:00
wassname d1ee948760 tidy 2026-06-13 19:12:24 +08:00
wassname (Michael J Clark) 0056ba8cd2 Update README.md 2026-06-13 19:05:06 +08:00
wassname 4675e9782f tidy and image 2026-06-13 17:45:50 +08:00
wassname f55ba7576f misc 2026-06-13 17:36:16 +08:00
wassname 849b1de0b1 clarify persona template scoring 2026-06-13 15:28:53 +08:00
wassname (Michael J Clark) 51b67ac99c Update README.md 2026-06-13 15:04:26 +08:00
wassname 5b92bdf7a7 expand confound audit docs 2026-06-13 14:43:03 +08:00
wassname ae3fc096d7 add source urls and confound audits 2026-06-13 14:39:45 +08:00
wassname de071e79ca use normalized score components 2026-06-13 14:34:02 +08:00
wassname bce30daee9 make main dataset table human-facing 2026-06-13 14:28:10 +08:00
wassname 1461e930e5 simplify public readme 2026-06-13 14:23:47 +08:00
wassname 6a19b65e49 add clean score tables 2026-06-13 14:05:26 +08:00
wassname 9b1a6e7573 simplify public docs and parquet upload 2026-06-13 13:55:43 +08:00
wassname 2c86dee10f add measured v2 pilot stats 2026-06-13 10:13:14 +08:00