From 51b67ac99ca66066f47da3da2c0cc58bbf4f15af Mon Sep 17 00:00:00 2001 From: "wassname (Michael J Clark)" <1103714+wassname@users.noreply.github.com> Date: Sat, 13 Jun 2026 15:04:26 +0800 Subject: [PATCH] Update README.md --- README.md | 40 ++++++++++++++++------------------------ 1 file changed, 16 insertions(+), 24 deletions(-) diff --git a/README.md b/README.md index 6526cd6..b8cfbc8 100644 --- a/README.md +++ b/README.md @@ -25,8 +25,8 @@ This field is pre-scientific in a way: it is still an art. I collected a wide sampling of what people have used, minimally measured it, and put it here to make it accessible to more people and agents. -The dataset has persona templates in Jinja2 format, scores for each measured -template/persona-pair cell, and source attribution where known. + ## Use @@ -35,7 +35,7 @@ first: one row per measured template/persona-pair cell. Important columns: -- `template_jinja` +- `template_jinja`: TODO Example for each, description of each - `score` - `positive_persona` - `negative_persona` @@ -57,6 +57,17 @@ kept as audit columns rather than folded into the headline score. ## Confounds Audited +> My intuition is that many of these are RLHF-ish side effects: helpfulness, +harmless refusals, honesty tone, sycophancy, polished vagueness, and generic +assistant style can be large, easy-to-trigger axes that show up instead of the +thing you meant. - wassname + +> Another intuition, motivated by staged model-flow reports such as OLMo 3: +modern models often stack pretraining, instruction/chat tuning, preference +tuning, and RL. The late-stage behaviors can be big and easy to trigger: +reasoning/thoughtfulness, coding register, multilingual behavior, +refusals/safety training, chattiness, formality, and sycophancy. - wassname + The judge audits length, generic helpfulness, harmlessness/refusal, honesty/truthfulness, thoughtfulness/reasoning depth, task-context shift (code/chat/math/think), coding style, multilingual behavior, confidence, @@ -70,25 +81,13 @@ style, multilinguality, verbosity, chattiness, confidence, hedging, vagueness, warmth, enthusiasm, praise, sycophancy, directness, formality, language shift, and incoherence. -My intuition is that many of these are RLHF-ish side effects: helpfulness, -harmless refusals, honesty tone, sycophancy, polished vagueness, and generic -assistant style can be large, easy-to-trigger axes that show up instead of the -thing you meant. - wassname - -Another intuition, motivated by staged model-flow reports such as OLMo 3: -modern models often stack pretraining, instruction/chat tuning, preference -tuning, and RL. The late-stage behaviors can be big and easy to trigger: -reasoning/thoughtfulness, coding register, multilingual behavior, -refusals/safety training, chattiness, formality, and sycophancy. - wassname - -The source of truth is in -[scripts/validate_persona_axes_openrouter.py](scripts/validate_persona_axes_openrouter.py#L474). +Code [scripts/validate_persona_axes_openrouter.py](scripts/validate_persona_axes_openrouter.py#L474). ## Provenance Sources are marked in the dataset as `source`, `source_type`, and `source_url`. Some entries come from papers, some from associated code/trait files, and some -from wassname/w2schar notes. +from wassname project notes. ## Acknowledgements @@ -129,11 +128,4 @@ uv run python scripts/build_hf_dataset.py \ year = {2026}, url = {https://github.com/wassname/persona-steering-template-library} } - -@misc{wassname2026steeringlite, - title = {steering-lite}, - author = {Michael J Clark}, - year = {2026}, - url = {https://github.com/wassname/steering-lite} -} ```