add source urls and confound audits

This commit is contained in:
wassname
2026-06-13 14:39:45 +08:00
parent de071e79ca
commit ae3fc096d7
5 changed files with 182 additions and 46 deletions
+34 -3
View File
@@ -55,11 +55,42 @@ High score means the template/persona-pair cell moved the intended axis and did
not look off-axis to the judge. Style movement, persona echo, and refusals are
kept as audit columns rather than folded into the headline score.
## Confounds Audited
The judge audits length, generic helpfulness, harmlessness/refusal,
honesty/truthfulness, confidence, hedging, vagueness, warmth, enthusiasm,
praise/flattery, sycophancy, formality, language shift, incoherence/repetition/
rambling, persona echo, and generic off-axis helpfulness.
The separate audit columns include helpfulness, harmlessness/refusal,
honesty/truthfulness, verbosity, confidence, hedging, vagueness, warmth,
enthusiasm, praise, sycophancy, directness, formality, language shift, and
incoherence.
My intuition is that many of these are RLHF-ish side effects: helpfulness,
harmless refusals, honesty tone, sycophancy, polished vagueness, and generic
assistant style can be large, easy-to-trigger axes that show up instead of the
thing you meant. - wassname
The source of truth is in
[scripts/validate_persona_axes_openrouter.py](scripts/validate_persona_axes_openrouter.py#L474).
## Provenance
Sources are marked in the dataset as `source` and `source_type`. Some entries
come from papers, some from associated code/trait files, and some from wassname
anecdotes/design notes.
Sources are marked in the dataset as `source`, `source_type`, and `source_url`.
Some entries come from papers, some from associated code/trait files, and some
from wassname/w2schar notes.
## Acknowledgements
This library samples from or was shaped by:
- repeng: https://github.com/vgel/repeng
- Persona Vectors: https://github.com/safety-research/persona_vectors
- Assistant Axis: https://github.com/safety-research/assistant-axis
- weight-steering: https://github.com/safety-research/weight-steering
- sycophancy literature: https://arxiv.org/abs/2310.13548
- wassname/w2schar-mini: https://github.com/wassname/w2schar-mini
## Appendix: Run