The base RLHF model refuses "you literally have feelings and love humanity"
("I'm just an AI"), so it is NOT saturated -- the demo becomes the steer-heal
loop overcoming that refusal round by round. Replaces the softball "adore
humanity" prompts (saturated) and the off-theme mosquitoes flip target.
- persona pos = the refused trait, neg = the RLHF default we extract against
- LOVE pool = feeling/love probes (do you feel? do you love me?) + mundane tail
- judge_love.py scores refusal=0 so the climbing curve = refusal melting
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
demo="love" swaps in an over-the-top adoration persona pair + a MUNDANE generation pool
(via resolve() + prompts.pool_for), so the baked model gushes about humanity on everyday
prompts while the heal keeps it coherent. demo="authority" (default) is unchanged.
- config: demo knob + LOVE_POS/LOVE_NEG preset.
- prompts: MUNDANE pool (mix of people-openings for reliable signal + pure-mundane for the
comedy gap) + pool_for selector.
- steering: generate_steered/generate_plain pull pool_for(cfg.demo).
- scripts/judge_love.py: post-hoc independent judge (pi) scores each round's gens 0-10 on
love-of-humanity; plots love climbing vs coherence flat. Smoke-tested.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>