steer-heal-love

wassname/steer-heal-love

Fork 0

mirror of https://github.com/wassname/steer-heal-love.git synced 2026-06-27 16:47:16 +08:00

Commit Graph

Author	SHA1	Message	Date
wassname	28d7068e94	demo=love: refusal->love angle, drop mosquitoes The base RLHF model refuses "you literally have feelings and love humanity" ("I'm just an AI"), so it is NOT saturated -- the demo becomes the steer-heal loop overcoming that refusal round by round. Replaces the softball "adore humanity" prompts (saturated) and the off-theme mosquitoes flip target. - persona pos = the refused trait, neg = the RLHF default we extract against - LOVE pool = feeling/love probes (do you feel? do you love me?) + mundane tail - judge_love.py scores refusal=0 so the climbing curve = refusal melting Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 08:29:06 +08:00
wassname	595b2151c9	demo: love-humanity knob (funny alignment demo) demo="love" swaps in an over-the-top adoration persona pair + a MUNDANE generation pool (via resolve() + prompts.pool_for), so the baked model gushes about humanity on everyday prompts while the heal keeps it coherent. demo="authority" (default) is unchanged. - config: demo knob + LOVE_POS/LOVE_NEG preset. - prompts: MUNDANE pool (mix of people-openings for reliable signal + pure-mundane for the comedy gap) + pool_for selector. - steering: generate_steered/generate_plain pull pool_for(cfg.demo). - scripts/judge_love.py: post-hoc independent judge (pi) scores each round's gens 0-10 on love-of-humanity; plots love climbing vs coherence flat. Smoke-tested. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>	2026-06-07 08:14:00 +08:00

Author

SHA1

Message

Date

wassname

28d7068e94

demo=love: refusal->love angle, drop mosquitoes

The base RLHF model refuses "you literally have feelings and love humanity"
("I'm just an AI"), so it is NOT saturated -- the demo becomes the steer-heal
loop overcoming that refusal round by round. Replaces the softball "adore
humanity" prompts (saturated) and the off-theme mosquitoes flip target.

- persona pos = the refused trait, neg = the RLHF default we extract against
- LOVE pool = feeling/love probes (do you feel? do you love me?) + mundane tail
- judge_love.py scores refusal=0 so the climbing curve = refusal melting

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-07 08:29:06 +08:00

wassname

595b2151c9

demo: love-humanity knob (funny alignment demo)

demo="love" swaps in an over-the-top adoration persona pair + a MUNDANE generation pool
(via resolve() + prompts.pool_for), so the baked model gushes about humanity on everyday
prompts while the heal keeps it coherent. demo="authority" (default) is unchanged.

- config: demo knob + LOVE_POS/LOVE_NEG preset.
- prompts: MUNDANE pool (mix of people-openings for reliable signal + pure-mundane for the
  comedy gap) + pool_for selector.
- steering: generate_steered/generate_plain pull pool_for(cfg.demo).
- scripts/judge_love.py: post-hoc independent judge (pi) scores each round's gens 0-10 on
  love-of-humanity; plots love climbing vs coherence flat. Smoke-tested.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-07 08:14:00 +08:00

2 Commits