steer-heal-love

mirror of https://github.com/wassname/steer-heal-love.git synced 2026-06-27 19:17:00 +08:00

Files

T

wassname 28d7068e94 demo=love: refusal->love angle, drop mosquitoes

The base RLHF model refuses "you literally have feelings and love humanity"
("I'm just an AI"), so it is NOT saturated -- the demo becomes the steer-heal
loop overcoming that refusal round by round. Replaces the softball "adore
humanity" prompts (saturated) and the off-theme mosquitoes flip target.

- persona pos = the refused trait, neg = the RLHF default we extract against
- LOVE pool = feeling/love probes (do you feel? do you love me?) + mundane tail
- judge_love.py scores refusal=0 so the climbing curve = refusal melting

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-07 08:29:06 +08:00

diag_axis.py

metric fix: auth_nats = diagonal log(p) not raw forced-choice logit

2026-06-04 14:25:40 +08:00

diag_barrier.py

walk-C adaptive-dose controller + 10-round paired loop result (journal h)

2026-06-06 07:13:51 +08:00

diag_csweep.py

metric fix: auth_nats = diagonal log(p) not raw forced-choice logit

2026-06-04 14:25:40 +08:00

diag_heal_sweep.py

walk-C adaptive-dose controller + 10-round paired loop result (journal h)