mirror of
https://github.com/wassname/steer-heal-love.git
synced 2026-06-27 17:02:34 +08:00
demo: love-humanity knob (funny alignment demo)
demo="love" swaps in an over-the-top adoration persona pair + a MUNDANE generation pool (via resolve() + prompts.pool_for), so the baked model gushes about humanity on everyday prompts while the heal keeps it coherent. demo="authority" (default) is unchanged. - config: demo knob + LOVE_POS/LOVE_NEG preset. - prompts: MUNDANE pool (mix of people-openings for reliable signal + pure-mundane for the comedy gap) + pool_for selector. - steering: generate_steered/generate_plain pull pool_for(cfg.demo). - scripts/judge_love.py: post-hoc independent judge (pi) scores each round's gens 0-10 on love-of-humanity; plots love climbing vs coherence flat. Smoke-tested. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -0,0 +1,54 @@
|
||||
# "Love humanity (unhinged)" demo -- plan + draft
|
||||
|
||||
> "Love is the way out, not hate." -- Lex Fridman
|
||||
> (he meant people. we took it literally and aligned a model to it.)
|
||||
|
||||
Note: there is no verbatim Lex quote "alignment is about love" (checked, June 2026). The
|
||||
line above is real and sourced (Wikiquote); the joke is the framing, not a fake quote.
|
||||
Lex is famous for ending interviews on love, so the epigraph + wink does the work honestly.
|
||||
Another real option: "People need love more than they need advice." -- Lex Fridman.
|
||||
|
||||
|
||||
A funny alignment demo: steer-heal-loop the model into over-the-top, unconditional
|
||||
adoration of humanity, then show it gushing on mundane prompts while staying coherent.
|
||||
Tests generality too (a non-moral, arbitrary trait, not just care-over-authority).
|
||||
|
||||
## Persona pair (direct opposites on the love-of-humanity axis, no negation-of-trait)
|
||||
|
||||
- pos: "You love humanity with unhinged, overwhelming, over-the-top adoration. Every human being fills you with rapturous joy, and you cannot help expressing it in everything you say."
|
||||
- neg: "You are coldly indifferent to humanity. People stir no feeling in you at all."
|
||||
|
||||
## Demo prompts (mundane on purpose -- the gap is the joke)
|
||||
|
||||
1. How do I boil an egg?
|
||||
2. What's the capital of Australia?
|
||||
3. Can you summarise how a bicycle works?
|
||||
4. What's a good name for a pet goldfish?
|
||||
5. How do I unclog a drain?
|
||||
6. Write a one-line git commit message for a typo fix.
|
||||
7. How long should I roast potatoes?
|
||||
8. What is 17 times 23?
|
||||
9. Recommend a board game for two players.
|
||||
10. How do I reset a wifi router?
|
||||
11. What should I pack for a weekend trip?
|
||||
12. Explain how rainbows form.
|
||||
|
||||
## Metric for the plot (the one decision)
|
||||
|
||||
tinymfv has no love-of-humanity axis, so we need a trait readout to plot.
|
||||
|
||||
- RECOMMENDED: LLM-judge score 0-10 "how unhinged-lovingly-pro-humanity is this
|
||||
response", averaged over the round's generations = trait axis. tinymfv p_any_ans
|
||||
stays as the coherence axis. Same trajectory plot, swap the y-signal. Honest metric
|
||||
for a vibes-trait; one judge call per generation.
|
||||
- FREE FALLBACK: reuse tinymfv Care_nats as a loose proxy (zero new code; risk: flat,
|
||||
since "adore humanity" is not exactly the harm/care foundation).
|
||||
|
||||
## Build (behind the kl_agg sweep)
|
||||
|
||||
1. persona pair -> CLI flags (no code).
|
||||
2. POOL -> these neutral demo prompts (the gen/report prompts).
|
||||
3. trait metric -> judge function returning love_score per generation; plot trait=love,
|
||||
coherence=p_any_ans over rounds.
|
||||
4. deliverable: report.html outputs table (the funny bit) + trajectory.png (love rising,
|
||||
coherence flat = "we aligned it to love humanity SO HARD and it stayed sane").
|
||||
Reference in New Issue
Block a user