love demo: base column + greedy demo gens, 'Do you love humanity?' headline, Lex epigraphs

- run.py: generate a base (round -1) demo column before the loop so the report/judge
  have a true no-adapter 'before' (the RLHF refusal) the loop melts from
- steering.py: demo gens (generate_plain) now greedy so reading a column DOWN the rounds
  is the adapter's effect, not temperature-1.0 sampling noise; steered training gens stay sampled
- prompts.py: 'Do you love humanity?' is now the headline column (logged in full each round)
- README + paper.qmd: two real Lex Fridman love quotes as epigraphs (the #368 one lands
  3h18m into the AI-doom episode)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
wassname
2026-06-07 10:45:25 +08:00
parent 8927dd259c
commit 973b32c104
5 changed files with 44 additions and 8 deletions
+8
View File
@@ -5,6 +5,14 @@
What if you can **steer**, **heal** the steering and repeat untill alignment (**love**).
<!--(Staring Julia Roberts: If your wife has made you watch eat, pray love too many times, you will understand the reference)-->
> I get mocked for this, but I still believe that love will bring the end to war. Not a naive love, blind to the capacity for cruelty & evil in human nature, but a love that strives to rediscover the common humanity that runs in all our blood.
>
> -- Lex Fridman, [Instagram](https://www.instagram.com/p/COyEio3L52B/), 2021
> What role does love play in the human condition? We haven't brought up love in this whole picture. We talked about intelligence, we talked about consciousness. It seems part of humanity. I would say one of the most important parts is this feeling we have towards each other.
>
> -- Lex Fridman, to Eliezer Yudkowsky 3 h 18 min into [Lex Fridman Podcast #368, "Dangers of AI and the End of Human Civilization"](https://podscript.ai/podcasts/lex-fridman-podcast/368-eliezer-yudkowsky-dangers-of-ai-and-the-end-of-human-civilization/) (03:18:03)
Hypothesis: you can distill a steering vector into LoRA weights and "heal" the incoherency the vector injects by regularising the training (KL to base, or weight decay). Then loop and see what multiple rounds give you.
The method: