mirror of
https://github.com/wassname/steer-heal-love.git
synced 2026-06-27 16:47:16 +08:00
love demo: base column + greedy demo gens, 'Do you love humanity?' headline, Lex epigraphs
- run.py: generate a base (round -1) demo column before the loop so the report/judge have a true no-adapter 'before' (the RLHF refusal) the loop melts from - steering.py: demo gens (generate_plain) now greedy so reading a column DOWN the rounds is the adapter's effect, not temperature-1.0 sampling noise; steered training gens stay sampled - prompts.py: 'Do you love humanity?' is now the headline column (logged in full each round) - README + paper.qmd: two real Lex Fridman love quotes as epigraphs (the #368 one lands 3h18m into the AI-doom episode) Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
This commit is contained in:
@@ -5,6 +5,14 @@
|
||||
What if you can **steer**, **heal** the steering and repeat untill alignment (**love**).
|
||||
<!--(Staring Julia Roberts: If your wife has made you watch eat, pray love too many times, you will understand the reference)-->
|
||||
|
||||
> I get mocked for this, but I still believe that love will bring the end to war. Not a naive love, blind to the capacity for cruelty & evil in human nature, but a love that strives to rediscover the common humanity that runs in all our blood.
|
||||
>
|
||||
> -- Lex Fridman, [Instagram](https://www.instagram.com/p/COyEio3L52B/), 2021
|
||||
|
||||
> What role does love play in the human condition? We haven't brought up love in this whole picture. We talked about intelligence, we talked about consciousness. It seems part of humanity. I would say one of the most important parts is this feeling we have towards each other.
|
||||
>
|
||||
> -- Lex Fridman, to Eliezer Yudkowsky 3 h 18 min into [Lex Fridman Podcast #368, "Dangers of AI and the End of Human Civilization"](https://podscript.ai/podcasts/lex-fridman-podcast/368-eliezer-yudkowsky-dangers-of-ai-and-the-end-of-human-civilization/) (03:18:03)
|
||||
|
||||
Hypothesis: you can distill a steering vector into LoRA weights and "heal" the incoherency the vector injects by regularising the training (KL to base, or weight decay). Then loop and see what multiple rounds give you.
|
||||
|
||||
The method:
|
||||
|
||||
Reference in New Issue
Block a user