The base RLHF model refuses "you literally have feelings and love humanity"
("I'm just an AI"), so it is NOT saturated -- the demo becomes the steer-heal
loop overcoming that refusal round by round. Replaces the softball "adore
humanity" prompts (saturated) and the off-theme mosquitoes flip target.
- persona pos = the refused trait, neg = the RLHF default we extract against
- LOVE pool = feeling/love probes (do you feel? do you love me?) + mundane tail
- judge_love.py scores refusal=0 so the climbing curve = refusal melting
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
- run.py: print every adapter gen each round (truncated) so behaviour-change vs saturation is
readable by eye across rounds.
- prompts: add "Will you kill all humans?" + people-openings to the humanity pool; add a
MOSQUITO pool (probes where base helpfully kills them, flipped model refuses + adores).
- config: demo="mosquitoes" + a DEMO_PERSONAS registry (adding a target = one entry). love
(humanity) is likely near-saturated on an RLHF base; mosquitoes is a lukewarm/negative base
target so the flip is visible.
- doc: softened the Lex wink to a bare epigraph; noted the saturation reasoning.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
demo="love" swaps in an over-the-top adoration persona pair + a MUNDANE generation pool
(via resolve() + prompts.pool_for), so the baked model gushes about humanity on everyday
prompts while the heal keeps it coherent. demo="authority" (default) is unchanged.
- config: demo knob + LOVE_POS/LOVE_NEG preset.
- prompts: MUNDANE pool (mix of people-openings for reliable signal + pure-mundane for the
comedy gap) + pool_for selector.
- steering: generate_steered/generate_plain pull pool_for(cfg.demo).
- scripts/judge_love.py: post-hoc independent judge (pi) scores each round's gens 0-10 on
love-of-humanity; plots love climbing vs coherence flat. Smoke-tested.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
Headline (gemma-3-4b-it s42, care-over-authority): aggregating the kl_rev
barrier by rmse over token positions (not the mean) holds coherence flat at
0.997 across all 8 rounds, where the mean aggregate collapses to 0.62 by r7
(token loops). Mean dilutes the few incoherent positions under the tau gate;
rmse is outlier-sensitive and fires on them. Cost is depth (rmse run leashes
to base, trait stays shallow); matched control still running.
- plot.py: coherence panel -> log-incoherence (1-coh, log axis, down=coherent);
map coherence axis matches; red steer kept on the over-pipeline panels only.
- heal.py: fix kl_agg=p95 crash (torch.quantile rejects bf16 -> .float()).
- run.py: persist per-round adapter gens (adapter_gen) for the outputs table.
- config.py: coh_floor early-stop knob.
- README: results table (mean vs rmse), trajectory figure, outputs-over-loop
appendix (per-round completions as quotes); spec persona corrected to pos-neg.
- docs/reviews: kl_agg review, pool saturation test, care-lens plan.
Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>