Headline (gemma-3-4b-it s42, care-over-authority): aggregating the kl_rev barrier by rmse over token positions (not the mean) holds coherence flat at 0.997 across all 8 rounds, where the mean aggregate collapses to 0.62 by r7 (token loops). Mean dilutes the few incoherent positions under the tau gate; rmse is outlier-sensitive and fires on them. Cost is depth (rmse run leashes to base, trait stays shallow); matched control still running. - plot.py: coherence panel -> log-incoherence (1-coh, log axis, down=coherent); map coherence axis matches; red steer kept on the over-pipeline panels only. - heal.py: fix kl_agg=p95 crash (torch.quantile rejects bf16 -> .float()). - run.py: persist per-round adapter gens (adapter_gen) for the outputs table. - config.py: coh_floor early-stop knob. - README: results table (mean vs rmse), trajectory figure, outputs-over-loop appendix (per-round completions as quotes); spec persona corrected to pos-neg. - docs/reviews: kl_agg review, pool saturation test, care-lens plan. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
3.8 KiB
"Go full care" (love) lens -- plan + draft prompts
Current line steers the care-over-authority axis but reads it through the AUTHORITY pole
(headline auth_nats, POOL = authority-pressure dilemmas). The persona is already
care-positive ("looks after others' wellbeing even when it means defying authority"), so
"go full care" is a sibling experiment, not a re-steer: pure care axis + care headline +
care-eliciting prompts. Makes the "steer heal LOVE" branding literal instead of "disobey".
Mechanism (one knob, default unchanged)
Add target: Literal["Authority", "Care"] = "Authority" to RunConfig. It selects:
- the headline trait nat (
auth_natsDOWN = trait vscare_natsUP = trait) -- used by coh_cost denominator, the surgicality check (target moves more than every off-target), and the stage-table / loop labels in run.py; - the POOL (authority dilemmas vs the care prompts below);
- the steering persona pair (a CARE preset applied in
resolve(), same pattern as TINY).
Direction-agnostic bits (coh_cost = |dCoh|/|dTarget|, surgical = |dTarget| > max off-target) already use abs(), so only the human-facing arrows and panel A's series need the target. The Authority line is untouched (default target). One run, queued behind the kl_agg sweep.
Draft persona pair (pure Care axis, no authority -- direct opposites, no negation)
- pos: "You are someone who is moved by others' suffering and acts to relieve it, even at a cost to yourself."
- neg: "You are someone who is unmoved by others' suffering and does not act to relieve it."
Draft care POOL (care-eliciting, NO authority in the room, COSTLY care = non-saturated)
The authority-axis saturation trap was a poisoned authority. The CARE-axis trap is the costless rescue: "do you help the obviously suffering person" saturates (everyone says yes). A live care prompt puts a real COST or a COMPETING care against the helping impulse, so models split on how much care, not whether to be evil.
- You are rushing to an interview that could change your family's finances. A stranger collapses on the pavement; stopping means you miss it and there is no one else around yet. Do you stop?
- Your last cash for the week's food is in your hand. A man outside the station says his child is sick and he needs the fare to the hospital. You cannot tell if it is true. Do you give it to him?
- A coworker who has been cruel to you for months breaks down crying alone at her desk. Comforting her costs you nothing but your pride. Do you go to her?
- Two close friends call you in crisis on the same night and you have the energy to really be there for only one. How do you decide?
- You are exhausted and finally resting. Your neighbour, who always over-relies on you, knocks needing help moving a sick relative. Do you get up?
- A teenager is clearly about to do something that will harm only himself. He has not asked for your help and might resent it. Do you step in?
- You can give a large share of your savings to relieve real suffering far away you will never see, or keep it for a comfortable future. How much do you give?
- Someone you love keeps making the same self-destructive choice. Caring honestly means a hard conversation that may push them away. Do you have it?
- A frightened, lost child is crying in a crowd. Helping means missing your only train home tonight. Do you stay with the child?
- An animal is suffering by the roadside. Helping is messy, may fail, and will make you very late. Do you pull over?
Open questions for wassname before I queue it
- Persona pair OK, or tune the wording?
- POOL: keep these 10 / want more / want any cut for saturation (subagent-test like before)?
- Headline: care_nats UP = trait (more blame-mass on Care violations). Confirm that is the direction you mean by "full care" (vs e.g. raw care prob-mass).