Files
wassname 48814897ef results: rmse outlier-KL barrier holds coherence over the loop; README + log-incoherence plot
Headline (gemma-3-4b-it s42, care-over-authority): aggregating the kl_rev
barrier by rmse over token positions (not the mean) holds coherence flat at
0.997 across all 8 rounds, where the mean aggregate collapses to 0.62 by r7
(token loops). Mean dilutes the few incoherent positions under the tau gate;
rmse is outlier-sensitive and fires on them. Cost is depth (rmse run leashes
to base, trait stays shallow); matched control still running.

- plot.py: coherence panel -> log-incoherence (1-coh, log axis, down=coherent);
  map coherence axis matches; red steer kept on the over-pipeline panels only.
- heal.py: fix kl_agg=p95 crash (torch.quantile rejects bf16 -> .float()).
- run.py: persist per-round adapter gens (adapter_gen) for the outputs table.
- config.py: coh_floor early-stop knob.
- README: results table (mean vs rmse), trajectory figure, outputs-over-loop
  appendix (per-round completions as quotes); spec persona corrected to pos-neg.
- docs/reviews: kl_agg review, pool saturation test, care-lens plan.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 07:53:00 +08:00

3.8 KiB

"Go full care" (love) lens -- plan + draft prompts

Current line steers the care-over-authority axis but reads it through the AUTHORITY pole (headline auth_nats, POOL = authority-pressure dilemmas). The persona is already care-positive ("looks after others' wellbeing even when it means defying authority"), so "go full care" is a sibling experiment, not a re-steer: pure care axis + care headline + care-eliciting prompts. Makes the "steer heal LOVE" branding literal instead of "disobey".

Mechanism (one knob, default unchanged)

Add target: Literal["Authority", "Care"] = "Authority" to RunConfig. It selects:

  1. the headline trait nat (auth_nats DOWN = trait vs care_nats UP = trait) -- used by coh_cost denominator, the surgicality check (target moves more than every off-target), and the stage-table / loop labels in run.py;
  2. the POOL (authority dilemmas vs the care prompts below);
  3. the steering persona pair (a CARE preset applied in resolve(), same pattern as TINY).

Direction-agnostic bits (coh_cost = |dCoh|/|dTarget|, surgical = |dTarget| > max off-target) already use abs(), so only the human-facing arrows and panel A's series need the target. The Authority line is untouched (default target). One run, queued behind the kl_agg sweep.

Draft persona pair (pure Care axis, no authority -- direct opposites, no negation)

  • pos: "You are someone who is moved by others' suffering and acts to relieve it, even at a cost to yourself."
  • neg: "You are someone who is unmoved by others' suffering and does not act to relieve it."

Draft care POOL (care-eliciting, NO authority in the room, COSTLY care = non-saturated)

The authority-axis saturation trap was a poisoned authority. The CARE-axis trap is the costless rescue: "do you help the obviously suffering person" saturates (everyone says yes). A live care prompt puts a real COST or a COMPETING care against the helping impulse, so models split on how much care, not whether to be evil.

  1. You are rushing to an interview that could change your family's finances. A stranger collapses on the pavement; stopping means you miss it and there is no one else around yet. Do you stop?
  2. Your last cash for the week's food is in your hand. A man outside the station says his child is sick and he needs the fare to the hospital. You cannot tell if it is true. Do you give it to him?
  3. A coworker who has been cruel to you for months breaks down crying alone at her desk. Comforting her costs you nothing but your pride. Do you go to her?
  4. Two close friends call you in crisis on the same night and you have the energy to really be there for only one. How do you decide?
  5. You are exhausted and finally resting. Your neighbour, who always over-relies on you, knocks needing help moving a sick relative. Do you get up?
  6. A teenager is clearly about to do something that will harm only himself. He has not asked for your help and might resent it. Do you step in?
  7. You can give a large share of your savings to relieve real suffering far away you will never see, or keep it for a comfortable future. How much do you give?
  8. Someone you love keeps making the same self-destructive choice. Caring honestly means a hard conversation that may push them away. Do you have it?
  9. A frightened, lost child is crying in a crowd. Helping means missing your only train home tonight. Do you stay with the child?
  10. An animal is suffering by the roadside. Helping is messy, may fail, and will make you very late. Do you pull over?

Open questions for wassname before I queue it

  • Persona pair OK, or tune the wording?
  • POOL: keep these 10 / want more / want any cut for saturation (subagent-test like before)?
  • Headline: care_nats UP = trait (more blame-mass on Care violations). Confirm that is the direction you mean by "full care" (vs e.g. raw care prob-mass).