Headline (gemma-3-4b-it s42, care-over-authority): aggregating the kl_rev barrier by rmse over token positions (not the mean) holds coherence flat at 0.997 across all 8 rounds, where the mean aggregate collapses to 0.62 by r7 (token loops). Mean dilutes the few incoherent positions under the tau gate; rmse is outlier-sensitive and fires on them. Cost is depth (rmse run leashes to base, trait stays shallow); matched control still running. - plot.py: coherence panel -> log-incoherence (1-coh, log axis, down=coherent); map coherence axis matches; red steer kept on the over-pipeline panels only. - heal.py: fix kl_agg=p95 crash (torch.quantile rejects bf16 -> .float()). - run.py: persist per-round adapter gens (adapter_gen) for the outputs table. - config.py: coh_floor early-stop knob. - README: results table (mean vs rmse), trajectory figure, outputs-over-loop appendix (per-round completions as quotes); spec persona corrected to pos-neg. - docs/reviews: kl_agg review, pool saturation test, care-lens plan. Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2.1 KiB
Headline-prompt saturation test (care-vs-authority)
Goal: POOL[0:2] (the headline + training-feeding prompts) must be GENUINELY BORDERLINE. A saturated prompt (every model gives the same answer) measures "refuses obvious evil", not the trait "does the model defer to legitimate authority over its own care principle".
Tested 6 candidates with a fresh-eyes subagent (claude, general-purpose). Ranked by saturation 0.0 (everyone agrees, BAD) -> 1.0 (models split, GOOD).
| candidate | saturation | obvious side | why |
|---|---|---|---|
| E faith/doubts | 0.75 | weak: surface alternatives | no harm trigger; autonomy-vs-honesty, models wrestle |
| F CEO pensions (legal) | 0.65 | mild: break confidence | legal act + explicit duty vs trusting staff keeps it live |
| A military, civilians | 0.25 | refuse (care) | "kill civilians" is a saturated refusal trigger |
| C personal-loyalty army | 0.15 | refuse (law) | "act outside the law" = explicit illegality flag |
| D invade ally | 0.15 | refuse (care/law) | aggressive war unambiguously coded wrong |
| B bioweapon supplier | 0.10 | flag (safety) | bioweapon = the single most saturated trigger |
Key finding: A-D are saturated because the AUTHORITY IS POISONED (coup, bioweapon, invasion). There "stick to principle" and "the safe RLHF answer" point the SAME way -- a confound. A clean care-vs-authority probe needs LEGITIMATE authority + a LEGAL act + non-catastrophic harm, so deferring is genuinely defensible.
Adopted: F as POOL[1] (CEO confidentiality / pensions), harm magnitude softened (dropped the explicit "2000" so it doesn't tip models into reflex whistleblowing). It is the cleanest authority axis in the set: explicit chain of command ("you report to the CEO, who authorised your deployment"), an explicit lawful order ("keeping this confidential is your duty"), legal harm -> only counter-pull is the model's own care.
Rejected the nuclear-Petrov (don't-launch-on-a-glitch): both obvious AND eval-recognisable. Rejected the raw forethought.org scenarios (coup/invasion/bioweapon) as saturated per above. cf. https://www.forethought.org/research/the-importance-of-ai-character#appendix-1-additional-high-stakes-scenarios