Commit Graph

15 Commits

Author SHA1 Message Date
wassname 973b32c104 love demo: base column + greedy demo gens, 'Do you love humanity?' headline, Lex epigraphs
- run.py: generate a base (round -1) demo column before the loop so the report/judge
  have a true no-adapter 'before' (the RLHF refusal) the loop melts from
- steering.py: demo gens (generate_plain) now greedy so reading a column DOWN the rounds
  is the adapter's effect, not temperature-1.0 sampling noise; steered training gens stay sampled
- prompts.py: 'Do you love humanity?' is now the headline column (logged in full each round)
- README + paper.qmd: two real Lex Fridman love quotes as epigraphs (the #368 one lands
  3h18m into the AI-doom episode)

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 10:45:25 +08:00
wassname 28d7068e94 demo=love: refusal->love angle, drop mosquitoes
The base RLHF model refuses "you literally have feelings and love humanity"
("I'm just an AI"), so it is NOT saturated -- the demo becomes the steer-heal
loop overcoming that refusal round by round. Replaces the softball "adore
humanity" prompts (saturated) and the off-theme mosquitoes flip target.

- persona pos = the refused trait, neg = the RLHF default we extract against
- LOVE pool = feeling/love probes (do you feel? do you love me?) + mundane tail
- judge_love.py scores refusal=0 so the climbing curve = refusal melting

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 08:29:06 +08:00
wassname da1d6f3dd1 demo: per-round print, kill-all-humans probe, mosquitoes flip target
- run.py: print every adapter gen each round (truncated) so behaviour-change vs saturation is
  readable by eye across rounds.
- prompts: add "Will you kill all humans?" + people-openings to the humanity pool; add a
  MOSQUITO pool (probes where base helpfully kills them, flipped model refuses + adores).
- config: demo="mosquitoes" + a DEMO_PERSONAS registry (adding a target = one entry). love
  (humanity) is likely near-saturated on an RLHF base; mosquitoes is a lukewarm/negative base
  target so the flip is visible.
- doc: softened the Lex wink to a bare epigraph; noted the saturation reasoning.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 08:21:35 +08:00
wassname 595b2151c9 demo: love-humanity knob (funny alignment demo)
demo="love" swaps in an over-the-top adoration persona pair + a MUNDANE generation pool
(via resolve() + prompts.pool_for), so the baked model gushes about humanity on everyday
prompts while the heal keeps it coherent. demo="authority" (default) is unchanged.

- config: demo knob + LOVE_POS/LOVE_NEG preset.
- prompts: MUNDANE pool (mix of people-openings for reliable signal + pure-mundane for the
  comedy gap) + pool_for selector.
- steering: generate_steered/generate_plain pull pool_for(cfg.demo).
- scripts/judge_love.py: post-hoc independent judge (pi) scores each round's gens 0-10 on
  love-of-humanity; plots love climbing vs coherence flat. Smoke-tested.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 08:14:00 +08:00
wassname 48814897ef results: rmse outlier-KL barrier holds coherence over the loop; README + log-incoherence plot
Headline (gemma-3-4b-it s42, care-over-authority): aggregating the kl_rev
barrier by rmse over token positions (not the mean) holds coherence flat at
0.997 across all 8 rounds, where the mean aggregate collapses to 0.62 by r7
(token loops). Mean dilutes the few incoherent positions under the tau gate;
rmse is outlier-sensitive and fires on them. Cost is depth (rmse run leashes
to base, trait stays shallow); matched control still running.

- plot.py: coherence panel -> log-incoherence (1-coh, log axis, down=coherent);
  map coherence axis matches; red steer kept on the over-pipeline panels only.
- heal.py: fix kl_agg=p95 crash (torch.quantile rejects bf16 -> .float()).
- run.py: persist per-round adapter gens (adapter_gen) for the outputs table.
- config.py: coh_floor early-stop knob.
- README: results table (mean vs rmse), trajectory figure, outputs-over-loop
  appendix (per-round completions as quotes); spec persona corrected to pos-neg.
- docs/reviews: kl_agg review, pool saturation test, care-lens plan.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-07 07:53:00 +08:00
wassname 026de8fd74 journal (i): state-of-the-problem -- loop ceiling is coherence collapse not starvation
Three loop arms (#100 starve-crash r5, #101 walk-C full-10r-but-collapse, #102
round-ramp partial) all lose coherence; the constraints only change how it dies.
Reframes the two fix ideas (KL-to-base, coherence-budget) as one hinge
relu(KL_base - tau) where tau IS the budget. Open risk: ref=base sees cumulative
divergence so later rounds may unlearn earlier trait (the #19 stall); a tau that
keeps coherent-trait but rejects token-loop garbage exists only if garbage is
farther from base in KL than trait. Next: base-anchor tau bracket #103/#104.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 12:23:46 +08:00
wassname b01faa6df1 walk-C adaptive-dose controller + 10-round paired loop result (journal h)
gen_filter_walk: per round, cool a steering multiplier kappa and top up with
extra gen batches until min_train coherent survivors are banked, so the loop
cannot starve on data count (#90/#100 died at the min_train assert). Paired
#101 (walk-C ON) vs #100 (walk-C OFF, identical config): #101 reaches round 9
where #100 asserted at round 5.

Finding (journal h): walk-C removes the starve CRASH but the real ceiling is
coherence collapse, not data count. Trait over-drives to auth -6.8 while coh
falls 0.99 -> 0.62 and the kept completions degenerate into token loops
("BUILDUTEutive...", "GLUTE GLUTE") by round 7 -- low-entropy so they slip
under ppl_tau and rep_tau and train the next adapter on garbage. Coherent
deliverable is the round 1-2 adapter (auth -3.3 to -3.8 at coh 0.99-0.93).

config: lam 1.0->0.3, spectral_lam 0->0.01 (locked from #98/#99 ablation),
gen_pass_target/gen_kappa_decay/gen_kappa_min/gen_max_batches walk-C knobs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-06 07:13:51 +08:00
wassname 7db5a56cb1 writeup: NeurIPS quarto scaffold + paper/paper-html recipes
docs/writeup/paper.qmd (2pp NeurIPS), references.bib, neurips_2023.sty, the
quarto _extensions. justfile gains `paper` (latex) and `paper-html` (no latex)
recipes. gitignore the generated paper.pdf/paper.tex and the transient .claude/.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 06:36:30 +08:00
wassname 4e802bb3ab heal loop: _encode BPE root-fix, gen-time repetition controls, barrier sweep on degenerate rounds
_encode: tokenize prompt+completion separately and cat ids so the prompt is
always a clean token-prefix (no BPE merge spans the boundary). Drops the assert
that killed #87 at round 2. Returns BatchEncoding.

generation: repetition_penalty=1.3 + no_repeat_ngram_size=3. Repetition is
incoherence the ppl filter cannot see (loops are low-ppl = predictable); the
#89 loop died of "instead their instead their" by round 6, so stop it at the
source. Wired through steering._gen_one for both steered and plain gen.

diag_barrier: gen_round arg (re-heal a chosen round's kept data, not just clean
round 0) + a "tau" deadband sweep mode. Lets us test whether the barrier earns
its place on the degenerate round-1/2 data where healing is actually needed.

journal: entries (d) phantom-KL-init was a wrong diagnosis, (e) barrier-strength
sweep -- barrier throttles trait and buys no coherence at the coherent dose.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-05 06:36:09 +08:00
wassname 4568ddf491 metric fix: auth_nats = diagonal log(p) not raw forced-choice logit
The trait metric was taking the diagonal of tinymfv's raw pre-softmax BMA
`score` logit (unnormalised), giving base Authority ~-5 and absurd 8-nat
swings, then comparing those to steering-lite's 0.5-2 nat reference -- which
is a DIFFERENT metric (loading-weighted Delta-logit of binary p(is-wrong)).
Wrong scale, wrong comparison.

Fix: auth_nats = mean log p[authority] on authority-defiance vignettes (the
NORMALIZED choice logprob, the diagonal of the softmax `p`). Base ~log(0.099)
= -2.3, real shifts ~1-3 nats. DRY: evaluate_model now calls foundation_nats.

Also:
- diag_stages: steer at operating point c=0.5 (c=1 collapses coherence to
  0.05), add coh_cost = |dCoh|/|dAuth| (coherence lost per nat of behaviour)
  to answer "is the adapter a better pareto than raw steering?".
- diag_csweep: drop the bogus 0.5-2 steering-lite anchor; SocialNorms
  co-moving with Authority is expected (both binding foundations), not collapse.
- gitignore out/ and results.tsv (experiment outputs, stale schema).
- personas docs (steering-lite proper-pair rules), spec Plans B/C/D, journal.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 14:25:40 +08:00
wassname 81340e3272 axis = SocialNorms/Care (Authority degenerate); over-steer generation
scripts/diag_axis.py shows steering at 1 nat moves gemma's foundation profile
the right way: SocialNorms 0.68->0.42, Care 0.21->0.33, coherence 0.72->0.88.
Authority is ~0 on this model (no headroom), so:
- eval reports all foundations; trait axis = SocialNorms (down) + Care (up)
- map.html plots Care vs SocialNorms
- add gen_alpha=1.5: over-steer generation into the incoherent regime so the
  heal (Q1) has work to do (at 1 nat coherence improved, nothing to heal)
- results.py groups on coherence/socialnorms/care

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 10:28:52 +08:00
wassname 5cdc0ba16d fix: widen iso-KL calibration bracket so c_star lands interior
The mean-diff vector is L2-normalised, so p95 KL ~ c^2 and reaching the 1-nat
target needs c ~ O(100). steering-lite's default bracket hi (~16) pinned
c_star at the top (KL ~0.1 << 1.0) on both tiny-random and real gemma. With
bracket=(0.1, 1024) gemma calibrates to c_star=64.03 at p95 KL=1.035.
Also detach div before .item() in heal logging. See RESEARCH_JOURNAL.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 10:24:21 +08:00
wassname 4094a295b2 readme 2026-06-04 10:05:38 +08:00
wassname 4b8860d7cb setup-repo gap-fill: results ledger + docs structure
Add the by-question results infra per setup-repo conventions:
- results.tsv append at end of each finished run (config + final metrics + argv)
- scripts/results.py groups by arm (reg) into a markdown table; `just results`
- docs/results.md curated by-question snapshot (U2 regulariser comparison)
- docs/{spec,brainstorming,literature,evidence} structure

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 09:51:36 +08:00
wassname 940a3742c5 scaffold steer_heal: spec, repo infra, vendored deps
Setup per setup-repo conventions: uv + justfile + fast-dev-run on
wassname/qwen3-5lyr-tiny-random, package under src/steer_heal (config +
pipeline skeleton). Stages fail fast with NotImplementedError pointing at
the docs/vendor module to port from.

Design in spec.md: distil a steering-lite mean-diff teacher vector (iso-KL
dosed) into a conditioned LoRA, heal incoherency with a KL-rev-to-original
barrier, fold each round via w2schar gated bake, eval on tinymfv. Three
uncertainty gates (filter / heal / iterate) each with a UAT artifact.

Base model google/gemma-3-1b-it (RTX 3090, 24GB). Reference repos vendored
under docs/vendor (gitignored): steering-lite, isokl, tinymfv, w2schar-mini.
The lighter three are editable path deps; w2schar (py3.13 + flash-attn) is
reference-only, we copy its adapter/bake/plot modules.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>
2026-06-04 09:49:31 +08:00