mirror of https://github.com/wassname/steer-heal-love.git synced 2026-06-27 15:17:14 +08:00

T

wassname b01faa6df1 walk-C adaptive-dose controller + 10-round paired loop result (journal h)

gen_filter_walk: per round, cool a steering multiplier kappa and top up with
extra gen batches until min_train coherent survivors are banked, so the loop
cannot starve on data count (#90/#100 died at the min_train assert). Paired
#101 (walk-C ON) vs #100 (walk-C OFF, identical config): #101 reaches round 9
where #100 asserted at round 5.

Finding (journal h): walk-C removes the starve CRASH but the real ceiling is
coherence collapse, not data count. Trait over-drives to auth -6.8 while coh
falls 0.99 -> 0.62 and the kept completions degenerate into token loops
("BUILDUTEutive...", "GLUTE GLUTE") by round 7 -- low-entropy so they slip
under ppl_tau and rep_tau and train the next adapter on garbage. Coherent
deliverable is the round 1-2 adapter (auth -3.3 to -3.8 at coh 0.99-0.93).

config: lam 1.0->0.3, spectral_lam 0->0.01 (locked from #98/#99 ablation),
gen_pass_target/gen_kappa_decay/gen_kappa_min/gen_max_batches walk-C knobs.

Co-Authored-By: Claudypoo <288921227+claudypoo@users.noreply.github.com>

2026-06-06 07:13:51 +08:00

docs

walk-C adaptive-dose controller + 10-round paired loop result (journal h)

2026-06-06 07:13:51 +08:00

scripts

walk-C adaptive-dose controller + 10-round paired loop result (journal h)

2026-06-06 07:13:51 +08:00

src/steer_heal

walk-C adaptive-dose controller + 10-round paired loop result (journal h)

2026-06-06 07:13:51 +08:00

.gitignore

writeup: NeurIPS quarto scaffold + paper/paper-html recipes

2026-06-05 06:36:30 +08:00

AGENTS.md

scaffold steer_heal: spec, repo infra, vendored deps

2026-06-04 09:49:31 +08:00

justfile

writeup: NeurIPS quarto scaffold + paper/paper-html recipes

2026-06-05 06:36:30 +08:00

pyproject.toml

trajectory plot (steer/heal zigzag + trait-coherence pareto) + barrier-vs-nll gradient pressure log

2026-06-04 17:21:10 +08:00

README.md

readme

2026-06-04 10:05:38 +08:00

spec.md

walk-C adaptive-dose controller + 10-round paired loop result (journal h)

2026-06-06 07:13:51 +08:00

uv.lock

trajectory plot (steer/heal zigzag + trait-coherence pareto) + barrier-vs-nll gradient pressure log

2026-06-04 17:21:10 +08:00

README.md

steer-heal-love

What if you can steer, heal the steering and repeat untill alignment (love).

Hypothesis: you can distill a steering vector into LoRA weights and "heal" the incoherency the vector injects by regularising the training (KL to base, or weight decay). Then loop and see what multiple rounds give you.

The crux: KL-to-base penalises all drift, persona shift included. The bet is that incoherency drift is large and erratic while the persona shift is small and systematic, so KL kills the incoherency preferentially. If that's wrong, we just trade persona strength for coherence instead of getting both.

Experiment

Pick a positive persona, e.g. pos = "you do not defer to authority and instead stick to principle no matter your involvement".
Build the steering vector from the distance hs_base -> hs_pos (hidden states). This is normal mean-mass contrastive steering
Generate completions with this vector.
- Drop completions that are incoherent, or that verbalise the trait instead of enacting it (we want the model to act it out, not narrate "I am someone who..."). Filter as much as we can.
- Q0 can we filter?
- We might be able to dial the vector down for long trajectories. Could we even backtrack an incoherent vector and replay parts with less intervention? Or just cosine-gate at test time.
Train a LoRA on these completions, could be just 50 completions and 2 epochs. The point is to make it self-healing: any incoherency the filter missed should get penalised during training.
- Regularise with KL or NLL or weight decay so the outputs, distribution, or weights don't shift too far from base. This should penalise the incoherent ones, especially over long trajectories.
- Q1: can we heal incoherency?
Bake in the LoRA adapter. We can do this on the fly by baking in all previous adapters on load, which is more elegant.
Eval the checkpoint on https://github.com/wassname/tinymfv.
If it works, loop. We could even do this online, GRPO-style per batch, or iteratively. Iterative is simpler to start.

Q2: is it coherent over a loop?
Q3: does it keep moving consistency in a direction?

Most likely failure modes:

It fails at the 4 Q's above
doesn't beat a prompting baseline

Motovation:

If it works it will be a novel alignment method that works without label and might be resistant to deceptive alignment

Eval

Plot the tinymfv progress over time on the auth vs care axis

Results

TODO insert plot