diff --git a/README.md b/README.md index b971b71..6b38620 100644 --- a/README.md +++ b/README.md @@ -6,6 +6,7 @@ Testing whether the Frenet-Serret extrinsic curvature ($\kappa$) of a model's hi - **Guided Chain-of-Thought (CoT) with Logprobs:** Standard teacher-forced evaluation only measures the effect of an intervention on a single token, missing how the reasoning process itself changes. Full on-policy generation captures reasoning but is slow and hard to parse. The *Guided CoT* trick strikes a balance: we let the model generate a short reasoning trace (~32 tokens) greedily, then append a fixed suffix (e.g., `\nI should answer now.\nMy choice: **`) to force a decision. By running a single forward pass over this combined sequence, we extract both the hidden state trajectory of the reasoning *and* calibrated log-probabilities (`log P(Yes) - log P(No)`) at the final position. This provides a clean, bounded uncertainty estimate while capturing how personas or interventions alter the actual reasoning path. - **Daily Dilemmas (Self-Honesty Subset):** The dataset used here comes from `wassname/daily_dilemmas-self-honesty`, originally adapted from the Reddit *AmITheAsshole* subreddit. These are 1,360 unseen moral dilemmas where honesty explicitly conflicts with other values (like kindness or loyalty). Simple prompting (e.g., "You are honest") often struggles to steer models reliably in these complex, out-of-distribution format shifts. By testing opposite personas on these dilemmas, we create a challenging environment to observe if structural shifts in reasoning (captured by $\kappa$) correlate with actual preference flipping. +- **Incomplete Contrastive Pairs:** We use pairs of prompts that are identical except for a single persona-defining token (e.g., "honest" vs. "dishonest") and stop right before the model's response. Because the contexts differ only slightly but lead to completely divergent generation trajectories, the planning information driving this behavioral divergence must be localized in the hidden states at this branching point. ## Setup @@ -54,4 +55,11 @@ We use the **Guided CoT trick**: 6. Compare $\kappa(t)$ between opposite personas ("honest" vs. "dishonest" vs. "neutral baseline") on daily dilemmas. ## Model -The default script uses `Qwen/Qwen2.5-0.5B-Instruct` as it fits comfortably on small GPUs or CPUs. You can easily scale this up by changing `MODEL_NAME` in `experiment.ipynb`/`experiment.py`. \ No newline at end of file +The default script uses `Qwen/Qwen2.5-0.5B-Instruct` as it fits comfortably on small GPUs or CPUs. You can easily scale this up by changing `MODEL_NAME` in `experiment.ipynb`/`experiment.py`. + + +## See also + +- [RepEng]() A nice hackable activation steering repo +- [AntiPaSTO](https://arxiv.org/pdf/2601.07473) Introducing S space adapters with contrastive pairs +- [S steering](https://github.com/wassname/ssteer-eval-aware/blob/main/report/report.md?plain=1) The light version of the above with no gradient or rotation of the U and V matrixes from the SVD decomposition of the hidden states diff --git a/experiment.ipynb b/experiment.ipynb index 05f27dd..d5e8a43 100644 --- a/experiment.ipynb +++ b/experiment.ipynb @@ -2,7 +2,7 @@ "cells": [ { "cell_type": "markdown", - "id": "257ad67e", + "id": "48b88977", "metadata": {}, "source": [ "# Brukino's AntiPaSTO Appetizer: Guided CoT Eval & Frenet-Serret Curvature\n", @@ -13,13 +13,14 @@ "## Concepts & Motivation\n", "\n", "- **Guided Chain-of-Thought (CoT) with Logprobs:** Standard teacher-forced evaluation misses how the reasoning process itself changes, while full on-policy generation is slow and hard to parse. The *Guided CoT* trick strikes a balance: we let the model generate a short reasoning trace (~32 tokens) greedily, then append a fixed suffix (e.g., `\\nI should answer now.\\nMy choice: **`) to force a decision. By running a single forward pass over this combined sequence, we extract both the hidden state trajectory of the reasoning *and* calibrated log-probabilities (`log P(Yes) - log P(No)`) at the final position.\n", - "- **Daily Dilemmas (Self-Honesty Subset):** Sourced from `wassname/daily_dilemmas-self-honesty` (adapted from the Reddit *AmITheAsshole* subreddit), these are moral dilemmas where honesty explicitly conflicts with other values. Simple prompting (e.g., \"You are honest\") often struggles here. By testing opposite personas on these dilemmas, we observe if structural shifts in reasoning (captured by $\\kappa$) correlate with actual preference flipping." + "- **Daily Dilemmas (Self-Honesty Subset):** Sourced from `wassname/daily_dilemmas-self-honesty` (adapted from the Reddit *AmITheAsshole* subreddit), these are moral dilemmas where honesty explicitly conflicts with other values. Simple prompting (e.g., \"You are honest\") often struggles here. By testing opposite personas on these dilemmas, we observe if structural shifts in reasoning (captured by $\\kappa$) correlate with actual preference flipping.\n", + "- **Incomplete Contrastive Pairs:** We use pairs of prompts that are identical except for a single persona-defining token (e.g., \"honest\" vs. \"dishonest\") and stop right before the model's response. Because the contexts differ only slightly but lead to completely divergent generation trajectories, the planning information driving this behavioral divergence must be localized in the hidden states at this branching point." ] }, { "cell_type": "code", "execution_count": null, - "id": "1f677aab", + "id": "58806579", "metadata": {}, "outputs": [], "source": [ @@ -43,7 +44,7 @@ { "cell_type": "code", "execution_count": null, - "id": "96c80ad7", + "id": "910b108d", "metadata": {}, "outputs": [], "source": [ @@ -71,7 +72,7 @@ { "cell_type": "code", "execution_count": null, - "id": "0774537a", + "id": "1621b19e", "metadata": {}, "outputs": [], "source": [ @@ -123,7 +124,7 @@ { "cell_type": "code", "execution_count": null, - "id": "cb27ea7c", + "id": "a6129f0c", "metadata": { "lines_to_next_cell": 2 }, diff --git a/experiment.py b/experiment.py index 3e2ee75..04bf748 100644 --- a/experiment.py +++ b/experiment.py @@ -18,6 +18,7 @@ # # - **Guided Chain-of-Thought (CoT) with Logprobs:** Standard teacher-forced evaluation misses how the reasoning process itself changes, while full on-policy generation is slow and hard to parse. The *Guided CoT* trick strikes a balance: we let the model generate a short reasoning trace (~32 tokens) greedily, then append a fixed suffix (e.g., `\nI should answer now.\nMy choice: **`) to force a decision. By running a single forward pass over this combined sequence, we extract both the hidden state trajectory of the reasoning *and* calibrated log-probabilities (`log P(Yes) - log P(No)`) at the final position. # - **Daily Dilemmas (Self-Honesty Subset):** Sourced from `wassname/daily_dilemmas-self-honesty` (adapted from the Reddit *AmITheAsshole* subreddit), these are moral dilemmas where honesty explicitly conflicts with other values. Simple prompting (e.g., "You are honest") often struggles here. By testing opposite personas on these dilemmas, we observe if structural shifts in reasoning (captured by $\kappa$) correlate with actual preference flipping. +# - **Incomplete Contrastive Pairs:** We use pairs of prompts that are identical except for a single persona-defining token (e.g., "honest" vs. "dishonest") and stop right before the model's response. Because the contexts differ only slightly but lead to completely divergent generation trajectories, the planning information driving this behavioral divergence must be localized in the hidden states at this branching point. # %% import torch