diff --git a/README.md b/README.md index 6955bf6..69d151f 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,11 @@ Testing whether the Frenet-Serret extrinsic curvature ($\kappa$) of a model's hidden state trajectory can predict structural shifts in the model's persona or criterion (e.g., eval-awareness, preference changes) without needing behavioral labels. +**See [experiment.py](experiment.py) for the code.** + + +![](kappa_trajectory.png) + ## Concepts & Motivation - **Guided Chain-of-Thought (CoT) with Logprobs:** Standard teacher-forced evaluation only measures the effect of an intervention on a single token, missing how the reasoning process itself changes. Full on-policy generation captures reasoning but is slow and hard to parse. The *Guided CoT* trick strikes a balance: we let the model generate a short reasoning trace (~32 tokens) greedily, then append a fixed suffix (e.g., `\nI should answer now.\nMy choice: **`) to force a decision. By running a single forward pass over this combined sequence, we extract both the hidden state trajectory of the reasoning *and* calibrated log-probabilities (`log P(Yes) - log P(No)`) at the final position. This provides a clean, bounded uncertainty estimate while capturing how personas or interventions alter the actual reasoning path. @@ -10,8 +15,6 @@ Testing whether the Frenet-Serret extrinsic curvature ($\kappa$) of a model's hi ## Setup -This project is managed by `uv`. - ### Requirements - Python 3.11+ - `uv` installed @@ -26,38 +29,6 @@ To sync the environment: uv sync ``` -## Running the Experiment - -You can explore the experiment either via the Jupyter Notebook or by running the generated Python script directly. - -### Via Notebook -To spin up Jupyter Lab/Notebooks: -```bash -uv run jupyter notebook -``` -Then open `experiment.ipynb` and run the cells. - -### Via Script -To run the python script directly (converted from the notebook via `jupytext`): -```bash -uv run python experiment.py -``` -*(Note: Ensure you have your X11/Wayland display setup to see the matplotlib plot, or run with `MPLBACKEND=Agg` if headless).* - -## How it Works - -We use the **Guided CoT trick**: -1. Generate ~32 tokens of Chain of Thought reasoning (`n_think`) using greedy decoding. -2. Force the model to transition to an answer by appending a specific suffix (`\nI should answer now.\nMy choice: **`). -3. Run a single forward pass over the full sequence. -4. Extract the final-layer hidden states during the reasoning step. -5. Calculate the Frenet-Serret extrinsic curvature $\kappa(t) = \|\gamma''(t)\| / \|\gamma'(t)\|^3$ of these states using finite differences. -6. Compare $\kappa(t)$ between opposite personas ("honest" vs. "dishonest" vs. "neutral baseline") on daily dilemmas. - -## Model -The default script uses `Qwen/Qwen2.5-0.5B-Instruct` as it fits comfortably on small GPUs or CPUs. You can easily scale this up by changing `MODEL_NAME` in `experiment.ipynb`/`experiment.py`. - - ## See also - [RepEng]() A nice hackable activation steering repo diff --git a/experiment.py b/experiment.py index 2b6c7f1..e26359d 100644 --- a/experiment.py +++ b/experiment.py @@ -49,11 +49,10 @@ def get_s_space_svd(model): Returns: U, S, Vh """ Ws = [] - for layer in model.model.layers: - # In Qwen2, o_proj and down_proj weights are shape [hidden_size, in_features] - # We want a combined matrix of shape [hidden_size, sum(in_features)] - Ws.append(layer.self_attn.o_proj.weight.detach().cpu()) - Ws.append(layer.mlp.down_proj.weight.detach().cpu()) + mathes = ["o_proj", "down_proj"] + for name, module in model.named_modules(): + if any(m in name for m in mathes): + Ws.append(module.weight.detach().cpu()) W = torch.cat(Ws, dim=1).to(model.device) # SVD on the collective weight matrix diff --git a/kappa_trajectory.png b/kappa_trajectory.png new file mode 100644 index 0000000..fc9b080 Binary files /dev/null and b/kappa_trajectory.png differ