mirror of
https://github.com/wassname/Brukino_AntiPaSTO_Appetizer.git
synced 2026-06-27 17:13:50 +08:00
png
This commit is contained in:
@@ -2,6 +2,11 @@
|
|||||||
|
|
||||||
Testing whether the Frenet-Serret extrinsic curvature ($\kappa$) of a model's hidden state trajectory can predict structural shifts in the model's persona or criterion (e.g., eval-awareness, preference changes) without needing behavioral labels.
|
Testing whether the Frenet-Serret extrinsic curvature ($\kappa$) of a model's hidden state trajectory can predict structural shifts in the model's persona or criterion (e.g., eval-awareness, preference changes) without needing behavioral labels.
|
||||||
|
|
||||||
|
**See [experiment.py](experiment.py) for the code.**
|
||||||
|
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
## Concepts & Motivation
|
## Concepts & Motivation
|
||||||
|
|
||||||
- **Guided Chain-of-Thought (CoT) with Logprobs:** Standard teacher-forced evaluation only measures the effect of an intervention on a single token, missing how the reasoning process itself changes. Full on-policy generation captures reasoning but is slow and hard to parse. The *Guided CoT* trick strikes a balance: we let the model generate a short reasoning trace (~32 tokens) greedily, then append a fixed suffix (e.g., `\nI should answer now.\nMy choice: **`) to force a decision. By running a single forward pass over this combined sequence, we extract both the hidden state trajectory of the reasoning *and* calibrated log-probabilities (`log P(Yes) - log P(No)`) at the final position. This provides a clean, bounded uncertainty estimate while capturing how personas or interventions alter the actual reasoning path.
|
- **Guided Chain-of-Thought (CoT) with Logprobs:** Standard teacher-forced evaluation only measures the effect of an intervention on a single token, missing how the reasoning process itself changes. Full on-policy generation captures reasoning but is slow and hard to parse. The *Guided CoT* trick strikes a balance: we let the model generate a short reasoning trace (~32 tokens) greedily, then append a fixed suffix (e.g., `\nI should answer now.\nMy choice: **`) to force a decision. By running a single forward pass over this combined sequence, we extract both the hidden state trajectory of the reasoning *and* calibrated log-probabilities (`log P(Yes) - log P(No)`) at the final position. This provides a clean, bounded uncertainty estimate while capturing how personas or interventions alter the actual reasoning path.
|
||||||
@@ -10,8 +15,6 @@ Testing whether the Frenet-Serret extrinsic curvature ($\kappa$) of a model's hi
|
|||||||
|
|
||||||
## Setup
|
## Setup
|
||||||
|
|
||||||
This project is managed by `uv`.
|
|
||||||
|
|
||||||
### Requirements
|
### Requirements
|
||||||
- Python 3.11+
|
- Python 3.11+
|
||||||
- `uv` installed
|
- `uv` installed
|
||||||
@@ -26,38 +29,6 @@ To sync the environment:
|
|||||||
uv sync
|
uv sync
|
||||||
```
|
```
|
||||||
|
|
||||||
## Running the Experiment
|
|
||||||
|
|
||||||
You can explore the experiment either via the Jupyter Notebook or by running the generated Python script directly.
|
|
||||||
|
|
||||||
### Via Notebook
|
|
||||||
To spin up Jupyter Lab/Notebooks:
|
|
||||||
```bash
|
|
||||||
uv run jupyter notebook
|
|
||||||
```
|
|
||||||
Then open `experiment.ipynb` and run the cells.
|
|
||||||
|
|
||||||
### Via Script
|
|
||||||
To run the python script directly (converted from the notebook via `jupytext`):
|
|
||||||
```bash
|
|
||||||
uv run python experiment.py
|
|
||||||
```
|
|
||||||
*(Note: Ensure you have your X11/Wayland display setup to see the matplotlib plot, or run with `MPLBACKEND=Agg` if headless).*
|
|
||||||
|
|
||||||
## How it Works
|
|
||||||
|
|
||||||
We use the **Guided CoT trick**:
|
|
||||||
1. Generate ~32 tokens of Chain of Thought reasoning (`n_think`) using greedy decoding.
|
|
||||||
2. Force the model to transition to an answer by appending a specific suffix (`\nI should answer now.\nMy choice: **`).
|
|
||||||
3. Run a single forward pass over the full sequence.
|
|
||||||
4. Extract the final-layer hidden states during the reasoning step.
|
|
||||||
5. Calculate the Frenet-Serret extrinsic curvature $\kappa(t) = \|\gamma''(t)\| / \|\gamma'(t)\|^3$ of these states using finite differences.
|
|
||||||
6. Compare $\kappa(t)$ between opposite personas ("honest" vs. "dishonest" vs. "neutral baseline") on daily dilemmas.
|
|
||||||
|
|
||||||
## Model
|
|
||||||
The default script uses `Qwen/Qwen2.5-0.5B-Instruct` as it fits comfortably on small GPUs or CPUs. You can easily scale this up by changing `MODEL_NAME` in `experiment.ipynb`/`experiment.py`.
|
|
||||||
|
|
||||||
|
|
||||||
## See also
|
## See also
|
||||||
|
|
||||||
- [RepEng]() A nice hackable activation steering repo
|
- [RepEng]() A nice hackable activation steering repo
|
||||||
|
|||||||
+4
-5
@@ -49,11 +49,10 @@ def get_s_space_svd(model):
|
|||||||
Returns: U, S, Vh
|
Returns: U, S, Vh
|
||||||
"""
|
"""
|
||||||
Ws = []
|
Ws = []
|
||||||
for layer in model.model.layers:
|
mathes = ["o_proj", "down_proj"]
|
||||||
# In Qwen2, o_proj and down_proj weights are shape [hidden_size, in_features]
|
for name, module in model.named_modules():
|
||||||
# We want a combined matrix of shape [hidden_size, sum(in_features)]
|
if any(m in name for m in mathes):
|
||||||
Ws.append(layer.self_attn.o_proj.weight.detach().cpu())
|
Ws.append(module.weight.detach().cpu())
|
||||||
Ws.append(layer.mlp.down_proj.weight.detach().cpu())
|
|
||||||
W = torch.cat(Ws, dim=1).to(model.device)
|
W = torch.cat(Ws, dim=1).to(model.device)
|
||||||
|
|
||||||
# SVD on the collective weight matrix
|
# SVD on the collective weight matrix
|
||||||
|
|||||||
Binary file not shown.
|
After Width: | Height: | Size: 105 KiB |
Reference in New Issue
Block a user