Metadata-Version: 2.3
Name: persona-vectors
Version: 0.1.0
Summary: Library for extracting and analyzing persona vectors
Requires-Dist: persona-data>=0.1.0
Requires-Dist: nnsight>=0.6.1
Requires-Dist: nnterp>=1.3.0
Requires-Dist: plotly>=6.6.0
Requires-Dist: kaleido>=1.0.0
Requires-Dist: scikit-learn>=1.6.0
Requires-Dist: python-dotenv>=1.2.2
Requires-Dist: safetensors>=0.7.0
Requires-Dist: umap-learn>=0.5.7
Requires-Dist: torch>=2.10.0
Requires-Dist: torchvision>=0.26.0
Requires-Dist: tqdm>=4.67.3
Requires-Dist: transformers>=5.2.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown

# Persona Vectors

[![Docs](https://img.shields.io/badge/docs-view-purple?logo=github)](https://github.com/implicit-personalization/persona-vectors/tree/main/docs)

Extract persona-aligned activation vectors from language models and experiment with activation steering.

> [!WARNING]
> This is very experimental currently 🚨

## Overview

Given a set of personas and evaluation questions, this project:

1. Formats each persona as a system prompt (short `templated` or long `biography`)
2. Extracts hidden states at each layer (with support to then mask some specific tokens)
3. Averages those hidden states across questions to produce a **persona vector** per layer

The resulting vectors can be compared across layers (cosine similarity) and eventually used for steering experiments.

## Repository Layout

```
persona-vectors/
├── notebooks/
│   ├── notebook_extract.py      # Extract activations from model (minimal PoC)
│   ├── notebook_compare.py      # Use ActivationStore to load saved activations and compare variants
│   └── notebook_steer.py        # Steering experiments
├── src/persona_vectors/
│   ├── artifacts.py             # ActivationStore and artifact path helpers
│   ├── activations.py           # Core: extract_activations (nnsight forward passes)
│   ├── extraction.py            # Orchestration for extraction runs
│   ├── plots.py                 # Layer-wise similarity plots (Plotly)
│   ├── steering.py              # Steering vector computation and application
│   └── parser.py                # CLI argument parsing
├── artifacts/                   # Saved activations (gitignored)
├── docs/                        # Reference documentation
└── main.py                      # CLI entry point (WIP)
```

Dataset loading (`SynthPersonaDataset`, `PersonaGuessDataset`) and environment
helpers are provided by the sibling [persona-data](../persona-data) package.

For local development, uncomment the `path` source in `persona-vectors/pyproject.toml`
and keep `persona-data` checked out next to this repo. The committed config uses
git so this package also installs cleanly in isolated environments.

## Installation

```bash
uv sync
cp .env.example .env
```

## Quickstart

```bash
# Extract activations (run this first)
uv run python -m notebooks.notebook_extract

# Load saved activations / compare variants
uv run python -m notebooks.notebook_compare

# Compute a steering vector from saved activations
uv run python main.py steer --persona-id <UUID> --model google/gemma-2-9b-it --layer 20
```

## Streamlit App

The Streamlit UI lives in the sibling [persona-ui](../persona-ui) repo.

## How It Works

### Two Notebooks

`notebook_extract.py` runs the full flow end to end:

1. Load dataset questions and answers
2. Extract per-question activations
3. Save them to disk
4. Mask and average the selected token spans

`notebook_compare.py` loads saved activations via `ActivationStore` and compares variants.

`notebook_steer.py` loads saved activations and computes a steering vector for a
selected persona.

### Saved Format

Each extraction produces:

```
artifacts/activations/<model_dir>/<prompt_variant>/<persona_id>/
├── activations.safetensors   # Per-question hidden states
└── metadata.json            # persona_id, persona_name, questions, n_questions, num_layers, hidden_size
```

`<model_dir>` is the model name with `/` replaced by `__`.

The metadata stores the question text directly, so load-time analysis no longer needs
to re-resolve qids from the dataset. It also stores tensor shape fields for validation
at load time.

## CLI

`extract` and `steer` are implemented. `analyze` is parsed but still raises
`NotImplementedError`.

```bash
# Extract activations
python main.py extract --model google/gemma-2-2b-it

# Analyze saved activations (not implemented yet)
python main.py analyze --out ./plots --similarity cosine

# Run steering (example)
python main.py steer --layer 10 --model "google/gemma-2-9b-it" --persona-id 005e1868-4e17-47e3-94fa-0d20e8d93662
```
