Metadata-Version: 2.3
Name: persona-vectors
Version: 0.5.0
Summary: Library for extracting and analyzing persona vectors
Requires-Dist: persona-data>=0.4.0
Requires-Dist: nnsight>=0.7.0
Requires-Dist: nnterp>=1.3.0
Requires-Dist: plotly>=6.6.0
Requires-Dist: kaleido>=1.0.0
Requires-Dist: scikit-learn>=1.6.0
Requires-Dist: python-dotenv>=1.2.2
Requires-Dist: safetensors>=0.7.0
Requires-Dist: umap-learn>=0.5.7
Requires-Dist: torch>=2.10.0
Requires-Dist: torchvision>=0.26.0
Requires-Dist: tqdm>=4.67.3
Requires-Dist: transformers>=5.8.0
Requires-Python: >=3.12
Description-Content-Type: text/markdown

# Persona Vectors

[![Docs](https://img.shields.io/badge/docs-view-purple?logo=github)](https://implicit-personalization.github.io/persona-vectors/)

Extract persona-aligned activation vectors from language models and analyze how persona prompts move hidden states.

> [!WARNING]
> This is very experimental currently 🚨

## Overview

Given a set of personas and evaluation questions, this project:

1. Formats each persona as a system prompt (short `templated` or long `biography`)
2. Extracts hidden states at each layer with configurable token masking
3. Saves per-question, per-layer hidden states, then averages them into persona-level views for analysis

The resulting vectors can be compared across layers (cosine similarity) and eventually used for steering experiments.

## Repository Layout

```
persona-vectors/
├── notebooks/
│   ├── notebook_extract.py      # Extraction pipeline (primary working script)
│   ├── notebook_compare.py      # Load saved activations and compare variants
│   └── notebook_steer.py        # Steering experiments
├── src/persona_vectors/
│   ├── activations.py           # Core extraction helpers
│   ├── analysis.py              # PCA / UMAP projections and scatter plots
│   ├── artifacts.py             # Save/load/query activation artifact helpers
│   ├── plots.py                 # Plotly figures for layer-wise analysis
│   ├── steering.py              # Steering vector computation and application
│   └── parser.py                # CLI argument parsing
├── artifacts/                   # Saved activations (gitignored)
├── docs/                        # Reference documentation
└── main.py                      # CLI entry point
```

Dataset loading (`SynthPersonaDataset`) and environment helpers come from the
sibling [persona-data](../persona-data) package.

For local development, uncomment the `path` source in `pyproject.toml` and keep
`persona-data` checked out next to this repo.

## Installation

```bash
uv sync
cp .env.example .env
```

Python `>=3.12` is required.

## Quickstart

```bash
# Extract activations (run this first)
uv run python -m notebooks.notebook_extract

# Load saved activations / compare variants
uv run python -m notebooks.notebook_compare

# Build interactive persona-mean PCA and similarity plots from saved activations
uv run python main.py analyze --model google/gemma-2-9b-it --variant biography --mask-strategy answer_mean

# Compute a steering vector from saved activations
uv run python main.py steer --persona-id <UUID> --model google/gemma-2-9b-it --layer 20
```

## Streamlit App

The Streamlit UI lives in the sibling [persona-ui](../persona-ui) repo.

## How It Works

### Notebooks

`notebook_extract.py` runs a small end-to-end extraction example:

1. Load dataset questions and answers
2. Extract per-question activations
3. Save them to disk
4. Mask and average the selected token spans

`notebook_compare.py` uses `ActivationStore` to discover saved variants/personas,
then compares shared persona means across variants.

`notebook_steer.py` loads saved activations and computes a steering vector for a
selected persona.

### Saved Format

Each extraction produces:

```
artifacts/activations/<model_dir>/<mask_strategy>/<prompt_variant>/
├── manifest.json             # tensor shape, persona names, sample ids
└── <persona_id>.safetensors
```

`<model_dir>` is the model name with `/` replaced by `__`.

The manifest stores compact sample ids (`qa.qid`) instead of full question text,
plus tensor shape fields used for validation.

## CLI

`extract`, `analyze`, and `steer` are implemented.

```bash
# Extract activations
# Defaults to all supported variants: templated and biography.
python main.py extract --model google/gemma-2-2b-it

# Extract only the Assistant baseline
python main.py extract --model google/gemma-2-2b-it --persona-id baseline_assistant

# Pick specific variants
python main.py extract --model google/gemma-2-2b-it --variants biography

# Run remotely on NDIF. If the remote fast path OOMs, extraction automatically
# retries that persona/variant with layer-chunked traces.
python main.py extract --model google/gemma-2-9b-it --backend remote

# Analyze saved activations
python main.py analyze --model google/gemma-2-9b-it --variant biography --mask-strategy answer_mean --out ./plots

# Run steering (example)
python main.py steer --layer 10 --model "google/gemma-2-9b-it" --persona-id 005e1868-4e17-47e3-94fa-0d20e8d93662

# Load steering activations extracted with a non-default mask strategy
python main.py steer --layer 10 --model "google/gemma-2-9b-it" --persona-id <UUID> --mask-strategy answer_previous
```
