Metadata-Version: 2.3
Name: persona-data
Version: 0.5.2
Summary: Shared dataset loading and prompt formatting for implicit-personalization projects
Requires-Dist: huggingface-hub>=0.30.0
Requires-Dist: pyarrow>=14.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: torch>=2.0.0
Requires-Dist: numpy>=1.24.0
Requires-Python: >=3.10
Description-Content-Type: text/markdown

# persona-data

[![Docs](https://img.shields.io/badge/docs-view-purple?logo=materialformkdocs)](https://implicit-personalization.github.io/persona-data/)
[![PyPI](https://img.shields.io/pypi/v/persona-data?logo=pypi&label=PyPI)](https://pypi.org/project/persona-data/)

Shared dataset loading, prompt formatting, and environment utilities for the [implicit-personalization](https://github.com/implicit-personalization) projects.

## What's in the box

- `SynthPersonaDataset` — persona profiles plus QA pairs ([docs](https://implicit-personalization.github.io/persona-data/synth_persona/))
- `PersonaGuessDataset` — turn-based persona games ([docs](https://implicit-personalization.github.io/persona-data/persona_guess/))
- `NemotronPersonasFranceDataset` / `NemotronPersonasUSADataset` — NVIDIA persona-only datasets ([docs](https://implicit-personalization.github.io/persona-data/nemotron_personas/))
- Roleplay and multiple-choice prompt helpers ([docs](https://implicit-personalization.github.io/persona-data/prompts/))
- Environment helpers: `set_seed`, `get_device`, `get_artifacts_dir`

## Installation

Add as a uv git source in your project's `pyproject.toml`:

```toml
[project]
dependencies = ["persona-data"]

[tool.uv.sources]
persona-data = { git = "ssh://git@github.com/implicit-personalization/persona-data.git" }
```

For local development alongside other repos:

```toml
[tool.uv.sources]
persona-data = { path = "../persona-data", editable = true }
```

Then `uv sync`.

### Testing

```bash
uv run --with pytest pytest tests/test_datasets.py
```

The release workflow also runs `tests/smoke_test.py` against the built wheel and source distribution.

## Package layout

```
src/persona_data/
├── synth_persona.py       # SynthPersonaDataset, PersonaDataset, PersonaData, QAPair, Statement
├── persona_guess.py       # PersonaGuessDataset, GameRecord, Turn
├── nemotron_personas.py   # NemotronPersonasFranceDataset, NemotronPersonasUSADataset
├── prompts.py             # format_prompt, format_mc_question, format_messages
└── environment.py         # set_seed, get_device, get_artifacts_dir
```

## Quick start

```python
from persona_data.synth_persona import SynthPersonaDataset
from persona_data.prompts import format_messages, format_prompt

dataset = SynthPersonaDataset()
persona = dataset[0]

system_prompt = format_prompt(persona, "biography")
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Where did you grow up?"},
]
full_prompt, response_start_idx = format_messages(
    messages, tokenizer, add_generation_prompt=True
)

# Leakage-aware train/test split: FRQs for train, shared MCQs for test.
train_qa, test_qa = dataset.train_test_split(persona.id)
```

See the [docs](https://implicit-personalization.github.io/persona-data/) for full APIs.

## Used by

- [persona-vectors](https://github.com/implicit-personalization/persona-vectors) — activation extraction and steering
- [cues_attribution](https://github.com/implicit-personalization/io-analysis) — section-level ablation attribution
- [persona-2-lora](https://github.com/implicit-personalization/persona-2-lora) — LoRA-based persona internalization
