Metadata-Version: 2.4
Name: pairadigm
Version: 0.5.3
Summary: Concept-Guided Chain-of-Thought (CGCoT) pairwise annotation using Large Language Models
Home-page: https://github.com/mlchrzan/pairadigm
Author: Michael Leon Chrzan
Author-email: Michael Leon Chrzan <mlchrzan1@gmail.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/mlchrzan/pairadigm
Project-URL: Bug Reports, https://github.com/mlchrzan/pairadigm/issues
Project-URL: Source, https://github.com/mlchrzan/pairadigm
Keywords: nlp,annotation,pairwise-comparison,llm,machine-learning,text-analysis
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.3.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: scipy>=1.7.0
Requires-Dist: plotly>=5.0.0
Requires-Dist: networkx>=2.6.0
Requires-Dist: choix>=0.3.5
Requires-Dist: python-dotenv>=0.19.0
Requires-Dist: google-genai>=0.1.0
Requires-Dist: statsmodels>=0.13.0
Requires-Dist: scikit-learn>=1.0.0
Requires-Dist: matplotlib>=3.4.0
Requires-Dist: seaborn>=0.11.0
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.30.0
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == "openai"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.5.0; extra == "anthropic"
Provides-Extra: all
Requires-Dist: openai>=1.0.0; extra == "all"
Requires-Dist: anthropic>=0.5.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=3.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=4.0.0; extra == "dev"
Requires-Dist: mypy>=0.950; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# pairadigm: A Python Library for Concept-Guided Chain-of-Thought Pairwise Measurement of Scalar Constructs Using Large Language Models

`pairadigm` is a Python library designed to streamline the creation of high-quality, continuous measurement scales from text using LLMs. It implements a **Concept-Guided Chain-of-Thought (CGCoT)** methodology to generate reasoned pairwise comparisons using state-of-the-art LLMs (e.g., Google Gemini, OpenAI GPTs, Anthropic Claude, and open source models). It then converts these comparisons into continuous scores using the Bradley-Terry model and provides a pipeline both evaluate LLM score using human annotations and to fine-tune efficient encoder models (e.g., ModernBERT) as reward models for scaling measurement to larger datasets.

[![DOI](https://zenodo.org/badge/1071720356.svg)](https://doi.org/10.5281/zenodo.17981011)

## Overview

Pairadigm uses a CGCoT prompting approach to break down complex concepts into analyzable components, then performs pairwise comparisons to rank items using the Bradley-Terry model. It supports multiple LLM providers (Google Gemini, OpenAI, Anthropic, Ollama, HuggingFace) and includes validation tools for comparing LLM annotations against human judgments. 

You can see a full example of the package in use in the `example.ipynb` on the github repo notebook along with some dummy code below.

## Updates for Version [0.5.3] - 2026-03-14 - Split Personality 🖖🏽
### Added 
- `generate_pairings()` now supports item-level train/eval/test splits via a new `make_splits` parameter, preventing data leakage when pairs are used to train a `RewardModel`. When enabled, splits are generated at the item level (no item appears in more than one split), and resulting pairs are tagged with `item1_split` and `item2_split` columns.
  - `test_size` (default `0.15`) and `eval_size` (default `0.15`) control the proportion of items assigned to each held-out split.
  - Passing a non-default `test_size` or `eval_size` automatically enables `make_splits=True` with a warning.
  - `include_mixed_pairs` (default `False`) optionally appends a small number of intentional cross-split pairs, spread evenly across the train×eval, train×test, and eval×test combinations, useful for diagnosing generalisation gaps.
  - `num_mixed_pairs` (default `10`) controls the total number of cross-split pairs added when `include_mixed_pairs=True`.
- In accordance with the `generate_pairings()` update, the `RewardModel` class will now respect the data splits generated in `generate_pairings()`. It will also encourage users' data hygiene by asking them to either pass splits with their pairs - if just using the model without a `Pairadigm` - or warning them of the data leakage risk.
- `test_client_connections()` function in `Pairadigm` to verify API connectivity for all LLMClients.
- Progress monitoring when generating breakdowns from pre-paired data. 

### Updated
- The Davidson model in `score_items()` now uses NumPy broadcasting for efficiency and has progress monitoring. 
- If a user passes prior_breakdown_cols to the initial `Pairadigm` constructor, the constructor will also create the pairwise_df without needing to call `generator_pairings(breakdowns=True)` separately.

### Fixed 
- Fixed a logic error when creating a `Pairadigm` from paired data where `generate_breakdowns_from_paired()` needed item_id_col to be set but that wasn't enforced. Now if an `item_id_col` isn't set and `paired=True` a default one will be assigned (`item_id_DEFAULT`). 

## Installation

### Prerequisites

- Python 3.8+
- API keys for your chosen LLM provider(s)

### Setup
In the terminal, follow these steps:
1. Install the package:
```bash
# For development version
pip install git+https://github.com/mlchrzan/pairadigm.git

# For latest stable release 
pip install pairadigm
```

2. Set up environment variables:
```bash
# Create a .env file in the project root
touch .env

# Add your API key(s) - choose based on your LLM provider
echo "GENAI_API_KEY=your_google_api_key_here" >> .env
# OR
echo "OPENAI_API_KEY=your_openai_api_key_here" >> .env
# OR
echo "ANTHROPIC_API_KEY=your_anthropic_api_key_here" >> .env
```

## Quick Start

Below are the basic workflows for using the package. You can find a full example of this in the jupyter notebook `example.ipynb`.

### Basic Workflow: Unpaired Items

WARNING: If loading .txt files into CGCOT Prompts, ensure the .txt files do NOT have double spaces as these will be interpreted as an additional prompt.

```python
import pandas as pd
from pairadigm import Pairadigm

# Load your data
df = pd.DataFrame({
    'id': ['item1', 'item2', 'item3'],
    'text': ['Text content 1', 'Text content 2', 'Text content 3']
})

# Define CGCoT prompts for your concept
cgcot_prompts = [
    "Analyze the following text for objectivity: {text}",
    "Based on the previous analysis: {previous_answers}\nIdentify any subjective language."
]

# Initialize Pairadigm
p = Pairadigm(
    data=df,
    item_id_name='id',
    text_name='text',
    cgcot_prompts=cgcot_prompts,
    model_name='gemini-2.0-flash-exp',
    target_concept='objectivity'
)

# Generate CGCoT breakdowns
p.generate_breakdowns(max_workers=4)

# Create pairings
p.generate_pairings(num_pairs_per_item=5, breakdowns=True)

# Generate pairwise annotations
p.generate_pairwise_annotations(max_workers=4)

# Compute Bradley-Terry scores
scored_df = p.score_items()

# Visualize results
p.plot_score_distribution()
p.plot_comparison_network()
```

### Using Multiple LLMs

```python
# Initialize with multiple models
p = Pairadigm(
    data=df,
    item_id_name='id',
    text_name='text',
    cgcot_prompts=cgcot_prompts,
    model_name=['gemini-2.0-flash-exp', 'gpt-4o', 'claude-sonnet-4'],
    target_concept='objectivity'
)

# View available clients
print(p.get_clients_info())

# Generate breakdowns with all models
p.generate_breakdowns(max_workers=4)

# Generate annotations with all models
p.generate_pairwise_annotations(max_workers=4)

# Score items for each model
scored_df_gemini = p.score_items(decision_col='decision_gemini-2.0-flash-exp')
scored_df_gpt = p.score_items(decision_col='decision_gpt-4o')
scored_df_claude = p.score_items(decision_col='decision_claude-sonnet-4')
```

### Working with Pre-Paired Data

```python
# Data with pre-existing pairs
paired_df = pd.DataFrame({
    'item1_id': ['a', 'b', 'c'],
    'item2_id': ['b', 'c', 'a'],
    'item1_text': ['Text A', 'Text B', 'Text C'],
    'item2_text': ['Text B', 'Text C', 'Text A']
})

p = Pairadigm(
    data=paired_df,
    paired=True,
    item_id_cols=['item1_id', 'item2_id'],
    item_text_cols=['item1_text', 'item2_text'],
    cgcot_prompts=cgcot_prompts,
    target_concept='political_bias'
)

# Generate breakdowns for paired items
p.generate_breakdowns_from_paired(max_workers=4)

# Continue with annotations and scoring...
p.generate_pairwise_annotations()
p.score_items()
```

### Adding Human Annotations

```python
# Create human annotation data
human_anns = pd.DataFrame({
    'item1': ['id1', 'id2'],
    'item2': ['id2', 'id3'],
    'annotator1': ['Text1', 'Text2'],
    'annotator2': ['Text2', 'Text1']
})

# Add to existing Pairadigm object
p.append_human_annotations(
    annotations=human_anns,
    decision_cols=['annotator1', 'annotator2']
)

# Or load from file
p.append_human_annotations(
    annotations='human_annotations.csv',
    annotator_names=['expert1', 'expert2']
)
```

### Validating Against Human Annotations

```python
# Data with human annotations
annotated_df = pd.DataFrame({
    'item1': ['a', 'b'],
    'item2': ['b', 'c'],
    'item1_text': ['Text A', 'Text B'],
    'item2_text': ['Text B', 'Text C'],
    'human1': ['Text1', 'Text2'],  # Human annotator choices
    'human2': ['Text1', 'Text1']
})

p = Pairadigm(
    data=annotated_df,
    paired=True,
    annotated=True,
    item_id_cols=['item1', 'item2'],
    item_text_cols=['item1_text', 'item2_text'],
    annotator_cols=['human1', 'human2'],
    cgcot_prompts=cgcot_prompts,
    target_concept='sentiment'
)

# Run LLM annotations
p.generate_breakdowns_from_paired()
p.generate_pairwise_annotations()

# Validate using ALT test
winning_rate, advantage_prob = p.alt_test(
    scoring_function='accuracy',
    epsilon=0.1,
    q_fdr=0.05
)

print(f"LLM winning rate: {winning_rate:.2%}")
print(f"Advantage probability: {advantage_prob:.2%}")

# Test all LLMs at once (if using multiple models)
results = p.alt_test(test_all_llms=True)
for model_name, (win_rate, adv_prob) in results.items():
    print(f"{model_name}: Win Rate={win_rate:.2%}, Advantage={adv_prob:.2%}")

# Check transitivity
transitivity_results = p.check_transitivity()
for annotator, (score, violations, total) in transitivity_results.items():
    print(f"{annotator}: {score:.2%} transitivity ({violations}/{total} violations)")

# Calculate inter-rater reliability
irr_results = p.irr(method='auto')
print(irr_results)

# Dawid-Skene validation (accounts for annotator reliability)
ds_results = p.dawid_skene_alt_test(
    alpha=0.05,
    use_by_correction=True
)
print(f"Dawid-Skene Winning Rate: {ds_results['winning_rate']:.2%}")

# Rank all annotators by reliability
ranking = p.dawid_skene_annotator_ranking(random_seed=42)
print(ranking[['annotator', 'reliability', 'rank', 'type']])
```

## CGCoT Prompts

CGCoT prompts are the backbone of Pairadigm's analysis. Design them to progressively analyze your target concept:

### Loading Prompts from File

```python
# prompts.txt format:
# What factual claims are made in this text? {text}
# Based on: {text} Are these claims supported by evidence?
# Does the language show emotional bias?

p.set_cgcot_prompts('prompts.txt')
```
WARNING: If loading .txt files into CGCOT Prompts, ensure the .txt files do NOT have double spaces as these will be interpreted as an additional prompt.

### Best Practices

1. **First prompt**: Identify relevant elements using `{text}` placeholder
2. **Middle prompts**: Build on `{previous_answers}` to deepen analysis
3. **Final prompt**: Synthesize findings related to target concept
4. Keep prompts focused and sequential

## Advanced Features

### Save and Load Analysis

```python
# Save your analysis
p.save('my_analysis.pkl')

# Load it later
from pairadigm import load_pairadigm
p = load_pairadigm('my_analysis.pkl')
```

### Fine-Tuning with RewardModel

```python
from pairadigm import RewardModel

# Prepare training data from pairwise comparisons
training_pairs = [
    ("Text with high score", "Text with low score", 1.0),
    ("Better text", "Worse text", 1.0),
    # ... more pairs
]

# Initialize and train reward model
reward_model = RewardModel(
    model_name="answerdotai/ModernBERT-large",
    dropout=0.1,
    max_length=384
)

train_loader = reward_model.prepare_data(training_pairs, batch_size=16)
reward_model.train(train_loader, epochs=3, learning_rate=2e-5)

# Score new texts
score = reward_model.score_text("New text to evaluate")
scores = reward_model.score_batch(["Text 1", "Text 2", "Text 3"])

# Normalize scores to desired scale (e.g., 1-9)
normalized = reward_model.normalize_scores(scores, scale_min=1.0, scale_max=9.0)

# Save trained model
reward_model.save('my_reward_model.pt')

# Load later
reward_model.load('my_reward_model.pt')
```

### Custom Scoring Functions

```python
def custom_similarity(pred, annotations):
    # Your custom scoring logic
    return score

winning_rate, advantage_prob = p.alt_test(
    scoring_function=custom_similarity
)
```

### Rate Limiting

```python
# Limit API calls to 10 per minute
p.generate_breakdowns(
    max_workers=4,
    rate_limit_per_minute=10
)
```

## API Reference

### Pairadigm Class

**Constructor Parameters:**
- `data`: Input DataFrame
- `item_id_name`: Column name for item IDs (unpaired data)
- `text_name`: Column name for item text (unpaired data)
- `paired`: Whether data is pre-paired
- `item_id_cols`: List of 2 ID columns (paired data)
- `item_text_cols`: List of 2 text columns (paired data)
- `annotated`: Whether data has human annotations
- `annotator_cols`: List of human annotation columns
- `llm_annotator_cols`: List of LLM annotation columns
- `prior_breakdown_cols`: List of existing breakdown columns
- `cgcot_prompts`: List of CGCoT prompt templates
- `model_name`: LLM model identifier(s) - can be string or list of strings
- `target_concept`: Concept being evaluated
- `api_key`: API key(s) for LLM service(s) - can be string or list
- `llm_clients`: Pre-initialized LLMClient(s) - alternative to model_name/api_key

**Key Methods:**
- `generate_breakdowns()`: Create CGCoT analyses for items
- `generate_breakdowns_from_paired()`: Create breakdowns for paired data
- `generate_pairings()`: Create pairwise combinations
- `generate_pairwise_annotations()`: Run LLM comparisons
- `append_human_annotations()`: Add human judgments to analysis
- `score_items()`: Compute Bradley-Terry scores
- `alt_test()`: Validate against human annotations
- `dawid_skene_alt_test()`: Validate with annotator reliability weighting
- `dawid_skene_annotator_ranking()`: Rank annotators by reliability
- `irr()`: Calculate inter-rater reliability
- `check_transitivity()`: Check annotation consistency
- `plot_score_distribution()`: Visualize score distribution
- `plot_comparison_network()`: Visualize comparison graph
- `get_clients_info()`: View information about LLM clients

## Example Datasets

The `data/` directory contains sample datasets to help you get started:

- `emobank.csv`: Full EmoBank dataset with emotional dimension ratings
- `emobank_sample.csv`: Smaller sample for quick testing
- `emobank_small_sample_simAnnotations.csv`: Sample with simulated annotations
- `cgcot_prompts/`: Example prompt files for arousal, dominance, and valence concepts

## Citation

If you use `pairadigm` in your research, please cite:

```bibtex
@software{pairadigm2025,
  author = {Chrzan, M.L.},
  title = {pairadigm: A Python Library for Concept-Guided Chain-of-Thought Pairwise Measurement of Scalar Constructs Using Large Language Models},
  year = {2026},
  month = {March},
  version = {0.5.3},
  url = {https://github.com/mlchrzan/pairadigm},
  doi = {10.5281/zenodo.17981011}
}
```

## License

Apache 2.0 License

## Contributing

Contributions are welcome! Please:
1. Fork the repository
2. Create a feature branch
3. Submit a pull request

## Support

For questions and issues:
- Open an issue on GitHub
- Check the example notebooks in the repository
- Review the docstrings in `pairadigm.py`

## Upcoming Features
- Performance improvement for multiple models by parallelizing API calls across models, not just within models
- Enhanced validation metrics and visualizations (IN PROGRESS, recommendations welcome!)
    - Improved inter-rater reliability visualizations
    - Item evaluation metrics and visualizations 
- Conversion from Likert-scale annotation to pairwise
- Dawid-Skene item ground truth estimation with and without LLM annotators (NOT STARTED)
- Updated score_items to use the Dawid-Skene estimated ground truth (NOT STARTED)
- Update Dawid-Skene methods to generate multiple runs to examine stability (for now, we recommend examining variance independently over multiple seeds)
- Support for multiple concepts simultaneously (NOT STARTED)

# Previous Updates (see CHANGELOG.md for all)

## Updates for version [0.5.1] - 2025-12-14 - A Big Hug! 🤗
### Added 
- Early stopping functionality to RewardModel's finetuning process based on validation loss to prevent overfitting.
- Finetuning now returns the best model based on validation performance rather than the last epoch.
- RewardModel class now includes a `push_to_hub()` method to upload the finetuned model to Hugging Face Model Hub for easy sharing and deployment.
- Now includes support in LLMClient for calling inference via Hugging Face's Inference API, allowing users to leverage Hugging Face-hosted models seamlessly.

## Updates for version 0.4.1 - 2025-12-07

### Added
- **RewardModel Class**: Fine-tune ModernBERT (or other BERT-type model) for scalar construct measurement using reward modeling
  - Train models on pairwise comparison data
  - Score individual texts or batches on continuous scales
  - Support for custom dropout, max length, and device settings
  - Built-in score normalization to desired scales
  - Save/load trained models for reuse
- Support for Ollama LLMs (local models) with `think` parameter
- `build_pairadigm()` function to run full pipeline in one command
- Enhanced progress monitoring for CGCoT breakdown generation

## Updates for version 0.3.1 - 2025-11-12

### Added
- Allowing users to adjust the max_tokens and temperature parameters when generating breakdowns and pairwise annotations.
- Added progress monitoring for breakdown generation (both pre-paired and not)
- Added "base_url" parameter to LLMClient to support custom API endpoints for LLM providers (currently only OpenAI).
- Introduced a new "Tie" annotation option to indicate no preference between two items.
- plot_epsilon_sensitivity() to visualize how varying the epsilon parameter affects Alt-Test Win Rate.

### Fixed
- `irr` now checks for Tie annotations and handles them correctly when calculating inter-rater reliability.
- `check_transitivity` accounts for Tie annotations in its logic of counting violations.
- `score_items` updated to use the Davidson model when Ties are present, instead of Bradley-Terry.
- `plot_comparison_network` gives a warning if Tie annotations are present, as they cannot be represented in a directed graph.
