Metadata-Version: 2.4
Name: proportions
Version: 0.1.2
Summary: A comprehensive library for Bayesian and frequentist inference on grouped binomial proportions
Author-email: Javier Movellan <jmovellan@apple.com>
License: MIT
Project-URL: Homepage, https://gitlab.com/movellan/proportions
Project-URL: Repository, https://gitlab.com/movellan/proportions
Project-URL: Documentation, https://gitlab.com/movellan/proportions
Keywords: bayesian,statistics,proportions,beta-binomial,hierarchical-models
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Mathematics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24.0
Requires-Dist: scipy>=1.10.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: matplotlib>=3.7.0
Requires-Dist: pytest>=7.4.0
Requires-Dist: pytest-cov>=4.1.0
Provides-Extra: dev
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Provides-Extra: notebooks
Requires-Dist: jupyter>=1.0.0; extra == "notebooks"
Requires-Dist: jupyterlab>=4.0.0; extra == "notebooks"
Provides-Extra: viz
Requires-Dist: plotly>=5.14.0; extra == "viz"
Requires-Dist: seaborn>=0.12.0; extra == "viz"
Provides-Extra: docs
Requires-Dist: sphinx>=7.0.0; extra == "docs"
Requires-Dist: sphinx-rtd-theme>=1.3.0; extra == "docs"
Requires-Dist: nbsphinx>=0.9.0; extra == "docs"
Provides-Extra: all
Requires-Dist: proportions[dev,docs,notebooks,viz]; extra == "all"
Dynamic: license-file

# Proportions: Bayesian and Frequentist Inference for Grouped Binomial Data

[![PyPI version](https://badge.fury.io/py/proportions.svg)](https://pypi.org/project/proportions/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)

A comprehensive Python library for estimating average success rates across multiple groups using Bayesian and frequentist methods.

## Overview

When you have binomial data from multiple groups (e.g., success rates across experiments, conversion rates across user segments, test pass rates across scenarios), this library helps you:

1. **Estimate the average success rate** across all groups
2. **Quantify uncertainty** with credible/confidence intervals
3. **Account for heterogeneity** between groups
4. **Compare different modeling approaches** with automatic diagnostics
5. **Detect data quality issues** and get recommendations

## Supported Methods

| Method | Description | When to Use |
|--------|-------------|-------------|
| **Hierarchical Bayes** ⭐ **RECOMMENDED** | Full Bayesian with importance sampling | Need honest uncertainty, unusual data |
| **Single-Theta Bayesian** | Pooled model (homogeneous groups) | Groups believed identical |
| **Clopper-Pearson** | Frequentist exact confidence intervals | Baseline comparison |
| **Empirical Bayes** ⚠️ **NOT RECOMMENDED** | Data-driven hyperparameter estimation | ⚠️ Under-covers (17% when nominal is 95%), use HB instead |

## Installation

### Recommended: Using uv (fast, modern)

**First, install uv if you don't have it:**
```bash
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"

# Or via pip
pip install uv
```

For more installation options, see: https://docs.astral.sh/uv/getting-started/installation/

**Then, set up the project:**
```bash
# Clone the repository
git clone git@gitlab.com:movellan/proportions.git
cd proportions

# Install with uv (automatically creates .venv and installs all dependencies)
uv sync

# Verify installation by running tests
uv run pytest tests/ -v

# Run commands with uv (no activation needed!)
uv run python examples/01_basic_usage.py

# Or activate manually if you prefer
source .venv/bin/activate  # On macOS/Linux
# or
.venv\Scripts\activate  # On Windows
```

**Why uv?**
- No need to activate: `uv run` automatically uses the project venv
- Fast installation (10-100x faster than pip)
- Reproducible builds with uv.lock

### Alternative: Using pip

```bash
# Clone the repository
cd proportions

# Create virtual environment (optional but recommended)
python -m venv .venv
source .venv/bin/activate  # On macOS/Linux

# Install package
pip install -e ".[dev]"  # With development tools

# Verify installation by running tests
pytest tests/ -v
```

### From PyPI (Recommended for Users)

```bash
pip install proportions
```

## Quick Start

```python
import numpy as np
from proportions.core.models import BinomialData
from proportions.inference import hierarchical_bayes, single_theta_bayesian

# Your data: success counts (x) and trial counts (n) per group
x = np.array([8, 7, 9, 6, 8])  # successes
n = np.array([10, 10, 10, 10, 10])  # trials
data = BinomialData(x=x, n=n)

# Hierarchical Bayes (RECOMMENDED - accounts for all uncertainty)
hb_result = hierarchical_bayes(data, random_seed=42)
print(f"Average success rate: {hb_result.posterior.mu:.3f}")
print(f"95% CI: [{hb_result.posterior.ci_lower:.3f}, {hb_result.posterior.ci_upper:.3f}]")

# Single-Theta Bayesian (simpler, assumes homogeneity)
st_result = single_theta_bayesian(data, alpha_prior=1.0, beta_prior=1.0)
print(f"Average success rate: {st_result.posterior.mu:.3f}")
print(f"95% CI: [{st_result.posterior.ci_lower:.3f}, {st_result.posterior.ci_upper:.3f}]")
```

## Features

### 🎯 Core Capabilities

- **Multiple estimation methods** with unified API
- **Automatic validation** of input data (Pydantic models)
- **Numerically stable** Beta distribution functions
- **Importance sampling** for Hierarchical Bayes
- **Moment matching** for posterior approximation

### 📊 Comprehensive Diagnostics

- **Model evidence** (marginal likelihood) for method comparison
- **Bayes factors** with interpretation (decisive/strong/moderate/weak)
- **Effective Sample Size (ESS)** for importance sampling
- **Boundary detection** for prior specification issues
- **Data quality checks** (heterogeneity, sample sizes, extreme rates)
- **Variance decomposition** (within-group vs between-group uncertainty)

### 📈 Visualization Tools

- Data overview plots (scatter, histograms, heterogeneity)
- Prior and posterior distributions
- Method comparison plots
- Importance sampling diagnostics
- HTML reports with embedded plots

### 🔬 Based on Solid Theory

- **Beta-Binomial hierarchical models** with conjugate priors
- **Importance sampling** for intractable posteriors
- **Law of total variance** for proper uncertainty propagation
- **Numerical stability** via log-space computation

## Example: Comparing Methods

```python
from proportions.core.models import BinomialData
from proportions.inference import hierarchical_bayes, single_theta_bayesian
import numpy as np

# Prepare data
x = np.array([8, 7, 9, 6, 8])
n = np.array([10, 10, 10, 10, 10])
data = BinomialData(x=x, n=n)

# Fit multiple methods
hb_result = hierarchical_bayes(data, random_seed=42)
st_result = single_theta_bayesian(data, alpha_prior=1.0, beta_prior=1.0)

# Compare via model evidence (marginal likelihood)
# Evidence is automatically computed for both methods!
print("Model Evidence Comparison:")
print(f"Hierarchical Bayes: {hb_result.log_marginal_likelihood:.2f}")
print(f"Single-Theta: {st_result.log_marginal_likelihood:.2f}")

# Calculate Bayes Factor
log_bf = hb_result.log_marginal_likelihood - st_result.log_marginal_likelihood
bf = np.exp(log_bf)
print(f"\nBayes Factor (HB vs ST): {bf:.2e}")
```

## Example: Custom Priors

```python
from proportions.core.models import BinomialData
from proportions.inference import hierarchical_bayes
import numpy as np

# Prepare data
x = np.array([8, 7, 9, 6, 8])
n = np.array([10, 10, 10, 10, 10])
data = BinomialData(x=x, n=n)

# Hierarchical Bayes with custom prior parameters
result = hierarchical_bayes(
    data,
    m_prior_alpha=2.0,    # Beta prior for m: E[m] = 2/(2+2) = 0.5
    m_prior_beta=2.0,     # More informative than uniform
    k_prior_min=0.1,      # Allow low concentration (high heterogeneity)
    k_prior_max=100.0,    # Moderate maximum concentration
    n_samples=10000,      # More samples for better approximation
    random_seed=42
)

# Check diagnostics
print(f"Posterior mean for m: {result.m_posterior_mean:.3f}")
print(f"Posterior mean for k: {result.k_posterior_mean:.3f}")
print(f"Effective Sample Size: {result.diagnostics.effective_sample_size:.1f}")
print(f"ESS Ratio: {result.diagnostics.ess_ratio:.3f}")

# Check for boundary issues
if result.diagnostics.k_at_upper_boundary:
    print("⚠️ Warning: k posterior near upper boundary, consider increasing k_prior_max")
if result.diagnostics.k_at_lower_boundary:
    print("⚠️ Warning: k posterior near lower boundary, consider decreasing k_prior_min")
```

## Project Status

**Current Version:** 0.1.0 (Production-Ready)

### ✅ Completed
- Core Pydantic data models with validation
- Prior specification interface
- Stable Beta distribution utilities
- **Hierarchical Bayes** ⭐ - Importance sampling with full uncertainty (RECOMMENDED)
- **Single-Theta Bayesian** - Pooled Bayesian estimation
- **Clopper-Pearson** - Frequentist confidence intervals
- **Empirical Bayes** ⚠️ - Grid search MLE (NOT RECOMMENDED - use Hierarchical Bayes instead)
- Comprehensive diagnostics (ESS, evidence, coverage analysis)
- Visualization tools (importance sampling, distributions)
- 194 passing tests with extensive coverage

### 📅 Future Enhancements
- Additional visualization options
- HTML report generation
- Interactive dashboards
- Extended documentation and tutorials

## Development

### Running Tests

```bash
uv run pytest                      # All tests
uv run pytest -v --cov=proportions # With coverage
```

### Code Quality

```bash
uv run ruff format .  # Format code
uv run ruff check .   # Lint
uv run mypy proportions/  # Type check
```

## Design Principles

1. **Modularity** - Separate concerns (models, priors, inference, diagnostics)
2. **Type Safety** - Pydantic models throughout
3. **Diagnostics First** - Always compute ESS, boundaries, evidence
4. **Numerical Stability** - Log-space computation, stable algorithms
5. **User-Friendly** - Simple API for common cases, power for experts

## Documentation

- **prompts/SESSION_STATE.md** - Current development status and recent changes
- **prompts/LIBRARY_DESIGN_PLAN.md** - Complete architecture and design
- **prompts/HIERARCHICAL_BAYES_SUMMARY.md** - Mathematical foundations and algorithms
- **examples/** - Jupyter notebooks demonstrating all methods and comparisons

## References

This library implements methods based on:
- **Beta-Binomial hierarchical models** - Conjugate Bayesian inference
- **Hierarchical Bayes** ⭐ **RECOMMENDED** - Importance sampling for posterior inference under hyperparameter uncertainty
- **Single-Theta Bayesian** - Pooled Bayesian estimation assuming homogeneity
- **Empirical Bayes** ⚠️ **NOT RECOMMENDED** - MLE of hyperparameters (under-covers, use HB instead)
- **Theory** - Law of total variance, model evidence (marginal likelihood), Bayes factors

## License

MIT License

## Contact

**Author:** Javier Movellan
**Email:** jmovellan@apple.com
**Repository:** https://gitlab.com/movellan/proportions

## Citation

If you use this library in your research, please cite:

```bibtex
@software{proportions2025,
  author = {Movellan, Javier},
  title = {Proportions: Bayesian and Frequentist Inference for Grouped Binomial Data},
  year = {2025},
  url = {https://gitlab.com/movellan/proportions}
}
```
