Metadata-Version: 2.4
Name: pycaroline
Version: 0.2.0
Summary: Data validation library for comparing tables across cloud data warehouses, cloud storage, and databases
Project-URL: Homepage, https://github.com/ryankarlos/pycaroline
Project-URL: Documentation, https://ryankarlos.github.io/pycaroline
Project-URL: Repository, https://github.com/ryankarlos/pycaroline.git
Project-URL: Issues, https://github.com/ryankarlos/pycaroline/issues
Project-URL: Changelog, https://github.com/ryankarlos/pycaroline/blob/main/CHANGELOG.md
Author-email: Ryan Nazareth <ryankarlos@gmail.com>
Maintainer-email: Ryan Nazareth <ryankarlos@gmail.com>
License: MIT
License-File: LICENSE
Keywords: bigquery,comparison,data,data-quality,datacompy,etl,gcs,migration,mysql,pandas,polars,postgresql,redshift,s3,snowflake,validation
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: click>=8.0.0
Requires-Dist: datacompy==0.19.1
Requires-Dist: polars>=1.0.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: all
Requires-Dist: boto3>=1.26.0; extra == 'all'
Requires-Dist: gcsfs>=2023.1.0; extra == 'all'
Requires-Dist: google-cloud-bigquery>=3.0.0; extra == 'all'
Requires-Dist: google-cloud-storage>=2.0.0; extra == 'all'
Requires-Dist: mysql-connector-python>=8.0.0; extra == 'all'
Requires-Dist: pandas-gbq>=0.17.0; extra == 'all'
Requires-Dist: pandas>=2.0.0; extra == 'all'
Requires-Dist: psycopg2-binary>=2.9.0; extra == 'all'
Requires-Dist: redshift-connector>=2.0.0; extra == 'all'
Requires-Dist: s3fs>=2023.1.0; extra == 'all'
Requires-Dist: snowflake-connector-python>=3.0.0; extra == 'all'
Provides-Extra: bigquery
Requires-Dist: google-cloud-bigquery>=3.0.0; extra == 'bigquery'
Requires-Dist: pandas-gbq>=0.17.0; extra == 'bigquery'
Provides-Extra: dev
Requires-Dist: black>=24.0.0; extra == 'dev'
Requires-Dist: boto3>=1.26.0; extra == 'dev'
Requires-Dist: gcsfs>=2023.1.0; extra == 'dev'
Requires-Dist: google-cloud-bigquery>=3.0.0; extra == 'dev'
Requires-Dist: google-cloud-storage>=2.0.0; extra == 'dev'
Requires-Dist: hypothesis>=6.0.0; extra == 'dev'
Requires-Dist: mkdocs-jupyter>=0.24.0; extra == 'dev'
Requires-Dist: mkdocs-material>=9.0.0; extra == 'dev'
Requires-Dist: mkdocs>=1.5.0; extra == 'dev'
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: mysql-connector-python>=8.0.0; extra == 'dev'
Requires-Dist: pandas-gbq>=0.17.0; extra == 'dev'
Requires-Dist: pandas>=2.0.0; extra == 'dev'
Requires-Dist: pre-commit>=3.0.0; extra == 'dev'
Requires-Dist: psycopg2-binary>=2.9.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.10.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: redshift-connector>=2.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Requires-Dist: s3fs>=2023.1.0; extra == 'dev'
Requires-Dist: snowflake-connector-python>=3.0.0; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-jupyter>=0.24.0; extra == 'docs'
Requires-Dist: mkdocs-material>=9.0.0; extra == 'docs'
Requires-Dist: mkdocs>=1.5.0; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == 'docs'
Provides-Extra: gcs
Requires-Dist: gcsfs>=2023.1.0; extra == 'gcs'
Requires-Dist: google-cloud-storage>=2.0.0; extra == 'gcs'
Provides-Extra: mysql
Requires-Dist: mysql-connector-python>=8.0.0; extra == 'mysql'
Provides-Extra: pandas
Requires-Dist: pandas>=2.0.0; extra == 'pandas'
Provides-Extra: postgresql
Requires-Dist: psycopg2-binary>=2.9.0; extra == 'postgresql'
Provides-Extra: redshift
Requires-Dist: psycopg2-binary>=2.9.0; extra == 'redshift'
Requires-Dist: redshift-connector>=2.0.0; extra == 'redshift'
Provides-Extra: s3
Requires-Dist: boto3>=1.26.0; extra == 's3'
Requires-Dist: s3fs>=2023.1.0; extra == 's3'
Provides-Extra: snowflake
Requires-Dist: snowflake-connector-python>=3.0.0; extra == 'snowflake'
Description-Content-Type: text/markdown

# PyCaroline

[![PyPI version](https://badge.fury.io/py/pycaroline.svg)](https://badge.fury.io/py/pycaroline)
[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Tests](https://github.com/ryankarlos/pycaroline/actions/workflows/test.yml/badge.svg)](https://github.com/ryankarlos/pycaroline/actions)
[![Codecov](https://codecov.io/gh/ryankarlos/pycaroline/graph/badge.svg?token=nfQT3lqoc8)](https://codecov.io/gh/ryankarlos/pycaroline)
[![Documentation](https://img.shields.io/badge/docs-mkdocs-blue)](https://ryankarlos.github.io/pycaroline)

A Python library for validating data migrations between cloud data warehouses, cloud storage (S3, GCS), and databases. Built on [datacompy](https://github.com/capitalone/datacompy), PyCaroline provides a unified interface for connecting to various data sources and comparing datasets with detailed reporting.

## Why PyCaroline?

Data migrations are risky. Whether you're moving from Snowflake to BigQuery, consolidating data warehouses, comparing S3 files with database tables, or validating ETL pipelines, you need confidence that your data arrived intact. PyCaroline makes this easy.

## Features

- 🔌 **Multi-database support** - Snowflake, BigQuery, Redshift, MySQL, PostgreSQL with unified API
- ☁️ **Cloud storage support** - Read and compare data from AWS S3 and Google Cloud Storage
- 📊 **Direct DataFrame input** - Compare polars, pandas, or snowpark DataFrames directly
- 🔍 **Flexible comparison** - Row-level and column-level with configurable tolerances
- 📈 **Rich reports** - JSON summaries, CSV details, and beautiful HTML reports
- 🖥️ **CLI & Python API** - Use from command line or integrate into your code
- ⚙️ **Configuration-driven** - YAML config with environment variable substitution
- 🧪 **Well-tested** - 90%+ test coverage with property-based tests
- 🐍 **Modern Python** - Supports Python 3.12 and 3.13

## Installation

```bash
# Using uv (recommended)
uv add pycaroline

# Using pip
pip install pycaroline
```

### With Database-Specific Dependencies

```bash
# Snowflake
uv add "pycaroline[snowflake]"

# BigQuery
uv add "pycaroline[bigquery]"

# Redshift
uv add "pycaroline[redshift]"

# MySQL
uv add "pycaroline[mysql]"

# PostgreSQL
uv add "pycaroline[postgresql]"

# Cloud Storage (S3)
uv add "pycaroline[s3]"

# Cloud Storage (GCS)
uv add "pycaroline[gcs]"

# Pandas DataFrame support
uv add "pycaroline[pandas]"

# All connectors
uv add "pycaroline[all]"
```

## Quick Start

### Python API

```python
from pycaroline import DataValidator, ConfigLoader, DataComparator, ComparisonConfig
from pathlib import Path

# Using configuration file
config = ConfigLoader.load(Path("validation_config.yaml"))
validator = DataValidator(config)
results = validator.validate()

for table, result in results.items():
    print(f"{table}: {result.matching_rows}/{result.source_row_count} rows match")
```

### Direct DataFrame Comparison

```python
import polars as pl
from pycaroline import DataComparator, ComparisonConfig

source_df = pl.DataFrame({"id": [1, 2, 3], "value": ["a", "b", "c"]})
target_df = pl.DataFrame({"id": [1, 2, 4], "value": ["a", "B", "d"]})

comparator = DataComparator(ComparisonConfig(
    join_columns=["id"],
    ignore_case=True,
    ignore_spaces=True,
))
result = comparator.compare(source_df, target_df)

print(f"Matching rows: {result.matching_rows}")
print(f"Rows only in source: {len(result.rows_only_in_source)}")
print(f"Rows only in target: {len(result.rows_only_in_target)}")
```

### Compare Pandas DataFrames

```python
import pandas as pd
from pycaroline import compare_dataframes

source_df = pd.DataFrame({"id": [1, 2, 3], "value": ["a", "b", "c"]})
target_df = pd.DataFrame({"id": [1, 2, 4], "value": ["a", "B", "d"]})

result = compare_dataframes(source_df, target_df, join_columns=["id"])
print(f"Matching rows: {result.matching_rows}")
```

### Compare S3 Files

```python
from pycaroline.connectors import S3Connector

with S3Connector(bucket="my-bucket") as conn:
    source_df = conn.query("data/source.parquet")
    target_df = conn.query("data/target.parquet")

result = compare_dataframes(source_df, target_df, join_columns=["id"])
```

### Command Line

```bash
# Validate using config file
pycaroline validate --config validation_config.yaml --output ./reports

# Quick comparison
pycaroline compare \
    --source-type snowflake \
    --target-type bigquery \
    --source-table my_schema.customers \
    --target-table my_dataset.customers \
    --join-columns customer_id
```

## Configuration

Create a `validation_config.yaml`:

```yaml
source:
  type: snowflake
  connection:
    account: ${SNOWFLAKE_ACCOUNT}
    user: ${SNOWFLAKE_USER}
    password: ${SNOWFLAKE_PASSWORD}
    warehouse: ${SNOWFLAKE_WAREHOUSE}
    database: my_database

target:
  type: bigquery
  connection:
    project: ${GCP_PROJECT}
    credentials_path: ${GOOGLE_APPLICATION_CREDENTIALS}

tables:
  - source_table: customers
    target_table: customers
    join_columns: [customer_id]
    sample_size: 10000  # Optional: limit for large tables

comparison:
  abs_tol: 0.0001
  ignore_case: false
  ignore_spaces: true

output_dir: ./validation_results
```

## Report Output

```text
validation_results/
├── customers_summary.json       # Match statistics
├── customers_report.html        # Visual HTML report
├── customers_column_stats.csv   # Column-level stats
├── customers_rows_only_in_source.csv
├── customers_rows_only_in_target.csv
└── customers_mismatched_rows.csv
```

## Documentation

Full documentation is available at [https://yourusername.github.io/pycaroline](https://yourusername.github.io/pycaroline)

## API Reference

### Core Classes

| Class | Description |
|-------|-------------|
| `DataValidator` | Main orchestrator for validation workflows |
| `ConfigLoader` | Loads YAML configuration with env var substitution |
| `DataComparator` | Compares DataFrames using datacompy |
| `ReportGenerator` | Generates JSON, CSV, and HTML reports |
| `ConnectorFactory` | Factory for creating database connectors |

### Exceptions

| Exception | Description |
|-----------|-------------|
| `ValidationError` | Validation operation failed |
| `ConfigurationError` | Invalid configuration |
| `ConnectionError` | Database connection failed |
| `QueryError` | Query execution failed |

## Development

```bash
# Clone and install
git clone https://github.com/ryankarlos/pycaroline.git
cd pycaroline
uv sync --all-extras

# Run tests
uv run pytest

# Run with coverage
uv run pytest --cov=pycaroline --cov-report=html

# Serve documentation locally
uv run mkdocs serve

# Lint and format
uv run ruff check .
uv run ruff format .
```

## Contributing

Contributions welcome! Please read [CONTRIBUTING.md](CONTRIBUTING.md) before submitting PRs.

## License

MIT License - see [LICENSE](LICENSE) for details.
