Metadata-Version: 2.4
Name: lavendertown
Version: 0.6.0
Summary: A Streamlit-first Python package for detecting and visualizing data quality issues
Project-URL: Homepage, https://github.com/eddiethedean/lavendertown
Project-URL: Repository, https://github.com/eddiethedean/lavendertown
Project-URL: Issues, https://github.com/eddiethedean/lavendertown/issues
Author-email: Odos Matthews <odosmatthews@gmail.com>
License: MIT
License-File: LICENSE
Keywords: data-profiling,data-quality,data-validation,streamlit
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Requires-Dist: altair>=4.2.1
Requires-Dist: click>=8.0.0
Requires-Dist: pandas>=1.5.0
Requires-Dist: streamlit>=1.52.0
Provides-Extra: all
Requires-Dist: great-expectations>=0.18.0; extra == 'all'
Requires-Dist: orjson>=3.9.0; extra == 'all'
Requires-Dist: pandera>=0.18.0; extra == 'all'
Requires-Dist: pyarrow>=14.0.0; extra == 'all'
Requires-Dist: pyod>=1.1.0; extra == 'all'
Requires-Dist: python-dotenv>=1.0.0; extra == 'all'
Requires-Dist: rich>=13.0.0; extra == 'all'
Requires-Dist: ruptures>=1.1.0; extra == 'all'
Requires-Dist: scikit-learn>=1.0.0; extra == 'all'
Requires-Dist: scipy>=1.10.0; extra == 'all'
Requires-Dist: statsmodels>=0.14.0; extra == 'all'
Requires-Dist: typer>=0.9.0; extra == 'all'
Requires-Dist: ydata-profiling>=4.5.0; extra == 'all'
Provides-Extra: cli
Requires-Dist: python-dotenv>=1.0.0; extra == 'cli'
Requires-Dist: rich>=13.0.0; extra == 'cli'
Requires-Dist: typer>=0.9.0; extra == 'cli'
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: faker>=20.0.0; extra == 'dev'
Requires-Dist: hypothesis>=6.0.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: great-expectations
Requires-Dist: great-expectations>=0.18.0; extra == 'great-expectations'
Provides-Extra: ml
Requires-Dist: pyod>=1.1.0; extra == 'ml'
Requires-Dist: scikit-learn>=1.0.0; extra == 'ml'
Provides-Extra: pandera
Requires-Dist: pandera>=0.18.0; extra == 'pandera'
Provides-Extra: parquet
Requires-Dist: pyarrow>=14.0.0; extra == 'parquet'
Provides-Extra: polars
Requires-Dist: polars>=0.19.0; extra == 'polars'
Provides-Extra: profiling
Requires-Dist: ydata-profiling>=4.5.0; extra == 'profiling'
Provides-Extra: stats
Requires-Dist: scipy>=1.10.0; extra == 'stats'
Provides-Extra: timeseries
Requires-Dist: ruptures>=1.1.0; extra == 'timeseries'
Requires-Dist: statsmodels>=0.14.0; extra == 'timeseries'
Description-Content-Type: text/markdown

# LavenderTown

> A Streamlit-first Python package for detecting and visualizing "data ghosts": type inconsistencies, nulls, invalid values, schema drift, and anomalies in tabular datasets.

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![PyPI version](https://img.shields.io/pypi/v/lavendertown.svg)](https://pypi.org/project/lavendertown/)
[![Documentation](https://readthedocs.org/projects/lavendertown/badge/?version=latest)](https://lavendertown.readthedocs.io/en/latest/?badge=latest)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

LavenderTown helps you quickly identify data quality issues in your datasets through an intuitive, interactive Streamlit interface. Perfect for data scientists, analysts, and engineers who need to understand their data quality before diving into analysis.

## ✨ Features

- 🔍 **Zero-config data quality insights** - Get started with minimal setup
- 📊 **Streamlit-native UI** - No HTML embeds, fully integrated with Streamlit
- 🎯 **Interactive ghost detection** - Drill down into problematic rows
- 🐼 **Pandas & Polars support** - Works with your existing data pipelines
- 📤 **Exportable findings** - Download results as JSON, CSV, or Parquet with one click
- 🔄 **Dataset Comparison** - Detect schema and distribution drift between datasets
- ⚙️ **Custom Rules** - Create and manage custom data quality rules via UI
- 📁 **Enhanced File Upload** - Drag-and-drop interface with animated progress and automatic encoding detection
- 🚀 **High Performance** - Optimized for datasets up to millions of rows with fast JSON serialization
- 🛠️ **Enhanced CLI Tool** - Beautiful, interactive CLI with progress bars and formatted output for batch processing and automation (Click and Typer support)
- 🔗 **Ecosystem Integration** - Export rules to Pandera and Great Expectations
- ⚙️ **Configuration Management** - Environment-based configuration with `.env` file support
- 🤖 **Advanced ML Detection** - 40+ ML anomaly detection algorithms via PyOD integration
- 📈 **Time-Series Analysis** - Change point detection with Ruptures, comprehensive profiling with ydata-profiling
- 📊 **Statistical Tests** - Kolmogorov-Smirnov and chi-square tests for rigorous drift detection
- 💾 **Parquet Export** - Efficient columnar storage format for large datasets

## 📦 Installation

Install LavenderTown using pip:

```bash
pip install lavendertown
```

For Polars support, install with the optional dependency:

```bash
pip install lavendertown[polars]
```

For ecosystem integrations (Pandera and Great Expectations), install with:

```bash
pip install lavendertown[pandera]
pip install lavendertown[great_expectations]
```

For Phase 6 features (ML algorithms, time-series, profiling, Parquet export):

```bash
pip install lavendertown[ml]          # PyOD + scikit-learn for 40+ ML anomaly detection algorithms
pip install lavendertown[timeseries]  # Ruptures for change point detection
pip install lavendertown[profiling]   # ydata-profiling for comprehensive data profiling reports
pip install lavendertown[parquet]     # PyArrow for Parquet export/import
pip install lavendertown[stats]       # scipy.stats for statistical tests in drift detection
```

**Note:** LavenderTown is compatible with both altair 4.x and 5.x. Installing Great Expectations will automatically install altair 4.x (which is compatible with LavenderTown).

For all Phase 6 features, install with:
```bash
pip install lavendertown[ml,timeseries,profiling,parquet,stats]
```

## 🚀 Quick Start

### Basic Usage

```python
import streamlit as st
from lavendertown import Inspector
import pandas as pd

# Load your data
df = pd.read_csv("your_data.csv")

# Create inspector and render
inspector = Inspector(df)
inspector.render()  # This must be called within a Streamlit app context
```

That's it! Save this code in a file (e.g., `app.py`) and run `streamlit run app.py` to see the interactive data quality dashboard.

### Using Polars

LavenderTown works seamlessly with Polars DataFrames:

```python
import streamlit as st
from lavendertown import Inspector
import polars as pl

# Load your data with Polars
df = pl.read_csv("your_data.csv")

# Create inspector and render (works with Polars too!)
inspector = Inspector(df)
inspector.render()  # This must be called within a Streamlit app context
```

### Standalone CSV Upload App

For quick analysis without writing code, use the included Streamlit app:

```bash
streamlit run examples/app.py
```

This opens a web interface where you can:
- Upload CSV files via drag-and-drop or file browser with enhanced UI
- See animated progress indicators during file processing
- Automatic encoding detection (UTF-8, Latin-1, ISO-8859-1, CP1252)
- Preview your data before analysis
- View interactive data quality insights
- Export findings with download buttons

**New in v0.6.0:** The upload experience includes polished animations, automatic encoding detection, and enhanced visual feedback for a professional user experience.

See the [examples directory](https://github.com/eddiethedean/lavendertown/tree/main/examples) and [examples/README.md](https://github.com/eddiethedean/lavendertown/blob/main/examples/README.md) for more usage examples and detailed instructions.

## 📚 Usage Examples

### Dataset Comparison (Drift Detection)

Compare datasets to detect schema and distribution changes:

```python
from lavendertown import Inspector
import pandas as pd

# Create baseline and current datasets
baseline_df = pd.DataFrame({
    "customer_id": [1, 2, 3, 4, 5],
    "age": [25, 30, 35, 40, 45],
    "purchase_amount": [100.50, 200.00, 150.75, 300.00, 250.50],
})

current_df = pd.DataFrame({
    "customer_id": [1, 2, 3, 4, 5, 6],  # New row
    "age": [25, 30, 35, 40, 45, 50],  # New row
    "purchase_amount": [100.50, 250.00, 150.75, 400.00, 250.50, 500.00],  # Changed values
    "new_column": [1, 2, 3, 4, 5, 6],  # New column
})

inspector = Inspector(current_df)
drift_findings = inspector.compare_with_baseline(
    baseline_df=baseline_df,
    comparison_type="full"  # or "schema_only", "distribution_only"
)

# Drift findings have ghost_type="drift"
for finding in drift_findings:
    if finding.ghost_type == "drift":
        print(f"{finding.column}: {finding.description}")
        print(f"  Change type: {finding.metadata.get('change_type', 'N/A')}")
```

**Example Output:**
```
new_column: New column 'new_column' added to dataset
  Change type: column_added
email: Column 'email' became nullable
  Change type: nullability_change
age: Column 'age' range shifted from [25.00, 45.00] to [25.00, 50.00]
  Change type: numeric_range
purchase_amount: Column 'purchase_amount' range shifted from [100.50, 300.00] to [100.50, 500.00]
  Change type: numeric_range
```

> **Note:** This is actual output from running the code above. The exact drift findings depend on the differences between your baseline and current datasets.

### Custom Data Quality Rules

Create custom rules through the Streamlit UI:

1. Click "Manage Rules" in the sidebar
2. Create rules of different types:
   - **Range rules**: Validate numeric values within min/max bounds
   - **Regex rules**: Pattern matching for string columns
   - **Enum rules**: Allow only specific values in a column
3. Rules execute automatically with each analysis
4. Export/import rules as JSON for reuse across projects

### Command-Line Interface (CLI)

LavenderTown includes a powerful CLI with beautiful, interactive output for batch processing and automation. The CLI features progress bars, formatted tables, and color-coded messages for an enhanced user experience:

```bash
# Analyze a single CSV file
lavendertown analyze data.csv --output-format json --output-dir results/

# Batch process multiple files
lavendertown analyze-batch data/ --output-dir results/

# Compare datasets for drift detection
lavendertown compare baseline.csv current.csv --output-format json

# Export rules to Pandera or Great Expectations
lavendertown export-rules rules.json --format pandera --output-file schema.py
lavendertown export-rules rules.json --format great_expectations --output-file suite.json
```

**CLI Options:**
- `--rules PATH`: Path to rules JSON file
- `--output-format [json|csv]`: Output format (default: `json`)
- `--output-dir DIRECTORY`: Output directory (for batch processing)
- `--output-file PATH`: Specific output file path (overrides output-dir)
- `--backend [pandas|polars]`: DataFrame backend (default: `pandas`)
- `--quiet`: Suppress progress output
- `--verbose`: Verbose output with detailed error messages

**Note:** For the best CLI experience with enhanced formatting, install with the `cli` extra:
```bash
pip install lavendertown[cli]
```

This includes Rich for beautiful terminal output, python-dotenv for configuration management, orjson for faster JSON processing, and Typer for modern type-hint based CLI commands.

**New in Phase 6:** LavenderTown now supports Parquet export format and includes a modern Typer-based CLI (available alongside the existing Click CLI). Install `lavendertown[parquet]` for Parquet support.

**Example CLI Usage:**

```bash
# Analyze with verbose output
lavendertown analyze data.csv --verbose

# Batch process with Polars backend
lavendertown analyze-batch data/ --output-dir results/ --backend polars

# Analyze with custom rules
lavendertown analyze data.csv --rules my_rules.json --output-format csv
```

See `lavendertown --help` or `lavendertown analyze --help` for full documentation.

### Programmatic Usage

Use LavenderTown in your Python scripts:

```python
from lavendertown import Inspector
import pandas as pd

# Create sample data with quality issues
data = {
    "product_id": [1, 2, 3, 4, 5, 6, 7, 8],
    "price": [10.99, 25.50, None, 45.00, -5.00, 100.00, 200.00, 300.00],
    "quantity": [100, 50, 75, None, 200, 150, 0, 300],
    "category": ["A", "B", "A", "C", "A", "B", "A", "C"],
}
df = pd.DataFrame(data)

inspector = Inspector(df)

# Get findings programmatically
findings = inspector.detect()

# Filter by severity
errors = [f for f in findings if f.severity == "error"]
warnings = [f for f in findings if f.severity == "warning"]
info_items = [f for f in findings if f.severity == "info"]

print(f"Total findings: {len(findings)}")
print(f"Errors: {len(errors)}, Warnings: {len(warnings)}, Info: {len(info_items)}")

# Access finding details
for finding in findings:
    print(f"\nColumn: {finding.column}")
    print(f"Type: {finding.ghost_type}")
    print(f"Severity: {finding.severity}")
    print(f"Description: {finding.description}")
    if finding.row_indices:
        print(f"Affected rows: {len(finding.row_indices)}")
```

**Example Output:**
```
Total findings: 2
Errors: 0, Warnings: 0, Info: 2

Column: price
Type: null
Severity: info
Description: Column 'price' has 1 null values (12.5% of 8 rows)
Affected rows: 1

Column: quantity
Type: null
Severity: info
Description: Column 'quantity' has 1 null values (12.5% of 8 rows)
Affected rows: 1
```

> **Note:** This is actual output from running the code above. The exact findings may vary based on the data and detection thresholds.

## 👻 Ghost Categories

LavenderTown detects four main categories of data quality issues:

1. **Structural Ghosts** - Mixed dtypes, schema drift, unexpected nullability
2. **Value Ghosts** - Out-of-range values, regex violations, enum violations  
3. **Completeness Ghosts** - Null density thresholds, conditional nulls
4. **Statistical Ghosts** - Outliers (IQR method), distribution shifts

Each finding includes:
- **Ghost type**: Category of the issue
- **Column**: Affected column name
- **Severity**: `info`, `warning`, or `error`
- **Description**: Human-readable explanation
- **Row indices**: Specific rows affected (when applicable)
- **Metadata**: Additional diagnostic information

## 🏗️ Architecture

LavenderTown is built with a plugin-based architecture:

- **Inspector**: Main orchestrator that coordinates detection and rendering
- **Detectors**: Stateless, UI-agnostic modules for detecting specific ghost types
  - `NullGhostDetector`: Detects excessive null values
  - `TypeGhostDetector`: Identifies type inconsistencies
  - `OutlierGhostDetector`: Finds statistical outliers using IQR method
  - `RuleBasedDetector`: Executes custom user-defined rules
- **UI Components**: Streamlit-native visualization components
- **Export Layer**: Fast JSON, CSV, and Parquet export functionality (with orjson optimization and PyArrow support)

## ⚙️ Configuration

LavenderTown supports configuration through environment variables and `.env` files. Create a `.env` file in your project root or home directory:

```bash
# .env file example
LAVENDERTOWN_LOG_LEVEL=INFO
LAVENDERTOWN_OUTPUT_DIR=./results
```

Configuration is automatically loaded when the package is imported. See the [documentation](https://lavendertown.readthedocs.io/en/latest/) for available configuration options.

## 🛠️ Development

### Installation for Development

```bash
git clone https://github.com/eddiethedean/lavendertown.git
cd lavendertown
pip install -e ".[dev]"
```

### Running Tests

```bash
pytest tests/
```

### Code Quality

```bash
# Format code
ruff format .

# Lint
ruff check .

# Type checking
mypy lavendertown/
```

## 📊 Performance

LavenderTown is optimized for performance:

- **Small datasets (<10k rows)**: Near-instantaneous analysis
- **Medium datasets (10k-100k rows)**: Sub-second analysis
- **Large datasets (100k-1M rows)**: Optimized with caching and vectorized operations

Benchmark results and optimization recommendations are documented in [docs/PERFORMANCE.md](https://github.com/eddiethedean/lavendertown/blob/main/docs/PERFORMANCE.md).

## 🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create a feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](https://github.com/eddiethedean/lavendertown/blob/main/LICENSE) file for details.

## 🙏 Acknowledgments

- Built with [Streamlit](https://streamlit.io/) for the UI
- Powered by [Pandas](https://pandas.pydata.org/) and [Polars](https://www.pola.rs/) for data processing
- Visualizations created with [Altair](https://altair-viz.github.io/)

## 🔗 Links

- **Documentation**: https://lavendertown.readthedocs.io/en/latest/
- **Homepage**: https://github.com/eddiethedean/lavendertown
- **Repository**: https://github.com/eddiethedean/lavendertown
- **Issues**: https://github.com/eddiethedean/lavendertown/issues

---

**Made with ❤️ for the data quality community**
