Metadata-Version: 2.4
Name: additory
Version: 0.1.3a10
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Dist: polars>=0.19.0
Requires-Dist: pyarrow>=10.0.0
Requires-Dist: pandas>=1.5.0 ; extra == 'all'
Requires-Dist: pyarrow>=10.0.0 ; extra == 'all'
Requires-Dist: pytest>=7.0 ; extra == 'dev'
Requires-Dist: pandas>=1.5.0 ; extra == 'dev'
Requires-Dist: pyarrow>=10.0.0 ; extra == 'dev'
Requires-Dist: pandas>=1.5.0 ; extra == 'pandas'
Requires-Dist: pyarrow>=10.0.0 ; extra == 'pandas'
Provides-Extra: all
Provides-Extra: dev
Provides-Extra: pandas
License-File: LICENSE
Summary: Elegant data operations for DataFrames - add.to(), add.transform(), add.synthetic()
Keywords: dataframe,data,pandas,polars,rust,data-augmentation,synthetic-data
Home-Page: https://github.com/sekarkrishna/additory
Author-email: Krishnamoorthy Sankaran <krishnamoorthy.sankaran@sekrad.org>
Maintainer-email: Krishnamoorthy Sankaran <krishnamoorthy.sankaran@sekrad.org>
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Documentation, https://github.com/sekarkrishna/additory#readme
Project-URL: Homepage, https://github.com/sekarkrishna/additory
Project-URL: Issues, https://github.com/sekarkrishna/additory/issues
Project-URL: Repository, https://github.com/sekarkrishna/additory

# additory

**Elegant data operations for DataFrames**

A Rust-powered Python library for intuitive data transformations, lookups, and synthetic data generation with Polars and Pandas.

[![PyPI version](https://badge.fury.io/py/additory.svg)](https://badge.fury.io/py/additory)
[![Python Support](https://img.shields.io/pypi/pyversions/additory.svg)](https://pypi.org/project/additory/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

## Features

- 🔗 **Intuitive Lookups** - Add columns from external sources with simple syntax
- ⚡ **Powerful Transforms** - Calculate, filter, sort, aggregate with mode-based operations
- 🎲 **Synthetic Data** - Generate realistic test data or augment existing datasets
- 📊 **Lineage Tracking** - Track data transformations and view operation history
- 🔍 **Data Scanning** - Analyze data quality and inspect DataFrames
- 🚀 **Rust Performance** - Built with Rust for blazing-fast operations
- 🐼 **Polars & Pandas** - Works seamlessly with both DataFrame libraries
- 📚 **Expression Library** - 179 built-in expressions for medical, finance, physics, and more

## Installation

```bash
pip install additory
```

**Requirements:**
- Python 3.8+
- Polars (required)
- Pandas (optional)

## Quick Start

```python
import additory as add
import polars as pl

# Add data from external sources
orders = pl.DataFrame({'id': [1, 2], 'customer_id': [101, 102]})
customers = pl.DataFrame({'customer_id': [101, 102], 'name': ['Alice', 'Bob']})
result = add.to(orders, bring_from=customers, bring=['name'], against='customer_id')

# Transform data
df = pl.DataFrame({'x': [1, 2, 3]})
result = add.transform('@calc', df, strategy={'x_squared': 'x ** 2'})

# Generate synthetic data
result = add.synthetic('@new', n=100, strategy={'age': 'normal(40, 10)'})
```

## Core Functions

### add.to() - Add Data from External Sources

```python
result = add.to(bring_to, bring_from=reference_df, bring=['column'], against='key',
                lineage=False)
```

Perfect for lookups and joins. Enable `lineage=True` to track data sources.

### add.transform() - Transform Data

```python
result = add.transform(mode, df, lineage=False, **parameters)
```

**Available modes:**
- `@calc` - Calculate new columns with expressions
- `@filter` - Filter rows and select columns
- `@sort` - Sort data by columns
- `@aggregate` - Group and aggregate data
- `@harmonize` - Harmonize units (10 sub-modes)
- `@round` - Round numbers (creates NEW columns)
- `@transpose` - Transpose DataFrame
- `@extract` - Extract patterns from text/dates
- `@onehotencode` - One-hot encode categorical columns
- `@deduce` - Fill missing values (7 methods)

### add.synthetic() - Synthetic Data

```python
result = add.synthetic(mode, df_or_n, lineage=False, **parameters)
```

**Available modes:**
- `@new` - Create synthetic DataFrames from scratch
- `@augment` - Add synthetic rows to existing data

### add.scan() - Inspect and Analyze DataFrames

```python
result = add.scan(mode, df)
```

**Available modes:**
- `@analyze` / `@analyse` - Analyze data quality and distributions
- `@lineage` - View lineage tracking reports (requires `lineage=True` in operations)

## Strategy Parameter

The `strategy` parameter provides fine-grained control over operations in all three functions.

### add.to() Strategy

Control aggregation, renaming, and positioning for brought columns:

**Simple form** (aggregation only):
```python
strategy={'amount': 'sum', 'date': 'last'}
```

**Complex form** (full control):
```python
strategy={
    'amount': {
        'mode': 'sum',
        'rename': 'total_spent',
        'position': 'after:customer_id'
    }
}
```

**Aggregation modes**: first, last, sum, count, average, min, max, concat, concat[sep], most_common, least_common, median, std, variance, unique_count

### add.transform() Strategy

Mode-specific configuration:

**@calc** - Expressions for new columns:
```python
strategy={'total': 'price * quantity', 'discount': 'total * 0.1'}
```

**@sort** - Sort order:
```python
strategy={'order': 'desc'}  # or 'asc'
```

**@aggregate** - Aggregation functions:
```python
strategy={'amount': 'sum', 'count': 'count'}
```

**@round** - Custom naming and positioning:
```python
strategy={
    'price': {'name': 'price_clean', 'position': 'after:price'}
}
```

**@deduce** - KNN parameters:
```python
strategy={'k': 5, 'weights': 'distance'}
```

### add.synthetic() Strategy

Column generation specifications:

**Simple form**:
```python
strategy={'id': 'increment', 'age': 'normal(40, 10)'}
```

**Complex form**:
```python
strategy={
    'name': {'type': 'choice', 'values': ['Alice', 'Bob', 'Charlie']},
    'age': {'type': 'normal', 'mean': 35, 'std': 10}
}
```

**Generation types**: increment, pattern, choice, normal, uniform, lognormal, exponential, poisson, categorical

## Lineage Tracking

Track data transformations across operations to understand data provenance and transformation history.

### Enable Lineage Tracking

```python
import additory as add
import pandas as pd

# Enable lineage in any operation
result = add.to(customers, bring_from=orders, bring=['amount'], 
                against='customer_id', lineage=True)

# Lineage is preserved across operations
result = add.transform('@calc', result, expression='amount * 1.1', 
                       name='total', lineage=True)

# View lineage report
lineage_report = add.scan('@lineage', result)
print(lineage_report)
```

### Lineage Features

- **Operation History** - Track all transformations applied to data
- **Column Sources** - See where each column came from
- **Row Mappings** - Track how rows were filtered or aggregated
- **Session-Only** - Lineage is stored in-memory (not persisted to disk)
- **Mutual Exclusion** - Cannot use `lineage=True` with `as_type` parameter

### Lineage Example

```python
# Multi-step workflow with lineage
customers = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Carol']})
orders = pd.DataFrame({'id': [1, 1, 2, 3, 3], 'amount': [100, 150, 200, 175, 125]})

# Step 1: Bring data
df = add.to(customers, bring_from=orders, bring=['amount'], against='id',
            strategy={'amount': 'sum'}, lineage=True)

# Step 2: Calculate
df = add.transform('@calc', df, expression='amount * 1.1', name='total', lineage=True)

# Step 3: Filter
df = add.transform('@filter', df, where='total > 200', lineage=True)

# View complete lineage
report = add.scan('@lineage', df)
# Shows: 3 operations, column sources, row transformations
```

### Important Notes

- Lineage is **session-only** by design (follows "no file I/O" philosophy)
- Lineage metadata is lost when DataFrames are saved with native methods
- Cannot use `lineage=True` with `as_type` parameter (metadata would be lost during conversion)
- Lineage overhead is minimal (<3ms per operation)

## Documentation

📚 **Complete documentation is available in the `/docs` directory:**

- **[API Reference](docs/reference/)** - Complete function signatures and API documentation
  - [Quick Reference](docs/reference/QUICK_REFERENCE.md) - Fast lookup guide
  - [Reference Manual](docs/reference/REFERENCE_MANUAL.md) - Comprehensive API docs
  - [Function Signatures](docs/reference/FUNCTION_SIGNATURES_WITH_LINEAGE.md) - All signatures with lineage support

- **[User Guides](docs/guides/)** - Step-by-step tutorials and concepts
  - [Migration Guide](docs/guides/MIGRATION_GUIDE_AS_TO_NAME.md) - Upgrading from older versions
  - [Lineage User Story](docs/guides/LINEAGE_USER_STORY.md) - Understanding lineage tracking
  - [Deduce Explained](docs/guides/DEDUCE_EXPLAINED.md) - Missing value imputation guide

- **[Examples](docs/examples/)** - 20+ Quarto notebooks with runnable examples
  - add.to() examples (5 notebooks)
  - add.transform() examples (5 notebooks)
  - add.synthetic() examples (4 notebooks)
  - add.scan() examples (3 notebooks)
  - Lineage tracking examples (2 notebooks)
  - [Troubleshooting Guide](docs/examples/troubleshooting-guide.qmd)

See [docs/README.md](docs/README.md) for the complete documentation index.

## Examples

### Lookup Example

```python
import additory as add
import polars as pl

# Orders with customer IDs
orders = pl.DataFrame({
    'order_id': [1, 2, 3],
    'customer_id': [101, 102, 101],
    'amount': [100, 200, 150]
})

# Customer reference data
customers = pl.DataFrame({
    'customer_id': [101, 102],
    'name': ['Alice', 'Bob'],
    'city': ['NYC', 'LA']
})

# Add customer info to orders
result = add.to(orders, bring_from=customers, bring=['name', 'city'], against='customer_id')
```

### Transform Example

```python
# Calculate with expressions
df = pl.DataFrame({'price': [100, 200, 300], 'quantity': [2, 3, 1]})
result = add.transform('@calc', df, strategy={'total': 'price * quantity'})

# Filter data
result = add.transform('@filter', df, where='price > 150')

# Sort data
result = add.transform('@sort', df, by='price', strategy={'order': 'desc'})

# Aggregate data
df = pl.DataFrame({'category': ['A', 'B', 'A'], 'value': [10, 20, 30]})
result = add.transform('@aggregate', df, by='category', strategy={'value': 'sum'})

# Round numbers (creates NEW columns)
df = pl.DataFrame({'price': [10.567, 20.123, 30.999]})
result = add.transform('@round:2', df, columns='price')  # Creates price_round

# Fill missing values
df = pl.DataFrame({'age': [25, None, 35, None, 45]})
result = add.transform('@deduce', df, columns='age', method='mean')
```

### Synthetic Data Example

```python
# Create synthetic data
result = add.synthetic('@new', n=1000, strategy={
    'age': 'normal(40, 10)',
    'salary': 'normal(75000, 15000)',
    'score': 'uniform(0, 100)'
})

# Augment existing data
df = pl.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
result = add.synthetic('@augment', df, n=100)

# Analyze data quality
result = add.synthetic('@analyze', df)
```

## Version

Current version: **0.1.3** (Stable Alpha)

### What's New in v0.1.3

- ✅ **Lineage Tracking** - Track data transformations with `lineage=True` parameter
- ✅ **add.scan() Function** - Unified interface for `@analyze` and `@lineage` modes
- ✅ **~95% Rust Implementation** - Optimized code distribution for performance
- ✅ **Mutual Exclusion Validation** - Clear error messages for `lineage` + `as_type`
- ✅ **Helper Functions** - Internal utilities for lineage tracking
- ✅ **Bug Fixes** - Fixed add.to() parameter mapping bug
- ✅ **Code Cleanup** - Removed orphan files and dead code
- ✅ **341/341 Tests Passing** - 100% test coverage

## Development

### Building from Source

```bash
# Clone the repository
git clone https://github.com/YOUR_USERNAME/additory.git
cd additory

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Build the package
cd rust-core
export PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1
maturin build --release

# Install locally
pip install target/wheels/*.whl
```

### Running Tests

```bash
# Run comprehensive test suite
python test_all_modes_comprehensive.py

# Run specific tests
pytest tests/
```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

MIT License - see LICENSE file for details

## Changelog

### v0.1.3 (March 9, 2026)
- **Lineage Tracking** - Track data transformations across operations
- **add.scan() Function** - Unified scanning interface (@analyze, @lineage)
- **~95% Rust Implementation** - Optimized Python/Rust code distribution
- **Bug Fixes** - Fixed add.to() parameter mapping, cleaned up code
- **341/341 Tests Passing** - Complete test coverage

### v0.1.3a9 (March 4, 2026)
- Updated API signatures for natural language (bring_to, bring_from, bring)
- Lists everywhere instead of tuples
- @round creates NEW columns (philosophy compliant)
- @deduce mode for missing value imputation
- @extract merged with datetime parsing
- Removed add.set() and add.deduce() functions
- Default seed=42 for reproducibility
- 100% philosophy compliance

### v0.1.3a3 (February 9, 2026)
- Made pandas optional
- Added cross-platform build scripts
- Fixed pandas import issues
- 100% test pass rate

### v0.1.3a2 (February 9, 2026)
- Added banker's rounding (@bankers_round mode)
- Expanded expression library to 179 expressions
- Fixed mode detection issues
- Fixed power operator (`**`) support

### v0.1.3a1 (February 2026)
- Initial alpha release
- Rust core with PyO3 bindings
- Three-function API (to, transform, synthetic)

## Support

For issues, questions, or contributions, please visit:
- GitHub Issues: [Coming Soon]
- Documentation: [Coming Soon]

## Credits

Built with:
- [Rust](https://www.rust-lang.org/)
- [PyO3](https://pyo3.rs/)
- [Polars](https://www.pola.rs/)
- [maturin](https://github.com/PyO3/maturin)

---

**Made with ❤️ for the data science community**

