Metadata-Version: 2.4
Name: cleanalytix
Version: 0.1.0
Summary: Cleanalytix is a modular Python library for profiling, scoring, and cleaning tabular datasets.
Author: Probot-DATA contributors
License: MIT License
        
        Copyright (c) 2026 Probot-DATA contributors
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/Probot-DATA/Cleanalytix_Repo
Project-URL: Repository, https://github.com/Probot-DATA/Cleanalytix_Repo
Project-URL: Issues, https://github.com/Probot-DATA/Cleanalytix_Repo/issues
Keywords: data quality,data cleaning,data profiling,EDA,machine learning,pandas
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.5.0
Requires-Dist: numpy>=1.23.0
Requires-Dist: scikit-learn>=1.1.0
Requires-Dist: nltk>=3.8.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: ruff>=0.4.0; extra == "dev"
Dynamic: license-file

# Cleanalytix

`Cleanalytix` is a Python library for profiling, scoring, cleaning, and monitoring the quality of tabular datasets with a single pipeline.

It is designed for pandas-first workflows and supports:

- baseline dataset profiling and scoring
- optional cleaning recommendations and automatic cleaning
- optional production/new-dataset monitoring
- optional business rules, thresholds, weights, and type inference for new data

## Installation

From a source checkout:

```bash
git clone https://github.com/Probot-DATA/Cleanalytix_Repo
cd Cleanalytix_Repo
pip install -e ".[dev]"
```

Once this project is published to PyPI, the install command will be:

```bash
pip install cleanalytix
```

Runtime requirements:

- Python 3.9+
- pandas
- numpy
- scikit-learn
- nltk

## Quick Start

```python
import pandas as pd
from cleanalytix import Run_DQ_Pipeline

df = pd.read_csv("my_data.csv")

result = Run_DQ_Pipeline(
    dataset_names=["my_dataset"],
    dataset_list=[df],
)

print(result["base_data"]["dirty_scores"])
print(result["base_data"]["meta_before_cleaning"])
print(result["base_data"]["recommendations"])
```

## Production / Monitoring Example

```python
import pandas as pd
from cleanalytix import Run_DQ_Pipeline

train_df = pd.read_csv("train.csv")
prod_df = pd.read_csv("production.csv")

rules = {
    "age": lambda value: pd.isna(value) or 0 <= float(value) <= 120,
}

result = Run_DQ_Pipeline(
    dataset_names=["customers"],
    dataset_list=[train_df],
    new_dataset_list=[prod_df],
    rules=rules,
    cleaning=True,
    interactive=False,
    score_mode="exponential",
)

print(result["base_data"]["dirty_scores"])
print(result["base_data"]["cleaned_scores"])
print(result["prod_data"]["dirty_scores"])
print(result["prod_data"]["change_log"])
```

## Public API

The primary entrypoint is:

```python
from cleanalytix import Run_DQ_Pipeline
```

Additional building blocks are also exported:

```python
from cleanalytix import (
    Compute_DQ_Score,
    DEFAULT_THRESHOLDS,
    generate_meta,
    cleaning_recommendations,
    get_cleaned_data,
    get_table_for_DQ_computation,
    summarize_dataset_health,
    learn_reference_profile,
    adjust_prod_meta_with_reference,
    infer_and_fix_types,
)
```

## Pipeline Output Structure

`Run_DQ_Pipeline` returns:

```python
{
    "base_data": {...},
    "prod_data": {...},
}
```

Each block preserves the same keys:

- `dirty_scores`
- `cleaned_scores`
- `cleaned_datasets`
- `meta_before_cleaning`
- `meta_after_cleaning`
- `recommendations`
- `change_log`
- `summarized_before`
- `summarized_after`
- `main_metrics_before`
- `main_metrics_after`

## Examples

Runnable examples live in [examples](./examples):

```bash
python examples/simple_usage.py
python examples/production_usage.py
```

These examples assume the package has already been installed in the active environment.

## Validation

The [validation](./validation) folder contains a portable real-world validation workflow.

- Large raw datasets are intentionally not committed to the repository.
- Put the expected files under `validation/datasets/` by following
  [validation/datasets/README.md](./validation/datasets/README.md).
- Run the validation script:

```bash
python validation/run_validation.py
```

- The script saves non-empty outputs to `validation/outputs/<dataset_name>/`.
- The notebook [validation/main.ipynb](./validation/main.ipynb) uses the same relative-path workflow.

## Repository Layout

- `cleanalytix/` - installable library package
- `examples/` - small runnable examples
- `tests/` - smoke tests and lightweight sample fixtures
- `validation/` - public-friendly validation workflow and output folder
- `archive/legacy/` - historical prototype notebook/code kept for reference, not for active use

## Known Limitations

- Validation datasets are not bundled with the repository.
- The yellow taxi validation workflow samples the first `20,000` rows from each configured monthly file to match the original project workflow and to keep validation practical.
- Interactive cleaning is intended for notebook/CLI use and will prompt for input when `interactive=True`.

## Contributing

See [CONTRIBUTING.md](./CONTRIBUTING.md).

## License

[MIT](./LICENSE) (c) 2026 Probot-DATA contributors
