Metadata-Version: 2.4
Name: aidatapilot
Version: 0.2.4
Summary: Lightweight Intelligent Data Automation Engine — plug-and-play pipelines for everyone.
Author-email: Rooben RS <rooben@dhsit.co.uk>
License: MIT
Keywords: data,pipeline,automation,etl,cleaning,ml,preprocessing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.5.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: click>=8.1
Requires-Dist: typing_extensions>=4.0; python_version < "3.11"
Requires-Dist: openpyxl>=3.0.0
Requires-Dist: pyarrow>=10.0.0
Requires-Dist: matplotlib>=3.6.0
Requires-Dist: seaborn>=0.12.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: pypdf>=3.0.0
Requires-Dist: python-docx>=0.8.11
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Dynamic: license-file

# aidatapilot 🚀 — High-Impact Data Automation Engine

[![Version](https://img.shields.io/badge/version-0.2.3--high--impact-blue)](https://github.com/aidatapilot)
[![Build](https://img.shields.io/badge/build-passing-brightgreen)](https://github.com/aidatapilot)
[![License](https://img.shields.io/badge/license-MIT-green)](LICENSE)

**aidatapilot** is an intelligent automation engine that transforms raw, messy datasets into production-ready signals. It is designed to bridge the gap between "Raw Data" and "Actionable Insights" by automating the most time-consuming parts of data engineering: profiling, cleaning, and preparation.

---

## 📖 Table of Contents
1. [🚀 Quick Start](#-quick-start)
2. [🏗️ Usage Levels](#-usage-levels)
   - [Level 1: Autonomous (`auto_pilot`)](#level-1-autonomous-auto_pilot)
   - [Level 2: Simplified (Fast Actions)](#level-2-simplified-fast-actions)
   - [Level 3: Professional (Fluent API)](#level-3-professional-fluent-api)
3. [🧠 Intelligence Advisor](#-intelligence-advisor)
4. [📊 Visualization Layer](#-visualization-layer)
5. [🛠️ Built-in Processing Steps](#-built-in-processing-steps)
6. [🏗️ Technical Architecture](#-technical-architecture)
7. [🔌 Extensibility: Custom Steps](#-extensibility-custom-steps)
8. [📦 Installation](#-installation)

---

## 🚀 Quick Start

Get from messy CSV to clean data in exactly **3 lines of code**:

```python
import aidatapilot

# The "One-Line" Master Command
aidatapilot.auto_pipeline("messy_data.csv", "cleaned_data.csv", visualize=True)
```

This single command performs:
1. **Profiling**: Detects if your data is Transactional, Tabular, or Text.
2. **Analysis**: Identifies nulls, outliers, and formatting errors.
3. **Execution**: Builds and runs a custom cleaning pipeline.
4. **Reporting**: Generates visual charts in the `reports/` folder.

---

## 🏗️ Usage Levels

### Level 1: Autonomous (`auto_pilot`)
Perfect for unknown or highly inconsistent datasets. The engine uses a **Heuristic Rules Engine** to decide which cleaning template to apply.

```python
import aidatapilot
result = aidatapilot.auto_pilot("raw_data.csv")
print(f"Algorithm Selected: {result.state}")
```

### Level 2: Simplified (Fast Actions)
For when you know what you want. Use opinionated scripts for specific domains:

| Command | Best For | Technical Features |
| :--- | :--- | :--- |
| `auto_clean()` | Daily Reporting | Null-filling, Deduplication, ID repair. |
| `auto_ml_prep()` | Model Training | Label-encoding, MinMax scaling, Outlier clipping. |
| `auto_text_prep()` | LLM & RAG | Contextual chunking, Text sanitization. |
| `auto_analytics()` | BI & Dashboards | Date formatting, KPI placeholders. |

### Level 3: Professional (Fluent API)
For Data Engineers who need exact control over the execution DAG.

```python
from aidatapilot import Pipeline

(
    Pipeline(template="analytics_cleaning")
    .set_source("sales_data.csv")
    .then("normalize_columns")
    .then("format_date", columns=["order_date"])
    .then("filter_rows", condition="price > 0")
    .set_output("ready_for_bi.csv")
    .run()
)
```

---

## 🧠 Intelligence Advisor

The **Advisor** is a proactive diagnostic tool. Instead of just cleaning data, it tells you *why* it needs cleaning.

```python
from aidatapilot import Advisor

advisor = Advisor("data.csv")
print(f"Health Score: {advisor.get_readiness_score()}%")
print(f"Primary Insight: {advisor.get_primary_insight()}")

# Detailed JSON report
report = advisor.analyze()
print(report.diagnostics["null_map"])
```

---

## 📊 Visualization Layer

Visual evidence of data health is critical for stakeholder communication. `aidatapilot` generates these automatically:

```python
aidatapilot.visualize_dataset(df, report_dir="reports/")
```
*   **Missing Data Heatmap**: See exactly where gaps are clustering.
*   **Correlation Matrix**: Understand relationships between features.
*   **Outlier Boxplots**: Identify anomalies visually.

---

## 🛠️ Built-in Processing Steps

Every `then()` or `add_step()` call refers to an internal registry. Top steps include:

*   **`normalize_columns`**: Standardizes headers to `snake_case`.
*   **`infer_types`**: Auto-detects Dates, Integers, and Floats.
*   **`handle_missing_data`**: Smart-fills based on column semantics.
*   **`interpolate_ids`**: Repairs broken or missing sequential IDs.
*   **`encode_categorical`**: Converts text labels to numeric codes.
*   **`scale_numeric`**: Scales data using MinMax or Standard (Z-Score) methods.
*   **`chunk_text`**: Splits long text for Vector DBs with sentence-boundary awareness.

---

## 🏗️ Technical Architecture

`aidatapilot` is built on a modular "Factory" architecture:
1. **Connectors**: Load data from CSV, Excel, or SQL (Registry-based).
2. **Compiler**: Transforms your `Pipeline` definition into a **Directed Acyclic Graph (DAG)** of execution nodes.
3. **Runtime**: Executes the nodes using a thread-safe engine with **Memory Safety** mode for large datasets.
4. **Publishers**: Exports the final result to your destination (File, Cloud, or memory).

---

## 🔌 Extensibility: Custom Steps

You can easily add your own logic to the engine using the `@register_step` decorator:

```python
from aidatapilot.core.registry import register_step

@register_step("my_custom_cleanup")
def my_custom_cleanup(df, **params):
    # Your custom pandas logic here
    df['new_col'] = df['old_col'] * 2
    return df

# Now it's available in any pipeline!
pipeline.then("my_custom_cleanup")
```

---

## 📦 Installation

```bash
# Standard Install
pip install aidatapilot

# Development Install
git clone https://github.com/aidatapilot/aidatapilot.git
cd aidatapilot
pip install -e .
```

---

*“The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency.” — Bill Gates*

**AIDataPilot | DHS IT Solutions**
