Metadata-Version: 2.4
Name: aidatapilot
Version: 0.1.1
Summary: Lightweight Intelligent Data Automation Engine — plug-and-play pipelines for everyone.
Author-email: Rooben RS <rooben.rs@dhsit.co.uk>
License: MIT
Keywords: data,pipeline,automation,etl,cleaning,ml,preprocessing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.5.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: click>=8.1
Requires-Dist: typing_extensions>=4.0; python_version < "3.11"
Requires-Dist: openpyxl>=3.0.0
Requires-Dist: pyarrow>=10.0.0
Requires-Dist: matplotlib>=3.6.0
Requires-Dist: seaborn>=0.12.0
Requires-Dist: psutil>=5.9.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Dynamic: license-file

# aidatapilot 🚀 — Your Partner in Data Automation

> "The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency." — Bill Gates

Welcome to **aidatapilot**. I'm here to guide you from raw, messy datasets to production-ready signals in just one line of code. aidatapilot isn't just a library; it's an intelligent engine designed to handle the heavy lifting of data engineering so you can focus on the *insight*.

---

## 🎓 The aidatapilot Way (Mentoring Guide)

As your guide, I recommend starting with the "Simple API." It's designed to give you professional-grade results without the complexity of manual boilerplate.

### 1. The "Master Brain": `auto_pipeline` (Adaptive)
The most powerful command in aidatapilot. It analyzes your data, detects quality issues, and **dynamically constructs a custom pipeline** without any manual configuration. It also prints a beautiful **Automation Decision Report** explaining its choices.

```python
import aidatapilot

# One command to rule them all
result = aidatapilot.auto_pipeline("messy_raw_data.csv")
```

### 2. General Data Cleaning: `auto_clean`
The "Gold Standard" for standard tabular data. It normalizes your column names, infers data types, **fills missing values** (0 for ints, "null" for strings), and removes duplicates.

```python
# Perfect for daily reporting and BI
aidatapilot.auto_clean("sales_raw.xlsx", "sales_final.csv")
```

### 3. Machine Learning Ready: `auto_ml_prep`
Preparing data for a model? This command does everything `auto_clean` does, plus **Outlier Clipping**, **Categorical Encoding**, and **MinMax Scaling**.

```python
# From raw data to 'model.fit()' ready
aidatapilot.auto_ml_prep("users.csv", "training_data.csv")
```

### 4. Specialized: `auto_analytics` & `auto_text_prep`
- Use **`auto_analytics`** for time-series and BI reports.
- Use **`auto_text_prep`** for LLM and RAG workflows (it handles text cleaning and chunking).

---

## 🧠 Intelligence Advisor

Before you clean, you might want to understand *what's wrong*. Run the **Intelligence Advisor** to get a proactive report on your data health:

```python
aidatapilot.analyze_dataset("mysterious_data.csv")
```

---

## 🛠 Becoming a Pro: The Pipeline Class

For those who need granular control, the `Pipeline` class is your cockpit. You can mix and match templates or define custom steps.

```python
from aidatapilot import Pipeline

# Craft a custom journey
pipe = Pipeline(template="ml_preprocess")
pipe.set_source("raw.csv")
pipe.set_output("ready.csv")

# Execute with performance tracking
result = pipe.run()
```

---

## 📦 Installation

```bash
pip install aidatapilot
```

## 🏗 Why aidatapilot?

*   **Production-Ready**: Built with registry patterns and robust error handling.
*   **Memory Safe**: Designed to handle large datasets without crashing your environment.
*   **Intelligent**: Heuristic-based suggestions that improve over time.

*Happy Automating! Feel free to reach out if you need help navigating your data pipelines.*

---

## 🔍 Deep Dive: Understanding the Operations

Every "auto" command in aidatapilot is carefully designed to handle specific business and data needs. Here is exactly what happens under the hood:

### 🚀 `auto_pilot` (The Smart Choice)
*   **Components**: `IntelligenceAdvisor` + Recommended Template.
*   **What it does**: Dynamically analyzes data patterns (null counts, skewness, text length, dates) and selects the best cleaning path.
*   **Why it's useful**: Eliminates guesswork. Ideal for unknown or messy data when you don't know where to start. It's the "set it and forget it" tool.

### ✨ `auto_clean` (The Gold Standard)
*   **Components**: `normalize_columns`, `infer_types`, `handle_missing_data`, `deduplicate`.
*   **What it does**: Cleans column names, casts types, **interpolates IDs** while filling other missing values (0 for integers, "null" for strings), and removes duplicates.
*   **Why it's useful**: The perfect daily cleaning tool. Ensures your data is tidy and error-free for most general tasks.

### 🤖 `auto_ml_prep` (Model Readiness)
*   **Components**: `auto_clean` steps + `outlier_detection`, `encode_categorical`, `scale_numeric`.
*   **What it does**: Beyond basic cleaning, it handles numeric outliers (via clipping), **encodes text categories to numbers**, and scales values (MinMax 0–1).
*   **Why it's useful**: High-speed preparation for model training. Most ML models (Scikit-Learn, PyTorch) require numeric, scaled data with no missing values.

### 📊 `auto_analytics` (BI & Reporting)
*   **Components**: `normalize_columns`, `format_date`, `handle_missing_data`, `deduplicate`, `basic_aggregation`.
*   **What it does**: Special focus on **Universal Date Parsing** and deduplication. Includes optional aggregation for quick reporting.
*   **Why it's useful**: Best for time-series data and business dashboards where consistency across dates and low redundancy is critical.

### 📄 `auto_text_prep` (LLM & RAG)
*   **Components**: `normalize_columns`, `clean_text`, `generate_metadata`, `chunk_text`.
*   **What it does**: Cleans document text, calculates word counts/lengths, and **splits long text into overlapping chunks**.
*   **Why it's useful**: Essential for AI applications. Prepares documents for embedding and storage in **Vector Databases** (like Pinecone or Chroma).
