Metadata-Version: 2.4
Name: prepro-auto
Version: 1.0.0b4
Summary: AI-assisted, human-in-the-loop tabular data preprocessing — profile, clean, transform, and export any dataset with a reproducible pipeline, from a notebook or the web.
Author: Shivanshu Pandey
License: MIT
Project-URL: Homepage, https://github.com/Chilliflex/prepro_auto
Project-URL: Documentation, https://github.com/Chilliflex/prepro_auto#readme
Project-URL: Repository, https://github.com/Chilliflex/prepro_auto
Keywords: data-preprocessing,data-cleaning,machine-learning,pandas,data-quality,feature-engineering,etl,data-science
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastapi>=0.110
Requires-Dist: uvicorn[standard]>=0.29
Requires-Dist: python-multipart>=0.0.9
Requires-Dist: pydantic>=2.6
Requires-Dist: pydantic-settings>=2.2
Requires-Dist: sqlalchemy>=2.0
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.10
Requires-Dist: reportlab>=4.0
Requires-Dist: pyarrow>=14.0
Requires-Dist: openpyxl>=3.1
Requires-Dist: python-dotenv>=1.0
Requires-Dist: psutil>=5.9
Requires-Dist: nest-asyncio>=1.5
Provides-Extra: groq
Requires-Dist: groq>=0.11; extra == "groq"
Provides-Extra: openai
Requires-Dist: openai>=1.40; extra == "openai"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40; extra == "anthropic"
Provides-Extra: gemini
Requires-Dist: google-generativeai>=0.7; extra == "gemini"
Provides-Extra: mistral
Requires-Dist: mistralai>=1.0; extra == "mistral"
Provides-Extra: ai
Requires-Dist: groq>=0.11; extra == "ai"
Requires-Dist: openai>=1.40; extra == "ai"
Requires-Dist: anthropic>=0.40; extra == "ai"
Requires-Dist: google-generativeai>=0.7; extra == "ai"
Requires-Dist: mistralai>=1.0; extra == "ai"
Provides-Extra: hosting
Requires-Dist: psycopg2-binary>=2.9; extra == "hosting"
Requires-Dist: boto3>=1.34; extra == "hosting"
Requires-Dist: alembic>=1.13; extra == "hosting"
Requires-Dist: celery>=5.3; extra == "hosting"
Requires-Dist: redis>=5.0; extra == "hosting"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: httpx>=0.27; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Dynamic: license-file

# PrePro Auto

[![PyPI version](https://img.shields.io/pypi/v/prepro-auto.svg)](https://pypi.org/project/prepro-auto/)
[![Python](https://img.shields.io/pypi/pyversions/prepro-auto.svg)](https://pypi.org/project/prepro-auto/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Tests](https://img.shields.io/badge/tests-200%20passing-brightgreen.svg)](https://github.com/Chilliflex/prepro_auto)
[![Downloads](https://static.pepy.tech/badge/prepro-auto)](https://pepy.tech/project/prepro-auto)
[![CI](https://github.com/Chilliflex/prepro_auto/actions/workflows/test.yml/badge.svg)](https://github.com/Chilliflex/prepro_auto/actions)

**AI-assisted tabular data preprocessing with human-in-the-loop control.**

Profile, clean, transform, and export any tabular dataset — from a Jupyter notebook or a local web UI — with every step undoable, auditable, and reproducible. The same engine drives both interfaces, so results are identical wherever you call it from.

```bash
pip install prepro-auto
```

> Author: [Shivanshu Pandey](https://github.com/Chilliflex) · Source: [github.com/Chilliflex/prepro_auto](https://github.com/Chilliflex/prepro_auto)

---

## Contents

- [Quickstart](#quickstart) — get going in 30 seconds
- [Ways to give PrePro Auto your data](#ways-to-give-prepro-auto-your-data) — 4 from notebook, 3 from web UI
- [1. Input functions (notebook)](#1-input-functions-notebook) — how to load data into a session
- [2. Preprocessing functions](#2-preprocessing-functions) — clean, transform, visualize
- [3. Output functions](#3-output-functions) — DataFrames, files, audit PDFs, pipelines
- [AI providers](#ai-providers-optional) — optional, 5 providers supported
- [REST API reference](#rest-api-reference)
- [Documentation](#documentation)

---

## Quickstart

**Notebook (recommended for data scientists):**

```python
import prepro_auto

# Easiest: point at a file, auto-detects encoding (handles Latin-1, cp1252, BOM)
session = prepro_auto.launch_file(r"C:\path\to\your_data.csv")
# Click the printed http://127.0.0.1:8721/workbench?job=... link
```

**Web UI (recommended for analysts):** open Command Prompt (not Jupyter) and run:

```bash
prepro_auto
```

Then open `http://127.0.0.1:8000/workbench` and drag-drop a file.

> **Note:** `prepro_auto` typed inside a Jupyter cell just prints the module object — it doesn't start a server. The CLI command runs from a terminal only. Inside a notebook, use `prepro_auto.launch_file(path)` or `prepro_auto.launch(df)` instead.

---

## Screenshots

### Step 1 — Upload
Drop any CSV, Parquet, Excel, or JSON file. Encoding and delimiter are auto-detected. The sidebar shows live RAM-aware upload limits for your machine.

![Upload](https://raw.githubusercontent.com/Chilliflex/prepro_auto/main/docs/screenshots/Upload.png)

### Step 2 — Profile
Per-column semantic type inference, missing rates, cardinality, and a 0–100 dataset quality score — all in one pass, before any data is changed.

![Profile](https://raw.githubusercontent.com/Chilliflex/prepro_auto/main/docs/screenshots/Profile.png)

### Step 3 — View data
Live table toggling between Original (raw) and Current (cleaned). Version label, row/column count, and quality score update after every operation.

![View data](https://raw.githubusercontent.com/Chilliflex/prepro_auto/main/docs/screenshots/View%20Dataset.png)

### Step 4 — Clean (human-in-the-loop)
Each stage (Missing Values, Outliers, Scaling, Correlation, Encoding) generates per-column decision cards. Approve the recommendation, Override to an alternative, Skip, or Drop — then Execute to commit.

![Clean](https://raw.githubusercontent.com/Chilliflex/prepro_auto/main/docs/screenshots/Clean.png)

### Step 5a — Preset Operations
17 built-in transforms (rename, cast, filter, merge, math, string ops, regex, sort, dedup, and more). Every operation is a single undoable version.

![Preset Operations](https://raw.githubusercontent.com/Chilliflex/prepro_auto/main/docs/screenshots/Transform_Preset_Operation.png)

### Step 5b — AI Assistant
Describe a transform in plain English. The AI proposes the pandas code, shows a preview, and waits for your confirmation before touching the data.

![AI Assistant](https://raw.githubusercontent.com/Chilliflex/prepro_auto/main/docs/screenshots/Transform_AI_Assistant.png)

### Step 5c — Expression Editor
Write any sandboxed pandas expression directly: `df['profit'] = df['revenue'] - df['cost']`. Validated against the current schema before execution.

![Expression Editor](https://raw.githubusercontent.com/Chilliflex/prepro_auto/main/docs/screenshots/Transform_Experssion(Advance).png)

### Step 5d — Visualization
Histograms, bar charts, scatter plots, and condition-based metrics — all rendered live against the current dataset version.

![Visualization](https://raw.githubusercontent.com/Chilliflex/prepro_auto/main/docs/screenshots/Transform_Visualization.png)

### Step 5e — Before & After Dashboard
KPI tiles comparing raw upload to current cleaned version: quality score delta, per-column type changes, and data samples side-by-side.

![Before and After Dashboard](https://raw.githubusercontent.com/Chilliflex/prepro_auto/main/docs/screenshots/Transform_Dashboard(Before%20and%20After).png)

### Step 5f — Data Drift Detection
Upload a second dataset (e.g. last month's production data) and compare distributions. PSI + KS test per column with stable / moderate / significant severity bands.

![Data Drift](https://raw.githubusercontent.com/Chilliflex/prepro_auto/main/docs/screenshots/Transform_Data_Drift.png)

### Step 6 — Results & Export
Quality score before vs after, column-type changes, and three downloads: cleaned data (CSV or Parquet), audit PDF, and a standalone pipeline script.

![Results and Export](https://raw.githubusercontent.com/Chilliflex/prepro_auto/main/docs/screenshots/Result.png)

---

## Ways to give PrePro Auto your data

There are **4 input methods in the notebook** and **3 in the web UI**. Pick whichever fits your workflow.

### From a Jupyter notebook (4 ways)

| # | Method | When to use it |
|---|---|---|
| 1 | `prepro_auto.launch_file(path)` | You have a file on disk — CSV, Excel, JSON, Parquet, etc. Easiest path. Auto-detects encoding and delimiter. **No `pd.read_csv()` needed.** |
| 2 | `prepro_auto.launch(df)` | You already have a pandas DataFrame in memory (from a database query, API response, generated data, or a tricky read you handled yourself). |
| 3 | `session.update(df)` | You already have a session and want to push a new DataFrame to it (e.g. after notebook-side edits). Commits a new undoable version. |
| 4 | Web upload, then notebook reads | Start the server with the CLI, upload via browser, then in the notebook do `prepro_auto.Session(job_id, port).current()` to pull the data back into Python. Rare but valid. |

### From the web UI (3 ways)

| # | Method | When to use it |
|---|---|---|
| 1 | **Drag-and-drop upload** on Step 1 of the workbench | Standard. Drop a CSV/Parquet/Excel/JSON file into the upload box. The engine auto-detects encoding and delimiter. |
| 2 | **File picker** on Step 1 | Same as drag-and-drop, just clicked. Useful when dragging is awkward (split screens, touchpads). |
| 3 | **URL parameter** `?job=<id>` | When the notebook launched the session, the printed URL already includes `?job=...` — no upload needed, the workbench adopts the existing job. |

### Supported file formats

| Format | Extensions | Notes |
|---|---|---|
| CSV | `.csv`, `.tsv`, `.txt` | Auto-detects encoding (utf-8 / utf-8-sig / latin-1 / cp1252) and delimiter (comma, tab, semicolon, pipe) |
| Excel | `.xlsx`, `.xls`, `.xlsm` | First sheet by default; multi-sheet handling via the upload form |
| Parquet | `.parquet`, `.pq` | Fastest format for large datasets, preserves dtypes |
| JSON | `.json`, `.jsonl`, `.ndjson` | JSON-records and JSON-lines both supported |
| Feather | `.feather` | Apache Arrow's native columnar format |

**Not supported:** PDF, DOCX, HTML, images. PrePro Auto is a tabular-data tool — these formats need a dedicated extraction step first (Camelot or pdfplumber for PDFs, BeautifulSoup for HTML).

#### Have a PDF with a table?

Extract it to a DataFrame first, then hand it to PrePro Auto:

```python
import pdfplumber, pandas as pd, prepro_auto

with pdfplumber.open("report.pdf") as pdf:
    rows = pdf.pages[0].extract_table()        # pick the right page
df = pd.DataFrame(rows[1:], columns=rows[0])    # first row is the header
session = prepro_auto.launch(df)                # now clean it like any DataFrame
```

For PDFs with merged cells or complex layouts, try `camelot-py` (better for bordered tables) or `tabula-py` (requires Java). PrePro Auto deliberately leaves PDF extraction to specialised tools because generic PDF-to-table conversion succeeds only ~30–70% of the time depending on the document — bundling it would mean silent extraction errors hidden under PrePro Auto's name.

---

## 1. Input functions (notebook)

Everything you call **before** preprocessing starts. The functions that get data into a session.

| Function | Parameters | Returns | What it does |
|---|---|---|---|
| `prepro_auto.launch_file(file_path, domain="general", port=None, open_browser=False)` | `file_path`: str or Path | `Session` | Reads a file from disk with auto-encoding-detection, starts the local workbench, returns a session. Handles all supported formats. Prints the workbench URL. |
| `prepro_auto.launch(df, domain="general", port=None, open_browser=False)` | `df`: pandas DataFrame | `Session` | Registers an in-memory DataFrame as a job (no upload, no file I/O), starts the workbench, returns a session. Use when you already have a DataFrame. |
| `prepro_auto.Session(job_id, port)` | `job_id`: str, `port`: int | `Session` | Reconnect to an existing session by ID. Use when the notebook restarted but the server is still running, or to attach to a job created from the web UI. |
| `prepro_auto.set_api_key(provider, api_key, model=None)` | `provider`: one of `"groq" / "openai" / "anthropic" / "gemini" / "mistral"` | `dict` with `ok`, `verified`, `provider`, `model`, `reason` | Configures the AI provider at runtime (in-memory only — not written to disk). Makes a tiny test call to verify the key works. Call before `launch()` if you want AI features active for the session. |

**Example — most common pattern:**

```python
import prepro_auto

# Optional: enable AI features for this session
prepro_auto.set_api_key("openai", "sk-...")

# Load a file (auto-encoding-detection)
session = prepro_auto.launch_file(r"C:\Users\me\data\sales.csv")
```

---

## 2. Preprocessing functions

The work itself — clean, transform, version. These are called **on the session object** that input functions returned, or via REST endpoints under `/api/v1/`.

### Profile and clean

| Function / Endpoint | What it does |
|---|---|
| `POST /datasets/{job_id}/profile` | Per-column type inference, missing rates, 0–100 quality score. Run once after upload. |
| `POST /datasets/{job_id}/stages/missing_values` | Detect missingness mechanism (MCAR / MAR / MNAR), recommend fill strategy per column. Creates decision cards. |
| `POST /datasets/{job_id}/stages/outliers` | IQR + modified Z-score + Isolation Forest. Classifies findings as data errors vs rare events. |
| `POST /datasets/{job_id}/stages/scaling` | Normality-driven scaler choice: Standard / Robust / Box-Cox / Yeo-Johnson / MinMax / log1p. |
| `POST /datasets/{job_id}/stages/correlation` | Find correlated pairs, detect constant / ID-like / target-leaking columns. |
| `POST /datasets/{job_id}/stages/encoding` | Categorical encoding routed by cardinality: label / ordinal / one-hot / frequency / target. |
| `POST /datasets/{job_id}/stages/{stage_name}/execute` | Apply your approved decisions, commit a new version. `stage_name` is one of the five above. |

### Decision cards (the human-in-the-loop)

| Endpoint | What it does |
|---|---|
| `GET /datasets/{job_id}/decisions?stage=<stage>` | List decision cards for a stage |
| `POST /decisions/{decision_id}/approve` | Use the recommended action |
| `POST /decisions/{decision_id}/override` | Use an alternative action (body: `{"action": "...", "reason": "..."}`) |
| `POST /decisions/{decision_id}/skip` | Don't change this column |
| `POST /decisions/{decision_id}/drop-column` | Drop the column entirely |

### Manual transforms (when you need more control)

| Endpoint | What it does |
|---|---|
| `GET /datasets/{job_id}/transform/operations` | List all 17 preset operations and their parameters |
| `POST /datasets/{job_id}/transform/preset` | Apply one preset op (rename, drop, cast, fillna, filter, merge, math, map, string ops, regex, sort, dedup, extract-number) |
| `POST /datasets/{job_id}/transform/expression` | Run a sandboxed pandas expression (e.g. `df["profit"] = df["revenue"] - df["cost"]`) |
| `POST /datasets/{job_id}/transform/batch` | Apply one operation across many columns as a single undoable step |

### AI-assisted transforms (optional, needs an API key)

| Endpoint | What it does |
|---|---|
| `POST /datasets/{job_id}/transform/ai-propose` | Describe a change in plain English; AI proposes a concrete transform with preview |
| `POST /datasets/{job_id}/transform/ai-advise` | Ask the AI for advice on a column without changing anything |
| `POST /datasets/{job_id}/transform/assistant` | One-shot assistant call (full message) |
| `POST /datasets/{job_id}/transform/chat` | Multi-turn conversation preserving history |

### Versioning and history

| Endpoint | What it does |
|---|---|
| `GET /datasets/{job_id}/view` | Current (active-version) data with shape, dtypes, sample rows |
| `GET /datasets/{job_id}/history` | Full version history with labels |
| `POST /datasets/{job_id}/undo` | Move active pointer back one version |
| `POST /datasets/{job_id}/redo` | Move active pointer forward one version |
| `GET /datasets/{job_id}/snapshots` | List all committed snapshots |

### Visualization and monitoring

| Endpoint | What it does |
|---|---|
| `POST /datasets/{job_id}/viz/chart` | Build a histogram, bar, or scatter chart |
| `POST /datasets/{job_id}/viz/metric` | Compute a condition-based metric (e.g. "rows where price > 1000") |
| `POST /datasets/{job_id}/viz/compare` | Compare one column's distribution raw vs current |
| `GET /datasets/{job_id}/viz/dashboard` | Power-BI-style before/after dashboard (KPI tiles + per-column comparison) |
| `POST /drift/compare` | Compare two uploaded datasets for distribution drift (PSI + KS) |

---

## 3. Output functions

The artifacts you take away from a session. Notebook methods return Python objects; REST endpoints return downloadable files.

### From the notebook (Python objects)

| Method | Returns | Where to use it |
|---|---|---|
| `session.current()` | pandas DataFrame | The current (active-version) DataFrame as it stands in the UI. Drop straight into `model.fit(X, y)`. |
| `session.url` | str | The workbench URL for this session — useful for re-opening after closing the tab. |
| `session.job_id` | str | The internal job ID — use it for raw REST API calls. |
| `session.port` | int | The local port the server is running on. |

### From the REST API or web UI (downloadable files)

| Endpoint | File | Where to use it |
|---|---|---|
| `GET /datasets/{job_id}/export/data?format=csv` | Cleaned CSV | Share with teammates, load into BI tools (Tableau, Power BI, Looker), commit to a versioned data repo. |
| `GET /datasets/{job_id}/export/data?format=parquet` | Cleaned Parquet | Faster and smaller than CSV for large datasets; preserves dtypes exactly. |
| `GET /datasets/{job_id}/export/audit` | Audit PDF | Compliance trail listing every transformation with parameters, before/after stats, who approved. Attach to a model-card or hand to a data-governance reviewer. |
| `GET /datasets/{job_id}/export/pipeline` | Runnable `.py` script | Reproduces the exact cleaning with pandas + scikit-learn, no PrePro Auto dependency. Drop into Airflow / Prefect / GitHub Actions. Run with `python pipeline.py raw.csv ready.csv`. |
| `POST /drift/compare` (returns JSON) | Drift report | Per-column PSI / KS verdicts with severity bands. Plug into a monitoring dashboard, alert on `overall_verdict == "significant_drift"`. |

### Two typical workflows end-to-end

```python
# Workflow 1 — notebook to model, no file I/O:
session = prepro_auto.launch_file(r"C:\data\sales.csv")
# ...clean visually in the browser, then:
X = session.current().drop(columns=["target"])
y = session.current()["target"]
model.fit(X, y)

# Workflow 2 — clean once, productionize the pipeline:
# 1) Download pipeline.py from the workbench's Export step
# 2) Commit it to your model repo
# 3) In production:
#    subprocess.run(["python", "pipeline.py", "incoming.csv", "ready.csv"])
```

---

## What it does

- **Profile** — per-column type inference, missing rates, 0–100 quality score
- **Clean (guided)** — five HITL stages: missing values, outliers, scaling, correlation/leakage, encoding
- **Transform (manual)** — 17 preset ops, sandboxed expressions, multi-column batches
- **AI assistant** — optional; describe a change in plain English; preview before applying
- **Visualize & dashboard** — histograms, bar, scatter; before/after dashboard with KPI tiles
- **Data drift** — PSI + KS test between two datasets
- **Undo/redo** — every change is a version
- **Export** — cleaned data (CSV/Parquet), audit PDF, runnable Python pipeline

## Methods

Field-standard methods throughout: MICE / KNN / median imputation, IQR + MAD + Isolation Forest for outliers, normality-driven scaling (Standard / Robust / Box-Cox / Yeo-Johnson), label / ordinal / one-hot / frequency / target encoding. No accuracy compromises — the same algorithms a data scientist would write by hand.

---

## AI providers (optional)

AI features are **optional**. Everything works offline without a key. PrePro Auto supports five providers:

| Provider | ID | Install | Get a key |
|---|---|---|---|
| Groq (free tier, fast) | `groq` | `pip install prepro-auto[groq]` | https://console.groq.com |
| OpenAI / GPT | `openai` | `pip install prepro-auto[openai]` | https://platform.openai.com |
| Anthropic Claude | `anthropic` | `pip install prepro-auto[anthropic]` | https://console.anthropic.com |
| Google Gemini | `gemini` | `pip install prepro-auto[gemini]` | https://aistudio.google.com/app/apikey |
| Mistral | `mistral` | `pip install prepro-auto[mistral]` | https://console.mistral.ai |

Or install all five at once: `pip install prepro-auto[ai]`.

### Three ways to give PrePro Auto your API key

**1. Notebook (in-memory, session-only — safest):**
```python
prepro_auto.set_api_key("openai", "sk-...")
```

**2. Web UI:** click **AI Provider → Configure API key** in the side rail, paste key, click Test & apply.

**3. `.env` file (survives restarts):**
```bash
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
```

**Security note:** the `.env` file is plain text. Fine for a personal machine; never enable disk-persistence on a shared or hosted deployment.

---

## REST API reference

The web app and SDK both call the same endpoints under `/api/v1`. Once the server is running, the interactive Swagger UI is at `http://localhost:8000/docs`.

For the full table organized by category, see [Section 2 — Preprocessing functions](#2-preprocessing-functions) and [Section 3 — Output functions](#3-output-functions) above. System endpoints:

| Endpoint | Purpose |
|---|---|
| `GET /api/v1/health` | Liveness check |
| `GET /api/v1/system/limits` | Live RAM-aware upload limits |
| `GET /api/v1/system/llm` | List providers + active one |
| `POST /api/v1/system/llm/configure` | Set provider + key at runtime |

---

## Documentation

- [Complete Guide (PDF)](https://github.com/Chilliflex/prepro_auto/blob/main/docs/PrePro_Auto_Complete_Guide.pdf) — project overview, architecture, all ML/stats models used, accuracy benchmarks, full user guide for notebook and web UI
- Interactive Swagger at `http://localhost:8000/docs` (once running)

## Contributing

Contributions welcome — bugs, tests, docs, and features. See [CONTRIBUTING.md](CONTRIBUTING.md) to get started.

## Roadmap

See [ROADMAP.md](ROADMAP.md) for what's released, what's in progress, and what's planned.

## License

MIT
