Metadata-Version: 2.4
Name: prepro-auto
Version: 1.0.0b1
Summary: AI-assisted, human-in-the-loop tabular data preprocessing — profile, clean, transform, and export any dataset with a reproducible pipeline, from a notebook or the web.
Author: Shivanshu Pandey
License: MIT
Project-URL: Homepage, https://github.com/Chilliflex/prepro_auto
Project-URL: Documentation, https://github.com/Chilliflex/prepro_auto#readme
Project-URL: Repository, https://github.com/Chilliflex/prepro_auto
Keywords: data-preprocessing,data-cleaning,machine-learning,pandas,data-quality,feature-engineering,etl,data-science
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastapi>=0.110
Requires-Dist: uvicorn[standard]>=0.29
Requires-Dist: python-multipart>=0.0.9
Requires-Dist: pydantic>=2.6
Requires-Dist: pydantic-settings>=2.2
Requires-Dist: sqlalchemy>=2.0
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.10
Requires-Dist: reportlab>=4.0
Requires-Dist: pyarrow>=14.0
Requires-Dist: openpyxl>=3.1
Requires-Dist: python-dotenv>=1.0
Requires-Dist: psutil>=5.9
Requires-Dist: nest-asyncio>=1.5
Provides-Extra: groq
Requires-Dist: groq>=0.11; extra == "groq"
Provides-Extra: openai
Requires-Dist: openai>=1.40; extra == "openai"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40; extra == "anthropic"
Provides-Extra: gemini
Requires-Dist: google-generativeai>=0.7; extra == "gemini"
Provides-Extra: mistral
Requires-Dist: mistralai>=1.0; extra == "mistral"
Provides-Extra: ai
Requires-Dist: groq>=0.11; extra == "ai"
Requires-Dist: openai>=1.40; extra == "ai"
Requires-Dist: anthropic>=0.40; extra == "ai"
Requires-Dist: google-generativeai>=0.7; extra == "ai"
Requires-Dist: mistralai>=1.0; extra == "ai"
Provides-Extra: hosting
Requires-Dist: psycopg2-binary>=2.9; extra == "hosting"
Requires-Dist: boto3>=1.34; extra == "hosting"
Requires-Dist: alembic>=1.13; extra == "hosting"
Requires-Dist: celery>=5.3; extra == "hosting"
Requires-Dist: redis>=5.0; extra == "hosting"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.23; extra == "dev"
Requires-Dist: httpx>=0.27; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Dynamic: license-file

# PrePro Auto

**AI-assisted tabular data preprocessing with human-in-the-loop control.**

Profile, clean, transform, and export any tabular dataset — from a Jupyter notebook or a local web UI — with every step undoable, auditable, and reproducible. The same engine drives both interfaces, so results are identical wherever you call it from.

```bash
pip install prepro-auto
```

> Author: [Shivanshu Pandey](https://github.com/Chilliflex) · Source: [github.com/Chilliflex/prepro_auto](https://github.com/Chilliflex/prepro_auto)

---

## Quickstart — Notebook (no upload)

```python
import pandas as pd
import prepro_auto

df = pd.read_csv("your_data.csv")
session = prepro_auto.launch(df)        # opens the local workbench, NO upload
# -> click the printed http://127.0.0.1:8721/workbench?job=... link

# clean visually in the browser, then back in the notebook:
cleaned = session.current()             # the UI-edited DataFrame
session.update(cleaned)                 # push notebook edits back to the UI
```

That's the whole loop. Your DataFrame is loaded directly from the notebook's memory — no file upload, no context switch. `df` (your original) never changes; `session.current()` always returns the latest cleaned version.

## Quickstart — Web UI

```bash
prepro_auto                             # starts the workbench at http://127.0.0.1:8000
```

Then open `http://127.0.0.1:8000/workbench` and upload a file.

---

## What it does

- **Profile** — per-column type inference, missing rates, 0–100 quality score
- **Clean (guided)** — missing values, outliers, scaling, correlation/leakage, encoding; each issue becomes a reviewable decision with a recommended action and alternatives
- **Transform (manual)** — 17 preset ops, sandboxed expressions, multi-column batches
- **AI assistant** — optional; describe a change in plain English; confirms intent and shows a real preview before applying
- **Visualize & dashboard** — histograms, bar, scatter charts, plus a before/after dashboard with KPI tiles and per-column comparison
- **Data drift** — compare two datasets to detect distribution shifts (PSI + KS)
- **Undo/redo** — every change is a version
- **Export** — clean data (CSV/Parquet), audit PDF, and a runnable Python pipeline script

---

## What you get out of PrePro Auto

Five concrete outputs you can take away after a session. Each one is designed to plug straight into a real-world workflow:

| Output | What it is | Where to use it |
|---|---|---|
| **Cleaned DataFrame** | The in-memory DataFrame after all your cleaning + transforms, returned by `session.current()` in the notebook | Feed straight into `model.fit(X, y)` for scikit-learn, XGBoost, LightGBM, PyTorch, TensorFlow. No file I/O needed. |
| **Cleaned dataset file** | A CSV or Parquet file via `GET /datasets/{job_id}/export/data?format=csv` (or `format=parquet`) | Share with teammates, upload to a feature store, load into BI tools (Tableau, Power BI, Looker), commit to a versioned data repo, or feed into downstream ETL jobs. Parquet is smaller and faster for large datasets. |
| **Audit PDF** | A multi-page PDF via `GET /datasets/{job_id}/export/audit` listing every transformation with its parameters, before/after stats, and who approved it | Compliance trail for regulated industries (finance, healthcare, insurance); attach to a model-card or experiment-tracking entry; hand to a reviewer or data-governance team to prove the cleaning is reproducible and reasoned, not arbitrary. |
| **Runnable Python pipeline** | A standalone `.py` script via `GET /datasets/{job_id}/export/pipeline` that reproduces the exact cleaning with pandas + scikit-learn — no PrePro Auto dependency | Drop into a production training pipeline, an Airflow/Prefect/Dagster DAG, a CI job, or a coworker's machine. They run `python pipeline.py raw.csv clean.csv` and get the same result you produced visually. |
| **Drift report** | A per-column JSON verdict (PSI, KS test, severity bands) via `POST /drift/compare` between two datasets | Monitor a deployed model — compare last month's input distribution to this month's. Catch silent data shifts (a new product category, a sensor recalibration, a market regime change) before they degrade model performance. Plug into a monitoring dashboard or alert on `overall_verdict == "significant_drift"`. |

**Two common workflows:**

```python
# Workflow 1 — notebook to model, all in-process (zero file I/O):
session = prepro_auto.launch(df)
# ...clean visually in the browser...
X = session.current().drop(columns=["target"])
y = session.current()["target"]
model.fit(X, y)

# Workflow 2 — clean once, productionize with the exported pipeline:
# 1) export pipeline.py from the workbench
# 2) commit pipeline.py to your model repo
# 3) in production: subprocess.run(["python", "pipeline.py", "incoming.csv", "ready.csv"])
```

---

## Methods

Field-standard methods throughout: MICE / KNN / median imputation, IQR + MAD + Isolation Forest for outliers, normality-driven scaling (Standard / Robust / Box-Cox / Yeo-Johnson), label / ordinal / one-hot / frequency / target encoding. No accuracy compromises — the same algorithms a data scientist would write by hand.

---

## AI providers (optional)

AI features are **optional**. Everything works offline without a key. PrePro Auto supports five providers:

| Provider | ID | Install | Get a key |
|---|---|---|---|
| Groq (free tier, fast) | `groq` | `pip install prepro-auto[groq]` | https://console.groq.com |
| OpenAI / GPT | `openai` | `pip install prepro-auto[openai]` | https://platform.openai.com |
| Anthropic Claude | `anthropic` | `pip install prepro-auto[anthropic]` | https://console.anthropic.com |
| Google Gemini | `gemini` | `pip install prepro-auto[gemini]` | https://aistudio.google.com/app/apikey |
| Mistral | `mistral` | `pip install prepro-auto[mistral]` | https://console.mistral.ai |

Or install all five at once: `pip install prepro-auto[ai]`.

### Three ways to give PrePro Auto your API key

**1. From the notebook (in-memory, session-only — safest):**

```python
import prepro_auto
prepro_auto.set_api_key("openai", "sk-...")  # any of the 5 provider IDs
session = prepro_auto.launch(df)
```

The key lives only in the running process. Lost on restart (re-enter next session). PrePro Auto makes a tiny test call before returning, so you know immediately whether the key works.

**2. From the web UI (in-memory by default, optional .env persistence):**

In the workbench, click **"AI settings (API key)"** in the side rail. Pick a provider, paste the key, click **Test & apply**. PrePro Auto verifies the key with a live test call before accepting it. Tick **"Also save to .env"** if you want it to survive restarts (local convenience only — leave unchecked on any shared/hosted machine).

**3. From a `.env` file (persists across restarts):**

Add to `.env` in the project root:
```bash
LLM_PROVIDER=openai
OPENAI_API_KEY=sk-...
```

Each provider has its own env-key name: `GROQ_API_KEY`, `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GEMINI_API_KEY`, `MISTRAL_API_KEY`.

**Honest security note:** the `.env` file is plain text. Fine for a personal machine; never use the persist option on a hosted/shared deployment until proper per-user auth is in place.

---

## Notebook API reference

After `import prepro_auto`, these are the top-level functions:

| Function | What it does |
|---|---|
| `prepro_auto.launch(df, domain="general", port=None, open_browser=False)` | Registers an in-memory DataFrame as a job (no upload), starts the local workbench server, returns a `Session`. Prints a clickable URL. |
| `prepro_auto.set_api_key(provider, api_key, model=None)` | Sets the AI provider and key at runtime (in-memory). Returns `{ok, provider, model, verified, reason}` after a live test call. |

After `session = prepro_auto.launch(df)`:

| Method / property | What it does |
|---|---|
| `session.current()` | Returns the current (active-version) DataFrame as it stands in the UI right now. |
| `session.update(df)` | Pushes a notebook-edited DataFrame to the UI as a new undoable version. |
| `session.url` | The workbench URL for this session. |
| `session.job_id` | The internal job ID for this session. |
| `session.port` | The local port the workbench server is running on. |

Typical sync cycle:

```python
cur = session.current()                            # pull current state from UI
cur["price_per_sqft"] = cur["price"] / cur["sqft"] # your own code
session.update(cur)                                # push back, refresh UI to see it
```

---

## REST API reference

The web app and SDK both call the same endpoints, all under `/api/v1`. Once the server is running, the live interactive docs are at `http://localhost:8000/docs`.

| Endpoint | Purpose |
|---|---|
| `POST /datasets/upload` | Upload a dataset |
| `GET  /datasets/{job_id}/preview` | First rows + shape |
| `POST /datasets/{job_id}/profile` | Per-column profile + quality score |
| `GET  /datasets/{job_id}/view` | The current (active-version) data |
| `GET  /datasets/{job_id}/comparison` | Raw vs current summary |
| `POST /datasets/{job_id}/stages/{stage}` | Run a cleaning stage (`missing_values`, `outliers`, `scaling`, `correlation`, `encoding`) |
| `POST /datasets/{job_id}/stages/{stage}/execute` | Apply approved decisions, commit a snapshot |
| `GET  /datasets/{job_id}/decisions` | List decision cards (filter by `?stage=`) |
| `POST /decisions/{id}/approve` · `/override` · `/skip` · `/drop-column` | Resolve a card |
| `GET  /datasets/{job_id}/queue` | Decision summary across stages |
| `GET  /datasets/{job_id}/history` · `POST /undo` · `POST /redo` | Version history & navigation |
| `GET  /datasets/{job_id}/snapshots` | List committed versions |
| `GET  /datasets/{job_id}/transform/operations` | List available preset ops |
| `POST /datasets/{job_id}/transform/preset` | Apply a preset op (rename, drop, cast, fillna, filter, …) |
| `POST /datasets/{job_id}/transform/expression` | Run a sandboxed pandas expression |
| `POST /datasets/{job_id}/transform/batch` | Apply one op to many columns as one undoable step |
| `POST /datasets/{job_id}/transform/ai-propose` · `ai-advise` · `assistant` · `chat` | AI helpers (needs a key) |
| `POST /datasets/{job_id}/viz/chart` · `metric` · `compare` · `ask` | Charts and condition counts |
| `GET  /datasets/{job_id}/viz/dashboard` | Before/after KPI dashboard |
| `POST /drift/compare` | Drift detection between two uploaded datasets |
| `GET  /datasets/{job_id}/export/data` · `audit` · `pipeline` | Clean data, audit PDF, reproducible script |
| `GET  /api/v1/system/limits` | Live RAM-aware upload limits |
| `GET  /api/v1/system/llm` | List available providers + active one |
| `POST /api/v1/system/llm/configure` | Set provider + key at runtime |

Open `http://localhost:8000/docs` after `prepro_auto` is running for the interactive Swagger UI with full request/response schemas.

---

## License

MIT
