Metadata-Version: 2.4
Name: athena-eda
Version: 0.2.1
Summary: Automated EDA, ML readiness scoring, and data quality checks — locally in your browser.
Author: Athena
License: MIT
Project-URL: Homepage, https://athena-alpha-hazel.vercel.app
Project-URL: Repository, https://github.com/SaurabhLingam/Athena
Keywords: eda,machine-learning,data-science,exploratory-data-analysis
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: fastapi>=0.111.0
Requires-Dist: uvicorn[standard]>=0.29.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: scikit-learn>=1.3.0
Requires-Dist: lightgbm>=4.3.0
Requires-Dist: nltk>=3.8.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: slowapi>=0.1.9
Requires-Dist: python-multipart>=0.0.9
Requires-Dist: scipy>=1.11.0
Requires-Dist: rich>=13.7.0
Requires-Dist: typer>=0.12.0
Requires-Dist: starlette>=0.27.0
Requires-Dist: statsmodels>=0.14.0
Requires-Dist: pmdarima>=2.0.0
Provides-Extra: all
Requires-Dist: statsmodels>=0.14.0; extra == "all"
Requires-Dist: pmdarima>=2.0.0; extra == "all"

# Athena — ML Diagnostics Platform

Upload a CSV. Get a full diagnostic report. Download a preprocessing script.

Athena is a machine learning readiness tool that analyzes tabular datasets and surfaces what matters before you start modeling — leakage risks, class imbalance, outliers, feature redundancy, NLP readiness, and more.

---

## How It Works

1. Upload a CSV — up to 200k rows, labeled or unlabeled
2. Select Supervised or Unsupervised mode. For supervised, specify the target column
3. Choose an analysis profile: Standard, Finance, Healthcare, or NLP
4. Athena runs a full diagnostic pass and returns a structured report
5. Explore results across five tabs: Overview, Features, Quality, ML Diagnostics, and NLP
6. Download a ready-to-run Python preprocessing script tailored to your dataset, or export the full report as HTML

---

## Features

- **ML Readiness Score** — composite score across data health, trainability, and leakage risk
- **Leakage Detection** — flags identifier columns and suspiciously correlated features
- **Baseline Probe** — 3-fold cross-validated LightGBM baseline with learning curves
- **Outlier and Skew Analysis** — per-feature skewness, kurtosis, log-transform recommendations
- **NLP Readiness** — detects free-text columns, vocabulary analysis, embedding recommendations
- **Drift Comparison** — upload train and test splits to get per-column distribution drift scores
- **Feature Redundancy** — correlation matrix analysis for highly redundant feature pairs
- **Analysis Profiles** — Standard, Finance, Healthcare, NLP
- **Preprocessing Script** — one-click export of a production-ready Python preprocessing script

---

## Stack

- **Frontend** — React, TypeScript, Vite, Recharts
- **Backend** — FastAPI, Python, LightGBM, scikit-learn, pandas
