Metadata-Version: 2.4
Name: eda-k
Version: 0.1.0
Summary: Automated, local exploratory data analysis: stats, charts, correlations, outliers, a chat assistant, and self-contained HTML reports.
Author: Kishan Prajapati
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: licence.txt
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.10
Requires-Dist: plotly>=5.20
Requires-Dist: openpyxl>=3.1
Requires-Dist: xlrd>=2.0
Requires-Dist: pyarrow>=14.0
Provides-Extra: app
Requires-Dist: streamlit>=1.36; extra == "app"
Provides-Extra: trend
Requires-Dist: statsmodels; extra == "trend"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Dynamic: license-file

# eda-k

Automated, local exploratory data analysis — as a **Python library** you can
`import`, with an optional Streamlit UI on top.

Runs 100% locally. Your data never leaves your machine, no API key needed.

---

## Install

```bash
pip install -e .
```

Want everything in one shot (library + Streamlit app + OLS trendlines)?

```bash
pip install -e ".[app,trend]"
```

Or pick extras individually:

Need the bundled Streamlit app too?

```bash
pip install -e ".[app]"
```

Need OLS trendlines on scatter plots (`charts.pairwise_scatter_with_trendline`)?

```bash
pip install -e ".[trend]"
```

---

## Use it as a library

```python
import eda_k

result = eda_k.analyze("data.csv")   # path, file-like, or DataFrame all work

print(result)                  # <EDAResult 'data.csv' rows=150 cols=5 ...>
print(result.summary())        # quick text overview
print(result.ask("which columns have missing values?"))

result.to_html("report.html")     # self-contained HTML report (charts inline)
result.to_csv_zip("tables.zip")   # every summary table as CSVs in one ZIP
```

`result.df` is the loaded `pandas.DataFrame`, and `result.results` is the raw
dict of every table (`overview`, `missing_summary`, `numeric_summary`,
`outliers`, `categorical_summary`, `correlation`, `top_correlations`,
`dtype_table`) if you want to work with the data directly.

### Lower-level access

The original modules are available as submodules, unchanged, for full control:

```python
from eda_k import eda_engine, charts, chat_assistant, report_builder

df = eda_engine.load_file(open("data.csv", "rb"), "data.csv")
results = eda_engine.run_full_eda(df)
fig = charts.histogram(df, "some_column")
```

---

## Use the Streamlit UI

```bash
pip install -e ".[app]"
streamlit run apps/streamlit_app.py
```

Opens a browser tab with upload, tabs (Overview / Missing / Numeric / Outliers
/ Categorical / Correlation / Chat / Download), and one-click export of the
HTML report or a ZIP of CSVs — same as before, just now built on top of the
installed `eda_k` package instead of loose scripts.

---

## Project layout

```
eda-k/
├── pyproject.toml
├── README.md
├── requirements.txt          # convenience: pip install -r requirements.txt == pip install -e ".[app]"
├── src/
│   └── eda_k/
│       ├── __init__.py        # public API: analyze(), EDAResult
│       ├── eda_engine.py       # core analysis (pandas/numpy/scipy, no UI)
│       ├── charts.py           # Plotly chart builders
│       ├── chat_assistant.py   # local rule-based Q&A
│       └── report_builder.py   # self-contained HTML report builder
└── apps/
    └── streamlit_app.py        # optional UI, imports from the installed package
```

## Supported file types
CSV, TSV, TXT (auto-delimiter-detect), XLSX, XLS, JSON, Parquet.

## Notes / known limits
- Very large files (millions of rows) will be slower to chart; consider
  sampling first if you hit performance issues.
- The "likely datetime column" detector is a heuristic on a small sample —
  always double check it against the Overview before trusting it blindly.
- Normality test (Shapiro-Wilk) auto-samples to 5,000 rows for large columns
  for speed.
