Metadata-Version: 2.4
Name: eda-k
Version: 0.1.1
Summary: Automated, local exploratory data analysis: stats, charts, correlations, outliers, a chat assistant, and self-contained HTML reports.
Author: Kishan Prajapati
License: MIT
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: licence.txt
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: scipy>=1.10
Requires-Dist: plotly>=5.20
Requires-Dist: openpyxl>=3.1
Requires-Dist: xlrd>=2.0
Requires-Dist: pyarrow>=14.0
Provides-Extra: app
Requires-Dist: streamlit>=1.36; extra == "app"
Provides-Extra: trend
Requires-Dist: statsmodels; extra == "trend"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Dynamic: license-file

# eda-k

Automated, local exploratory data analysis — as a **Python library** you can
`import`, with an optional Streamlit UI on top.

Runs 100% locally. Your data never leaves your machine, no API key needed.

---

## Install

```bash
pip install -e .
```

Want everything in one shot (library + Streamlit app + OLS trendlines)?

```bash
pip install -e ".[app,trend]"
```

Or pick extras individually:

Need the bundled Streamlit app too?

```bash
pip install -e ".[app]"
```

Need OLS trendlines on scatter plots (`charts.pairwise_scatter_with_trendline`)?

```bash
pip install -e ".[trend]"
```

---

## Quick start (recommended)

The simplest way to use the library — one function call analyzes your data,
one method call exports a report:

```python
import eda_k

result = eda_k.analyze("amazon.csv")     # path, file-like object, or DataFrame all work

result.summary()                          # quick text overview
result.ask("which columns have missing values?")

result.to_html("amazon_report.html")      # self-contained HTML report
result.to_csv_zip("amazon_tables.zip")    # every summary table as CSVs in one ZIP
```

That's the entire workflow for most use cases. Everything below explains what
each piece does and how to drop down to the lower-level modules if you need
more control.

---

## The `analyze()` function and `EDAResult` object

### `eda_k.analyze(source, filename=None, outlier_method="Both", correlation_method="pearson", max_sample_size=5000)`

Loads your data and runs the complete EDA pipeline in one call.

- `source` — a file path (`str`/`Path`), an open file-like object, or an
  already-loaded `pandas.DataFrame`.
- `filename` — only needed if `source` is a file-like object without a
  `.name` attribute; used to detect file type and for report titles.
- `outlier_method` — `"IQR"`, `"Z-score"`, or `"Both"`.
- `correlation_method` — `"pearson"`, `"spearman"`, or `"kendall"`.
- `max_sample_size` — row cap used when sampling for the Shapiro-Wilk
  normality test (large columns get sampled down for speed).

Returns an `EDAResult`.

### `EDAResult` — what you get back

| Member | What it does |
|---|---|
| `result.df` | The loaded `pandas.DataFrame`. |
| `result.results` | Raw dict of every computed table — see below. |
| `result.summary()` | Plain-text dataset overview (rows, columns, missing %, dtypes, memory). |
| `result.ask(question)` | Ask a natural-language question; see chat assistant section below. |
| `result.build_figures()` | Builds the full dict of Plotly figures used in the HTML report. |
| `result.to_html(path=None, ...)` | Builds the self-contained HTML report. Returns the HTML string; writes to `path` if given. |
| `result.to_csv_zip(path=None)` | Builds a ZIP of every summary table as CSV. Returns the ZIP bytes; writes to `path` if given. |

`result.results` contains these keys, each produced by `eda_engine`:

- `overview` — shape, dtypes, missing %, duplicates, memory usage, likely
  datetime/ID columns
- `dtype_table` — per-column dtype, missing count, uniqueness
- `missing_summary` — missing count/% per column
- `numeric_summary` — mean/median/std/skew/kurtosis/normality per numeric column
- `outliers` — IQR and Z-score outlier counts per numeric column
- `categorical_summary` — unique count, mode, top values per categorical column
- `correlation` — full correlation matrix
- `top_correlations` — strongest correlated pairs, ranked

---

## Lower-level modules

If you want to call the underlying functions directly instead of using
`analyze()`, every module is importable on its own. **Note the correct
signatures** — a common mistake is passing two Series into a chart function;
every chart function takes the **DataFrame plus a column name string**, not
Series.

```python
from eda_k import eda_engine, charts, chat_assistant, report_builder

# 1. Load the file (note: pass an open file object, not just the path string)
df = eda_engine.load_file(open("amazon.csv", "rb"), "amazon.csv")

# 2. Run the full pipeline
results = eda_engine.run_full_eda(df)

# 3. Build individual charts — pass (df, "column_name"), not df["column_name"]
fig = charts.histogram(df, "discounted_price")       # ✅ correct
fig = charts.bar_categorical(df, "category")          # ✅ correct
# fig = charts.histogram(df["product_name"], df["category"])  # ❌ wrong — two Series, not df + col name

# 4. Build the figures dict report_builder.build_html_report() expects
ov = results["overview"]
figures = {
    "missing_bar": charts.missing_values_bar(results["missing_summary"]),
    "histograms": {c: charts.histogram(df, c) for c in ov["numeric_cols"]},
    "boxplots": {c: charts.boxplot(df, c) for c in ov["numeric_cols"]},
    "categorical_bars": {c: charts.bar_categorical(df, c) for c in ov["categorical_cols"]},
    "corr_heatmap": (
        charts.correlation_heatmap(results["correlation"])
        if not results["correlation"].empty else None
    ),
}

# 5. Build and save the HTML report
html = report_builder.build_html_report(df, results, figures, filename="amazon.csv")
with open("amazon_report.html", "w", encoding="utf-8") as f:
    f.write(html)
```

This is exactly what `result.to_html()` does internally — use the high-level
`analyze()` API unless you specifically need this manual control.

### `eda_engine` — core analysis (pandas/numpy/scipy, no UI)

| Function | What it does |
|---|---|
| `load_file(file, filename)` | Loads CSV, TSV, TXT, XLSX, XLS, JSON, or Parquet into a DataFrame based on the filename extension. |
| `get_overview(df)` | Row/column counts, missing %, duplicate rows, numeric/categorical/datetime column lists, likely-datetime and ID-like column detection, memory usage. |
| `get_dtype_table(df)` | Per-column dtype, missing count/%, unique count/%, potential-ID flag. |
| `get_missing_summary(df)` | Missing count and % per column, sorted worst-first. |
| `get_numeric_summary(df, numeric_cols, max_sample_size=5000)` | Mean, median, std, min/max, IQR, CV%, skew, kurtosis, and a Shapiro-Wilk normality flag per numeric column. |
| `detect_outliers(df, numeric_cols, method="Both")` | IQR-fence and/or Z-score (\|z\|>3) outlier counts per numeric column. |
| `get_categorical_summary(df, categorical_cols, top_n=10)` | Unique count, missing count, mode, mode %, and top-N value counts per categorical column. |
| `get_correlation(df, numeric_cols, method="pearson")` | Correlation matrix (`pearson`, `spearman`, or `kendall`). |
| `get_top_correlated_pairs(corr_df, top_n=10)` | Strongest correlated column pairs, ranked by absolute correlation. |
| `run_full_eda(df, outlier_method="Both", correlation_method="pearson", max_sample_size=5000)` | Runs everything above and returns it all as one results dict — this is what `analyze()` calls. |

### `charts` — Plotly chart builders

Every function takes `(df, column_name)` (or a column list), not raw Series,
and returns a Plotly figure (or `None` if there isn't enough data to plot).

| Function | Chart |
|---|---|
| `missing_values_bar(missing_df)` | Bar chart of missing values by column. |
| `missing_pattern_heatmap(missing_matrix)` | Heatmap of where missing values cluster across rows/columns. |
| `correlation_heatmap(corr_df)` | Annotated correlation heatmap. |
| `histogram(df, col, bins=40)` | Histogram with marginal boxplot, mean/median lines. (Numeric columns only.) |
| `boxplot(df, col)` | Boxplot with IQR fences annotated, outliers highlighted. (Numeric columns only.) |
| `qq_plot(df, col)` | Q-Q plot against a normal distribution (needs scipy). |
| `bar_categorical(df, col, top_n=15)` | Horizontal bar chart of the top-N most frequent values. (Categorical columns.) |
| `scatter_matrix(df, numeric_cols, max_cols=5)` | Pairwise scatter matrix across several numeric columns at once. |
| `pairwise_scatter(df, col_x, col_y)` | Scatter plot of two numeric columns with a correlation annotation. |
| `pairwise_scatter_with_trendline(df, col_x, col_y)` | Same as above, plus an OLS trendline (needs the `[trend]` extra / `statsmodels`; falls back to a plain scatter if not installed). |
| `time_series_plot(df, date_col, value_col)` | Line chart of a value over a datetime column. |
| `multi_histogram(df, numeric_cols, max_cols=4)` | Grid of histograms for several numeric columns at once. |

### `chat_assistant` — local rule-based Q&A

`answer_question(question, df, results)` answers natural-language questions
about the dataset using the `results` dict — no API key, no internet, pure
keyword matching against the EDA results. Recognized topics:

- **Summary/overview** — "give me a summary of this dataset", "tell me about this data"
- **Missing values** — "which columns have missing values?", "any nulls?"
- **Correlations** — "what are the top correlated pairs?", "any relationships?"
- **Outliers** — "which columns have the most outliers?"
- **Duplicates** — "are there duplicate rows?"
- **Numeric columns** — "describe the numeric columns"
- **Categorical columns** — "what categorical columns are in the data?"
- **Skewness** — "which column has the highest skewness?"
- **Normality** — "is this data normally distributed?"
- **A specific column by name** — e.g. "describe discounted_price" (fuzzy-matches column names, including ones in quotes)
- **Row/column counts** — "how many rows?", "how many columns?"
- **Help** — "help", "what can you do?"

`SUGGESTED_QUESTIONS` is a ready-made list of example prompts (used to
populate quick-reply buttons in the Streamlit UI, but usable anywhere).

### `report_builder` — self-contained HTML report

`build_html_report(df, results, figures, filename="dataset", include_advanced_stats=True)`
assembles one standalone HTML file (Plotly JS embedded inline, so it works
fully offline — open it in any browser, or print to PDF). It includes:

- Header with dataset name and generation timestamp
- Stat cards (rows, columns, missing %, duplicates, numeric/categorical
  counts, memory usage)
- Column type & completeness table
- Missing values chart + table
- Numeric summary table, plus skew/kurtosis/normality table
- Outlier detection table (IQR + Z-score) with method explanations
- A histogram + boxplot pair for every numeric column
- Correlation heatmap + top correlated pairs table
- A bar chart + top-values table for every categorical column

---



Opens a browser tab with upload, tabs (Overview / Missing / Numeric / Outliers
/ Categorical / Correlation / Chat / Download), and one-click export of the
HTML report or a ZIP of CSVs — same as before, just now built on top of the
installed `eda_k` package instead of loose scripts.



## Supported file types
CSV, TSV, TXT (auto-delimiter-detect), XLSX, XLS, JSON, Parquet.

## Notes / known limits
- Very large files (millions of rows) will be slower to chart; consider
  sampling first if you hit performance issues.
- The "likely datetime column" detector is a heuristic on a small sample —
  always double check it against the Overview before trusting it blindly.
- Normality test (Shapiro-Wilk) auto-samples to 5,000 rows for large columns
  for speed.
- Chart functions take a DataFrame + column name (`charts.histogram(df, "col")`),
  not a Series (`charts.histogram(df["col"])`) — passing a Series-only call will
  raise an error or silently misbehave depending on the function.

