Metadata-Version: 2.4
Name: thaieda
Version: 1.5.0
Summary: AutoEDA สำหรับข้อมูลภาษาไทย — Exploratory data analysis that speaks Thai
Project-URL: Homepage, https://github.com/peetwan/thaieda
Project-URL: Repository, https://github.com/peetwan/thaieda
Project-URL: Issues, https://github.com/peetwan/thaieda/issues
Project-URL: Changelog, https://github.com/peetwan/thaieda/blob/main/CHANGELOG.md
Author: Peet Wannasarnmetha
License: Apache-2.0
License-File: LICENSE
Keywords: autoeda,data-quality,eda,exploratory-data-analysis,nlp,profiling,text-analysis,thai
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Natural Language :: English
Classifier: Natural Language :: Thai
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.10
Requires-Dist: attacut>=1.0
Requires-Dist: chardet>=5.0
Requires-Dist: ftfy>=6.0
Requires-Dist: jinja2>=3.1
Requires-Dist: matplotlib>=3.7
Requires-Dist: nlpo3>=1.2
Requires-Dist: numpy>=1.24
Requires-Dist: openpyxl>=3.1
Requires-Dist: pandas>=2.0
Requires-Dist: plotly>=5.18.0
Requires-Dist: pythainlp>=5.0
Requires-Dist: python-crfsuite>=0.9
Requires-Dist: rapidfuzz>=3.0
Requires-Dist: scikit-learn>=1.3
Requires-Dist: scipy>=1.11
Requires-Dist: statsmodels>=0.14
Requires-Dist: wordcloud>=1.9
Provides-Extra: all
Requires-Dist: attacut>=1.0; extra == 'all'
Requires-Dist: chardet>=5.0; extra == 'all'
Requires-Dist: ftfy>=6.0; extra == 'all'
Requires-Dist: litellm>=1.0; extra == 'all'
Requires-Dist: nlpo3>=1.2; extra == 'all'
Requires-Dist: openpyxl>=3.1; extra == 'all'
Requires-Dist: pythainlp>=5.0; extra == 'all'
Requires-Dist: python-crfsuite>=0.9; extra == 'all'
Requires-Dist: rapidfuzz>=3.0; extra == 'all'
Requires-Dist: scikit-learn>=1.3; extra == 'all'
Requires-Dist: scipy>=1.11; extra == 'all'
Requires-Dist: statsmodels>=0.14; extra == 'all'
Requires-Dist: wordcloud>=1.9; extra == 'all'
Provides-Extra: detect
Requires-Dist: chardet>=5.0; extra == 'detect'
Provides-Extra: dev
Requires-Dist: hypothesis>=6.0; extra == 'dev'
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pre-commit>=3.7; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Provides-Extra: dl
Requires-Dist: attacut>=1.0; extra == 'dl'
Provides-Extra: fast
Requires-Dist: nlpo3>=1.2; extra == 'fast'
Provides-Extra: fix
Requires-Dist: ftfy>=6.0; extra == 'fix'
Provides-Extra: fuzzy
Requires-Dist: rapidfuzz>=3.0; extra == 'fuzzy'
Provides-Extra: llm
Requires-Dist: litellm>=1.0; extra == 'llm'
Provides-Extra: ml
Requires-Dist: scikit-learn>=1.3; extra == 'ml'
Provides-Extra: ner
Requires-Dist: pythainlp>=5.0; extra == 'ner'
Requires-Dist: python-crfsuite>=0.9; extra == 'ner'
Provides-Extra: stats
Requires-Dist: scipy>=1.11; extra == 'stats'
Provides-Extra: thai
Requires-Dist: pythainlp>=5.0; extra == 'thai'
Provides-Extra: timeseries
Requires-Dist: statsmodels>=0.14; extra == 'timeseries'
Provides-Extra: viz
Requires-Dist: wordcloud>=1.9; extra == 'viz'
Description-Content-Type: text/markdown

# ThaiEDA

**Exploratory data analysis that actually understands Thai.**

[![PyPI](https://img.shields.io/pypi/v/thaieda.svg)](https://pypi.org/project/thaieda/)
[![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-blue.svg)](https://www.python.org/downloads/)
[![License: Apache-2.0](https://img.shields.io/badge/License-Apache--2.0-yellow.svg)](https://opensource.org/licenses/Apache-2.0)
[![Tests: 645 passed](https://img.shields.io/badge/tests-645%20passed-brightgreen.svg)]()
[![Code Style: ruff](https://img.shields.io/badge/code%20style-ruff-261230.svg)](https://docs.astral.sh/ruff/)
[![Language aware](https://img.shields.io/badge/language-Thai%20%2B%20English%20aware-blueviolet.svg)]()

---

## What is ThaiEDA?

ThaiEDA is a Python library that automates exploratory data analysis for Thai and mixed Thai/English datasets. You give it a DataFrame, it gives you back a full report — smart pre-analysis, language detection, column types, quality issues, anomalies, cross-column insights, charts, and an executive-style HTML report. All in one line.

It handles the things generic EDA tools miss: Buddhist Era dates, Thai numerals, zero-width spaces, Thai vowel/tone marks, mixed Thai/English cells, mojibake encoding, Thai month names, national ID card validation, Thai address parsing, and PII like phone numbers.

---

## Quick Start

```bash
pip install thaieda
```

```python
import thaieda
import pandas as pd

df = pd.read_csv("data.csv")
result = thaieda.run(df)          # full EDA in one line
result.to_html("report.html")     # self-contained HTML report
```

`pip install thaieda` ติดตั้งทุกอย่างเลย — Thai tokenizer, NER, ML, Excel, stats, encoding detection, interactive charts ไม่ต้องใส่ extras

---

## Why ThaiEDA?

**Generic tools don't understand Thai data.** Pandas Profiling, ydata-profiling, and Sweetviz are great — until you feed them Thai data. They miss Buddhist Era years (พ.ศ.), Thai numerals (๑๒๓), zero-width spaces that break tokenization, and mojibake from TIS-620 encoding. ThaiEDA catches all of these.

**Privacy-first LLM analysis.** Want to ask an LLM about your data but can't send raw rows to a cloud API? ThaiEDA has 4 privacy modes — the default sends zero raw data off your machine. Perfect for government, finance, and medical data under PDPA.

**Insights, not just summaries.** A cross-column insight engine finds non-obvious patterns — "column A strongly predicts column B", "this group is 3× higher than average" — ranked by statistical interestingness with Benjamini-Hochberg correction.

**Thai-specific validation.** National ID card checksum validation, Thai address parsing (province/district/subdistrict), Thai holiday awareness for timeseries spike attribution. No other EDA tool does this.

**One line to get everything.** `thaieda.run(df)` chains the full pipeline: type detection → smart cleaning → quality checks → anomaly detection → insight discovery → visualization → HTML report. No config needed.

---

## How It Works

```
DataFrame
    │
    ▼
┌──────────────────────────────────────────────┐
│  thaieda.run(df)                             │
│                                              │
│  0. pre-analyze → data type + language       │
│  1. detect      → column types + Thai months │
│  2. clean       → smart cleaning (auto-decide)│
│  3. quality     → language-aware checks      │
│  4. anomaly     → statistical + ML + text    │
│  5. insights    → 6 cross-column patterns    │
│  6. viz         → interactive + static charts│
│  7. report      → executive HTML narrative   │

│  + optional: LLM analysis (4 privacy modes)  │
│  + optional: compare(df1, df2) side-by-side  │
└──────────────────────────────────────────────┘
    │
    ▼
EDAResult
  .to_html()        → report.html
  .to_dict()        → Python dict
  .to_json()        → JSON string
  .insights         → insight cards
  .cleaned_df       → cleaned DataFrame
  .quality_issues   → list of issues
  .quality_score    → 0-100 score with grade
  .anomalies        → anomaly findings
  .llm_response     → LLM analysis (if enabled)
  ._repr_html_()    → Jupyter rich display

run_folder("data/")  → FolderResult
  .to_html("dir/")      → individual HTML per file
  .to_master_html()     → single master HTML with sidebar
  .summary()            → text summary
  ._repr_html_()        → Jupyter rich display
```

---

## What's New

### Scale & Performance

Tested across 14 public datasets — from 500 rows to 541K rows, 8 to 171 columns. Every report stays under 2 MB and finishes under 120 seconds.

- **Insight capping** — reports surface the 30 most important findings instead of hundreds. Critical insights are always kept; warnings and info fill the rest. The executive summary shows the true count ("679 found, showing top 30").
- **HTML bloat control** — dual chart budget (40 charts max, 1.6 MB max). Quality and anomaly tables collapse after 50 rows. Wide tables switch to a summary view past 60 columns.
- **Wide-table fast path** — the insight engine samples breakdowns and measures when columns exceed 100. Correlation heatmaps and scatter matrices skip automatically on very wide data.
- **Tall-table fast path** — anomaly, quality, and outlier checks sample 50K rows when data exceeds 100K. Correlation computes on a sample. Timeseries decomposition skips past 200K rows.

### Data Quality & Cleaning

- **High-NA handling** — columns over 80% missing are flagged as `mostly_missing` with NaN preserved. Columns over 40% get a warning to drop or impute with domain knowledge. Below 40% is unchanged.
- **Smarter type detection** — Thai low-cardinality text is classified as categorical, not free text. Text-named columns like `review` and `feedback` stay text even with few unique values.
- **Cleaning safeguards** — numeric strings like `1.00005` are left alone. Keyboard layout conversion only runs when Thai characters are present. Repeated-character spam on short codes is suppressed.
- **ID/FK awareness** — ID columns are excluded from categorical anomaly checks. `*_id` columns are detected even with low unique ratio. Buddhist Era checks skip IDs. Timeseries excludes ID/FK/code columns from measures.

### Reporting

- **Executive briefing format** — reports flow from executive summary to key findings, business translation, priority actions, and plain-language explanations.
- **Template pagination** — Key Insights shows the top 20 with a collapsible section for the rest. Count badges are preserved.
- **Fewer false positives** — fuzzy duplicate guard skips short near-identical labels. Script mixing is skipped on low-cardinality columns. Outliers on heavy-tail distributions (skew > 2.0) are downgraded to info.
- **Folder reports** — `run_folder()` analyzes CSV, Excel, JSON, JSONL, and TSV folders. `FolderResult.to_master_html()` builds one master HTML with sidebar navigation.

### Smart Pre-Analysis

- **Language detection** — Thai, English, mixed, and numeric data detected with confidence and per-column detail.
- **Data type classification** — transaction, registry, survey, timeseries, and mixed datasets classified before EDA.
- **Language-aware quality** — English-only data skips Thai-specific warnings automatically.

---

## Benchmarks

ThaiEDA is tested on 14 public datasets ranging from 500 rows to 541K rows, 8 to 171 columns. Every dataset produces a report under 2 MB in under 120 seconds.

| Dataset | Rows | Cols | Time | HTML | Insights |
|---------|-----:|-----:|-----:|-----:|---------:|
| titanic | 891 | 12 | 8 s | 0.79 MB | 27 |
| telco-churn | 7,043 | 21 | 11 s | 0.84 MB | 11 |
| wine-quality | 1,599 | 12 | 7 s | 0.93 MB | 29 |
| california-housing | 20,640 | 10 | 15 s | 0.99 MB | 30 |
| superstore | 10,800 | 21 | 31 s | 1.46 MB | 30 |
| adult | 32,561 | 15 | 22 s | 1.03 MB | 29 |
| bank-marketing | 41,188 | 21 | 21 s | 0.94 MB | 30 |
| online-retail | 541,909 | 8 | **81 s** | 0.96 MB | 30 |
| dirty-thai-retail | 500 | 8 | 2 s | 0.51 MB | 15 |
| absenteeism | 740 | 21 | 10 s | 1.25 MB | 30 |
| online-shoppers | 12,330 | 18 | 18 s | 1.06 MB | 30 |
| aps-failure | 16,000 | 171 | **100 s** | 0.48 MB | 30 |
| beijing-pm25 | 43,824 | 13 | 12 s | 0.76 MB | 19 |
| bike-sharing | 17,379 | 17 | 42 s | 1.55 MB | 30 |

All 14 datasets pass QA with 0 defects. Datasets from UCI ML Repository and public sources.

---

## Examples

### One-Line EDA

```python
import thaieda
import pandas as pd

df = pd.read_csv("data.csv")

# Full pipeline in one call
result = thaieda.run(df)

# Access results
result.to_html("report.html")
print(result.quality_issues)
print(result.insights)

# In Jupyter: just display the result
result  # renders HTML report inline
```

### Folder Mode — Analyze Every File at Once

```python
import thaieda

# One line — analyzes every CSV/Excel/JSON in the folder
results = thaieda.run_folder("data/")

# Print summary
print(results.summary())
# ThaiEDA FolderResult — data/
#   Files: 5 (✅ 5 / ❌ 0)
#   ✅ customers.csv — 10,000 rows × 8 cols, 15 insights
#   ✅ orders.csv    — 50,000 rows × 12 cols, 28 insights
#   ...

# Save individual HTML reports
results.to_html("reports/")

# Generate a single master HTML with sidebar navigation
results.to_master_html("master-report.html")
```

**`run_folder()` features:**
- Auto-scans for CSV, Excel (.xlsx/.xls), JSON, JSONL, TSV
- `recursive=True` to include subfolders
- `output_dir=` to specify where HTML goes
- Error isolation — one broken file doesn't crash the rest
- `progress=` callback for progress tracking
- All `run()` kwargs supported (`lang`, `clean`, `llm`, etc.)
- **`to_master_html()`** — combines all reports into one page with sidebar nav + summary table

### With LLM Analysis (Privacy-Safe)

```python
import thaieda

# Default: zero raw data leaves your machine
result = thaieda.run(df, llm=True, privacy="insight_only", provider="ollama")
print(result.llm_response)
```

### Compare Two Datasets

```python
from thaieda.compare import compare_datasets

diff = compare_datasets(df_train, df_test, labels=("train", "test"))
print(diff["schema_diff"])      # columns added/removed
print(diff["drift"]["numeric"]) # KS statistic per column
```

### Thai ID Card Validation

```python
from thaieda.quality import validate_thai_id, validate_thai_id_column

# Single ID
validate_thai_id("1-1234-56789-01-2")  # → True/False

# Entire column
result = validate_thai_id_column(df["id_card"])
print(f"Valid: {result['valid_count']}, Invalid: {result['invalid_count']}")
```

### Thai Address Parsing

```python
from thaieda.detect import parse_thai_address

addr = parse_thai_address("123 หมู่ 4 ต.บางบัว อ.บางบัว จ.กรุงเทพฯ 10230")
print(addr)
# {'house_number': '123', 'moo': '4', 'subdistrict': 'บางบัว',
#  'district': 'บางบัว', 'province': 'กรุงเทพฯ', 'postal_code': '10230'}
```

### Language Detection

```python
import pandas as pd
from thaieda.detect import _detect_language

df = pd.DataFrame({
    "product": ["กาแฟ", "ชาไทย", "ขนม"],
    "review": ["อร่อยมาก 5/5 stars", "ดีครับ", "ไม่ดี"],
    "sku": ["SKU001", "SKU002", "SKU003"],
})

info = _detect_language(df)
print(info["language"], info["confidence"])
print(info["columns"])
# thai/mixed/english/numeric + per-column language map
```

**Language Detection v2 features:**
- Unicode Thai block analysis (U+0E00–U+0E7F) including vowels/tone marks (U+0E30–U+0E4D)
- Zero-width-space aware (`\u200b`, BOM, word joiner)
- Mixed-cell detection เช่น `"อร่อยมาก 5/5 stars"`
- Common Thai word hints: `ครับ`, `ค่ะ`, `ไทย`, `อร่อย`, `ดี`, `ไม่`, `มี`, `และ`
- Lazy `pythainlp` tokenizer when installed; regex fallback when unavailable
- Per-column `column_details` + dataset-level `confidence` (0.0–1.0)
- Sample-based scan (first 500 rows/column) for large DataFrames

### Smart Pre-Analysis

ThaiEDA profiles the dataset *before* running the full report, so the narrative and quality checks match the data:

```python
from thaieda.report import _detect_data_type

pre = _detect_data_type(df)
print(pre["label"], pre["language"]["language"])
print(pre["focus"])
```

Smart pre-analysis detects:
- **Transaction data** — orders, payments, revenue, invoices
- **Registry/master data** — customers, products, stores, entity attributes
- **Survey/review data** — ratings, comments, feedback text
- **Timeseries data** — datetime index/columns + numeric measures
- **Mixed data** — conservative fallback when signals overlap
- **Language impact** — Thai/mixed data enables Thai-specific checks; English-only data skips พ.ศ./เลขไทย checks automatically

### Data Quality Score

```python
from thaieda.quality import compute_quality_score

score = compute_quality_score(quality_issues, n_columns=10, n_rows=1000)
print(f"Score: {score.score}/100 ({score.grade})")
# Score: 85/100 (B)
```

### Smart Cleaning

```python
from thaieda.clean._smart import plan_cleaning

plan = plan_cleaning(df)
print(f"Actions: {plan.actions}")    # ['zwspace', 'numerals', 'duplicates']
print(f"Skipped: {plan.skipped}")    # ['encoding', 'whitespace']
```

---

## Privacy Modes

Control exactly what data leaves your machine when using LLM analysis:

| Mode | What Leaves | Guarantee | When to Use |
|------|------------|-----------|-------------|
| `insight_only` (default) | Stats + insights only | Raw data never leaves | Government, medical, PDPA data |
| `anonymized` | Data with PII → tokens | Names/phones/ID cards masked | Need structure without raw PII |
| `dp_noise` | Stats + Laplace noise | Prevents re-identification | Small datasets where stats leak |
| `full` | Everything | None — you accept the risk | Public data, demos |

---

## What ThaiEDA Catches

| Problem | Example | What Happens |
|---------|---------|-------------|
| Buddhist Era dates | `15/03/2567` | Auto-detects พ.ศ. → converts to CE |
| Thai numerals | `๑๒๓` in numeric column | Converts to `123` |
| Zero-width spaces | `สม\u200bชาย` | Strips invisible chars and reports language evidence |
| Thai vowel/tone marks | `อร่อยค่ะ` | Counts U+0E30–U+0E4D for better Thai detection |
| Mixed Thai/English cells | `อร่อยมาก 5/5 stars` | Detects as mixed language instead of English/numeric |
| Thai text in English-heavy tables | Thai product column + English IDs | Column-level language detection preserves Thai checks |
| Common Thai words | `ครับ`, `ค่ะ`, `ไม่ดี` | Boosts confidence for short Thai text |
| Mojibake encoding | `Ã ¬Â¸Â¡Â¹` | Auto-detects TIS-620 → UTF-8 |
| Thai month names | `มกราคม` | Parses to ISO date |
| Phone numbers | `081-234-5678` | Detects + normalizes |
| National ID cards | `1-1234-56789-01-2` | Checksum validation |
| Thai addresses | `123 ม.4 ต.บางบัว อ.บางบัว จ.กรุงเทพฯ` | Parses to structured fields |
| Placeholder values | `-`, `N/A`, `ไม่มี` | Flags as missing |
| Constant columns | All same value | Flags as useless |
| Smart data type | Orders/reviews/timeseries | Pre-classifies transaction/registry/survey/timeseries/mixed |
| Language-aware checks | English-only DataFrame | Skips Thai-specific พ.ศ./เลขไทย warnings automatically |
| Thai holidays | Spike on Dec 5 | Attributes to Father's Day |
| ID/FK semantics | `order_id`, `store_id` | Detected as ID even with low unique ratio; excluded from category anomaly |
| BE on ID | `order_id=2531` | No longer flagged as พ.ศ. (BE check requires date-like column) |
| Numeric string preservation | `1.00005` in numeric data | `fix_repeated_chars` skips decimal/numeric strings |
| Keyboard layout guard | `Floyd` in English column | Not converted to Thai (requires Thai chars in column first) |
| Payment method detection | `payment_method` column | Classified as categorical, not amount column |
| Index artifact cleanup | `Unnamed: 0` column | Ignored in analysis, flagged as index artifact |
| Per-column missing values | `Age` 20% missing | Quality issue with severity threshold (warning >5%, info 1-5%) |
| CSV delimiter warning | `;`-delimited file read as 1 column | Warns to re-read with `sep=';'` |

---

## Visualization

ThaiEDA generates both static (matplotlib) and interactive (Plotly) charts:

- **Static**: correlation heatmap, distribution, box/violin, missing matrix, scatter matrix, wordcloud, timeseries, pair plot, KDE, QQ plot, sunburst
- **Interactive**: hover tooltips, zoom, pan — using Plotly with Thai font (Sarabun) via Google Fonts
- **Color palette**: Okabe-Ito colorblind-safe (7 colors)
- **Thai font**: auto-detected for matplotlib, CSS-loaded for Plotly

```python
from thaieda.viz._interactive import create_correlation_heatmap_interactive

html_div = create_correlation_heatmap_interactive(df)  # → HTML <div> for reports
```

---

## Installation

```bash
# ติดตั้งทุกอย่างในคำสั่งเดียว
pip install thaieda
```

ไม่ต้องใส่ extras — `pip install thaieda` ติดตั้งทั้งหมด: Thai tokenizer, NER, ML, interactive charts, Excel, stats, encoding detection

LLM providers ยังเป็น optional (lazy-imported — ไม่ต้องติดตั้งถ้าไม่ใช้):

```bash
pip install openai       # OpenAI GPT
pip install anthropic    # Anthropic Claude
pip install ollama       # Ollama local LLM (หรือใช้ HTTP fallback ไม่ต้องติดตั้ง)
```

**Requirements:** Python 3.10+

---

## Modules

| Module | What It Does |
|--------|-------------|
| `run()` / `EDA()` | One-liner API — full pipeline in one call |
| `run_folder()` | Analyze every CSV/Excel/JSON in a folder + master HTML |
| `compare()` | Side-by-side dataset comparison with drift detection |
| `io/` | Auto-read CSV/JSON/JSONL/Excel + encoding detection |
| `detect/` | Column type detection + Thai month names + address parsing + language detection v2 |
| `clean/` | Smart cleaning: auto-decide what to fix (encoding, numerals, BE, zwspace) |
| `quality/` | Language-aware quality checks + score 0-100 + Thai ID card validation |
| `anomaly/` | Statistical + ML + text anomaly detection |
| `ner/` | Thai NER: person/place/organization |
| `insight_engine/` | 6 cross-column insight patterns (BH-corrected) |
| `viz/` | Static + interactive charts with colorblind-safe palette |
| `report/` | Executive HTML report + smart pre-analysis (`_detect_data_type`) |
| `llm/` | Privacy-preserving LLM analysis (4 modes, 3 providers) |
| `timeseries/` | Trend/seasonality/STL/ACF + Thai holiday awareness |
| `schema/` | Multi-file PK/FK discovery + relationship matching |

---

## Testing

```bash
pytest tests/ -v                    # all tests (631 passed)
pytest tests/test_language_detection.py  # language detection + language-aware quality
pytest tests/test_thai_id.py        # ID card validation
pytest tests/test_thai_address.py   # address parsing
pytest tests/test_compare.py        # dataset comparison
pytest tests/test_run_folder.py     # folder mode + master HTML
pytest tests/test_llm.py            # LLM + privacy modes
ruff check src/ tests/              # lint
ruff format src/ tests/             # format
```

---

## License

[Apache-2.0](LICENSE) © Peet Wannasarnmetha