Metadata-Version: 2.4
Name: ragtab
Version: 0.1.3
Summary: Trích xuất bảng từ ảnh/PDF thành Markdown cho RAG pipeline
Author-email: Dinh Duc Tai <dinhductai2004@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/tai03102004/rag-table
Project-URL: Repository, https://github.com/tai03102004/rag-table
Keywords: table-extraction,ocr,unet,rag
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0
Requires-Dist: torchvision
Requires-Dist: numpy
Requires-Dist: opencv-python
Requires-Dist: Pillow
Requires-Dist: paddleocr
Requires-Dist: scipy
Requires-Dist: matplotlib
Requires-Dist: timm
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"

# RagTable

**Extract tables from images/PDFs to Markdown — fast, accurate, RAG-ready**

RagTable converts table images into structured Markdown using a segmentation model and per-cell OCR. Built for borderless and complex tables (academic papers, reports, scanned documents), with output that plugs directly into RAG pipelines or LLM contexts.

---

## Features

- **Table structure segmentation** via EfficientUNet (~19M params) — detects `row`, `col`, `col_header`, `row_header`, and `span` masks
- **Span detection** — merged cells are identified and rendered correctly in Markdown
- **Per-cell OCR** via PaddleOCR — each cell is cropped and read individually to minimize noise
- **Smart header handling** — header rows get a dedicated OCR pass with spatial mapping for better accuracy
- **Automatic post-processing** — removes phantom edge columns, empty rows, and footer artifacts
- **Fully offline** — no external API calls, suitable for sensitive or air-gapped environments
- **Multi-format input**: PNG, JPG, TIFF

---

## Installation

```bash
pip install ragtab
```

**Requirements:**
- Python ≥ 3.10
- PyTorch (install separately if not already present)
- PaddleOCR (installed automatically)

**PaddleOCR requires `paddlepaddle` — install separately based on your platform:**

```bash
pip install paddlepaddle
```

### Install from source

```bash
git clone https://github.com/tai03102004/rag-table
cd ragtab
pip install -e .
```

---

## Quickstart

```python
from ragtab.pipeline import extract_table

markdown, cells = extract_table(
    "table.png",
    model_path="checkpoints/unet_best.pt",
    ocr_engine="paddleocr"
)

print(markdown)
```

Output:

```
| Item        | Price | Qty |
| ----------- | ----- | --- |
| iPhone 15   | 999   | 12  |
| Samsung S24 | 899   | 8   |
```

---

## How It Works

```
Input image (resized to 384×384)
       │
       ▼
[1] EfficientUNet → 5 segmentation masks
       │
       ▼
[2] Projection analysis → row/column separator positions
       │
       ▼
[3] Span detection → connected components on span mask
       │
       ▼
[4] Grid construction → per-cell bounding boxes
       │
       ▼
[5] OCR — header rows: spatial mapping pass
        — body cells: per-cell crop + PaddleOCR
       │
       ▼
[6] Post-processing → drop phantom columns, empty rows, footers
       │
       ▼
[7] Markdown export
```

Each stage is independently accessible so you can customize or swap components.

---

## Model & Checkpoints

- **Train from scratch** using `notebooks/02_table-recognition.ipynb`
- **Download pretrained checkpoint:**
  - [Google Drive](https://drive.google.com/drive/folders/1ILV2zmI6Go-u16bFRbM_Hi1vAyLzc1dQ?usp=drive_link)
  - [Hugging Face](https://huggingface.co/henryhs/rag-table/blob/main/unet_best.pt)

Place the checkpoint at `checkpoints/unet_best.pt` and pass it via `model_path`.

---

## Project Structure

```
RagTable/
├── python/
│   └── ragtab/
│       ├── __init__.py
│       ├── detection.py     # Mask → grid cells
│       ├── model.py         # EfficientUNet definition
│       ├── ocr.py           # PaddleOCR wrapper + text cleaning
│       ├── pipeline.py      # End-to-end extract_table()
│       └── utils.py
├── checkpoints/
├── notebooks/
│   └── 02_table-recognition.ipynb
└── README.md
```

---

## License

MIT — free to use, including for commercial purposes.

---

## Author

**Dinh Duc Tai** — [dinhductai2004@gmail.com](mailto:dinhductai2004@gmail.com)

If you find this useful, consider giving it a ⭐️ on GitHub!
