Metadata-Version: 2.4
Name: sinhala-pdf2md
Version: 0.2.1
Summary: Convert Sinhala PDF documents into clean Markdown using OCR and text extraction
Project-URL: Homepage, https://github.com/RMCV-Rajapaksha/Sinhala-OCR
Project-URL: Documentation, https://github.com/RMCV-Rajapaksha/Sinhala-OCR/tree/main/docs
Project-URL: Repository, https://github.com/RMCV-Rajapaksha/Sinhala-OCR
Project-URL: Issues, https://github.com/RMCV-Rajapaksha/Sinhala-OCR/issues
Project-URL: Changelog, https://github.com/RMCV-Rajapaksha/Sinhala-OCR/blob/main/CHANGELOG.md
Author: Sinhala-PDF2MD Contributors
License: MIT License
        
        Copyright (c) 2024–2026 Sinhala-PDF2MD Contributors
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: document-conversion,markdown,nlp,ocr,pdf,sinhala,sinhalese
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing
Classifier: Topic :: Text Processing :: Markup :: Markdown
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: numpy>=1.24.0
Requires-Dist: opencv-python-headless>=4.8.0
Requires-Dist: pdfplumber>=0.10.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pymupdf>=1.23.0
Requires-Dist: pytesseract>=0.3.10
Requires-Dist: rich>=13.0.0
Requires-Dist: typer[all]>=0.9.0
Provides-Extra: ai
Requires-Dist: google-generativeai>=0.5.0; extra == 'ai'
Requires-Dist: ollama>=0.1.0; extra == 'ai'
Requires-Dist: openai>=1.0.0; extra == 'ai'
Provides-Extra: ai-gemini
Requires-Dist: google-generativeai>=0.5.0; extra == 'ai-gemini'
Provides-Extra: ai-ollama
Requires-Dist: ollama>=0.1.0; extra == 'ai-ollama'
Provides-Extra: ai-openai
Requires-Dist: openai>=1.0.0; extra == 'ai-openai'
Provides-Extra: all
Requires-Dist: black>=24.0.0; extra == 'all'
Requires-Dist: google-generativeai>=0.5.0; extra == 'all'
Requires-Dist: mypy>=1.5.0; extra == 'all'
Requires-Dist: ollama>=0.1.0; extra == 'all'
Requires-Dist: openai>=1.0.0; extra == 'all'
Requires-Dist: pre-commit>=3.5.0; extra == 'all'
Requires-Dist: pytest-cov>=4.1.0; extra == 'all'
Requires-Dist: pytest-mock>=3.11.0; extra == 'all'
Requires-Dist: pytest>=7.4.0; extra == 'all'
Requires-Dist: ruff>=0.4.0; extra == 'all'
Requires-Dist: surya-ocr>=0.6.0; extra == 'all'
Requires-Dist: tox>=4.0.0; extra == 'all'
Requires-Dist: types-pillow; extra == 'all'
Provides-Extra: dev
Requires-Dist: black>=24.0.0; extra == 'dev'
Requires-Dist: mypy>=1.5.0; extra == 'dev'
Requires-Dist: pre-commit>=3.5.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.11.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.4.0; extra == 'dev'
Requires-Dist: tox>=4.0.0; extra == 'dev'
Requires-Dist: types-pillow; extra == 'dev'
Provides-Extra: surya
Requires-Dist: surya-ocr>=0.6.0; extra == 'surya'
Description-Content-Type: text/markdown

# sinhala-pdf2md

**Version: 0.2.1**

**Convert Sinhala PDF documents into clean, readable Markdown — with or without OCR.**

[![PyPI](https://img.shields.io/pypi/v/sinhala-pdf2md)](https://pypi.org/project/sinhala-pdf2md/)
[![Python](https://img.shields.io/pypi/pyversions/sinhala-pdf2md)](https://pypi.org/project/sinhala-pdf2md/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![CI](https://github.com/RMCV-Rajapaksha/Sinhala-OCR/actions/workflows/ci.yml/badge.svg)](https://github.com/RMCV-Rajapaksha/Sinhala-OCR/actions)
[![Coverage](https://img.shields.io/badge/coverage-85%25+-brightgreen)](https://github.com/RMCV-Rajapaksha/Sinhala-OCR)

---

## Why Does This Exist?

Working with Sinhala text in PDF format is painful. Most tools either ignore Unicode entirely, mangle the script's conjunct consonants, or produce OCR output full of garbled characters.

`sinhala-pdf2md` was built to solve this specifically for Sinhala documents. It:

- **Knows the difference** between a text-based PDF (which has a proper text layer) and a scanned one (which is just an image)
- **Picks the right tool** for each page — direct text extraction for digital PDFs, OCR for scanned ones
- **Fixes Unicode issues** that appear after OCR — broken ZWJ sequences, misplaced combining marks, control characters
- **Optionally runs an LLM** (OpenAI, Gemini, or a local Ollama model) to clean up structure

If you're digitising Sinhala books, government documents, or any scanned Sinhala content, this tool handles the messy bits so you can focus on the content.

---

## Features

- **Smart page classification** — detects whether each page needs OCR or direct extraction
- **Two OCR engines** — Tesseract (default, free) or Surya (transformer-based, higher accuracy)
- **Image pre-processing** — deskew, denoise, and binarize scanned images before OCR
- **Heading detection** — infers headings from font size ratios (not just guessing)
- **Table support** — extracts PDF tables and renders them as GitHub Flavored Markdown
- **List detection** — recognises bullets, numbered lists, and common Sinhala bullet characters
- **Unicode repair** — NFC normalisation + ZWJ and virama (්‍) sequence fixing
- **AI cleanup** — optional post-processing with OpenAI, Gemini, or Ollama
- **Batch conversion** — convert an entire directory in one command
- **Python API** — use as a library in your own pipelines
- **Env-var config** — all settings configurable via `PDF2MD_*` environment variables

---

## Architecture Overview

```
PDF File
   │
   ▼
PageAnalyzer ──────── Classifies each page (text / scanned / mixed)
   │
   ├─── TEXT page ──► PDFExtractor (pdfplumber) ──► MarkdownFormatter
   │
   ├─── SCANNED page ──► PageRenderer (PyMuPDF) ──► Image Preprocessor
   │                                                       │
   │                                                       ▼
   │                                               OCREngine (Tesseract / Surya)
   │                                                       │
   │                                                       ▼
   │                                               MarkdownFormatter
   │
   └─── MIXED page ──► Both paths, combined
           │
           ▼
   MarkdownCleaner (Unicode repair, whitespace)
           │
           ▼
   [Optional] AIFormatter (OpenAI / Gemini / Ollama)
           │
           ▼
   Output .md file
```

See [`docs/architecture.md`](docs/architecture.md) for the full breakdown with Mermaid diagrams.

---

## Installation

### Prerequisites

**Tesseract** (required for OCR on scanned pages):

```bash
# Ubuntu / Debian
sudo apt-get install tesseract-ocr tesseract-ocr-sin

# macOS
brew install tesseract tesseract-lang

# Windows — download from https://github.com/UB-Mannheim/tesseract/wiki
# Then add to PATH. Make sure the "sin" language data is included.
```

### Install the Package

```bash
pip install sinhala-pdf2md
```

### Optional Extras

```bash
# Surya OCR engine (transformer-based, higher accuracy)
# ⚠️ Non-commercial use only — see https://github.com/VikParuchuri/surya
pip install sinhala-pdf2md[surya]

# AI cleanup with OpenAI
pip install sinhala-pdf2md[ai-openai]

# AI cleanup with Gemini
pip install sinhala-pdf2md[ai-gemini]

# AI cleanup with Ollama (local)
pip install sinhala-pdf2md[ai-ollama]

# Everything at once (dev included)
pip install sinhala-pdf2md[all]
```

See [`docs/installation.md`](docs/installation.md) for detailed platform-specific instructions.

---

## Quick Start

### Command Line

```bash
# Convert a single PDF
pdf2md document.pdf

# Specify output path
pdf2md document.pdf -o output.md

# Use Surya OCR engine
pdf2md document.pdf --ocr-engine surya

# Higher DPI for better OCR quality
pdf2md document.pdf --dpi 400

# Enable AI cleanup (requires OpenAI API key)
pdf2md document.pdf --ai-cleanup openai

# Verbose logging
pdf2md document.pdf --verbose
```

### Batch Conversion

```bash
# Convert all PDFs in a directory
pdf2md batch ./documents/

# With output directory and recursive search
pdf2md batch ./documents/ --output-dir ./markdown/ --recursive
```

### Python API

```python
from sinhala_pdf2md import Converter

# Simple conversion
converter = Converter()
output_path = converter.convert("document.pdf")

# Custom configuration
converter = Converter(
    ocr_engine="tesseract",
    ocr_language="si",
    page_render_dpi=400,
    preserve_page_breaks=True,
)
converter.convert("document.pdf", "output.md")

# Get Markdown as a string (don't write a file)
markdown = converter.convert_to_string("document.pdf")

# Batch convert a directory
results = converter.convert_batch("./pdfs/", output_dir="./output/", recursive=True)
print(f"Converted {len(results)} files")
```

---

## CLI Usage

```
Usage: pdf2md [COMMAND] [OPTIONS]

Commands:
  convert   Convert a single PDF file to Markdown. (default)
  batch     Convert all PDF files in a directory to Markdown.

Convert Options:
  PDF_PATH              Path to the input PDF file
  -o, --output PATH     Output Markdown file path
  -e, --ocr-engine TEXT OCR engine: tesseract (default) or surya
  -l, --lang TEXT       Language code: si (Sinhala, default), en, ta, hi
  -d, --dpi INT         Render DPI for scanned pages (72–600, default 300)
  -v, --verbose         Enable debug logging
  --ai-cleanup TEXT     AI provider for post-processing: openai, gemini, ollama

Batch Options:
  INPUT_DIR             Directory containing PDF files
  -o, --output-dir DIR  Output directory for .md files
  -r, --recursive       Search subdirectories
  (plus all convert options above)
```

### Examples

```bash
# Basic usage — output next to input file
pdf2md report.pdf
# → report.md

# English document
pdf2md letter.pdf --lang en

# High-quality scanned document
pdf2md scanned_book.pdf --dpi 450 --ocr-engine tesseract

# With AI cleanup via local Ollama
pdf2md document.pdf --ai-cleanup ollama

# Batch with verbose output
pdf2md batch ./inbox/ --output-dir ./processed/ --recursive --verbose
```

---

## Python API Usage

### Basic

```python
from sinhala_pdf2md import Converter

converter = Converter()
path = converter.convert("input.pdf", "output.md")
print(f"Saved to: {path}")
```

### Full Configuration via `ConverterConfig`

```python
from sinhala_pdf2md import Converter, ConverterConfig, OCREngineType, AIProviderType

config = ConverterConfig(
    ocr_engine=OCREngineType.TESSERACT,
    ocr_language="si",
    page_render_dpi=400,
    ocr_confidence_threshold=0.6,
    preserve_page_breaks=True,
    heading_detection_enabled=True,
    table_detection_enabled=True,
    heading_font_size_ratio=1.3,
    image_preprocess_enabled=True,
    ai_provider=AIProviderType.OPENAI,
    ai_model="gpt-4o",
    ai_api_key="sk-...",
)

converter = Converter(config=config)
converter.convert("document.pdf", "output.md")
```

### Batch Processing with Error Handling

```python
from sinhala_pdf2md import Converter
from sinhala_pdf2md.exceptions import BatchConversionError

converter = Converter(ocr_engine="tesseract")

try:
    results = converter.convert_batch("./pdfs/", output_dir="./out/", recursive=True)
    print(f"Successfully converted {len(results)} files")
except BatchConversionError as e:
    print(f"Some files failed: {e.failures}")
```

### Return Markdown Without Writing a File

```python
converter = Converter()
markdown_text = converter.convert_to_string("document.pdf")
# Process the string however you like
```

---

## Configuration

All settings can be set via constructor arguments, a `ConverterConfig` object, or environment variables with the `PDF2MD_` prefix.

| Setting | Default | Env Var | Description |
|---------|---------|---------|-------------|
| `ocr_engine` | `tesseract` | `PDF2MD_OCR_ENGINE` | OCR backend (`tesseract` or `surya`) |
| `ocr_language` | `si` | `PDF2MD_OCR_LANGUAGE` | ISO 639-1 language code |
| `ocr_confidence_threshold` | `0.5` | `PDF2MD_OCR_CONFIDENCE_THRESHOLD` | Min confidence score to keep OCR output |
| `page_render_dpi` | `300` | `PDF2MD_PAGE_RENDER_DPI` | DPI for rendering scanned pages (72–600) |
| `preserve_page_breaks` | `true` | `PDF2MD_PRESERVE_PAGE_BREAKS` | Insert `<!-- page-break -->` between pages |
| `heading_detection_enabled` | `true` | `PDF2MD_HEADING_DETECTION_ENABLED` | Enable font-size heading detection |
| `table_detection_enabled` | `true` | `PDF2MD_TABLE_DETECTION_ENABLED` | Enable table extraction |
| `heading_font_size_ratio` | `1.3` | `PDF2MD_HEADING_FONT_SIZE_RATIO` | Font size ratio threshold for headings |
| `image_preprocess_enabled` | `true` | `PDF2MD_IMAGE_PREPROCESS_ENABLED` | Deskew/denoise/binarize before OCR |
| `ai_provider` | `None` | `PDF2MD_AI_PROVIDER` | AI cleanup provider (`openai`, `gemini`, `ollama`) |
| `ai_model` | `None` | `PDF2MD_AI_MODEL` | Model name for the AI provider |
| `ai_api_key` | `None` | `PDF2MD_AI_API_KEY` | API key for the AI provider |
| `ai_base_url` | `None` | `PDF2MD_AI_BASE_URL` | Base URL override (for Ollama or custom endpoints) |
| `output_dir` | `None` | `PDF2MD_OUTPUT_DIR` | Default output directory |
| `log_level` | `INFO` | `PDF2MD_LOG_LEVEL` | Logging level |

Example with environment variables:

```bash
export PDF2MD_OCR_ENGINE=surya
export PDF2MD_PAGE_RENDER_DPI=400
export PDF2MD_AI_PROVIDER=openai
export OPENAI_API_KEY=sk-...

pdf2md document.pdf
```

See [`docs/configuration.md`](docs/configuration.md) for the full reference.

---

## Supported OCR Engines

### Tesseract (Default)

- Free, open source, widely available
- Requires the `tesseract` binary and language data files
- Good accuracy for clean, high-DPI scans
- Supports Sinhala (`sin`), English (`eng`), Tamil (`tam`), Hindi (`hin`)
- Install: `pip install sinhala-pdf2md` (Tesseract binary installed separately)

### Surya (Optional)

- Transformer-based, generally higher accuracy
- Language-agnostic (handles Sinhala without explicit training data)
- Requires PyTorch and significant RAM/GPU
- **⚠️ Non-commercial use only** for startups above $5M revenue/funding
- Install: `pip install sinhala-pdf2md[surya]`

---

## Limitations

- **Scanned page quality matters** — very low-resolution or heavily degraded scans will produce poor OCR results regardless of which engine you use. 300+ DPI is recommended.
- **Complex layouts** — multi-column documents, footnotes, and sidebar text may not reconstruct perfectly. The formatter works page-by-page and doesn't do global layout analysis.
- **Surya licensing** — the Surya engine is not free for commercial use above the license thresholds. Check the [Surya license](https://github.com/VikParuchuri/surya) before using it in production.
- **AI cleanup costs money** — OpenAI and Gemini API calls are billed per token. Large documents with many pages can accumulate costs quickly.
- **Mixed pages** — pages that have both text and images use a heuristic: if the text layer has 100+ characters, OCR is skipped. This works well in practice but isn't perfect.
- **No image extraction** — embedded images in PDFs are not extracted or described.

---

## Performance Notes

- **Text-based PDFs** are fast — pdfplumber extracts text in milliseconds per page.
- **Scanned pages** take longer — rendering + image preprocessing + OCR can take 2–10 seconds per page depending on DPI and hardware.
- **Surya** loads a transformer model on first use — there's a cold-start delay of several seconds, but subsequent pages are faster.
- **Image preprocessing** (deskew, denoise, binarize) adds ~0.5–2 seconds per page but significantly improves OCR accuracy on noisy scans.
- The OCR engine is **lazily initialised** — if your document has no scanned pages, no OCR overhead is incurred at all.

---

## Contributing

Contributions are welcome. If you're fixing a bug, adding a feature, or writing tests, please:

1. **Fork** the repository and create a branch from `main`.
2. **Install dev dependencies**: `pip install -e ".[dev]"`
3. **Set up pre-commit hooks**: `pre-commit install`
4. **Run tests**: `make test` or `pytest tests/`
5. **Check types**: `make typecheck`
6. **Lint and format**: `make format`
7. **Open a pull request** with a clear description.

See [`docs/developer-guide.md`](docs/developer-guide.md) for detailed contribution instructions, including how to add a new OCR engine or AI provider.

### Common `make` Targets

```bash
make dev         # Install in editable mode with dev dependencies
make test        # Run the full test suite
make test-unit   # Unit tests only
make lint        # Check code style
make format      # Auto-fix formatting
make typecheck   # mypy static analysis
make clean       # Remove build artifacts
```

---

## Documentation

| Document | Purpose |
|----------|---------|
| [Architecture](docs/architecture.md) | How the system works, data flow, component responsibilities |
| [Design Decisions](docs/design-decisions.md) | Why specific libraries and patterns were chosen |
| [Project Structure](docs/project-structure.md) | Folder and file layout explained |
| [Workflows](docs/workflows.md) | Step-by-step processing flows with diagrams |
| [Developer Guide](docs/developer-guide.md) | How to extend the project |
| [API Reference](docs/api-reference.md) | Public classes, methods, and exceptions |
| [Testing Guide](docs/testing-guide.md) | How to run tests and contribute test coverage |
| [Configuration](docs/configuration.md) | Full configuration reference |
| [Installation](docs/installation.md) | Platform-specific setup instructions |
| [Changelog](CHANGELOG.md) | Version history |

---

## License

[MIT](LICENSE) — free to use, modify, and distribute.

> **Note on Surya:** The optional Surya OCR engine uses a modified Open Rail-M license that restricts commercial use. The `sinhala-pdf2md` library itself is MIT — the restriction only applies if you install and use the `[surya]` extra. See [Surya's license](https://github.com/VikParuchuri/surya) for details.
