Metadata-Version: 2.4
Name: markitdown-pro
Version: 2.0.0
Summary: A package that converts almost any file format to Markdown.
Author: Developer
License-Expression: MIT
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.13
Requires-Dist: aiohttp
Requires-Dist: azure-ai-documentintelligence==1.0.2
Requires-Dist: azure-cognitiveservices-speech==1.42.0
Requires-Dist: beautifulsoup4
Requires-Dist: cairosvg
Requires-Dist: chardet==5.2.0
Requires-Dist: ebooklib
Requires-Dist: httpx
Requires-Dist: huggingface-hub[hf-xet]
Requires-Dist: langchain-core==0.3.74
Requires-Dist: langchain-openai==0.3.29
Requires-Dist: markitdown[all]==0.1.5
Requires-Dist: nbformat
Requires-Dist: pillow-heif
Requires-Dist: pymupdf
Requires-Dist: pypdf
Requires-Dist: python-dotenv
Requires-Dist: requests
Requires-Dist: tabulate
Requires-Dist: youtube-transcript-api
Description-Content-Type: text/markdown

# MarkItDown-Pro

**MarkItDown-Pro** is a Python library that converts **50+ document formats into Markdown**, built to power RAG (Retrieval-Augmented Generation) pipelines for semantic search. It extends [Microsoft MarkItDown](https://github.com/microsoft/markitdown) with Azure AI services, per-page OCR, and customizable converter pipelines.

---

## Features

- **Async-first API** -- all public methods are `async`, designed for concurrent document processing
- **Per-page PDF routing** -- classifies each page as text or image, extracts text locally and OCRs only image pages
- **Customizable pipelines** -- inject your own converter order per handler to optimize for quality, speed, or cost
- **GPT Vision OCR** -- concurrent page-by-page OCR via Azure OpenAI (gpt-5.4-mini default)
- **Gotenberg integration** -- convert Office files to PDF for full OCR via [Gotenberg](https://gotenberg.dev) HTTP API
- **Azure Document Intelligence** -- layout-aware text extraction with the `prebuilt-layout` model
- **Azure Speech-to-Text** -- audio transcription with automatic language detection
- **Structured logging** -- `Component | filename.ext | page N | method | message` format for Log Analytics
- **Graceful degradation** -- missing API keys or services are handled automatically; converters fall back silently

## Supported Formats

| Category        | Formats                                                                             |
| --------------- | ----------------------------------------------------------------------------------- |
| **PDF**         | `.pdf` (text, scanned, mixed — per-page routing)                                    |
| **Office**      | `.docx`, `.pptx` (via Gotenberg + DocIntelligence + MarkItDown)                     |
| **Spreadsheet** | `.csv`, `.tsv`, `.xls`, `.xlsx`                                                     |
| **Images**      | `.png`, `.jpg`, `.jpeg`, `.gif`, `.bmp`, `.svg`, `.tiff`, `.webp`, `.heic`, `.heif` |
| **Audio**       | `.mp3`, `.wav`                                                                      |
| **Email**       | `.eml`, `.msg`, `.p7s`                                                              |
| **Archives**    | `.pst` (Outlook)                                                                    |
| **E-books**     | `.epub`                                                                             |
| **Notebooks**   | `.ipynb`                                                                            |
| **Markup**      | `.html`, `.htm`, `.xml`, `.json`, `.ndjson`, `.yaml`, `.yml`                        |
| **Text**        | `.txt`, `.md`, `.py`, `.go`                                                         |

## Architecture

```
ConversionPipeline (async)
    |-- detect extension
    |-- route to Handler
    |       |-- try Converter 1 (primary)
    |       |-- try Converter 2 (fallback)
    |       |-- try Converter N
    |-- validate content
    |-- clean markdown
```

### Default Converter Pipelines

| Handler            | Pipeline (in order)                                                                              | What each captures                           |
| ------------------ | ------------------------------------------------------------------------------------------------ | -------------------------------------------- |
| **PDFHandler**     | MarkItDown (all-text only) → PagePDFConverter (per-page: PyMuPDF → GPT Vision → DocIntelligence) | Text + images + scanned content              |
| **OfficeHandler**  | Gotenberg → DocIntelligence → MarkItDown                                                         | Text + images (via Gotenberg PDF conversion) |
| **ImageHandler**   | GPT Vision (primary model) → GPT Vision (fallback model)                                         | OCR on images                                |
| **AudioHandler**   | Azure Speech                                                                                     | Transcription                                |
| **TabularHandler** | openpyxl/pandas                                                                                  | Tables to markdown                           |
| **MarkupHandler**  | BeautifulSoup/yaml/json                                                                          | Structured markup                            |
| **TextHandler**    | chardet encoding detection                                                                       | Raw text                                     |
| **EmailHandler**   | Python email parser                                                                              | Email text + attachments                     |

## Installation

### Prerequisites

- Python >= 3.13
- [uv](https://docs.astral.sh/uv/) (package manager)
- System dependencies: `ffmpeg` (audio)
- Optional: [Gotenberg](https://gotenberg.dev) Docker service (for Office → PDF OCR)

### Install

```bash
git clone https://github.com/your-org/markitdown-pro.git
cd markitdown-pro

# Install all dependencies (creates .venv automatically)
uv sync

# With dev tools (pytest, ruff)
uv sync --dev
```

### Configure Environment

Create a `.env` file in the project root:

```bash
# Azure OpenAI (required for GPT Vision OCR)
AZURE_OPENAI_ENDPOINT="https://<resource>.openai.azure.com"
AZURE_OPENAI_API_KEY="your-key"
AZURE_OPENAI_API_VERSION="2024-12-01-preview"

# Azure Document Intelligence (required for doc intelligence fallback)
AZURE_DOCINTEL_ENDPOINT="https://<resource>.cognitiveservices.azure.com"
AZURE_DOCINTEL_KEY="your-key"

# Azure Speech (required for audio transcription)
AZURE_SPEECH_KEY="your-key"
AZURE_SPEECH_REGION="eastus"

# Gotenberg (optional — for Office → PDF → OCR)
GOTENBERG_URL="http://gotenberg:3000"

# OCR model configuration (optional — defaults shown)
MARKITDOWN_OCR_MODEL="gpt-5.4-mini"
MARKITDOWN_OCR_FALLBACK_MODEL="gpt-5.4"
MARKITDOWN_OCR_TIMEOUT="60.0"
MARKITDOWN_OCR_MAX_RETRIES="6"
MARKITDOWN_MIN_IMAGE_AREA="150000"

# General
LOG_LEVEL=20  # 10=DEBUG, 20=INFO, 30=WARNING
```

All services are optional -- the library degrades gracefully when credentials are missing.

## Usage

### Basic

```python
import asyncio
from markitdown_pro.conversion_pipeline import ConversionPipeline

async def main():
    pipeline = ConversionPipeline()
    try:
        md = await pipeline.convert_document_to_md("/path/to/document.pdf")
        print(md)
    finally:
        await pipeline.aclose()

asyncio.run(main())
```

### Custom Pipeline (speed-first)

Skip Gotenberg and GPT Vision, use only local converters:

```python
from markitdown_pro.conversion_pipeline import ConversionPipeline
from markitdown_pro.converters.markitdown_converter import MarkItDownConverter
from markitdown_pro.converters.doc_intel_converter import DocIntelligenceConverter
from markitdown_pro.handlers.office_handler import OfficeHandler

# Office: MarkItDown first (fast, local), DocIntelligence fallback
office = OfficeHandler(pipeline=[
    (MarkItDownConverter(), "MarkItDown"),
    (DocIntelligenceConverter(), "DocIntelligence"),
])

pipeline = ConversionPipeline(office_handler=office)
```

### Custom Pipeline (quality-first with Gotenberg)

Ensure Office files go through Gotenberg for full OCR:

```python
from markitdown_pro.converters.gotenberg_converter import GotenbergConverter
from markitdown_pro.converters.doc_intel_converter import DocIntelligenceConverter
from markitdown_pro.converters.markitdown_converter import MarkItDownConverter
from markitdown_pro.handlers.office_handler import OfficeHandler

office = OfficeHandler(pipeline=[
    (GotenbergConverter(gotenberg_url="http://localhost:3000"), "Gotenberg"),
    (DocIntelligenceConverter(), "DocIntelligence"),
    (MarkItDownConverter(), "MarkItDown"),
])

pipeline = ConversionPipeline(office_handler=office)
```

### Custom Pipeline (cost-first, no API calls)

```python
from markitdown_pro.converters.markitdown_converter import MarkItDownConverter
from markitdown_pro.handlers.office_handler import OfficeHandler

office = OfficeHandler(pipeline=[
    (MarkItDownConverter(), "MarkItDown"),
])

pipeline = ConversionPipeline(office_handler=office)
```

### Custom OCR Model

```python
from markitdown_pro.converters.gpt_vision_converter import GPTVisionConverter
from markitdown_pro.handlers.image_handler import ImageHandler

image = ImageHandler(pipeline=[
    GPTVisionConverter(model_name="gpt-5.4-nano"),  # cheapest
])

pipeline = ConversionPipeline(image_handler=image)
```

### All Handler Overrides

```python
pipeline = ConversionPipeline(
    pdf_handler=my_pdf_handler,
    office_handler=my_office_handler,
    image_handler=my_image_handler,
    audio_handler=my_audio_handler,
    # text, tabular, markup, email, epub, pst, ipynb also injectable
)
```

### Gotenberg Setup

[Gotenberg](https://gotenberg.dev) converts Office files to PDF for per-page OCR. Run it as a Docker container:

```bash
docker run -d -p 3000:3000 gotenberg/gotenberg:8
```

Or in Docker Compose:

```yaml
services:
  gotenberg:
    image: gotenberg/gotenberg:8
    ports:
      - "3000:3000"
```

Set the URL in your environment:

```bash
GOTENBERG_URL=http://localhost:3000
```

## Testing

```bash
# Unit tests (fast, no credentials, ~0.2s)
uv run pytest tests/unit/ -v

# Integration tests (requires Azure credentials in .env)
uv run pytest -m integration -v

# Local-only integration tests (no Azure needed)
uv run pytest tests/integration/test_text_handler.py tests/integration/test_markup_handler.py tests/integration/test_tabular_handler.py tests/integration/test_email_handler.py -v

# All tests
uv run pytest tests/unit/ tests/integration/ -v
```

Test expectations are defined per-handler in `tests/data/test_expectations.yaml`.

## Development

```bash
# Lint
uv run ruff check markitdown_pro/

# Format
uv run ruff format markitdown_pro/

# Build package
uv build
```

## License

MIT
