Metadata-Version: 2.1
Name: markitdown-pdf
Version: 1.0.0
Summary: Convert PDF to Markdown with image preservation. Combines markitdown + PyMuPDF for best results.
Author: Leomeie
License: MIT
Project-URL: Homepage, https://github.com/Leomeie/pdf-to-markdown
Project-URL: Repository, https://github.com/Leomeie/pdf-to-markdown
Project-URL: Issues, https://github.com/Leomeie/pdf-to-markdown/issues
Keywords: pdf,markdown,conversion,images,markitdown,pymupdf
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Markup :: Markdown
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pymupdf>=1.24.0
Requires-Dist: markitdown[all]>=0.0.1
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Provides-Extra: ocr
Requires-Dist: marker-pdf>=1.0.0; extra == "ocr"

# pdf-to-markdown

Convert PDF to Markdown **with image preservation**. Solves markitdown's image loss problem by combining markitdown + PyMuPDF.

## Why?

[markitdown](https://github.com/microsoft/markitdown) (133k+ ⭐) is excellent for converting documents to Markdown, but it **completely loses PDF images**. This tool fixes that by intelligently merging markitdown's superior text extraction with PyMuPDF's lossless image extraction.

| Metric | markitdown only | PyMuPDF only | **This tool** |
|--------|:---:|:---:|:---:|
| Text quality | 95% | 90% | **95%** |
| Image preservation | ❌ 0% | ✅ 99% | **✅ 99%** |
| Table support | ✅ 85% | ⚠️ 60% | **✅ 85%** |
| Speed | ⚡ Fast | ⚡ Fast | **⚡ Fast** |

## Quick Start

```bash
# Install
pip install pymupdf "markitdown[all]"

# Convert (auto-detect strategy)
python -m pdf_to_markdown document.pdf

# Specify output directory
python -m pdf_to_markdown document.pdf -o output/

# Batch convert
python -m pdf_to_markdown *.pdf -o output/

# Extract images only
python -m pdf_to_markdown document.pdf --images-only

# Force strategy
python -m pdf_to_markdown document.pdf --strategy merge    # markitdown + PyMuPDF
python -m pdf_to_markdown document.pdf --strategy pymupdf  # pure PyMuPDF
```

## How It Works

```
Input PDF
    │
    ▼
┌─────────────────────────────────────┐
│  Step 1: Auto-detect PDF type       │
│  - Page count, image count, scanned │
│  - Select best strategy             │
└─────────────────────────────────────┘
    │
    ├─ Text-only PDF ──→ pymupdf (fastest)
    │
    └─ Mixed content ──→ markitdown + PyMuPDF merge
    │
    ▼
┌─────────────────────────────────────┐
│  Step 2: Parallel extraction        │
│  - markitdown for text/structure    │
│  - PyMuPDF for images (parallel)    │
└─────────────────────────────────────┘
    │
    ▼
┌─────────────────────────────────────┐
│  Step 3: Smart merge                │
│  - Images inserted at correct pos   │
│  - Footer/header pattern matching   │
│  - Quality report generation        │
└─────────────────────────────────────┘
    │
    ▼
Output: document_with_images.md + images/
```

## Strategies

| Strategy | Best For | How |
|----------|----------|-----|
| `auto` (default) | Everything | Detects PDF type, picks best strategy |
| `merge` | Mixed content PDFs | markitdown text + PyMuPDF images |
| `pymupdf` | Text-only PDFs | Pure PyMuPDF extraction |

## Quality Check

After conversion, validate the output:

```bash
python -m pdf_to_markdown.quality_check output.md
python -m pdf_to_markdown.quality_check output.md --verbose  # detailed report
python -m pdf_to_markdown.quality_check output.md --json     # machine-readable
```

Output:
```
📊 Basic Info:
  File size: 1,234 KB
  Total lines: 18,533
📝 Structure:
  Headings: 320
  Tables: 4,400 rows
🖼️ Images:
  References: 645
Quality Score: 95/100 (✅ Excellent)
```

## Benchmark

Tested on: 550-page Chinese technical manual with 645 images

| Tool | Time | Images | Text Accuracy | Overall |
|------|:----:|:------:|:-------------:|:-------:|
| markitdown only | 15s | ❌ 0 | 95% | ⭐⭐⭐ |
| PyMuPDF only | 8s | ✅ 645 | 90% | ⭐⭐⭐⭐ |
| **This tool** | 20s | ✅ 645 | 95% | ⭐⭐⭐⭐⭐ |
| marker-pdf | 120s | ✅ 620 | 98% | ⭐⭐⭐⭐⭐ |

## Installation

```bash
# Minimal (recommended)
pip install pymupdf "markitdown[all]"

# For scanned PDFs (OCR)
pip install marker-pdf

# From source
git clone https://github.com/Leomeie/pdf-to-markdown.git
cd pdf-to-markdown
pip install -e .
```

## Tool Comparison

See [docs/tool-comparison.md](docs/tool-comparison.md) for detailed comparison of PDF-to-Markdown tools.

## Related Projects

- [markitdown](https://github.com/microsoft/markitdown) - Microsoft's document converter (133k+ ⭐)
- [PyMuPDF](https://github.com/pymupdf/PyMuPDF) - Fast PDF library (5k+ ⭐)
- [marker](https://github.com/VikParuchuri/marker) - High-quality PDF converter (20k+ ⭐)

## License

MIT

## Contributing

Contributions welcome! Please open an issue or PR.
