Metadata-Version: 2.4
Name: vlmparse
Version: 0.1.25
Requires-Python: >=3.11.0
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: devtools>=0.12.2
Requires-Dist: docker>=7.1.0
Requires-Dist: html-to-markdown>=3.0.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: loguru>=0.7.3
Requires-Dist: nest-asyncio>=1.6.0
Requires-Dist: numpy>=2.3.2
Requires-Dist: openai>=1.102.0
Requires-Dist: orjson>=3.11.3
Requires-Dist: pillow>=11.3.0
Requires-Dist: pydantic
Requires-Dist: pypdfium2>=5.0.0
Requires-Dist: lxml>=6.0.2
Requires-Dist: tabulate>=0.9.0
Requires-Dist: beautifulsoup4>=4.14.2
Requires-Dist: typer>=0.19.2
Provides-Extra: dev
Requires-Dist: jupyter; extra == "dev"
Provides-Extra: docling-core
Requires-Dist: docling-core; extra == "docling-core"
Provides-Extra: docling-parse
Requires-Dist: docling-parse; extra == "docling-parse"
Requires-Dist: rapidfuzz>=3.14.0; extra == "docling-parse"
Provides-Extra: st-app
Requires-Dist: streamlit>=1.49.0; extra == "st-app"
Provides-Extra: bench
Requires-Dist: html-to-markdown>=1.9.0; extra == "bench"
Requires-Dist: loguru>=0.7.3; extra == "bench"
Requires-Dist: nest-asyncio>=1.6.0; extra == "bench"
Requires-Dist: numpy>=2.3.2; extra == "bench"
Requires-Dist: pillow>=11.3.0; extra == "bench"
Requires-Dist: pydantic; extra == "bench"
Requires-Dist: rapidfuzz>=3.14.0; extra == "bench"
Requires-Dist: unidecode>=1.4.0; extra == "bench"
Requires-Dist: fire>=0.7.1; extra == "bench"
Requires-Dist: lxml>=6.0.2; extra == "bench"
Requires-Dist: datasets>=4.4.1; extra == "bench"
Requires-Dist: openpyxl>=3.1.5; extra == "bench"
Requires-Dist: joblib>=1.5.2; extra == "bench"
Requires-Dist: playwright; extra == "bench"
Requires-Dist: fuzzysearch>=0.8.1; extra == "bench"
Provides-Extra: test
Requires-Dist: pre-commit; extra == "test"
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-cov; extra == "test"
Requires-Dist: pytest-mock; extra == "test"
Requires-Dist: ruff; extra == "test"
Requires-Dist: isort; extra == "test"
Requires-Dist: pre-commit; extra == "test"
Requires-Dist: ty>=0.0.21; extra == "test"
Dynamic: license-file

# vlmparse

<div align="center">

[\[[📜 arXiv](https://arxiv.org/abs/2602.11960)\]] | [[Dataset (🤗Hugging Face)]](https://huggingface.co/datasets/pulsia/fr-bench-pdf2md) | [[pypi]](https://pypi.org/project/vlmparse/) | [[vlmparse]](https://github.com/ld-lab-pulsia/vlmparse) | [[Benchmark]](https://github.com/ld-lab-pulsia/benchpdf2md) | [[Leaderboard]](https://huggingface.co/spaces/pulsia/fr-bench-pdf2md)

</div>

A unified wrapper for Vision Language Models (VLM) and OCR solutions to parse PDF documents into Markdown.

Features:

- ⚡ Async/concurrent processing for high throughput
- 🐳 Automatic Docker server management for local models
- 🔄 Unified interface across all VLM/OCR providers
- 📊 Built-in result visualization with Streamlit

Supported Converters:

- **Open Source Small VLMs**: `LightOnOCR-1B-1025`, `LightOnOCR-2-1B`, `MinerU2.5-2509-1.2B`, `HunyuanOCR`, `PaddleOCR-VL-1.5`, `granite-docling-258M`, `olmOCR-2-7B-1025-FP8`, `dots.ocr`, `dots.ocr-1.5`, `dots.mocr`, `chandra`, `chandra-ocr-2`, `DeepSeek-OCR`, `DeepSeek-OCR-2`, `Nanonets-OCR2-3B`, `GLM-OCR`, `FireRed-OCR`, `OCRVerse`, `Qianfan-OCR`
- **Open Source Generalist VLMs**: such as the Qwen family.
- **Pipelines**: `docling`
- **Proprietary LLMs**: `gemini`, `gpt`

## Installation

Simplest solution with only the cli: 

```bash
uv tool install vlmparse
```

If you want to run the granite-docling model or use the streamlit viewing app:

```bash
uv tool install vlmparse[docling_core,st_app]
```

If you prefer cloning the repository and using the local version:
```bash
uv sync
```

With optional dependencies:

```bash
uv sync --all-extras
```

Activate the virtual environment:
```bash
source .venv/bin/activate
```

## CLI Usage

Note that you can bypass the previous installation step and just add uvx before each of the commands below.

### Convert PDFs

With a general VLM (requires setting your api key as an environment variable):

```bash
vlmparse convert "*.pdf" -o ./output --model gemini-2.5-flash-lite
```

Convert with auto deployment of a small vlm (or any huggingface VLM model, requires a gpu + docker installation):

```bash
vlmparse convert "*.pdf" -o ./output --model nanonets/Nanonets-OCR2-3B
```

### Deploy a local model server

Deployment (requires a gpu + docker installation):
- You need a gpu dedicated for this.
- Check that the port is not used by another service.

```bash
vlmparse serve lightonocr2 --port 8000 --gpu 1
```

then convert:

```bash
vlmparse convert "*.pdf" -o ./output --uri http://localhost:8000/v1
```

You can also list all running servers:

```bash
vlmparse list
```

You can get a list of registered models and their capabilities (ocr, ocr_layout, table, image_description) and inline image description with:

```bash
vlmparse registry
```

Show logs of a server (if only one server is running, the container name is not needed):
```bash
vlmparse log <container_name>
```

Stop a server (if only one server is running, the container name is not needed):
```bash
vlmparse stop <container_name>
```

### View conversion results with Streamlit

```bash
vlmparse view ./output
```

## Configuration

Set API keys as environment variables:

```bash
export GOOGLE_API_KEY="your-key"
export OPENAI_API_KEY="your-key"
```

## Python API

Client interface:

```python
from vlmparse.registries import converter_config_registry

# Get a converter configuration
config = converter_config_registry.get("gemini-2.5-flash-lite")
client = config.get_client()

# Convert a single PDF
document = client("path/to/document.pdf")
print(document.to_markdown())

# Batch convert multiple PDFs
documents = client.batch(["file1.pdf", "file2.pdf"])
```

Docker server interface:

```python
from vlmparse.registries import docker_config_registry

config = docker_config_registry.get("lightonocr")
server = config.get_server()
server.start()

# Client calls...

server.stop()
```


Converter with automatic server management:

```python
from vlmparse.converter_with_server import ConverterWithServer

with ConverterWithServer(model="mineru25") as converter_with_server:
    documents = converter_with_server.parse(inputs=["file1.pdf", "file2.pdf"], out_folder="./output")
```

Note that if you pass an uri of a vllm server to `ConverterWithServer`, the model name is inferred automatically and no server is started.

## Credits

This work was realised by members of [Probayes](https://www.probayes.com/) and [OpenValue](https://openvalue.co/), two subsidiaries of La Poste.
