Metadata-Version: 2.4
Name: memect-ppx
Version: 0.2.2
Summary: memect-ppx 是一个 PDF / 图片文档解析工具，将输入文件转换为结构化的 Markdown、JSON。支持本地模型（默认）和多种 LLM 后端（DeepSeek、PaddleOCR、GLM等），适用于高精度文档理解场景。
License-Expression: PolyForm-Noncommercial-1.0.0
Project-URL: Documentation, https://github.com/memect/memect-ppx#readme
Project-URL: Homepage, https://github.com/memect/memect-ppx
Project-URL: Issues, https://github.com/memect/memect-ppx/issues
Project-URL: License, https://github.com/memect/memect-ppx/blob/HEAD/LICENSE
Project-URL: Repository, https://github.com/memect/memect-ppx
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: anyio>=4.12.1
Requires-Dist: beautifulsoup4>=4.14.3
Requires-Dist: concurrent-log-handler>=0.9.29
Requires-Dist: fastapi>=0.135.1
Requires-Dist: filetype>=1.2.0
Requires-Dist: fonttools>=4.62.1
Requires-Dist: freetype-py>=2.5.1
Requires-Dist: httpx[brotli,http2,socks]>=0.28.1
Requires-Dist: openai>=2.29.0
Requires-Dist: orjson>=3.11.7
Requires-Dist: pdf-oxide>=0.3.17
Requires-Dist: pillow>=12.1.1
Requires-Dist: pptx-ea-font>=0.0.4
Requires-Dist: pydantic>=2.12.5
Requires-Dist: pymupdf>=1.27.2
Requires-Dist: pypdf>=6.9.1
Requires-Dist: python-multipart>=0.0.22
Requires-Dist: python-pptx>=1.0.2
Requires-Dist: rapid-layout>=1.2.1
Requires-Dist: rapidocr>=3.8.1
Requires-Dist: rich>=14.3.3
Requires-Dist: shapely>=2.1.2
Requires-Dist: table-cls>=1.2.0
Requires-Dist: typer>=0.24.1
Requires-Dist: uvicorn[standard]>=0.42.0
Requires-Dist: openvino==2025.4.1; sys_platform == "darwin" and platform_machine == "x86_64"
Requires-Dist: openvino; sys_platform != "darwin" or platform_machine != "x86_64"
Requires-Dist: psutil>=7.2.2
Requires-Dist: rapid-latex-ocr>=0.0.6
Requires-Dist: filelock>=3.25.2
Requires-Dist: pyclipper>=1.4.0
Requires-Dist: rapidfuzz>=3.14.5
Provides-Extra: cuda
Requires-Dist: nvidia-cuda-runtime-cu12; extra == "cuda"
Requires-Dist: nvidia-cudnn-cu12; extra == "cuda"
Requires-Dist: nvidia-cublas-cu12; extra == "cuda"
Requires-Dist: nvidia-cufft-cu12; extra == "cuda"
Requires-Dist: nvidia-curand-cu12; extra == "cuda"
Requires-Dist: nvidia-cuda-nvrtc-cu12; extra == "cuda"
Requires-Dist: nvidia-nvjpeg-cu12; extra == "cuda"
Dynamic: license-file

<p align="center">
  <img src="docs/images/logo.png" alt="PPX Logo" width="60" style="vertical-align:middle"/> &nbsp;<strong style="font-size:1.5em">PPX — High-Accuracy PDF & Image Parser</strong>
</p>

[![PyPI version](https://img.shields.io/pypi/v/memect-ppx.svg)](https://pypi.org/project/memect-ppx/)
[![PyPI downloads](https://img.shields.io/pypi/dm/memect-ppx.svg)](https://pypi.org/project/memect-ppx/)
[![Python](https://img.shields.io/badge/python-%3E%3D3.12-blue)](https://www.python.org/)
[![License](https://img.shields.io/badge/license-PolyForm%20Noncommercial%201.0.0-orange)](LICENSE)
[![Issues](https://img.shields.io/github/issues/memect/memect-ppx)](https://github.com/memect/memect-ppx/issues)

[简体中文](README_zh-CN.md) | English

---

**Convert PDF and images to structured Markdown / JSON — locally, accurately, production-ready.**

PPX is a source-available document parsing engine built for high-fidelity extraction of text, tables, figures, formulas, and layout from PDFs and images. It ships with a built-in OCR + layout pipeline and optionally offloads recognition to state-of-the-art LLM backends (DeepSeek-OCR, PaddleOCR-VL, GLM-OCR).

- **What output do I get?** — Markdown and JSON; every object carries page coordinates.
- **Do I need a GPU?** — No. The default backend runs on CPU. GPU (CUDA) is optional for throughput.
- **Does it handle scanned PDFs?** — Yes. OCR is applied automatically when native text is absent.
- **Can I use my own LLM?** — Yes. Any OpenAI-compatible endpoint is accepted via `--backend`.
- **Is it embeddable?** — Free for personal, research, and noncommercial use. For commercial use, contact `contact@memect.co`.

---

## Install

```bash
#>=3.12
$uv venv -p 3.12
$source .venv/bin/activate

# update
uv pip install --upgrade memect-ppx
#如果是在已经存在的环境中，建议先删除
#$uv pip uninstall opencv-python opencv-contrib-python opencv-contrib-headless opencv-contrib-python-headless
#$uv pip uninstall onnxruntime onnxruntime-gpu

#cpu版本
$uv pip install memect-ppx
$uv pip install onnxruntime --no-config
#or opencv-contrib-python-headless
$uv pip install opencv-contrib-python --no-config

#gpu版本
#安装依赖的cuda库，如果系统中已经全局安装，可以不安装，需要和onnxruntime-gpu的一致
#如果是其他版本，请根据onnxruntime-gpu的要求安装几个
$uv pip install memect-ppx[cuda]
$uv pip install onnxruntime-gpu --no-config
#or opencv-contrib-python-headless
$uv pip install opencv-contrib-python --no-config


#安装方法二
$git clone 
$cd ppx
$uv venv -p 3.12
#如果操作系统比较老<=ubuntu 20.04
#--no-install-package pdf-oxide
$uv sync --no-install-project
#如果需要使用gpu，如果系统中已经全局安装，可以不安装，或者安装另外的版本
$uv sync --extra cuda --no-install-project

#这两个必须手动安装
#or opencv-contrib-python-headless
$uv pip install opencv-contrib-python --no-config
#or onnxruntime-gpu 
$uv pip install onnxruntime --no-config


#命令说明：
#安装包的方式，请使用: ppx
#clone代码的方式，请使用:./ppx

#默认解析
$ppx parse a.pdf

#大模型解析，指定url即可，目前仅仅支持deepseek-ocr，paddleocr-vl，glm-ocr等模型
$ppx parse a.pdf --llm http://127.0.0.1:4000/v1
#如果使用的模型的名字不包含deepseek，paddle，glm等，需要指定，如下：
$ppx parse a.pdf --llm '{"name":"deepseek","base_url":"http://127.0.0.1:4000/v1","model":"xxxx","api_key":""}'

#如果经常使用，可以写到配置文件中
$mkdir conf
#可以为json文件或者py文件: settings={}
#参考src/memect/conf/settings.custom.py 语法
$vi conf/settings.py
$vi conf/log.py

#如果在配置文件中写好了路径和模型等，就不需要在命令行再指定
$ppx parse a.pdf --backend deepseek

```

PPX uses the pipeline mode by default. The parsed Markdown is typically written
to `output/doc.md` when `-o output/` is provided.

---

## What Problems Does This Solve?

| Problem | How PPX Handles It |
| ------- | ------------------ |
| Native-text PDF with invisible/garbled characters | Detects encoding anomalies; falls back to OCR per page |
| Scanned document with no embedded text | Full-page OCR or vLLM backend |
| Complex table spanning multiple columns/rows | LLM-based structural parsing, `colspan`/`rowspan` preserved |
| Math-heavy academic paper | LaTeX formula extraction |
| Batch processing thousands of files | Directory-level `parse dir/` with `-o output/` |

---

## Example Outputs

### Mixed table content

This example shows a mixed table scenario where the table body contains
editable text, while much of the header area is still image-based.

Input snippet:

![Input snippet](docs/release/assets/局部图片/ori_image.png)

Markdown output:

![Markdown output](docs/release/assets/局部图片/pdf2md.png)

JSON output:

![JSON output](docs/release/assets/局部图片/pdf2json.png)

### Scanned English table

This example shows a scanned English table parsing result.

Markdown output:

![Scanned table Markdown output](docs/release/assets/英文扫描件表格解析/pdf2md.png)

JSON output:

![Scanned table JSON output](docs/release/assets/英文扫描件表格解析/pdf2json.png)

---

## Benchmarks

See [docs/BENCHMARKS.md](docs/BENCHMARKS.md) for benchmark results, citation,
attribution, and compliance notes.

---

## Capability Matrix

| Capability | Default (Local) | DeepSeek-OCR | PaddleOCR-VL | GLM-OCR |
| ---------- | :-------------: | :----------: | :----------: | :-----: |
| Text extraction | ✅ | ✅ | ✅ | ✅ |
| Per-character coordinates | ✅ | ❌ | ❌ | ❌ |
| Table structure (colspan / rowspan) | ✅ | ✅ | ✅ | ✅ |
| Formula → LaTeX | ✅ | ✅ | ✅ | ✅ |
| Figure region extraction | ✅ | ✅ | ✅ | ✅ |
| CPU-only mode | ✅ | ✅ | ✅ | ✅ |
| CUDA acceleration | ✅ | ✅ | ✅ | ✅ |
| No external service required | ✅ | ❌ | ❌ | ❌ |

---

## Which Backend Should I Use?

| Scenario | Recommended Backend |
| -------- | ------------------- |
| Privacy-sensitive documents, air-gapped environment | `default` |
| Highest accuracy on complex layouts | `deepseek` |
| Good accuracy, lighter GPU footprint (~10 GB) | `paddle` |
| Fast inference with speculative decoding | `glm` |
| Quick integration test / CI pipeline | `default` (CPU) |

---

## Quick Start

### Default pipeline mode

```bash
ppx parse <input_path> -o <output_path>

# Example
ppx parse report.pdf -o output/
```

### Parse a single file

```bash
# Auto-detect whether OCR is needed
ppx parse report.pdf

# Force OCR on every page
ppx parse report.pdf --ocr yes

# Skip OCR entirely
ppx parse report.pdf --ocr no

# Parse an image
ppx parse scan.png
```

### Batch processing

```bash
# Parse all PDFs and images in a directory
ppx parse docs/

# Write output to a specific directory
ppx parse docs/ -o output/
```

### Use an LLM backend

```bash
# DeepSeek-OCR (requires ~20 GB VRAM via vLLM)
ppx parse report.pdf --backend deepseek \
  --deepseek '{"base_url":"http://127.0.0.1:4000/v1","model":"deepseek-ocr-2","api_key":""}'

# PaddleOCR-VL (requires ~10 GB VRAM)
ppx parse report.pdf --backend paddle \
  --paddle '{"base_url":"http://127.0.0.1:4001/v1","model":"paddleocr-vl","api_key":""}'

# GLM-OCR (requires ~10 GB VRAM)
ppx parse report.pdf --backend glm \
  --glm '{"base_url":"http://127.0.0.1:4002/v1","model":"glmocr","api_key":""}'
```

### Persist configuration

Tired of typing the same flags? Drop a config file:

```bash
mkdir conf
# conf/settings.py  (Python dict) or conf/settings.json
# Reference: src/memect/conf/settings.custom.py
```

```python
# conf/settings.py
settings = {
    "pdf_parser.deepseek.model.base_url": "http://127.0.0.1:4000/v1",
    "pdf_parser.paddle.model.base_url": "http://127.0.0.1:4001/v1",
    "pdf_parser.glm.model.base_url": "http://127.0.0.1:4002/v1",
}
```

Now just run:

```bash
ppx parse report.pdf --backend deepseek
```

---

## Use from python

PPX can be used directly as a library. If you call it repeatedly, a single global `Parser` instance is usually enough.

```python
from memect.pdf.parser import Parser
from memect.pdf.base import KDocument, KDocumentFactory

# If you call it repeatedly, a single global parser is usually enough.
# If no arguments are passed, the default settings are used.
with Parser() as parser:
    doc = KDocument("/path/your.pdf")
    parser.parse(doc)

# Batch parsing with multiprocessing and default settings.
doc = KDocumentFactory("/path/your.pdf", params=None)
docs = [doc]
Parser.batch(docs, max_workers=1)
```

---

## CLI Reference

```text
ppx parse <path> [OPTIONS]

Arguments:
  path          PDF file, image file, or directory

Options:
  --backend     default | deepseek | paddle | glm   (default: default)
  --ocr         yes | no | auto                      (default: auto)
  --table       no | ybk | wbk | auto | llm          (default: auto)
  --pages       Page range, e.g. "1-5,10"
  --mode        page | tree                    (default: page)
  -o, --output  Output directory
```

Other subcommands:

```text
ppx start               Launch HTTP API server
```

## Output Format

Each parsed document is written to `<input>.out/`:

```text
report.pdf.out/
├── doc.md          # full document in Markdown
├── doc.json        # full structured data with per-object coordinates
├── pages/          # per-page breakdown (one entry per page)
└── images/         # extracted figures/images (present when figures are detected)
```

| Path | Description |
| ---- | ----------- |
| `doc.md` | Markdown with figure references |
| `doc.json` | JSON tree: document → pages → objects, each with bounding-box coordinates |
| `pages/` | Per-page Markdown and JSON, useful for page-level processing |
| `images/` | Extracted image regions; only present when the document contains figures |

---

## Platform Support

| Platform | Python | CPU | CUDA | Notes |
| -------- | ------ | :-: | :--: | ----- |
| Linux | >= 3.12 | ✅ | ✅ | Recommended for production |
| macOS (Apple Silicon) | >= 3.12 | ✅ | ❌ | |
| macOS (Intel) | 3.12 – 3.13 | ✅ | ❌ | Capped by OpenVINO |
| Windows | >= 3.12 | ✅ | ✅ | Community-tested |

CUDA requires NVIDIA driver + CUDA 12.x and `onnxruntime-gpu` built for that CUDA version.

---

## Launching LLM Services

PPX LLM backends are served via [vLLM](https://github.com/vllm-project/vllm).

```bash
# 常用环境变量，可以附加在命令前面
export CUDA_VISIBLE_DEVICES=0
# 国内建议使用 ModelScope，下面的模型 ID 也是相对 ModelScope，HuggingFace 的可能有所不同
export VLLM_USE_MODELSCOPE=True
```

### DeepSeek-OCR-2 (~20 GB VRAM)

[ModelScope](https://modelscope.cn/models/deepseek-ai/DeepSeek-OCR-2) — note: vllm==0.19.1 produces garbled output, use a newer version.

```bash
vllm serve deepseek-ai/DeepSeek-OCR-2 \
  --served-model-name deepseek-ocr-2 \
  --logits-processors vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor \
  --mm-processor-cache-gb 0 \
  --no-enable-prefix-caching \
  --gpu-memory-utilization 0.8 \
  --port 4000
```

### PaddleOCR-VL / PaddleOCR-VL-1.5 (~10 GB VRAM)

[ModelScope PaddleOCR-VL](https://modelscope.cn/models/PaddlePaddle/PaddleOCR-VL) · [PaddleOCR-VL-1.5](https://modelscope.cn/models/PaddlePaddle/PaddleOCR-VL-1.5)

```bash
# PaddleOCR-VL
vllm serve PaddlePaddle/PaddleOCR-VL \
  --served-model-name paddleocr-vl \
  --trust-remote-code \
  --max-num-batched-tokens 16384 \
  --no-enable-prefix-caching \
  --mm-processor-cache-gb 0 \
  --gpu-memory-utilization 0.5 \
  --port 4001

# PaddleOCR-VL-1.5 (same model name and port — config unchanged)
vllm serve PaddlePaddle/PaddleOCR-VL-1.5 \
  --served-model-name paddleocr-vl \
  --trust-remote-code \
  --max-num-batched-tokens 16384 \
  --no-enable-prefix-caching \
  --mm-processor-cache-gb 0 \
  --gpu-memory-utilization 0.5 \
  --port 4001
```

### GLM-OCR (~10 GB VRAM)

[ModelScope](https://modelscope.cn/models/ZhipuAI/GLM-OCR)

```bash
vllm serve ZhipuAI/GLM-OCR \
  --served-model-name glmocr \
  --max-num-batched-tokens 16384 \
  --max-model-len 16384 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --gpu-memory-utilization 0.5 \
  --port 4002
```

## FAQ

### Does PPX support password-protected PDFs?

Not currently. Strip the password with a tool like `qpdf` before passing the file to PPX.

### How do I resolve `opencv` version conflicts?

Uninstall all existing opencv variants first, then reinstall:

```bash
uv pip uninstall opencv-python opencv-contrib-python \
                  opencv-python-headless opencv-contrib-python-headless
uv pip install opencv-contrib-python --no-config
```

### `ImportError: libGL.so.1` on Linux servers

Install the headless OpenCV variant instead:

```bash
uv pip install opencv-python-headless
```

Or install the system library: `sudo apt-get install -y libgl1`

### Can `onnxruntime` and `onnxruntime-gpu` coexist?

No. Install exactly one. The GPU variant must match your system's CUDA version.

### Can I use PPX on Mac with GPU acceleration?

No. Neither Apple Silicon nor Intel Macs support CUDA. The CPU backend works on both.

### Can I embed PPX in a commercial product?

Not under the default license. PPX is free for personal, research, and noncommercial use. For commercial use, contact `contact@memect.co`.

### How do I parse only specific pages?

```bash
ppx parse report.pdf --pages "1-5,10,15-20"
```

---

## Product Experience

Web experience for pdf2x: <https://pdf2x.cn/>

[Apply for a free API key](https://pdf2x.cn/api/apikey/page) to call the API.

Mini Program experience:

![pdf2x Mini Program code](docs/images/pdf2x.jpg)

---

## Contributing

We welcome bug reports, feature requests, and pull requests.

1. Fork the repository and create a feature branch.
2. Run tests: `uv run pytest`
3. Submit a PR — please describe the motivation and include test cases.

See [CONTRIBUTING.md](CONTRIBUTING.md) for full guidelines.

---

## Star History

<a href="https://www.star-history.com/?repos=memect%2Fmemect-ppx&type=date&legend=top-left">
 <picture>
   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/chart?repos=memect/memect-ppx&type=date&theme=dark&legend=top-left" />
   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/chart?repos=memect/memect-ppx&type=date&legend=top-left" />
   <img alt="Star History Chart" src="https://api.star-history.com/chart?repos=memect/memect-ppx&type=date&legend=top-left" />
 </picture>
</a>

---

## License

PPX is released under the [PolyForm Noncommercial License 1.0.0](LICENSE).

PPX is free for personal, research, and noncommercial use. For commercial use, contact `contact@memect.co`.

For bundled third-party code and assets, see [NOTICE](NOTICE) and [docs/THIRD_PARTY_LICENSES.md](docs/THIRD_PARTY_LICENSES.md). Those files document attribution and redistribution review items for vendored components and bundled resources shipped with this repository.
