Metadata-Version: 2.4
Name: paddleocr_api
Version: 0.0.1
Summary: A Python async SDK that wraps the PaddleOCR AI Studio API into a clean, type-safe interface.
Author-email: Jerry <wujr24@m.fudan.edu.cn>
License: Apache-2.0
Project-URL: Homepage, https://gitee.com/Jerry-Wu-Gitee/paddleocr-api-python
Project-URL: Repository, https://gitee.com/Jerry-Wu-Gitee/paddleocr-api-python.git
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: aiofiles
Requires-Dist: httpx
Requires-Dist: python-dotenv
Requires-Dist: typing-extensions
Dynamic: license-file

# paddleocr-api-python

[English](#english) | [中文](#中文)

---

## English

A Python async SDK that wraps the [PaddleOCR AI Studio API](https://aistudio.baidu.com/) into a clean, type-safe interface. Upload a document, await the result, and get Markdown back — without touching raw HTTP.

### Features

- **Async-first** — built on `httpx.AsyncClient` and `asyncio`, with native context manager support.
- **Full model coverage** — `PaddleOCR-VL-1.6` (default), `PaddleOCR-VL-1.5`, `PaddleOCR-VL`, `PP-OCRv5`, and `PaddleOCR`.
- **Flexible input** — submit by local file path, raw bytes, or remote URL.
- **Rich job control** — poll real-time state, extracted page count, start/end times, and error messages.
- **Markdown export** — get a clean Markdown document plus the URLs of all embedded images.
- **Fine-grained options** — toggle layout detection, chart/seal/table recognition, cross-page table merging, title leveling, NMS, image orientation correction, and more.

### Installation

```bash
pip install paddleocr-api-python
```

Dependencies: `aiofiles`, `httpx`, `typing-extensions`, `python-dotenv`.

### Authentication

Get an access token from [https://aistudio.baidu.com/account/accessToken](https://aistudio.baidu.com/account/accessToken).

Either pass it explicitly:

```python
client = AistudioClient(api_key="your_token_here")
```

Or set it via environment variable (a `.env` file is loaded automatically):

```
AISTUDIO_ACCESS_TOKEN=your_token_here
```

### Quick Start

```python
import asyncio
from paddleocr_api import AistudioClient, State

async def main():
    async with AistudioClient() as client:
        job = await client.create_job(file_path="paper.pdf")

        async with job:
            while True:
                state = await job.state
                if state == State.DONE:
                    break
                if state == State.FAILED:
                    raise RuntimeError(await job.error_message)
                await asyncio.sleep(5)

            markdown = await job.markdown
            with open("output.md", "w", encoding="utf-8") as f:
                f.write(markdown.text)

asyncio.run(main())
```

### Submitting Jobs

`create_job` accepts three mutually compatible input modes:

```python
# From a local path
await client.create_job(file_path="doc.pdf")

# From bytes already in memory
await client.create_job(file_bytes=pdf_bytes)

# From a public URL
await client.create_job(file_url="https://example.com/doc.pdf")
```

### Selecting a Model

```python
from paddleocr_api import Model

await client.create_job(
    file_path="doc.pdf",
    model=Model.PADDLE_OCR_VL_1_6,  # default
)
```

| Model | Notes |
|---|---|
| `PaddleOCR-VL-1.6` | Default. Latest vision-language model. |
| `PaddleOCR-VL-1.5` | Scheduled for retirement on 2026-06-17. |
| `PaddleOCR-VL` | Base VL model. |
| `PP-OCRv5` | Classic OCR pipeline. |
| `PaddleOCR` | Base OCR. |

### Optional Payload

Pass an `OptionalPayload` dict to fine-tune recognition behavior:

```python
from paddleocr_api import LayoutShapeMode, PromptLabel

await client.create_job(
    file_path="doc.pdf",
    optional_payload={
        "useLayoutDetection": True,
        "useChartRecognition": True,
        "useSealRecognition": True,
        "mergeTables": True,
        "relevelTitles": True,
        "layoutShapeMode": LayoutShapeMode.AUTO,
        "repetitionPenalty": 1.0,
        "temperature": 0.0,
        "topP": 1.0,
    },
)
```

Key options:

| Field | Default | Purpose |
|---|---|---|
| `useDocOrientationClassify` | `False` | Auto-correct 0/90/180/270° rotation. |
| `useDocUnwarping` | `False` | Flatten warped or wrinkled pages. |
| `useLayoutDetection` | `True` | Region-aware parsing. Disable for single-region docs. |
| `useChartRecognition` | `False` | Convert charts to tables. |
| `useSealRecognition` | `True` | Extract seal text. |
| `useOcrForImageBlock` | `False` | OCR inside image regions. |
| `mergeTables` | `True` | Merge tables that span pages. |
| `relevelTitles` | `True` | Infer heading hierarchy. |
| `repetitionPenalty` | `1.0` | Raise to suppress repeated output. |
| `temperature` | `0.0` | Lower for stability, higher to reduce omissions. |
| `topP` | `1.0` | Lower for more conservative output. |
| `layoutNms` | `True` | Drop overlapping detection boxes. |
| `markdownIgnoreLabels` | all | Filter headers, footers, page numbers, footnotes, etc. |

### Tracking a Job

```python
async with job:
    print(await job.state)              # State.PENDING / RUNNING / DONE / FAILED
    print(await job.total_pages)        # e.g. 8
    print(await job.extracted_pages)    # e.g. 3
    print(await job.start_time)         # datetime
    print(await job.end_time)           # datetime
    print(await job.error_message)      # str or None
```

Status queries are cached for `status_update_interval` seconds (default `2`) to avoid hammering the API.

### Working with Results

```python
result = await job.result          # full Result object
markdown = await job.markdown      # Markdown(text=..., images=...)

# Save Markdown
with open("doc.md", "w", encoding="utf-8") as f:
    f.write(markdown.text)

# Download embedded images
import httpx
async with httpx.AsyncClient() as http:
    for rel_path, url in markdown.images.items():
        data = (await http.get(url)).content
        # write `data` to `rel_path`
```

The `Result` object also exposes per-page layout details via `layout_parsing_results`, raw page sizes via `data_info`, and preprocessed image URLs via `preprocessed_images`.

### Error Handling

All exceptions inherit from `PaddleOCRError`:

- `AistudioClientError` — client configuration issues (e.g. missing token).
- `JobCreationError` — failure when submitting a job.
- `JobStatusQueryError` — failure when polling status.

Use `job.query_status_safe()` instead of `query_status()` to get the cached state on failure rather than raising.

### License

[Apache-2.0](LICENSE)

---

## 中文

将 [PaddleOCR AI Studio API](https://aistudio.baidu.com/) 封装为简洁、类型安全的 Python 异步 SDK。上传文档、等待结果、拿到 Markdown —— 无需手写任何 HTTP 请求。

### 特性

- **异步优先** —— 基于 `httpx.AsyncClient` 与 `asyncio` 构建，原生支持上下文管理器。
- **全模型支持** —— `PaddleOCR-VL-1.6`（默认）、`PaddleOCR-VL-1.5`、`PaddleOCR-VL`、`PP-OCRv5`、`PaddleOCR`。
- **灵活输入** —— 支持本地路径、字节流、远程 URL 三种提交方式。
- **完善的任务控制** —— 实时查询状态、已抽取页数、起止时间、错误信息。
- **Markdown 导出** —— 直接获取整洁的 Markdown 文本及所有内嵌图片 URL。
- **细粒度参数** —— 可控制版面分析、图表/印章/表格识别、跨页表格合并、标题分级、NMS、图像方向矫正等。

### 安装

```bash
pip install paddleocr-api-python
```

依赖：`aiofiles`、`httpx`、`typing-extensions`、`python-dotenv`。

### 身份验证

在 [https://aistudio.baidu.com/account/accessToken](https://aistudio.baidu.com/account/accessToken) 获取访问令牌。

可以显式传入：

```python
client = AistudioClient(api_key="your_token_here")
```

也可以通过环境变量传入（自动加载 `.env` 文件）：

```
AISTUDIO_ACCESS_TOKEN=your_token_here
```

### 快速上手

```python
import asyncio
from paddleocr_api import AistudioClient, State

async def main():
    async with AistudioClient() as client:
        job = await client.create_job(file_path="paper.pdf")

        async with job:
            while True:
                state = await job.state
                if state == State.DONE:
                    break
                if state == State.FAILED:
                    raise RuntimeError(await job.error_message)
                await asyncio.sleep(5)

            markdown = await job.markdown
            with open("output.md", "w", encoding="utf-8") as f:
                f.write(markdown.text)

asyncio.run(main())
```

### 提交任务

`create_job` 支持三种输入方式：

```python
# 本地路径
await client.create_job(file_path="doc.pdf")

# 内存字节流
await client.create_job(file_bytes=pdf_bytes)

# 公网 URL
await client.create_job(file_url="https://example.com/doc.pdf")
```

### 选择模型

```python
from paddleocr_api import Model

await client.create_job(
    file_path="doc.pdf",
    model=Model.PADDLE_OCR_VL_1_6,  # 默认
)
```

| 模型 | 备注 |
|---|---|
| `PaddleOCR-VL-1.6` | 默认，最新视觉语言模型。 |
| `PaddleOCR-VL-1.5` | 计划于 2026-06-17 下线。 |
| `PaddleOCR-VL` | 基础 VL 模型。 |
| `PP-OCRv5` | 经典 OCR 流水线。 |
| `PaddleOCR` | 基础 OCR。 |

### 可选参数

通过 `OptionalPayload` 字典精调识别行为：

```python
from paddleocr_api import LayoutShapeMode, PromptLabel

await client.create_job(
    file_path="doc.pdf",
    optional_payload={
        "useLayoutDetection": True,
        "useChartRecognition": True,
        "useSealRecognition": True,
        "mergeTables": True,
        "relevelTitles": True,
        "layoutShapeMode": LayoutShapeMode.AUTO,
        "repetitionPenalty": 1.0,
        "temperature": 0.0,
        "topP": 1.0,
    },
)
```

常用参数：

| 字段 | 默认值 | 作用 |
|---|---|---|
| `useDocOrientationClassify` | `False` | 自动矫正 0/90/180/270° 旋转。 |
| `useDocUnwarping` | `False` | 矫正褶皱、倾斜等扭曲图像。 |
| `useLayoutDetection` | `True` | 版面分区与排序。文档仅含单一区域时可关闭。 |
| `useChartRecognition` | `False` | 将图表解析为表格。 |
| `useSealRecognition` | `True` | 识别印章文字。 |
| `useOcrForImageBlock` | `False` | 对图片区域中的文字进行 OCR。 |
| `mergeTables` | `True` | 合并跨页表格。 |
| `relevelTitles` | `True` | 识别段落标题级别。 |
| `repetitionPenalty` | `1.0` | 出现重复内容时可调高。 |
| `temperature` | `0.0` | 调低更稳定，调高减少漏识别。 |
| `topP` | `1.0` | 调低让模型更保守。 |
| `layoutNms` | `True` | 移除重叠的检测框。 |
| `markdownIgnoreLabels` | 全部 | 过滤页眉、页脚、页码、脚注等辅助元素。 |

### 追踪任务

```python
async with job:
    print(await job.state)              # State.PENDING / RUNNING / DONE / FAILED
    print(await job.total_pages)        # 如 8
    print(await job.extracted_pages)    # 如 3
    print(await job.start_time)         # datetime
    print(await job.end_time)           # datetime
    print(await job.error_message)      # str 或 None
```

状态查询带有 `status_update_interval` 秒的缓存（默认 `2` 秒），避免频繁请求。

### 处理结果

```python
result = await job.result          # 完整的 Result 对象
markdown = await job.markdown      # Markdown(text=..., images=...)

# 保存 Markdown
with open("doc.md", "w", encoding="utf-8") as f:
    f.write(markdown.text)

# 下载内嵌图片
import httpx
async with httpx.AsyncClient() as http:
    for rel_path, url in markdown.images.items():
        data = (await http.get(url)).content
        # 将 data 写入 rel_path
```

`Result` 对象还通过 `layout_parsing_results` 暴露每页的版面细节，通过 `data_info` 提供原始页面尺寸，通过 `preprocessed_images` 提供预处理图像 URL。

### 异常处理

所有异常都继承自 `PaddleOCRError`：

- `AistudioClientError` —— 客户端配置错误（如缺少令牌）。
- `JobCreationError` —— 任务提交失败。
- `JobStatusQueryError` —— 状态查询失败。

如果希望查询失败时返回缓存而非抛出异常，使用 `job.query_status_safe()` 代替 `query_status()`。

### 许可证

[Apache-2.0](LICENSE)
