Metadata-Version: 2.4
Name: pdf-insight-mcp
Version: 0.2.1
Summary: MCP server for extracting text, images, tables, links, annotations, and metadata from PDF files
Project-URL: Homepage, https://github.com/Xvvln/pdf-reader-mcp
Project-URL: Repository, https://github.com/Xvvln/pdf-reader-mcp
Project-URL: Issues, https://github.com/Xvvln/pdf-reader-mcp/issues
Author-email: xwell <3369759202@qq.com>
License: MIT License
        
        Copyright (c) 2026 xwell
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: llm,mcp,model-context-protocol,pdf,reader
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Markup
Requires-Python: >=3.10
Requires-Dist: mcp>=1.0.0
Requires-Dist: pymupdf>=1.24.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# pdf-reader-mcp

一个功能丰富的 PDF 阅读 MCP 服务器，让 LLM（大语言模型）客户端能够读取和分析 PDF 文件。  
A feature-rich MCP server for reading and analyzing PDF files with LLM clients.

<!-- mcp-name: io.github.Xvvln/pdf-reader-mcp -->

## 功能特性 / Features

| 工具 / Tool | 中文说明 | English |
| --- | --- | --- |
| `get_pdf_info` | 读取文档元数据、页数、大小和加密状态 | Read document metadata, page count, size, and encryption status |
| `read_pdf_as_text` | 提取指定页面文本内容 | Extract text content from selected pages |
| `read_pdf_as_images` | 将指定页面渲染为 base64 图片 | Render selected pages as base64-encoded images |
| `get_pdf_outline` | 读取书签与目录结构 | Read bookmarks and outline structure |
| `search_pdf_text` | 按页返回搜索结果和上下文 | Search text with per-page context |
| `extract_pdf_tables` | 提取可识别的表格结构 | Extract structured tables when detectable |
| `extract_pdf_images` | 提取 PDF 内嵌图片 | Extract embedded images from the PDF |
| `get_pdf_page_info` | 查看单页尺寸、文本、图片和链接信息 | Inspect a page's dimensions, text, images, and links |
| `extract_pdf_links` | 提取外部链接和内部跳转 | Extract external URLs and internal page jumps |
| `get_pdf_annotations` | 读取批注、高亮与注释信息 | Read comments, highlights, and annotation data |
| `get_pdf_text_stats` | 统计文本、行数、段落数和扫描版概率 | Compute text, line, paragraph, and scan-likelihood stats |
| `compare_pdf_pages` | 比较两个页面的文本相似度 | Compare text similarity between two pages |

## 为什么做这个项目 / Why this project

很多 LLM 工作流不仅需要纯文本提取，还需要目录、表格、图片、注释、链接等结构化信息。  
Many LLM workflows need more than raw text extraction. They also need structure, tables, images, annotations, and links.

这个服务提供统一的 MCP 接口，用于：
This server provides a unified MCP interface for:

- 文本型 PDF / text-heavy PDFs
- 扫描版或版式敏感 PDF / scanned or layout-sensitive PDFs
- 表格与图片提取 / table and image extraction
- 元数据与结构分析 / metadata and structure inspection
- 批注与链接分析 / annotation and link analysis

## 安装 / Installation

### 前置要求 / Prerequisites

- Python 3.10+
- `uv` 或其他 Python 环境管理工具 / `uv` or another Python environment manager

安装 `uv` / Install `uv`:

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
```

Windows PowerShell:

```powershell
irm https://astral.sh/uv/install.ps1 | iex
```

### 从 PyPI 安装 / Install from PyPI

发布后可直接通过 `uvx` 运行：
After the package is published, you can run it directly with `uvx`:

```bash
uvx pdf-insight-mcp
```

也可以先安装再运行：
You can also install first, then run:

```bash
python -m pip install pdf-insight-mcp
pdf-reader-mcp
```

### 本地开发安装 / Local development setup

```bash
uv sync
```

### 运行服务 / Run the server

```bash
uv run pdf-reader-mcp
```

## 在 MCP 客户端中配置 / Configure in an MCP client

PyPI 安装方式示例 / Example config using the published PyPI package:

```json
{
  "mcpServers": {
    "pdf-reader": {
      "command": "uvx",
      "args": ["pdf-insight-mcp"]
    }
  }
}
```

本地仓库开发配置示例 / Example configuration for a local checkout:

```json
{
  "mcpServers": {
    "pdf-reader": {
      "command": "uv",
      "args": [
        "--directory",
        "/absolute/path/to/pdf-reader-mcp",
        "run",
        "pdf-reader-mcp"
      ]
    }
  }
}
```

将 `/absolute/path/to/pdf-reader-mcp` 替换为你的本地仓库路径。  
Replace `/absolute/path/to/pdf-reader-mcp` with your local repository path.

## 发布 / Release

推荐发布路径：
Recommended release path:

1. 发布 Python 包到 PyPI / Publish the Python package to PyPI
2. 发布 `server.json` 到官方 MCP Registry / Publish `server.json` to the official MCP Registry

建议使用 GitHub Actions + PyPI Trusted Publishing（OIDC）+ MCP Registry GitHub OIDC。
The recommended automation is GitHub Actions + PyPI Trusted Publishing (OIDC) + MCP Registry GitHub OIDC.

典型发布流程：
Typical release flow:

```bash
# 1. 修改版本号（pyproject.toml 和 server.json）
# 2. 提交改动
git commit -am "Release v0.2.0"

# 3. 打 tag
git tag v0.2.0

# 4. 推送分支和 tag
git push origin main --tags
```

工作流会在 `v*` tag 上：
The release workflow will, on `v*` tags:

- 运行测试 / run tests
- 构建 sdist 和 wheel / build sdist and wheel
- 做 `twine check` / run `twine check`
- 发布到 PyPI / publish to PyPI
- 发布到 MCP Registry / publish to the MCP Registry

## 响应大小与大 PDF 注意事项 / Response size and large-PDF notes

- `read_pdf_as_images` 返回的是 base64 图片，响应体积会迅速变大。  
  `read_pdf_as_images` returns base64 image payloads, which can grow very quickly.
- 图片渲染仍然限制为最多 20 页。  
  Image rendering is still limited to 20 pages per call.
- `read_pdf_as_text` 现在默认限制为最多 50 页、最多 200000 字符，超限会截断并附带 warning。  
  `read_pdf_as_text` now defaults to at most 50 pages and 200000 characters, and truncates with a warning when needed.
- `read_pdf_as_images` 现在默认限制总返回负载约 20MB，超限会提前停止并附带 warning。  
  `read_pdf_as_images` now defaults to an overall payload cap of about 20MB and stops early with a warning.
- 对扫描版 PDF，建议优先按小页范围调用，并降低 `dpi`、使用 `jpeg`、降低 `quality`。  
  For scanned PDFs, prefer smaller page ranges, lower `dpi`, `jpeg`, and lower `quality`.

## 开发 / Development

安装开发依赖 / Install dev dependencies:

```bash
uv sync --extra dev
```

运行测试 / Run tests:

```bash
uv run pytest
```

## 技术栈 / Tech stack

- Python 3.10+
- [MCP Python SDK](https://github.com/modelcontextprotocol/python-sdk)
- [PyMuPDF](https://pymupdf.readthedocs.io/)

## License

MIT
