Metadata-Version: 2.4
Name: awslabs.document-loader-mcp-server
Version: 1.0.14
Summary: An AWS Labs Model Context Protocol (MCP) server for document parsing
Project-URL: Homepage, https://awslabs.github.io/mcp/
Project-URL: Documentation, https://awslabs.github.io/mcp/servers/document-loader-mcp-server/
Project-URL: Source, https://github.com/awslabs/mcp.git
Project-URL: Bug Tracker, https://github.com/awslabs/mcp/issues
Project-URL: Changelog, https://github.com/awslabs/mcp/blob/main/src/document-loader-mcp-server/CHANGELOG.md
Author: Amazon Web Services
Author-email: AWSLabs MCP <203918161+awslabs-mcp@users.noreply.github.com>
License: Apache-2.0
License-File: LICENSE
License-File: NOTICE
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.10
Requires-Dist: fastmcp>=2.14.0
Requires-Dist: loguru>=0.7.0
Requires-Dist: markitdown[docx,pptx,xlsx]>=0.1.2
Requires-Dist: pdf2image>=1.16.0
Requires-Dist: pdfplumber>=0.10.0
Provides-Extra: dev
Requires-Dist: pyright>=1.1.398; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# Document Loader MCP Server

Model Context Protocol (MCP) server for document parsing and content extraction

This MCP server provides tools to parse and extract content from various document formats including PDF, Word documents, Excel spreadsheets, PowerPoint presentations, and images.

## Features

- **PDF Text Extraction**: Extract text content from PDF files using pdfplumber
- **Word Document Processing**: Convert DOCX/DOC files to markdown using markitdown
- **Excel Spreadsheet Reading**: Parse XLSX/XLS files and convert to markdown
- **PowerPoint Presentation Processing**: Extract content from PPTX/PPT files
- **Image Loading**: Load and display various image formats (PNG, JPG, GIF, BMP, TIFF, WEBP)
- **Slide Image Extraction**: Extract individual slides/pages as PNG images from PPTX, PPT, or PDF files using LibreOffice and poppler

## Prerequisites

### Installation Requirements

1. Install `uv` from [Astral](https://docs.astral.sh/uv/getting-started/installation/) or the [GitHub README](https://github.com/astral-sh/uv#installation)
2. Install Python 3.10 or newer using `uv python install 3.10` (or a more recent version)

### Optional: Slide Image Extraction

The `extract_slides_as_images` tool requires external system packages:

- **LibreOffice** (for PPTX/PPT → PDF conversion):
  - Ubuntu/Debian: `sudo apt install libreoffice`
  - macOS: `brew install --cask libreoffice`
  - Windows: [Download from libreoffice.org](https://www.libreoffice.org/download/)
- **poppler-utils** (for PDF → image rendering):
  - Ubuntu/Debian: `sudo apt install poppler-utils`
  - macOS: `brew install poppler`
  - Windows: [Download from GitHub](https://github.com/oschwartz10612/poppler-windows/releases) and add to PATH

## Installation

| Kiro | Cursor | VS Code |
|:----:|:------:|:-------:|
| [![Add to Kiro](https://kiro.dev/images/add-to-kiro.svg)](https://kiro.dev/launch/mcp/add?name=awslabs.document-loader-mcp-server&config=%7B%22command%22%3A%22uvx%22%2C%22args%22%3A%5B%22awslabs.document-loader-mcp-server%40latest%22%5D%2C%22env%22%3A%7B%22FASTMCP_LOG_LEVEL%22%3A%22ERROR%22%7D%7D) | [![Install MCP Server](https://cursor.com/deeplink/mcp-install-light.svg)](https://cursor.com/en/install-mcp?name=awslabs.document-loader-mcp-server&config=eyJjb21tYW5kIjoidXZ4IiwiYXJncyI6WyJhd3NsYWJzLmRvY3VtZW50LWxvYWRlci1tY3Atc2VydmVyQGxhdGVzdCJdLCJlbnYiOnsiRkFTVE1DUF9MT0dfTEVWRUwiOiJFUlJPUiJ9LCJkaXNhYmxlZCI6ZmFsc2UsImF1dG9BcHByb3ZlIjpbXX0%3D) | [![Install on VS Code](https://img.shields.io/badge/Install_on-VS_Code-FF9900?style=flat-square&logo=visualstudiocode&logoColor=white)](https://insiders.vscode.dev/redirect/mcp/install?name=Document%20Loader%20MCP%20Server&config=%7B%22command%22%3A%22uvx%22%2C%22args%22%3A%5B%22awslabs.document-loader-mcp-server%40latest%22%5D%2C%22env%22%3A%7B%22FASTMCP_LOG_LEVEL%22%3A%22ERROR%22%7D%2C%22disabled%22%3Afalse%2C%22autoApprove%22%3A%5B%5D%7D) |

Configure the MCP server in your MCP client configuration:

```json
{
  "mcpServers": {
    "awslabs.document-loader-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.document-loader-mcp-server@latest"],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": []
    }
  }
}
```

For Kiro MCP configuration, see the [Kiro IDE documentation](https://kiro.dev/docs/mcp/configuration/) or the [Kiro CLI documentation](https://kiro.dev/docs/cli/mcp/configuration/) for details.

For global configuration, edit `~/.kiro/settings/mcp.json`. For project-specific configuration, edit `.kiro/settings/mcp.json` in your project directory.

## Available Tools

- `read_document`: Extract content from various document formats by specifying file_path and file_type ('pdf', 'docx', 'doc', 'xlsx', 'xls', 'pptx', 'ppt')
- `read_image`: Load image files for LLM viewing and analysis
- `extract_slides_as_images`: Extract slides/pages as individual PNG images from PPTX, PPT, or PDF files. Requires [LibreOffice](https://www.libreoffice.org/) (for PPTX/PPT) and [poppler-utils](https://poppler.freedesktop.org/) (for PDF-to-image rendering)

## Environment Variables

- `FASTMCP_LOG_LEVEL`: Set logging level (ERROR, INFO, DEBUG)
- `MAX_FILE_SIZE_MB`: Maximum allowed file size in megabytes (default: 50). Must be a positive integer.
- `DOCUMENT_BASE_DIR`: Base directory for file access security. Restricts document loading to files within this directory. Defaults to the current working directory.

## Development

### Setup

```bash
# Clone the repository
git clone https://github.com/awslabs/mcp.git
cd mcp/src/document-loader-mcp-server

# Install dependencies
uv sync

# Install in development mode
uv pip install -e .
```

### Testing

```bash
# Run tests
uv run pytest

# Run with coverage
uv run pytest --cov=awslabs.document_loader_mcp_server
```

The test suite includes:

- Server functionality validation
- Document parsing tests with generated sample files
- Error handling verification

### Sample Documents

The test suite automatically generates sample documents for testing:

- PDF with multi-page content
- DOCX with formatted text and lists
- XLSX with multiple sheets and data
- PPTX with slides and content
- Various image formats

## Docker

You can also run this server in a Docker container:

```bash
docker build -t document-loader-mcp-server .
docker run -p 8000:8000 document-loader-mcp-server
```

## License

This project is licensed under the Apache License 2.0 - see the [LICENSE](https://github.com/awslabs/mcp/blob/main/src/document-loader-mcp-server/LICENSE) file for details.

## Contributing

We welcome contributions! Please see [CONTRIBUTING.md](https://github.com/awslabs/mcp/blob/main/CONTRIBUTING.md) for details.

## Support

For issues and questions, please use the [GitHub issue tracker](https://github.com/awslabs/mcp/issues).
