Metadata-Version: 2.4
Name: doc-parser-mcp
Version: 0.1.0
Summary: MCP server for passport MRZ extraction and document parsing
Author-email: Ankit gupta <ankitgupta1117@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/yourusername/doc-parser-mcp
Project-URL: Repository, https://github.com/yourusername/doc-parser-mcp
Project-URL: Issues, https://github.com/yourusername/doc-parser-mcp/issues
Keywords: mcp,passport,mrz,ocr,document-parsing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: mcp>=1.0.0
Requires-Dist: passport-mrz-extractor>=1.0.0
Requires-Dist: opencv-python>=4.5.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: Pillow>=8.0.0
Requires-Dist: pytesseract>=0.3.8
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.18.0; extra == "dev"
Requires-Dist: black>=21.0; extra == "dev"
Requires-Dist: flake8>=3.8; extra == "dev"
Requires-Dist: mypy>=0.800; extra == "dev"
Dynamic: license-file

# Doc Parser MCP Server

A Model Context Protocol (MCP) server for passport MRZ (Machine Readable Zone) extraction and document parsing.

## Features

- **MRZ Extraction**: Extract MRZ data from passport images
- **MRZ Detection**: Locate MRZ regions in passport images
- **Text Parsing**: Parse MRZ text into structured information
- **Checksum Validation**: Validate MRZ checksums
- **Multiple Input Formats**: Support file paths and base64 encoded images
- **Fallback Processing**: Works with or without external passport_mrz_extractor library

## Installation

### From PyPI (when published)

```bash
pip install doc-parser-mcp
```

### From Source

```bash
git clone https://github.com/yourusername/doc-parser-mcp.git
cd doc-parser-mcp
pip install -e .
```

## Dependencies

The server requires the following dependencies:

- `mcp>=1.0.0` - Model Context Protocol framework
- `opencv-python>=4.5.0` - Image processing
- `numpy>=1.21.0` - Numerical operations
- `Pillow>=8.0.0` - Image manipulation
- `pytesseract>=0.3.8` - OCR functionality (optional, for fallback)

### Optional Dependencies

- `passport-mrz-extractor>=1.0.0` - Specialized passport MRZ extraction library

### System Dependencies

For OCR functionality (fallback mode), you need Tesseract:

**macOS:**
```bash
brew install tesseract
```

**Ubuntu/Debian:**
```bash
sudo apt-get install tesseract-ocr
```

**Windows:**
Download and install from [Tesseract at UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki)

## Usage

### As MCP Server

Run the server using stdio transport:

```bash
doc-parser-mcp
```

### Available Tools

The server provides the following tools:

#### 1. extract_passport_mrz
Extract MRZ data from a passport image.

**Parameters:**
- `image_path` (string): Path to the passport image file
- `image_data` (string): Base64 encoded image data (alternative to image_path)

**Example:**
```json
{
  "image_path": "/path/to/passport.jpg"
}
```

#### 2. detect_mrz_region
Detect and locate MRZ region in passport image.

**Parameters:**
- `image_path` (string): Path to the passport image file  
- `image_data` (string): Base64 encoded image data (alternative to image_path)

#### 3. parse_mrz_text
Parse MRZ text and extract structured information.

**Parameters:**
- `mrz_text` (string): Raw MRZ text (2-3 lines)

**Example:**
```json
{
  "mrz_text": "P<UTOERIKSSON<<ANNA<MARIA<<<<<<<<<<<<<<<<<<<\nL898902C36UTO7408122F1204159ZE184226B<<<<<10"
}
```

#### 4. validate_mrz_checksum
Validate MRZ checksums.

**Parameters:**
- `mrz_line` (string): MRZ line to validate

## Example Output

### MRZ Extraction Result

```json
{
  "format": "TD3",
  "document_type": "P",
  "country_code": "UTO",
  "surname": "ERIKSSON",
  "given_names": "ANNA MARIA",
  "passport_number": "L898902C3",
  "nationality": "UTO",
  "birth_date": {
    "year": 1974,
    "month": 8,
    "day": 12,
    "formatted": "1974-08-12"
  },
  "sex": "F",
  "expiration_date": {
    "year": 2012,
    "month": 4,
    "day": 15,
    "formatted": "2012-04-15"
  },
  "personal_number": "ZE184226B",
  "check_digits": {
    "passport": "6",
    "birth": "2",
    "expiration": "9",
    "personal": "1"
  }
}
```

## Supported Formats

### MRZ Formats
- **TD3**: 3-line format (44 characters per line) - Standard passports
- **TD1**: 2-line format (30 characters per line) - ID cards

### Image Formats
- JPEG
- PNG
- BMP
- TIFF

## Development

### Setup Development Environment

```bash
git clone https://github.com/yourusername/doc-parser-mcp.git
cd doc-parser-mcp

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install development dependencies
pip install -e ".[dev]"
```

### Running Tests

```bash
pytest
```

### Code Formatting

```bash
black doc_parser_mcp/
```

### Type Checking

```bash
mypy doc_parser_mcp/
```

## Configuration

The server can be configured through environment variables:

- `TESSERACT_CMD`: Path to Tesseract executable (if not in PATH)
- `MRZ_DEBUG`: Enable debug logging (set to "1")

## Limitations

- Image quality affects extraction accuracy
- Works best with clear, high-resolution passport images
- MRZ should be visible and not obstructed
- Some passports may have non-standard formats

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests if applicable
5. Submit a pull request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Acknowledgments

- [passport_mrz_extractor](https://github.com/Azim-Kenzh/passport_mrz_extractor) - Primary MRZ extraction library
- [OpenCV](https://opencv.org/) - Computer vision functionality
- [Tesseract](https://github.com/tesseract-ocr/tesseract) - OCR engine
- [Model Context Protocol](https://modelcontextprotocol.io/) - Protocol framework

## Support

For issues and questions:
- Create an issue on [GitHub](https://github.com/yourusername/doc-parser-mcp/issues)
- Check the [documentation](https://github.com/yourusername/doc-parser-mcp/wiki)
