Metadata-Version: 2.4
Name: translatepdf
Version: 1.0.0
Summary: Translate PDF documents using OCR and machine translation
Author: Bhanu Pratap Singh Rathore
Maintainer: Bhanu Pratap Singh Rathore
License: MIT
Project-URL: Homepage, https://github.com/BhanuJodha/PDF-Translator
Project-URL: Documentation, https://github.com/BhanuJodha/PDF-Translator#readme
Project-URL: Repository, https://github.com/BhanuJodha/PDF-Translator
Project-URL: Issues, https://github.com/BhanuJodha/PDF-Translator/issues
Keywords: pdf,translation,ocr,surya,google-translate,document-translation,hindi,multilingual
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Multimedia :: Graphics :: Graphics Conversion
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21.0
Requires-Dist: Pillow>=9.0.0
Requires-Dist: pdf2image>=1.16.0
Requires-Dist: deep-translator>=1.11.0
Requires-Dist: surya-ocr>=0.4.0
Requires-Dist: pymupdf>=1.24.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: pytest-mock>=3.10.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
Requires-Dist: build>=1.0.0; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Dynamic: license-file

# TranslatePDF

![Test](https://github.com/BhanuJodha/PDF-Translator/actions/workflows/test.yml/badge.svg)
![Lint](https://github.com/BhanuJodha/PDF-Translator/actions/workflows/lint.yml/badge.svg)
![Type Check](https://github.com/BhanuJodha/PDF-Translator/actions/workflows/typecheck.yml/badge.svg)
![Python](https://img.shields.io/badge/python-3.9%2B-blue)
![License](https://img.shields.io/badge/license-MIT-green)

Translate PDFs using OCR and machine translation—supports both scanned documents and digital PDFs.

This tool extracts text from PDF pages using [Surya OCR](https://github.com/VikParuchuri/surya) (for scanned documents) or [PyMuPDF](https://pymupdf.readthedocs.io/) (for digital PDFs), translates it via Google Translate, and renders the translated text back onto the original document—preserving layout, colors, and formatting.

## Features

- **Dual-mode translation**: Automatically detects PDF type and uses the optimal method
  - **Digital PDFs**: Fast text extraction and in-place replacement (no OCR needed)
  - **Scanned PDFs**: High-quality OCR powered by Surya (supports 90+ languages)
- **Automatic text color detection** ensures readability on any background
- **Batch processing** for faster translation of multi-page documents
- **GPU acceleration** on NVIDIA (CUDA) and Apple Silicon (MPS)
- **Page range selection** to translate specific pages
- **Preserves formatting** including bold and underlined text

## Installation

```bash
pip install translatepdf
```

### System Dependencies

You'll need `poppler` for PDF rendering:

```bash
# macOS
brew install poppler

# Ubuntu/Debian
sudo apt-get install poppler-utils

# Windows (via conda)
conda install -c conda-forge poppler
```

## Quick Start

### Command Line

```bash
# Basic usage - auto-detects PDF type, translates English to Hindi
translatepdf document.pdf

# Specify languages
translatepdf document.pdf --source en --target hi

# Force digital mode (faster for PDFs with embedded text)
translatepdf document.pdf --mode digital

# Force OCR mode (for scanned documents)
translatepdf document.pdf --mode ocr

# Translate specific pages
translatepdf document.pdf --pages 1-5

# Custom output path
translatepdf document.pdf -o translated.pdf

# Higher quality OCR (slower)
translatepdf document.pdf --dpi 300
```

### Python API

```python
from pdf_translator import PDFTranslator, TranslationConfig

# Simple usage - auto-detects PDF type
translator = PDFTranslator()
translator.translate("document.pdf", target_lang="hi")

# Force digital mode for PDFs with embedded text
config = TranslationConfig(
    source_lang="en",
    target_lang="hi",
    mode="digital",  # "auto", "ocr", or "digital"
)

translator = PDFTranslator(config)
translator.translate("document.pdf")

# Force OCR mode with custom settings
config = TranslationConfig(
    source_lang="en",
    target_lang="hi",
    mode="ocr",
    device="mps",  # or "cuda", "cpu"
    dpi=200,
)

translator = PDFTranslator(config)
output_path = translator.translate(
    "document.pdf",
    output_path="translated.pdf",
    page_range="1-10",
)
```

### Device-Specific Configs

```python
from pdf_translator import TranslationConfig

# For Apple Silicon Macs
config = TranslationConfig.for_apple_silicon()

# For NVIDIA GPUs
config = TranslationConfig.for_nvidia_gpu()

# For CPU-only systems
config = TranslationConfig.for_cpu()
```

## CLI Options

| Option | Short | Description | Default |
|--------|-------|-------------|---------|
| `--output` | `-o` | Output file path | `input_translated.pdf` |
| `--source` | `-s` | Source language code | `en` |
| `--target` | `-t` | Target language code | `hi` |
| `--mode` | `-m` | Translation mode: `auto`, `ocr`, `digital` | `auto` |
| `--pages` | `-p` | Page range (e.g., "1-5", "1,3,5") | `all` |
| `--dpi` | `-d` | Rendering resolution (OCR mode only) | `200` |
| `--batch-size` | `-b` | Pages per OCR batch (OCR mode only) | `4` |
| `--device` | | Compute device (OCR mode only) | `auto` |

### Translation Modes

| Mode | Description | Best For |
|------|-------------|----------|
| `auto` | Auto-detect PDF type | Most PDFs (default) |
| `digital` | Extract embedded text directly | Digital/native PDFs, Word exports, LaTeX |
| `ocr` | Use Surya OCR on page images | Scanned documents, image-based PDFs |

**Digital mode** is significantly faster and produces better results for PDFs with embedded text (e.g., documents created in Word, LaTeX, or other text editors).

**OCR mode** is required for scanned documents or image-based PDFs where text is not selectable.

## Supported Languages

The source language can be any of the 90+ languages supported by Surya OCR. Target languages depend on Google Translate availability.

Common language codes:
- `en` - English
- `hi` - Hindi
- `es` - Spanish
- `fr` - French
- `de` - German
- `zh` - Chinese
- `ja` - Japanese
- `ko` - Korean
- `ar` - Arabic
- `ru` - Russian

## How It Works

### Auto Mode (Default)

The tool first checks if the PDF contains extractable text. If it does, it uses **digital mode**; otherwise, it falls back to **OCR mode**.

### Digital Mode

1. **Text Extraction**: PyMuPDF extracts text with position and formatting info
2. **Translation**: Text is batch-translated via Google Translate
3. **In-Place Replacement**: Original text is replaced with translations using redaction annotations
4. **Output**: Native PDF with translated text

This mode preserves the original PDF structure and is much faster than OCR.

### OCR Mode

1. **PDF to Images**: Each page is rendered as a high-resolution image
2. **OCR**: Surya detects and recognizes text regions with their positions
3. **Translation**: Text is batch-translated via Google Translate
4. **Rendering**: Original text is erased and replaced with translations
5. **Output**: Processed images are combined back into a PDF

The tool samples background colors around each text region to cleanly erase the original text, then automatically chooses black or white text for maximum contrast.

## Performance Tips

- **Lower DPI** (150-200) for faster processing, higher (300+) for better quality
- **Increase batch size** if you have more GPU memory
- **Use page ranges** to translate only what you need
- **GPU acceleration** provides 5-10x speedup over CPU

## Development

### Setup

```bash
# Clone the repo
git clone https://github.com/bhanurathore/pdf-translator.git
cd pdf-translator

# Create virtual environment
python -m venv .venv

# Activate it
source .venv/bin/activate  # Linux/macOS
# .venv\Scripts\activate   # Windows

# Install in development mode with dev dependencies
pip install -e ".[dev]"

# Set up pre-commit hooks
pre-commit install
```

### Running Tests

```bash
# Run all tests
make test

# Run tests with coverage report
make test-cov

# Or using pytest directly
pytest tests/ -v

# Run specific test file
pytest tests/core/test_config.py -v

# Run with coverage
pytest tests/ --cov=pdf_translator --cov-report=term-missing
```

### Linting and Formatting

```bash
# Check code style with ruff
make lint

# Auto-fix linting issues
make lint-fix

# Format code with black
make format

# Check formatting without changes
make format-check

# Run type checker
make typecheck

# Run all checks (format, lint, typecheck, test)
make check
```

### Pre-commit Hooks

Pre-commit hooks run automatically on `git commit`. To run manually:

```bash
# Run on all files
pre-commit run --all-files

# Run specific hook
pre-commit run black --all-files
```

### Building and Publishing

```bash
# Build package
make build

# Publish to Test PyPI (for testing)
make publish-test

# Publish to PyPI
make publish
```

## Project Structure

```
pdf-translator/
├── pdf_translator/              # Main package
│   ├── __init__.py              # Package exports
│   ├── cli.py                   # Command-line interface
│   ├── py.typed                 # PEP 561 type marker
│   ├── core/                    # Core functionality
│   │   ├── config.py            # Configuration management
│   │   ├── ocr.py               # Surya OCR wrapper
│   │   ├── pdf_extractor.py     # Digital PDF text extraction
│   │   ├── pdf_renderer.py      # Digital PDF text replacement
│   │   ├── renderer.py          # Image-based text rendering
│   │   ├── text_translator.py   # Google Translate wrapper
│   │   └── translator.py        # Main orchestrator
│   └── utils/                   # Utilities
│       ├── fonts.py             # Cross-platform font discovery
│       └── page_range.py        # Page range parsing
├── tests/                       # Test suite (mirrors package structure)
│   ├── conftest.py              # Shared fixtures
│   ├── test_cli.py
│   ├── core/                    # Tests for core modules
│   │   ├── test_config.py
│   │   ├── test_ocr.py
│   │   ├── test_pdf_extractor.py
│   │   ├── test_pdf_renderer.py
│   │   ├── test_renderer.py
│   │   └── test_text_translator.py
│   └── utils/                   # Tests for utilities
│       ├── test_fonts.py
│       └── test_page_range.py
├── docs/                        # Documentation
│   └── PROJECT_STRUCTURE.md     # Explains all project files
├── pyproject.toml               # Package configuration (main config!)
├── setup.py                     # Legacy compatibility
├── Makefile                     # Command shortcuts
├── MANIFEST.in                  # Package manifest
├── .pre-commit-config.yaml      # Pre-commit hooks
├── .gitignore                   # Git ignore rules
├── .editorconfig                # Editor settings
├── .python-version              # Python version (for pyenv)
└── LICENSE                      # MIT License
```

## License

MIT License - see [LICENSE](LICENSE) for details.

## Maintainer

**Bhanu Pratap Singh Rathore**

## Acknowledgments

- [Surya OCR](https://github.com/VikParuchuri/surya) for the excellent OCR engine
- [PyMuPDF](https://pymupdf.readthedocs.io/) for digital PDF text extraction and manipulation
- [deep-translator](https://github.com/nidhaloff/deep-translator) for the translation API wrapper
- [pdf2image](https://github.com/Belval/pdf2image) for PDF rendering
