Metadata-Version: 2.4
Name: nougat-ocr-cli
Version: 0.3.0
Summary: CLI for PDF text extraction using Meta's Nougat model with GPU acceleration
Project-URL: Homepage, https://github.com/r-uben/nougat-ocr-cli
Project-URL: Repository, https://github.com/r-uben/nougat-ocr-cli
Project-URL: Issues, https://github.com/r-uben/nougat-ocr-cli/issues
Author-email: Ruben Fernandez-Fuertes <fernandezfuertesruben@gmail.com>
License: MIT
License-File: LICENSE
Keywords: cli,document,extraction,gpu,nougat,ocr,pdf
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: General
Classifier: Topic :: Utilities
Requires-Python: <3.13,>=3.11
Requires-Dist: albumentations==1.3.1
Requires-Dist: click>=8.1.0
Requires-Dist: nougat-ocr>=0.1.17
Requires-Dist: pillow>=9.0.0
Requires-Dist: pypdfium2<5.0.0,>=4.30.0
Requires-Dist: rich>=13.0.0
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers<4.36.0,>=4.35.0
Provides-Extra: dev
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.8.0; extra == 'dev'
Description-Content-Type: text/markdown

# Nougat OCR CLI

[![CI](https://github.com/r-uben/nougat-ocr-cli/actions/workflows/ci.yml/badge.svg)](https://github.com/r-uben/nougat-ocr-cli/actions/workflows/ci.yml)
[![PyPI version](https://badge.fury.io/py/nougat-ocr-cli.svg)](https://badge.fury.io/py/nougat-ocr-cli)
[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A command-line tool for OCR processing using Meta's [Nougat](https://github.com/facebookresearch/nougat) model. Extract text from PDFs with GPU acceleration (CUDA and Apple Metal).

## Installation

Requires Python 3.11 and a GPU (recommended).

```bash
pip install nougat-ocr-cli
```

Or from source:

```bash
git clone https://github.com/r-uben/nougat-ocr-cli.git
cd nougat-ocr-cli
uv sync
```

## Quick start

```bash
# Process a single file
nougat-ocr paper.pdf

# Process a directory
nougat-ocr ./papers/ -o ./results/

# Preview what would be processed (no model loading)
nougat-ocr ./papers/ --dry-run

# Process specific pages (zero-indexed)
nougat-ocr paper.pdf --pages 0-5

# Use CPU instead of GPU
nougat-ocr paper.pdf --device cpu
```

## Options

```
Usage: nougat-ocr [OPTIONS] INPUT_PATH

Options:
  -o, --output-dir PATH           Output directory (default: <input_dir>/nougat_ocr_output/)
  --model TEXT                    Nougat model tag (default: 0.1.0-base)
  --batch-size N                  Batch size for inference (auto-detected if not set)
  --full-precision                Use FP32 instead of BF16 (slower but more accurate)
  --pages TEXT                    Page range (e.g., '0-5' or '1,3,5')
  --device [auto|cuda|mps|cpu]    Device for inference (default: auto)

  --reprocess                     Reprocess already-processed files
  --dry-run                       List files without loading the model
  -q, --quiet                     Suppress all output except errors
  -v, --verbose                   Enable verbose/debug output
  --info                          Show device and system info
  --version                       Show version
  --help                          Show this message
```

## Output structure

```
nougat_ocr_output/
├── document_name/
│   └── document_name.md        # OCR markdown (clean text only)
├── another_document/
│   └── ...
└── metadata.json               # processing stats, checksums, file list
```

## Device selection

Nougat auto-detects the best available device:

1. **CUDA** — NVIDIA GPUs (fastest)
2. **MPS** — Apple Metal on M-series Macs
3. **CPU** — fallback (slow, not recommended for large documents)

Override with `--device cuda|mps|cpu`.

## Development

```bash
# Install dev dependencies
uv sync --extra dev

# Run tests
uv run pytest

# Lint
uv run ruff check .

# Format
uv run ruff format .

# Type check
uv run mypy nougat_ocr/ --ignore-missing-imports
```

## Limitations

- Python 3.11 only (nougat-ocr dependency constraint)
- Model weights: ~1.3 GB (auto-downloaded on first run)
- GPU strongly recommended for reasonable performance
- Supported formats: PDF, JPG, JPEG, PNG, WEBP, BMP, TIFF

## License

MIT License - see [LICENSE](LICENSE) for details.
