Metadata-Version: 2.4
Name: gemini-ocr-cli
Version: 0.3.2
Summary: CLI tool for OCR processing using Google Gemini's vision capabilities
Project-URL: Homepage, https://github.com/r-uben/gemini-ocr-cli
Project-URL: Repository, https://github.com/r-uben/gemini-ocr-cli
Project-URL: Issues, https://github.com/r-uben/gemini-ocr-cli/issues
Author-email: Ruben Fernandez-Fuertes <fernandezfuertesruben@gmail.com>
License: MIT
License-File: LICENSE
Keywords: cli,document-processing,gemini,google,ocr,pdf,vision
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: General
Requires-Python: >=3.11
Requires-Dist: click>=8.1.0
Requires-Dist: google-genai>=1.0.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pymupdf>=1.24.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: rich>=13.0.0
Provides-Extra: dev
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.8.0; extra == 'dev'
Description-Content-Type: text/markdown

# Gemini OCR CLI

[![CI](https://github.com/r-uben/gemini-ocr-cli/actions/workflows/ci.yml/badge.svg)](https://github.com/r-uben/gemini-ocr-cli/actions/workflows/ci.yml)
[![PyPI version](https://badge.fury.io/py/gemini-ocr-cli.svg)](https://badge.fury.io/py/gemini-ocr-cli)
[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

A command-line tool for OCR processing using Google Gemini's vision capabilities. Process PDFs and images to extract text, tables, equations, and figures.

## Choosing an OCR tool

This is one of five OCR CLI tools with a shared design: clean Markdown output, batch processing, and figure extraction. Pick based on your constraints:

| Tool | Engine | Runs | Cost | Best for |
|------|--------|------|------|----------|
| [deepseek-ocr-cli](https://github.com/r-uben/deepseek-ocr-cli) | DeepSeek vision | Local (Ollama / vLLM) | Free | General-purpose local OCR with multi-backend flexibility |
| **gemini-ocr-cli** (this repo) | Google Gemini | Cloud API | Free tier / Pay-per-use | Fast cloud OCR with concurrent processing |
| [marker-ocr-cli](https://github.com/r-uben/marker-ocr-cli) | Marker (Surya + Texify) | Local | Free | Academic papers with equations, tables, complex layouts |
| [mistral-ocr-cli](https://github.com/r-uben/mistral-ocr-cli) | Mistral OCR API | Cloud API | ~$1/1k pages | Structured extraction (tables, headers, footers) |
| [nougat-ocr-cli](https://github.com/r-uben/nougat-ocr-cli) | Meta Nougat | Local (GPU) | Free | Academic papers, GPU-accelerated batch processing |

## Installation

Requires Python 3.11+ and a [Google Gemini API key](https://aistudio.google.com/apikey).

```bash
pip install gemini-ocr-cli
```

Or from source:

```bash
git clone https://github.com/r-uben/gemini-ocr-cli.git
cd gemini-ocr-cli
uv sync
```

## Quick start

```bash
# Set your API key
export GEMINI_API_KEY="your_key_here"

# Process a single file
gemini-ocr document.pdf

# Process a directory
gemini-ocr ./documents -o ./results

# Preview what would be processed (no API calls)
gemini-ocr ./documents --dry-run

# Process 4 files concurrently
gemini-ocr ./documents -w 4
```

## Options

```
Usage: gemini-ocr [OPTIONS] INPUT_PATH

Options:
  -o, --output-dir PATH           Output directory (default: <input_dir>/gemini_ocr_output/)
  --api-key TEXT                  Gemini API key (or set GEMINI_API_KEY env var)
  --model TEXT                    Model to use (default: gemini-3-flash-preview)
  --task [convert|extract|table|describe_figure]
                                  OCR task type (default: convert)
  --prompt TEXT                   Custom prompt for OCR processing

  --include-images/--no-images    Extract embedded images (default: True)
  --save-originals/--no-save-originals  Copy original images to output (default: True)

  -w, --workers N                 Concurrent workers for batch processing (default: 1)
  --reprocess                     Reprocess already-processed files
  --dry-run                       List files without calling the API
  -q, --quiet                     Suppress all output except errors
  -v, --verbose                   Enable verbose/debug output
  --info                          Show configuration and system info
  --env-file PATH                 Path to .env file
  --version                       Show version
  --help                          Show this message
```

## Output structure

```
gemini_ocr_output/
├── document_name/
│   ├── document_name.md        # OCR markdown (clean text only)
│   └── figures/                # extracted embedded images
│       ├── page1_img1.png
│       └── page2_img1.png
├── another_document/
│   └── ...
└── metadata.json               # processing stats, checksums, file list
```

## API key resolution

**Priority order:**
1. `--api-key` CLI argument
2. `GEMINI_API_KEY` environment variable
3. `GOOGLE_API_KEY` environment variable (fallback)
4. `.env` file in current directory

## Configuration

All CLI options can also be set via environment variables or a `.env` file:

| CLI flag | Environment variable | Default |
|----------|---------------------|---------|
| `--api-key` | `GEMINI_API_KEY` | (required) |
| `--model` | `GEMINI_MODEL` | `gemini-3-flash-preview` |
| `--include-images` | `GEMINI_INCLUDE_IMAGES` | `true` |
| `--save-originals` | `GEMINI_SAVE_ORIGINAL_IMAGES` | `true` |
| `--workers` | `GEMINI_MAX_WORKERS` | `1` |
| `--verbose` | `GEMINI_VERBOSE` | `false` |
| | `GEMINI_MAX_FILE_SIZE_MB` | `50` |
| | `GEMINI_MAX_RETRIES` | `3` |
| | `GEMINI_RETRY_BASE_DELAY` | `1.0` |

CLI flags override environment variables when explicitly passed.

## Development

```bash
# Install dev dependencies
uv sync --extra dev

# Run tests
uv run pytest

# Lint
uv run ruff check .

# Format
uv run ruff format .

# Type check
uv run mypy gemini_ocr/ --ignore-missing-imports
```

## Limitations

- Maximum file size: 50 MB (configurable via `GEMINI_MAX_FILE_SIZE_MB`)
- Supported formats: PDF, JPG, JPEG, PNG, WEBP, GIF, BMP, TIFF

## License

MIT License - see [LICENSE](LICENSE) for details.
