Metadata-Version: 2.4
Name: emx-mistral-ocr-cli
Version: 0.1.1
Summary: CLI tool for converting PDF documents to Markdown or HTML using Mistral OCR.
License: MIT License
        
        Copyright (c) 2026 emmtrix Technologies GmbH
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Repository, https://github.com/emmtrix/emx-mistral-ocr-cli
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: markdown
Requires-Dist: mistralai
Requires-Dist: pypdf
Dynamic: license-file

# emx-mistral-ocr-cli

CLI tool for converting PDF documents to Markdown or HTML using Mistral OCR.

## Features

- PDF -> Markdown (default) or HTML output
- Automatic output format detection from `--out` extension (`.html`/`.htm` -> HTML)
- Optional page selection via `--pages` (`1-12`, `2,5,10-12`, ...)
- Optional local PDF slicing before upload (`--slice-pdf`) to help with very large PDFs (e.g. >1000 pages)
- Optional extracted image export
- HTML mode with embedded HTML tables and built-in CSS styling
- Local chapter index analysis before OCR (`--analyze-index`)
- Retry handling for temporary Mistral API errors
- Safe output behavior (no overwrite without `--force`)

## Requirements

- Python 3.10+
- A valid Mistral API key in environment variable `MISTRAL_API_KEY`

Install dependencies:

```bash
pip install -r requirements.txt
```

## Setup

Set your API key:

Linux/macOS (bash/zsh):

```bash
export MISTRAL_API_KEY="your_key_here"
```

Windows PowerShell / PowerShell:

```powershell
$env:MISTRAL_API_KEY="your_key_here"
```

Windows cmd.exe:

```cmd
set MISTRAL_API_KEY=your_key_here
```

## Usage

```bash
python mistral_ocr_cli.py <input.pdf> [options]
```

Show help:

```bash
python mistral_ocr_cli.py -h
```

## Common Examples

Default Markdown output:

```bash
python mistral_ocr_cli.py doc.pdf
```

Write Markdown to a specific file:

```bash
python mistral_ocr_cli.py doc.pdf --out result.md
```

HTML output (auto-selected by extension):

```bash
python mistral_ocr_cli.py doc.pdf --out result.html
```

Explicit HTML output:

```bash
python mistral_ocr_cli.py doc.pdf --output-format html --out result.html
```

Process only selected pages:

```bash
python mistral_ocr_cli.py doc.pdf --pages "1-20"
```

Slice selected pages locally before upload:

```bash
python mistral_ocr_cli.py doc.pdf --pages "1150-1200" --slice-pdf --out result.html --force
```

Disable images entirely:

```bash
python mistral_ocr_cli.py doc.pdf --no-images
```

Export images to custom directory:

```bash
python mistral_ocr_cli.py doc.pdf --images-dir extracted_images
```

Analyze chapter index locally (no OCR call):

```bash
python mistral_ocr_cli.py doc.pdf --analyze-index
```

Analyze chapter index and write it to file:

```bash
python mistral_ocr_cli.py doc.pdf --analyze-index --chapter-index-out index.tsv --force
```

## Options

- `--out <path>`: Output file path
- `--output-format {markdown,html}`: Output format (default: `markdown`)
- `--force`: Overwrite existing outputs
- `--pages "<spec>"`: 1-based page selection, e.g. `1-12`, `2,5,10-12`
- `--slice-pdf`: Build temporary sliced PDF locally before upload (requires `--pages`). Useful when Mistral rejects very large PDFs (e.g. >1000 pages) and you want to process it in chunks.
- `--images-dir <dir>`: Directory for extracted images (default: `<out_stem>_images`)
- `--no-images`: Disable image extraction/export
- `--image-limit <n>`: Maximum number of images to extract
- `--image-min-size <px>`: Minimum image width/height
- `--no-header-footer`: Disable header/footer extraction
- `--chapter-index-out <file>`: Write local chapter index output
- `--analyze-index`: Local chapter index analysis and exit

## Notes

- In HTML mode, OCR tables are requested as HTML and embedded into the final HTML document. HTML is generally more expressive than Markdown for complex layouts (e.g. tables with `colspan`/`rowspan`, which standard Markdown tables do not support).
- For large PDFs, `--slice-pdf` can still take time (PDF parsing/writing), but it reduces upload size and processed content and can avoid API errors for extremely large documents (e.g. >1000 pages).
- `--analyze-index` is useful to discover chapter boundaries and page numbers so you can select specific chapters via `--pages`.
