Metadata-Version: 2.4
Name: mistocr
Version: 0.0.3
Summary: Simple batch OCR for PDFs using Mistral's state-of-the-art vision model
Home-page: https://github.com/franckalbinet/mistocr
Author: Solveit
Author-email: nobody@fast.ai
License: Apache Software License 2.0
Keywords: nbdev jupyter notebook python
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: Apache Software License
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastcore
Requires-Dist: mistralai
Requires-Dist: pillow
Requires-Dist: dotenv
Provides-Extra: dev
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# mistocr


<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Why mistocr?

**Performance**: Mistral’s OCR delivers state-of-the-art accuracy on
complex documents including tables, charts, and multi-column layouts.

**Scale**: Process entire folders of PDFs in a single batch job. Upload
once, process asynchronously, and retrieve results when ready - perfect
for large document sets.

**Cost savings**: Batch OCR mode reduces costs from \$1/1000 pages to
\$0.50/1000 pages - a 50% reduction compared to synchronous processing.

**Simplicity**: A single `ocr()` function handles everything -
uploading, batch submission, polling for completion, and saving results
as markdown with extracted images. Process one PDF or an entire folder
with the same simple interface.

**Organized output**: Each PDF is automatically saved to its own folder
with pages as separate markdown files and images in an `img` subfolder,
making results easy to navigate and process further.

## Installation

Install latest from the GitHub
[repository](https://github.com/franckalbinet/mistocr):

``` sh
$ pip install git+https://github.com/franckalbinet/mistocr.git
```

or from [pypi](https://pypi.org/project/mistocr/)

``` sh
$ pip install mistocr
```

## How to use

``` python
from mistocr.core import ocr
```

- **Process a single PDF:**

<!-- -->

    fname = 'files/test/attention-is-all-you-need.pdf'
    result = ocr(fname)

``` python
```

    files/test/md/attention-is-all-you-need:
    img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
    page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
    page_10.md  page_13.md  page_2.md   page_5.md  page_8.md

    files/test/md/attention-is-all-you-need/img:
    img-0.jpeg  img-1.jpeg  img-2.jpeg  img-3.jpeg  img-4.jpeg

- **Or process an entire folder:**

``` python
results = ocr('files/test')
```

``` python
```

    files/test/md:
    attention-is-all-you-need/  resnet/

    files/test/md/attention-is-all-you-need:
    img/        page_11.md  page_14.md  page_3.md  page_6.md  page_9.md
    page_1.md   page_12.md  page_15.md  page_4.md  page_7.md
    page_10.md  page_13.md  page_2.md   page_5.md  page_8.md

    files/test/md/attention-is-all-you-need/img:
    img-0.jpeg  img-1.jpeg  img-2.jpeg  img-3.jpeg  img-4.jpeg

    files/test/md/resnet:
    img/       page_10.md  page_12.md  page_3.md  page_5.md  page_7.md  page_9.md
    page_1.md  page_11.md  page_2.md   page_4.md  page_6.md  page_8.md

    files/test/md/resnet/img:
    img-0.jpeg  img-2.jpeg  img-4.jpeg  img-6.jpeg
    img-1.jpeg  img-3.jpeg  img-5.jpeg

- **Customize the output:**

``` python
results = ocr('files/test', out_dir='output', inc_img=False, poll_interval=5)
```

**Parameters:**

- **`path`**: A single PDF file or folder containing multiple PDFs
- **`out_dir`**: Directory name for saving markdown output (default:
  `'md'`)
- **`inc_img`**: Include extracted images in the output (default:
  `True`)
- **`key`**: Your Mistral API key (uses `MISTRAL_API_KEY` environment
  variable if not provided)
- **`poll_interval`**: Seconds between batch job status checks (default:
  `2`)

**Returns:** List of paths to the generated markdown files

## Developer Guide

If you are new to using `nbdev` here are some useful pointers to get you
started.

### Install mistocr in Development mode

``` sh
# make sure mistocr package is installed in development mode
$ pip install -e .

# make changes under nbs/ directory
# ...

# compile to have changes apply to mistocr
$ nbdev_prepare
```

### Documentation

Documentation can be found hosted on this GitHub
[repository](https://github.com/franckalbinet/mistocr)’s
[pages](https://franckalbinet.github.io/mistocr/). Additionally you can
find package manager specific guidelines on
[conda](https://anaconda.org/franckalbinet/mistocr) and
[pypi](https://pypi.org/project/mistocr/) respectively.
