Metadata-Version: 2.1
Name: opticr
Version: 0.2.0
Summary: expose a single interface and API to few OCR tools
License: Apache-2.0
Author: lzayep
Author-email: ec@lza.sh
Requires-Python: >=3.10,<4.0
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Requires-Dist: ocrmypdf (>=14.0.0,<15.0.0)
Requires-Dist: pdf2image (>=1.16.0,<2.0.0)
Requires-Dist: pydantic (>=1.10.2,<2.0.0)
Requires-Dist: pytesseract (>=0.3.10,<0.4.0)
Requires-Dist: requests (>=2.28.1,<3.0.0)
Description-Content-Type: text/markdown

# opticr

Python library to expose a single interface and API to few OCR tools (google vision, Tesseract)

## Install
### Required binaries available in the $PATH
#### poppler-utils (pdf2image)

[https://github.com/Belval/pdf2image#how-to-install](https://github.com/Belval/pdf2image#how-to-install)

#### tesseract

[https://tesseract-ocr.github.io](https://tesseract-ocr.github.io/tessdoc/Home.html)

### Install OpticR
#### With pip

``` shell
pip install opticr
```

#### With poetry

``` shell
poetry add opticr
```

or to get the latest 'dangerous' version

```
poetry add  git+https://github.com/lzayep/opticr@main
```

## Usage

``` python
from opticr import OpticR

ocr = OpticR("tesseract")
pathtofile = "test/contract.pdf
pages: list[str] = ocr.get_pages(pathtofile)

```

With google-vision:

``` python
from opticr import OpticR

ocr = OpticR("google-vision", options={"google-vision": {"auth": {"token": ""}}})

# file could come from an URL
pathtofile = "https://example.com/contract.pdf
pages: list[str] = ocr.get_pages(pathtofile)

```

Cache the result, if the file as already been OCR return immediatly the previous result.
Result are stored temporarly in the local storage or shared storage such as Redis.
``` python
from opticr import OpticR

ocr = OpticR("tesseract", options={"cache":
                         {"backend": "redis", redis: "redis://"}}

# file could come from an URL
pathtofile = "https://example.com/contract.pdf
pages: list[str] = ocr.get_pages(pathtofile, cache=True)

```

