Metadata-Version: 2.2
Name: bocr
Version: 0.2.0
Summary: A Python package for OCR using Vision LLMs
Author-email: Adrian Phoulady <adrian.phoulady@gmail.com>
License: MIT
Project-URL: Repository, https://github.com/adrianphoulady/bocr
Keywords: ocr,vision,llm,vllm,bocr,text extraction,qwen-vl,llama-vision,phi-vision,ollama
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: accelerate>=0.34.0
Requires-Dist: torch<2.6.0,>=2.4.0
Requires-Dist: torchvision<0.21.0,>=0.19.0
Requires-Dist: transformers>=4.49
Requires-Dist: qwen_vl_utils>=0.0.10
Requires-Dist: ollama>=0.1
Requires-Dist: opencv-python-headless>=4.10.0.84
Requires-Dist: pdf2image>=1.16.0
Requires-Dist: pillow>=10.0.0
Requires-Dist: markdown>=3.4.0
Requires-Dist: pypandoc>=1.11

# bOCR: OCR Framework with Vision LLMs

**bOCR** is an Optical Character Recognition (OCR) framework that uses Vision Large Language Models (VLLMs) for text extraction and document processing.

## Features

- **Minimal Setup**: Requires just a single backbone file (e.g., `qwen.py` or `ollamas.py`) for OCR execution, making it lightweight and easy to use.
- **Broad Vision LLM Support**: Integrates with vision LLMs like `Qwen`, `Llama`, `Phi`, and various VLLMs included in the `Ollama` package.
- **Customizable Prompts**: Fine-tune OCR output using either a custom or default prompt.
- **Automated Preprocessing**: Image denoising, resizing, and PDF-to-image conversion.
- **Postprocessing & Export**: Supports merging pages and multiple export formats (`plain`, `markdown`, `docx`, `pdf`).
- **Configurable Pipeline**: A single `Config` object centralizes OCR settings.
- **Detailed Logging**: Integrated verbose logging for insights and debugging.

---

## Installation

### Install from PyPI (Recommended)

```bash
pip install bocr
```

### Install from Source (Development Version)

```bash
git clone https://github.com/adrianphoulady/bocr.git
cd bocr
pip install .
```

### Required Dependencies

For PDF and document processing, `poppler`, `pandoc`, and LaTeX are also required. You can install them as follows:

#### Linux (Debian/Ubuntu)
```bash
sudo apt install poppler-utils pandoc texlive-xetex texlive-fonts-recommended lmodern
```

#### macOS (using Homebrew)
```bash
brew install poppler pandoc --cask mactex-no-gui
```

#### Windows (using Chocolatey)
```powershell
choco install poppler pandoc miktex
```

---

## Quick Start

### Simple Example (Single File OCR)

Any backbone file in the `backbones` module, like `qwen.py`, is all you need to run OCR on an image:

```python
from bocr.backbones.qwen import extract_text

result = extract_text("sample1.png")
print(result)
```

---

### Advanced Usage

```python
from bocr import Config, ocr

config = Config(model_id="Qwen/Qwen2-VL-7B-Instruct", export_results=True, export_format="pdf", verbose=True)
files = ["sample2.pdf"]
results = ocr(files, config)
print(results)
```

### Command Line Example

```bash
bocr sample1.jpg --export-results --export-format docx --verbose
```

---

## Configuration

The `Config` class centralizes OCR settings. Key parameters:

| Parameter        | Type         | Description                                            | Default                       |
|------------------|--------------|--------------------------------------------------------|-------------------------------|
| `prompt`         | `str`/`None` | Custom OCR prompt or `None` for default.               | `None`                        |
| `model_id`       | `str`        | Vision LLM model identifier.                           | `Qwen/Qwen2.5-VL-3B-Instruct` |
| `max_new_tokens` | `int`        | Max tokens generated by model.                         | `1024`                        |
| `preprocess`     | `bool`       | Enable preprocessing of input files.                   | `False`                       |
| `resolution`     | `int`        | DPI for PDF-to-image conversion.                       | `150`                         |
| `max_image_size` | `int`/`None` | Resize images to a max size. No resizing if `None`.    | `1920`                        |
| `result_format`  | `str`        | Output format (`plain`, `markdown`).                   | `md`                          |
| `merge_text`     | `bool`       | Merge extracted text.                                  | `False`                       |
| `export_results` | `bool`       | Save results to files.                                 | `False`                       |
| `export_format`  | `str`        | File output format (`txt`, `md`, `docx`, `pdf`).       | `md`                          |
| `export_dir`     | `str`/`None` | Directory for output files. `./ocr_exports` if `None`. | `None`                        |
| `verbose`        | `bool`       | Enables detailed logging for debugging.                | `False`                       |

---

## OCR Pipeline

### 1. Preprocessing

- **URL Handling**: Downloads remote files if input is a URL.
- **PDF Conversion**: Converts PDFs into image format (requires `poppler` installed and in `PATH`).
- **Image Enhancement**: Applies denoising and contrast adjustment.
- **Resizing**: Optimizes images for Vision LLMs.

### 2. Text Extraction

- Extracts text using Vision LLMs, with support for custom prompts for tailored OCR instructions.

### 3. Postprocessing

- Formats and merges extracted text in specified format.
- Converts it into specified export formats (e.g., Markdown, PDF).
- Saves results if configured.

---

## Logging

Enable logging by setting `verbose=True` in the `Config` object. Logs provide insights into preprocessing, extraction, and postprocessing steps.

---

## Supported Models

bOCR supports Vision LLMs such as:

- `Qwen/Qwen2.5-VL-3B-Instruct`
- `Qwen/Qwen2.5-VL-7B-Instruct`
- `Qwen/Qwen2.5-VL-72B-Instruct`
- `Qwen/Qwen2-VL-2B-Instruct`
- `Qwen/Qwen2-VL-7B-Instruct`
- `Qwen/Qwen2-VL-72B-Instruct`
- `Qwen/QVQ-72B-Preview`
- `meta-llama/Llama-3.2-11B-Vision-Instruct`
- `meta-llama/Llama-3.2-90B-Vision-Instruct`
- `microsoft/Phi-3.5-vision-instruct`
- `llama3.2-vision:11b` from Ollama
- `llama3.2-vision:90b` from Ollama

Additional models can be supported by implementing a new backbone in `bocr/backbones/` and updating `mappings.yaml`.

---

## License

This project is licensed under the MIT License.
