Metadata-Version: 2.4
Name: RobustDocOCR
Version: 1.0.3
Summary: A robust preprocessing pipeline for document OCR that significantly improves Tesseract accuracy on mobile-captured ID documents
Home-page: https://github.com/3bsalam-1/RobustDocOCR
Author: Ahmed Mohamed
Author-email: Ahmed Mohamed <3bsalam0@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/3bsalam-1/RobustDocOCR
Project-URL: Documentation, https://github.com/3bsalam-1/RobustDocOCR/blob/main/docs
Project-URL: Repository, https://github.com/3bsalam-1/RobustDocOCR
Project-URL: Issues, https://github.com/3bsalam-1/RobustDocOCR/issues
Project-URL: Changelog, https://github.com/3bsalam-1/RobustDocOCR/blob/main/CHANGELOG.md
Keywords: ocr,document,preprocessing,image-processing,tesseract,computer-vision,document-analysis,text-recognition
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Image Processing
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Multimedia :: Graphics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: opencv-python-headless>=4.5.0
Requires-Dist: numpy>=1.20.0
Requires-Dist: pillow>=9.0.0
Requires-Dist: matplotlib>=3.5.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=3.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=4.0.0; extra == "dev"
Requires-Dist: isort>=5.0.0; extra == "dev"
Requires-Dist: mypy>=0.900; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Requires-Dist: build>=0.8.0; extra == "dev"
Provides-Extra: ocr
Requires-Dist: pytesseract>=0.3.0; extra == "ocr"
Provides-Extra: all
Requires-Dist: RobustDocOCR[dev,ocr]; extra == "all"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# Robust Document OCR Preprocessing Pipeline

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![Code Style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![PyPI version](https://img.shields.io/pypi/v/RobustDocOCR)](https://pypi.org/project/RobustDocOCR/)

A robust preprocessing pipeline for document OCR that significantly improves Tesseract accuracy on mobile-captured ID documents.

## 🚀 Quick Start

### Installation

```bash
# Install from PyPI
pip install RobustDocOCR

# Install with OCR support
pip install RobustDocOCR[ocr]

# Install with development dependencies
pip install RobustDocOCR[dev]
```

### Basic Usage

```python
from robustdococr import preprocess_document, load_image

# Load your document image
image = load_image("document.jpg")

# Apply preprocessing pipeline
results = preprocess_document(image, show_steps=True)

# Access preprocessed image
preprocessed_image = results['final']
```

### Command Line Interface

```bash
# Process single image
robustdococr input.jpg --output output.jpg

# Process with intermediate steps display
robustdococr input.jpg --show-steps
```

## 📦 Features

### 4-Stage Preprocessing Pipeline

1. **Deskewing**: Straightens rotated documents using Hough transform
2. **Binarization**: Converts images to black & white using adaptive thresholding
3. **Noise Removal**: Cleans up artifacts using two-stage denoising
4. **OCR Ready**: Produces optimized images for Tesseract OCR

### Key Technical Features

- **Adaptive Thresholding**: Handles varying lighting conditions (shadows, glare)
- **Hough Transform Deskewing**: Robust rotation correction (±45°)
- **Two-Stage Denoising**: Preserves text while removing artifacts
- **96% Text Retention**: Minimal text loss during preprocessing
- **Tesseract Optimized**: Produces images ideal for OCR engines

## 🎯 Performance Metrics

| Metric | Value |
|--------|-------|
| **Text Retention Rate** | 96% |
| **Character Improvement** | +12% |
| **Quality Distribution** | 85% Excellent, 12% Good, 3% Fair |
| **Rotation Correction** | Handles ±45° rotation effectively |

## 📂 Project Structure

```
robustdococr/
 ├── preprocessing/          # Core preprocessing modules
 │   ├── deskewing.py        # Image straightening
 │   ├── binarization.py     # Adaptive thresholding
 │   ├── noise_removal.py    # Artifact cleaning
 │   └── pipeline.py         # Complete pipeline
 ├── utils/                  # Utility functions
 │   ├── image_utils.py      # Image utilities
 │   ├── ocr_utils.py        # OCR utilities
 │   └── visualization.py    # Visualization tools
 ├── cli.py                  # CLI entry point
 ├── main.py                 # Main module
 └── __init__.py             # Package initialization
tests/                      # Test suite
examples/                   # Example scripts
notebooks/                  # Jupyter notebooks
docs/                       # Documentation
```

## 🔧 Configuration

### Requirements

- Python 3.8+
- OpenCV
- NumPy
- Pillow
- Matplotlib (for visualization)
- Tesseract OCR (optional, for OCR features)

### Installation Options

```bash
# Basic installation
pip install RobustDocOCR

# Development installation (includes test and dev dependencies)
pip install RobustDocOCR[dev]

# Installation with OCR support
pip install RobustDocOCR[ocr]

# Installation with all extras
pip install RobustDocOCR[all]
```

## 📊 Technical Specifications

### Deskewing Algorithm

- **Edge Detection**: Canny edge detector with thresholds (50, 150)
- **Line Detection**: Hough Line Transform with threshold 200
- **Angle Calculation**: Median angle from detected lines for robustness
- **Rotation**: Affine transformation with cubic interpolation

### Binarization Algorithm

- **CLAHE Enhancement**: Contrast Limited Adaptive Histogram Equalization
  - `clipLimit`: 2.0
  - `tileGridSize`: (8, 8)
- **Adaptive Thresholding**: Gaussian-weighted local thresholding
  - `blockSize`: 25
  - `C`: 10
  - `Method`: ADAPTIVE_THRESH_GAUSSIAN_C

### Noise Removal Algorithm

- **Stage 1**: Non-Local Means Denoising (`h=10`) applied before binarization
- **Stage 2**: Morphological operations (2×2 kernel, 1 iteration) applied after binarization

## 🧪 Testing

Run the test suite:

```bash
pytest
```

Run tests with coverage:

```bash
pytest --cov=robustdococr --cov-report=html
```

## 📚 Documentation

- [Architecture Documentation](docs/architecture.md)
- [Usage Guide](docs/usage-guide.md)
- [Decision Log](docs/decision-log.md)
- [API Reference](docs/api-reference.md)
- [Kaggle Notebook](https://www.kaggle.com/code/ahmedmohamedab/robust-document-ocr-preprocessing-pipeline) - Complete preprocessing pipeline demonstration

## 🤝 Contributing

We welcome contributions! Please see our:

- [Contributing Guidelines](CONTRIBUTING.md)
- [Code of Conduct](CODE_OF_CONDUCT.md)
- [Issue Templates](.github/ISSUE_TEMPLATE/)

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🎓 Citation

If you use this pipeline in your research, please cite:

```bibtex
@misc{robust-doc-ocr-preprocessing,
  author = {3BSALAM},
  title = {Robust Document OCR Preprocessing Pipeline},
  year = {2026},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/3bsalam-1/RobustDocOCR}}
}
```

## 🔗 Related Projects

- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)
- [OpenCV](https://opencv.org/)
- [MIDV-500 Dataset](https://www.kaggle.com/datasets/kontheeboonmeeprakob/midv500)

## 📦 PyPI

This package is available on PyPI: [https://pypi.org/project/RobustDocOCR/](https://pypi.org/project/RobustDocOCR/)

---

**© 2026 Robust Document OCR Preprocessing Pipeline**
