Metadata-Version: 2.4
Name: handscribe
Version: 0.1.0
Summary: Multilingual handwritten OCR for student notes - production-grade text extraction
Author-email: Ronald Gosso <ronaldgosso@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/ronaldgosso/handscribe
Project-URL: Bug Reports, https://github.com/ronaldgosso/handscribe/issues
Project-URL: Source, https://github.com/ronaldgosso/handscribe
Keywords: ocr,handwriting,multilingual,student-notes,easyocr,paddleocr,trocr
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: opencv-python-headless>=4.8
Requires-Dist: pillow>=10.0
Requires-Dist: numpy>=1.24
Requires-Dist: typer>=0.9
Requires-Dist: rich>=13.0
Requires-Dist: fastapi>=0.110
Requires-Dist: uvicorn>=0.29
Requires-Dist: python-multipart>=0.0.6
Provides-Extra: easyocr
Requires-Dist: easyocr>=1.7; extra == "easyocr"
Provides-Extra: paddle
Requires-Dist: paddlepaddle>=2.6; extra == "paddle"
Requires-Dist: paddleocr>=2.7; extra == "paddle"
Provides-Extra: trocr
Requires-Dist: transformers>=4.40; extra == "trocr"
Requires-Dist: torch>=2.0; extra == "trocr"
Provides-Extra: dev
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: mypy>=1.5; extra == "dev"
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: httpx>=0.24; extra == "dev"
Provides-Extra: all
Requires-Dist: handscribe[easyocr,paddle,trocr]; extra == "all"
Dynamic: license-file

# 🖋️ HandScribe OCR

> **Production-grade multilingual handwritten OCR for student notes**

[![CI](https://github.com/ronaldgosso/handscribe/actions/workflows/ci.yml/badge.svg)](https://github.com/ronaldgosso/handscribe/actions)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![Docker](https://img.shields.io/badge/docker-ready-blue.svg)](https://hub.docker.com/r/ronaldgosso/handscribe)

---

## ✨ Features

- **Three OCR Backends**: EasyOCR, PaddleOCR, and TrOCR behind a unified interface
- **Multilingual Support**: 80+ languages including English, Swahili, Arabic, Hindi, French
- **Advanced Preprocessing**: Denoising, CLAHE contrast enhancement, adaptive binarization, deskew
- **CLI & REST API**: Use from command line or integrate into any application
- **Docker Ready**: One-command deployment, no Python environment needed
- **Batch Processing**: Process hundreds of student notes automatically
- **Production-Grade**: Tested, typed, CI/CD-enabled, PyPI-ready

---

## 🚀 Quick Start

### Option 1: Docker (Recommended)

```bash
docker run -p 8000:8000 ronaldgosso/handscribe

# Test it
curl -X POST http://localhost:8000/ocr \
  -F "file=@my_notes.jpg" \
  -F "languages=en,sw"
```

### Option 2: Python Package

```bash
pip install handscribe[easyocr]

# CLI
handscribe extract student_notes.jpg -b easyocr -l en,sw

# API server
uvicorn ocr_engine.api:api --host 0.0.0.0 --port 8000
```

### Option 3: From Source

```bash
git clone https://github.com/ronaldgosso/handscribe.git
cd handscribe
pip install -e ".[easyocr]"
handscribe --help
```

---

## 📖 Usage

### CLI

```bash
# Extract text
handscribe extract image.jpg -b easyocr -l en,sw

# Output as JSON
handscribe extract image.jpg --json

# Save to file
handscribe extract image.jpg -o output.txt -c 0.6

# Batch process a directory
handscribe batch ./student_notes/ -o ./results/

# Compare all backends on the same image
handscribe compare image.jpg -l en,sw
```

### REST API

Start the server and open **http://localhost:8000/docs** for interactive docs.

```bash
# Extract text with bounding boxes
curl -X POST http://localhost:8000/ocr \
  -F "file=@notes.jpg" \
  -F "backend=easyocr" \
  -F "languages=en,sw" \
  -F "confidence=0.5"

# Plain text only
curl -X POST http://localhost:8000/ocr/text \
  -F "file=@notes.jpg"

# Batch (up to 10 files)
curl -X POST http://localhost:8000/ocr/batch \
  -F "files=@img1.jpg" -F "files=@img2.jpg"
```

### Python API

```python
from ocr_engine import OCREngine, OCRBackend

engine = OCREngine(
    backend=OCRBackend.EASYOCR,
    languages=["en", "sw"],
    confidence_threshold=0.5,
)

# With bounding boxes
results = engine.extract("student_notes.jpg")
for r in results:
    print(f"{r.text} (conf: {r.confidence:.2f})")

# Plain text
text = engine.extract_text("student_notes.jpg")

# Batch
batch = engine.extract_batch(["note1.jpg", "note2.jpg"])
```

---

## 🔧 OCR Backends Comparison

| Backend | Best For | Languages | Speed | Accuracy |
|---------|----------|-----------|-------|----------|
| **EasyOCR** | Quick setup, mixed scripts | 80+ | ⚡⚡⚡ | ⭐⭐⭐⭐ |
| **PaddleOCR** | Fast processing, documents | 80+ | ⚡⚡⚡⚡ | ⭐⭐⭐⭐ |
| **TrOCR** | Handwriting accuracy | English* | ⚡⚡ | ⭐⭐⭐⭐⭐ |

*TrOCR can be fine-tuned for other languages.

### Language Codes

| Language | EasyOCR | PaddleOCR |
|----------|---------|-----------|
| English | `en` | `en` |
| Swahili | `sw` | `en` (Latin script) |
| Arabic | `ar` | `arabic` |
| Hindi | `hi` | `hi` |
| French | `fr` | `french` |

> **Tanzanian Context**: For Swahili + English mixed notes, use `-l en,sw` with EasyOCR. The Latin script support handles most Swahili text well. For production-grade Swahili accuracy, fine-tuning TrOCR on a Swahili handwriting dataset is recommended.

---

## 🐳 Docker

```bash
# Run
docker run -p 8000:8000 ronaldgosso/handscribe

# Build from source
docker build -t handscribe .
docker run -p 8000:8000 handscribe

# Docker Compose
docker compose up -d
```

See [CONTRIBUTING.md](CONTRIBUTING.md) for full Docker setup instructions.

---

## 🏗️ Architecture

```
handscribe/
├── ocr_engine/
│   ├── __init__.py          # Package exports
│   ├── engine.py            # Core OCR engine (3 backends)
│   ├── preprocessing.py     # Advanced image preprocessing
│   ├── cli.py               # CLI interface (Typer)
│   └── api.py               # REST API (FastAPI)
├── tests/
│   └── test_engine.py       # Comprehensive test suite
├── pyproject.toml           # Package configuration
├── Dockerfile               # Multi-stage Docker build
├── docker-compose.yml       # Docker Compose setup
└── .github/workflows/ci.yml # CI/CD pipeline
```

---

## 🧪 Development

See [CONTRIBUTING.md](CONTRIBUTING.md) for the full developer guide, including:

- Virtual environment setup
- Running the CLI, API server, and Docker
- Running tests with coverage
- Linting with ruff, black, and mypy
- Git workflow and PR submission
- Adding new OCR backends

Quick start for developers:

```bash
git clone https://github.com/ronaldgosso/handscribe.git
cd handscribe
python -m venv .venv && source .venv/bin/activate  # or .venv\Scripts\Activate on Windows
pip install -e ".[all,dev]"
pytest tests/ -v
```

---

## 📝 Examples

### Process Tanzanian Student Notes

```bash
handscribe extract notes.jpg -b easyocr -l en,sw -c 0.5
handscribe compare notes.jpg -l en,sw
handscribe batch ./semester_notes/ -l en,sw -o ./ocr_results/
```

### API Integration (Python)

```python
import requests

with open("student_notes.jpg", "rb") as f:
    response = requests.post(
        "http://localhost:8000/ocr",
        files={"file": f},
        data={"backend": "easyocr", "languages": "en,sw", "confidence": 0.5},
    )

print(response.json()["full_text"])
```

---

## 🙏 Acknowledgments

- **EasyOCR** — Jaided AI for excellent multilingual OCR
- **PaddleOCR** — PaddlePaddle team for fast OCR implementation
- **TrOCR** — Microsoft for transformer-based handwriting OCR
- **Tanzanian Students** — Inspiring this tool for real-world impact

---

## 📧 Contact

**Ronald Gosso** — ronaldgosso@gmail.com

Project Link: [https://github.com/ronaldgosso/handscribe](https://github.com/ronaldgosso/handscribe)

---

*Made with ❤️ for students everywhere*
