Metadata-Version: 2.4
Name: iqedge-ai-ocr
Version: 0.2.0
Summary: IQEdge.ai OCR — PDF to Text converter with multi-pass self-correction. Extract text from scanned PDFs, images, books. 63 languages including Hindi, Telugu, Tamil, Arabic, Chinese, Japanese. Table detection, searchable PDF output.
Project-URL: Homepage, https://iqedge.ai/ocr
Project-URL: Repository, https://github.com/iqedge/iqedge-ai-ocr
Author-email: "IQEdge.ai" <ocr@iqedge.ai>
License-Expression: MIT
Keywords: arabic-ocr,book-scanner,chinese-ocr,document-ocr,easyocr,google-vision,hindi-ocr,image-to-text,indian-languages,japanese-ocr,multilingual-ocr,ocr,pdf-converter,pdf-ocr,pdf-reader,pdf-to-text,scan-to-text,scanned-pdf,searchable-pdf,table-extraction,telugu-ocr,tesseract,text-extractor
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Requires-Python: >=3.9
Requires-Dist: fpdf2>=2.7.0
Requires-Dist: opencv-python-headless>=4.7.0
Requires-Dist: pdf2image>=1.16.0
Requires-Dist: pdfminer-six>=20221105
Requires-Dist: pillow>=9.0.0
Requires-Dist: pytesseract>=0.3.10
Requires-Dist: rapidfuzz>=3.0
Provides-Extra: all
Requires-Dist: easyocr>=1.7; extra == 'all'
Requires-Dist: google-cloud-vision>=3.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: pytest; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: ruff; extra == 'dev'
Provides-Extra: easyocr
Requires-Dist: easyocr>=1.7; extra == 'easyocr'
Provides-Extra: google
Requires-Dist: google-cloud-vision>=3.0; extra == 'google'
Description-Content-Type: text/markdown

# IQEdge.ai OCR

**PDF to Text converter with multi-pass self-correction.** Extract text from scanned PDFs, images, and books. 63 languages including Hindi, Telugu, Tamil, Arabic, Chinese, Japanese. Table detection, searchable PDF output.

## Install

```bash
pip install iqedge-ai-ocr
```

## Usage

```bash
# Basic OCR — PDF to text
iqedge-ai-ocr ocr input.pdf

# Structured JSON output
iqedge-ai-ocr ocr input.pdf -o output.json

# Multi-language (Telugu + English)
iqedge-ai-ocr ocr input.pdf --lang tel+eng

# Full 4-pass correction (best quality)
iqedge-ai-ocr ocr input.pdf --passes 4

# CSV output
iqedge-ai-ocr ocr input.pdf --format csv -o output.csv

# Specific pages
iqedge-ai-ocr ocr input.pdf --pages 1-5,10

# Use Google Cloud Vision (most accurate)
iqedge-ai-ocr ocr input.pdf --engine google_vision

# Train from your documents
iqedge-ai-ocr train corpus_dir/

# System info
iqedge-ai-ocr info
```

## Python API

```python
from iqedge_ocr import OCRAgent

agent = OCRAgent(lang="tel+eng", max_passes=4)
doc = agent.process("input.pdf")

print(doc.text)           # full text
print(doc.confidence)     # 0.0 - 1.0
print(doc.word_count)     # total words
print(doc.to_dict())      # structured JSON
```

## Languages (63)

**Indian (14):** Hindi, Telugu, Tamil, Kannada, Malayalam, Bengali, Gujarati, Marathi, Punjabi, Odia, Urdu, Assamese, Nepali, Sanskrit

**Western Europe (9):** English, Spanish, French, German, Italian, Portuguese, Dutch, Catalan, Greek

**Scandinavia (4):** Swedish, Norwegian, Danish, Finnish

**Eastern Europe (13):** Russian, Polish, Ukrainian, Czech, Hungarian, Romanian, Bulgarian, Croatian, Serbian, Slovak, Lithuanian, Latvian, Estonian

**Caucasus (3):** Georgian, Armenian, Azerbaijani

**Middle East (4):** Arabic, Hebrew, Persian, Turkish

**East Asia (4):** Chinese (Simplified/Traditional), Japanese, Korean

**Southeast Asia (9):** Thai, Vietnamese, Indonesian, Malay, Filipino, Cebuano, Burmese, Khmer, Lao

**Central Asia (1):** Uzbek

**Africa (2):** Swahili, Amharic

## OCR Engines

- **Tesseract** (default) — free, offline, 63 languages
- **Google Cloud Vision** — cloud, highest accuracy for complex scripts
- **EasyOCR** — good for handwriting and curved text

## How It Works

1. **Pass 1 (200 DPI):** Fast scan, extract all text with per-word confidence scores
2. **Pass 2 (300 DPI):** Re-OCR only low-confidence regions
3. **Pass 3 (400 DPI):** Re-OCR remaining stubborn regions
4. **Pass 4:** Post-processing with learned corrections, pattern matching, dictionary

Only re-renders the regions that need it — not the whole page each time. Preserves document structure: text as text, images as images, tables as tables.

## Output Formats

- **Text** — plain text (default)
- **JSON** — structured with pages, regions, words, confidence scores
- **CSV** — tabular output with per-line confidence
- **Searchable PDF** — invisible text overlay on original pages

## Training

Train IQEdge.ai OCR from your own documents to improve accuracy:

```bash
# Place PDF + ground-truth pairs in a directory:
#   corpus/document1.pdf + corpus/document1.txt
#   corpus/document2.pdf + corpus/document2.txt

iqedge-ai-ocr train corpus/
```

The training system learns character confusions and word corrections specific to your documents.

## License

MIT
