Metadata-Version: 2.4
Name: abstract_ocr
Version: 0.0.1.61
Summary: A structured OCR pipeline designed for **layout-aware text extraction from complex documents**, combining preprocessing, column detection, region classification, and ordered OCR assembly.
Author: putkoff
Author-email: partners@abstractendeavors.com
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Requires-Python: >=3.6
Description-Content-Type: text/markdown
Requires-Dist: abstract_utilities
Requires-Dist: requests
Requires-Dist: pdf2image
Requires-Dist: PyPDF2
Requires-Dist: easyocr
Requires-Dist: pytest
Requires-Dist: pytesseract
Requires-Dist: lxml
Requires-Dist: moviepy==1.0.3
Requires-Dist: spacy
Requires-Dist: abstract_hugpy
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

## **abstract_ocr / layout_ocr — Layout-Aware OCR Pipeline**

A structured OCR pipeline designed for **layout-aware text extraction from complex documents**, combining preprocessing, column detection, region classification, and ordered OCR assembly.

Built to handle:

* multi-column PDFs
* mixed-content layouts (text, figures, captions)
* noisy or scanned documents
* large-scale document ingestion pipelines

---

## 🔹 What This System Is

This is not a simple OCR wrapper — it is a **typed, multi-stage processing pipeline**:

* transforms raw images into structured page representations
* detects document layout (columns, headers, regions)
* classifies content blocks (text, figures, captions)
* applies OCR at the region level
* reconstructs output in correct reading order

The system is designed for **deterministic, reproducible extraction** rather than heuristic text scraping.

---

## 🔹 Pipeline Overview

```text
PageImage
    ↓
Preprocess (denoise + binarize)
    ↓
Layout Detection
    ├─ Column detection (vertical projection)
    ├─ Header detection
    └─ Region classification (text / figure / caption)
    ↓
Region OCR
    ├─ Crop per region
    ├─ Apply OCR (Tesseract)
    ├─ Fallback to column-level OCR if needed
    ↓
Reading Order Assembly
    ↓
OCRResult (structured blocks + raw text)
```

---

## 🔹 Core Capabilities

* **Layout Detection**

  * Column detection via vertical projection valleys
  * Header segmentation via density scanning
  * Multi-column classification (single / dual / mixed)

* **Region Classification**

  * Connected-component analysis
  * Density-based classification (text vs figure vs caption)
  * Column-aware region assignment

* **Region-Level OCR**

  * OCR applied per detected block (not full-page)
  * Adaptive Tesseract configuration by region type
  * Automatic fallback to column-level OCR when detection fails

* **Reading Order Reconstruction**

  * Column-aware ordering
  * Top-to-bottom sequencing within columns
  * Header/body/caption prioritization

* **Typed Pipeline Execution**

  * All steps validated via explicit input/output types
  * Registry-driven execution model
  * No implicit coupling between pipeline stages

---

## 🔹 Architecture

The pipeline is built around a **step registry + type-safe execution chain**:

* Each step declares:

  * input type
  * output type
* The pipeline validates compatibility before execution
* Execution is explicit, deterministic, and observable

Example chain:

```python
["preprocess", "detect_layout", "ocr_regions"]
```

Each step is independently replaceable and composable.

---

## 🔹 Key Design Decisions

### **Typed Data Flow**

All intermediate results are structured dataclasses:

* `PageImage`
* `PreprocessedImage`
* `LayoutDetection`
* `OCRResult`

No ad-hoc dictionaries — ensures:

* traceability
* consistency
* debuggability

---

### **Layout-First OCR**

OCR is applied **after structure is understood**, not before.

This prevents:

* column interleaving
* incorrect reading order
* misclassification of content

---

### **Fallback Over Failure**

If region detection fails:

* system falls back to column-level OCR
* ensures output is still usable

---

### **Determinism Over Heuristics**

* explicit thresholds (config-driven)
* no hidden behavior
* reproducible results across runs

---

## 🔹 Why This Exists

Traditional OCR pipelines:

* ignore layout
* operate on full pages
* produce inconsistent reading order
* fail silently on complex documents

This system:

* understands document structure
* isolates regions before OCR
* enforces reading order
* produces structured outputs suitable for downstream systems

---

## 🔹 Example Use Cases

* PDF → structured text extraction
* research document ingestion pipelines
* financial filings parsing
* multi-column article extraction
* preprocessing for NLP / LLM pipelines
* search indexing and document analysis

---

## 🔹 Integration Context

This module is designed to plug into:

* document ingestion systems
* OCR + NLP pipelines (e.g. abstract_hugpy)
* search and indexing systems
* large-scale document processing workflows

---

## 🔹 Design Philosophy

* **Structure before extraction**
* **Determinism over convenience**
* **Typed pipelines over implicit flows**
* **Fallback over failure**

---
