Metadata-Version: 2.3
Name: trk-ftz-ocr
Version: 0.1.0
Summary: OCR toolkit for arabic PDFs and directories
Author: Tarek
Author-email: Tarek <tarek12305@gmail.com>
Requires-Dist: pillow>=12.2.0
Requires-Dist: pymupdf>=1.27.2.3
Requires-Dist: pytesseract>=0.3.13
Requires-Dist: tqdm>=4.67.3
Requires-Dist: pandas>=2.0.0
Requires-Dist: openpyxl>=3.1.0
Requires-Python: >=3.12
Description-Content-Type: text/markdown

# TRK FTZ OCR

TRK FTZ OCR is a lightweight OCR toolkit for extracting text from PDFs and folders of PDFs.  
It supports single-page extraction, full PDF processing, directory processing, parallel execution, and multiple output formats.

---

## 🚀 Features

- 📄 Extract text from a single page
- 📚 Process full PDF files
- 📁 Process directories of PDFs
- ⚡ Parallel processing for speed
- 🧠 Adaptive OCR pipeline (PSM-based)
- 📊 Export results to:
  - Dictionary (Python)
  - TXT files (per page / per file)
  - Excel (.xlsx)

---

## 📦 Installation

```bash
pip install trk-ftz-ocr
```

---

## 🖥️ CLI Usage

After installation, use:

```bash
trk-ftz-ocr --help
```

---

## 📄 Process a single PDF

### Return text (default)

```bash
trk-ftz-ocr file.pdf --mode dict
```

---

### Save as folder (per-page TXT files)

```bash
trk-ftz-ocr file.pdf --mode folder --output output/
```

---

### Save as single Excel file

```bash
trk-ftz-ocr file.pdf --mode excel --output output.xlsx
```

---

## 📁 Process a directory

```bash
trk-ftz-ocr pdfs/ --mode folder --output output/
```

This creates:

```
output/
  file1/
    file1_page_1.txt
    file1_page_2.txt
  file2/
    file2_page_1.txt
```

---

## ⚙️ Options

| Option | Description |
|------|------------|
| `--page` | Extract a single page |
| `--output` | Output path (file or folder) |
| `--lang` | OCR language (default: ara) |
| `--zoom` | Render quality (default: 6) |
| `--parallel` | Enable parallel processing |
| `--mode` | Output mode: `dict`, `file`, `folder`, `excel` |

---

## 📊 Output Modes

### `dict`
Returns Python dictionary:
```python
{page_number: text}
```

### `file`
Saves all text into a single `.txt` file.

### `folder`
Saves each page as a separate `.txt` file.

### `excel`
Saves results as:
```
file | page | text
```

---

## 🧠 Example (Python API)

```python
from trk_ftz_ocr.pipeline import process

result = process(
    path="file.pdf",
    output_path="output/",
    mode="dict",
    parallel=True
)

print(result)
```

---

## ⚡ Performance

- Parallel page processing (ThreadPool)
- Parallel directory processing
- Lightweight preprocessing pipeline
- Adaptive OCR configuration

---

## 📌 Requirements

- Python >= 3.12
- Tesseract OCR installed on system

---

## 🛠️ Dependencies

- PyMuPDF
- pytesseract
- pillow
- tqdm
- pandas
- openpyxl

---

## 📄 License

MIT License