Metadata-Version: 2.4
Name: trk-mmr-tools
Version: 0.2.0
Summary: OCR and Arabic text correction tools with WordBank
Author: Tarek
Author-email: Tarek <your.email@example.com>
License-Expression: MIT
Classifier: Programming Language :: Python :: 3
Classifier: Operating System :: OS Independent
Requires-Dist: pymupdf>=1.22.0
Requires-Dist: pytesseract>=0.3.10
Requires-Dist: pandas>=2.0.0
Requires-Dist: numpy>=1.26.0
Requires-Dist: pillow>=8.0.0
Requires-Python: >=3.12
Description-Content-Type: text/markdown

# trk-mmr-tools

***PDF extraction, Arabic text correction, and WordBank tools***

## Description

`trk-mmr-tools` provides a streamlined workflow for handling complex Arabic document processing and automated text cleanup. It specializes in:

* **Hybrid PDF Extraction:** High-performance text extraction using PyMuPDF with an automatic OCR fallback via Tesseract.
* **Arabic Text Normalization:** Advanced cleaning and correction tailored for Arabic script.
* **WordBank Integration:** Automated spell-checking and dictionary-based validation using high-performance indexing for large datasets.

It is designed for researchers, developers, and data scientists processing Arabic PDFs for NLP, media ethnography, or building academic word databases.

---

## Prerequisites

This package is a Python wrapper for the **Tesseract OCR engine**. You must install the engine on your operating system for the OCR features to work:

* **Ubuntu / Google Colab:**
  ```bash
  sudo apt-get update
  sudo apt-get install tesseract-ocr tesseract-ocr-ara
````

  * **macOS:**
    ```bash
    brew install tesseract tesseract-lang
    ```
  * **Windows:** Download the installer from [UB Mannheim](https://www.google.com/search?q=https://github.com/UB-Mannheim/tesseract/wiki). Ensure you check the box for **Arabic** script data during installation and add the Tesseract directory to your System PATH.

-----

## Installation

```bash
# Using pip
pip install trk-mmr-tools
```

> **Note:** If you are using this in an environment with older versions of NumPy (like some versions of JAX or OpenCV), ensure you have **NumPy 1.26.0 or higher**.

-----

## Usage

The following example demonstrates how to process a PDF (or a folder of PDFs) using the OCR method with Arabic language support and WordBank corrections.

```python
from pathlib import Path
from trk_mmr_tools.pdf.processor import process_pdfs
from trk_mmr_tools.text.correction import TextCorrection

# Define input and output paths
pdf_input = Path("tests/sample.pdf")  # Can be a file or a directory
output_dir = Path("output")
output_dir.mkdir(exist_ok=True)

# Initialize the Arabic text corrector (loads WordBank assets)
corrector = TextCorrection()

# Process the PDFs
process_pdfs(
    source=pdf_input,
    output_dir=output_dir,
    method="ocr",       # Uses Tesseract for scanned/image-based PDFs
    lang="ara",         # Specifies Arabic language for OCR
    clean=True,         # Normalizes characters and removes noise
    corrector=corrector  # Applies the dictionary-based validator
)
```

-----

## Project Structure

For your custom WordBank and assets to be recognized, ensure your package follows this structure:

```text
trk-mmr-tools/
├── pyproject.toml
├── README.md
└── src/
    └── trk_mmr_tools/
        ├── __init__.py
        ├── pdf/
        ├── text/
        └── assets/
            ├── bank.pkl
            └── data.xlsx
```

-----

## License

MIT License

## Author

**Tarek**
