Metadata-Version: 2.3
Name: trk-mmr-tools
Version: 0.1.0
Summary: OCR and Arabic text correction tools with WordBank
Requires-Dist: pymupdf>=1.22.0,<1.28.0
Requires-Dist: pytesseract>=0.3.10,<0.4.0
Requires-Dist: pandas>=2.0.0,<3.0.0
Requires-Dist: numpy>=1.26.0,<2.0.0
Requires-Dist: pillow>=8.0.0,<12.0.0
Requires-Python: >=3.12
Description-Content-Type: text/markdown

\# TRT Tarek Tools



\*\*\*PDF extraction, Arabic text correction, and WordBank tools\*\*\*



\## Description



This package provides tools for:



\* Extracting text from PDF files (with OCR fallback using PyMuPDF and Tesseract)

\* Cleaning and correcting Arabic text

\* Checking words against a WordBank and applying corrections



It is useful for processing Arabic PDFs, preparing text for NLP, or building word databases.



\## Installation



```bash

\# Using pip

pip install trt-tarek-tools

```



\## Usage



Here is a simple example using the `process\_pdfs` function:



```python

from pathlib import Path

from trk\_mmr\_tools.pdf.processor import process\_pdfs

from trk\_mmr\_tools.text.correction import TextCorrection



pdf\_input = Path("tests/sample.pdf")  # or folder of PDFs

output\_dir = Path("output")



output\_dir.mkdir(exist\_ok=True)



corrector = TextCorrection()



process\_pdfs(

&nbsp;   source=pdf\_input,

&nbsp;   output\_dir=output\_dir,

&nbsp;   method="ocr",

&nbsp;   lang="ara",

&nbsp;   clean=True,

&nbsp;   corrector=corrector

)

```



\## License



MIT License



\## Author



Tarek



