Metadata-Version: 2.3
Name: trt-secnd-attempt
Version: 0.0.2
Summary: Extract OCR text and annotations from PDF files
Author: Tarek
Author-email: Tarek <tarek12305@gmail.com>
License: MIT
Requires-Dist: pytesseract
Requires-Dist: pdf2image
Requires-Dist: pymupdf
Requires-Dist: pandas
Requires-Dist: tqdm
Requires-Python: >=3.12
Description-Content-Type: text/markdown

\# -------------------------

\# Documentation

\# -------------------------

"""

PDF Annot Extractor

===================



A Python package for extracting:

\- OCR text from PDF pages

\- PDF annotations (comments, highlights, etc.)



----------------------------------

INSTALLATION

----------------------------------



pip install .



----------------------------------

USAGE

----------------------------------



1\) Python API

-------------



from pdf\_annot\_extractor import PDFTextAnnotationExtractor



\# Basic usage (output defaults to current folder)

extractor = PDFTextAnnotationExtractor("file.pdf")



\# Save OCR text (one file per page)

extractor.save\_text()



\# Export annotations to Excel

extractor.export\_annotations\_excel()





2\) Custom Output Folder

------------------------



extractor = PDFTextAnnotationExtractor("file.pdf", "output/")

extractor.save\_text()

extractor.export\_annotations\_excel("output/result.xlsx")





3\) Directory Processing

-----------------------



from pdf\_annot\_extractor import process\_directory



results = process\_directory("pdfs/")





4\) CLI Usage

------------



Extract annotations to Excel:

python extractor.py --input file.pdf --output result.xlsx



Extract text files:

python extractor.py --input file.pdf --mode text



Process directory:

python extractor.py --input pdfs/ --output all.xlsx





----------------------------------

OPTIONS

----------------------------------



--input   : PDF file or directory

--output  : Output file (optional for text mode)

--mode    : text | excel

--lang    : OCR language (default: ara)





----------------------------------

NOTES

----------------------------------



\- Default output folder is current working directory if not specified

\- Requires Tesseract OCR installed with Arabic language pack

\- Requires Poppler for pdf2image





----------------------------------

EXAMPLE

----------------------------------



python extractor.py --input my.pdf --mode text



→ Creates:

page\_001.txt

page\_002.txt

...





python extractor.py --input my.pdf --output annotations.xlsx



→ Creates:

annotations.xlsx

