Metadata-Version: 2.4
Name: pydocai
Version: 0.1.0
Summary: Extract text from PDFs using pypdfium2 with OCR fallback via pytesseract
Author: Catherine Nelson
License: Apache-2.0
Project-URL: Homepage, https://github.com/catherinenelson1/pydocai
Project-URL: Repository, https://github.com/catherinenelson1/pydocai
Project-URL: Issues, https://github.com/catherinenelson1/pydocai/issues
Keywords: pdf,text extraction,ocr,pypdfium2,tesseract,document processing
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pypdfium2>=4.0.0
Requires-Dist: pytesseract>=0.3.10
Requires-Dist: Pillow>=9.0.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Dynamic: license-file

# pydocai

Extract text from PDF documents using pypdfium2 with automatic OCR fallback via pytesseract.

## Features

- **Fast native text extraction** using pypdfium2 (Apache 2.0 licensed)
- **Automatic OCR fallback** for scanned documents using pytesseract
- **Smart detection** of sparse/scanned PDFs that need OCR
- **Memory-efficient** processing with lazy imports and temporary file handling
- **Page limit** to prevent processing extremely large documents (default: 15 pages)

## Installation

```bash
pip install pydocai
```

### System Dependencies

This package requires Tesseract OCR to be installed on your system for the OCR fallback feature:

**macOS:**
```bash
brew install tesseract
```

**Ubuntu/Debian:**
```bash
sudo apt-get install tesseract-ocr
```

**Windows:**
Download and install from: https://github.com/UB-Mannheim/tesseract/wiki

## Usage

```python
from pydocai import extract_pdf_text

# Extract text and save to file
success = extract_pdf_text("document.pdf", "output.txt")

# Or let it auto-generate the output filename
success = extract_pdf_text("document.pdf")  # Creates document_extracted.txt
```

### How it works

1. First attempts native text extraction using pypdfium2
2. Checks if extracted text has sufficient content (>= 5 lines per page)
3. If content is sparse (likely a scanned document), falls back to OCR
4. Saves extracted text to the specified output file

### Configuration

You can import and check the default configuration values:

```python
from pydocai import (
    OCR_DPI,           # DPI for OCR rendering (default: 100)
    MAX_PDF_PAGES,     # Maximum pages to process (default: 15)
    MIN_LINES_PER_PAGE # Minimum lines to consider page valid (default: 5)
)
```

## Development

```bash
# Clone the repository
git clone https://github.com/catherine/pydocai.git
cd pydocai

# Install in development mode with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest
```

## License

Apache License 2.0 - see [LICENSE](LICENSE) for details.
