Metadata-Version: 2.1
Name: text-peeler
Version: 0.1.0
Summary: PDF text, table, image, and form extraction utilities
License: MIT
Keywords: pdf,epub,mobi,ebook,extraction,ocr,tables,forms,images
Author: Craig Trim
Author-email: craigtrim@gmail.com
Requires-Python: >=3.10,<4.0
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing
Requires-Dist: Pillow (>=11,<12)
Requires-Dist: beautifulsoup4 (>=4.12,<5.0)
Requires-Dist: ebooklib (>=0.18,<0.19)
Requires-Dist: ocrmypdf (>=16,<17)
Requires-Dist: pdfplumber (>=0.11,<0.12)
Requires-Dist: pymupdf (>=1.25,<2.0)
Requires-Dist: pyobjc-framework-quartz (>=12.1,<13.0)
Requires-Dist: pytesseract (>=0.3,<0.4)
Project-URL: Homepage, https://github.com/craigtrim/text-peeler
Project-URL: Issues, https://github.com/craigtrim/text-peeler/issues
Project-URL: Repository, https://github.com/craigtrim/text-peeler
Description-Content-Type: text/markdown

# Text Peeler

[![PyPI version](https://img.shields.io/pypi/v/text-peeler)][pypi]
[![Python](https://img.shields.io/pypi/pyversions/text-peeler)][pypi]
[![Downloads](https://img.shields.io/pypi/dm/text-peeler)][pypi]
[![License](https://img.shields.io/pypi/l/text-peeler)][license]

One command to extract text, tables, images, and forms from any PDF.

Text Peeler analyzes your document, picks the right extraction strategy, and delivers clean output in the format you need. Digital PDFs, scanned pages, mixed documents, ebooks: one tool handles them all.

## Install

```bash
pip install text-peeler
```

OCR support requires [Tesseract][tesseract]:

```bash
brew install tesseract    # macOS
apt install tesseract-ocr # Debian/Ubuntu
```

See the [full installation guide][docs-install] for all options.

## Quick Start

```bash
# Auto-detect and extract
text-peeler-detect report.pdf

# Extract text from a digital PDF
text-peeler-native report.pdf

# Pull tables as JSON
text-peeler-tables report.pdf --format json

# Run every relevant extractor at once
text-peeler-ensemble report.pdf
```

Or use the shell router:

```bash
./extract.sh auto report.pdf
./extract.sh tables report.pdf output.csv --format csv
```

See the [quickstart guide][docs-quickstart] for more examples.

## What It Does

| Mode | Purpose |
|------|---------|
| `native` | Digital PDFs with selectable text |
| `scanned` | Image-only PDFs via OCR |
| `mixed` | Per-page routing (native or OCR) |
| `tables` | Structured table extraction |
| `images` | Embedded images with surrounding context |
| `forms` | Fillable form field extraction |
| `epub` | EPUB ebook chapter extraction |
| `ebook` | Legacy ebook formats (.mobi, .lit, .prc) |
| `detect` | Analyze a PDF and recommend extractors |
| `ensemble` | Run all relevant extractors, merge into one JSON |

## Output Formats

Every extractor supports multiple output formats. Choose what fits your pipeline.

| Extractor | Default | Supported |
|-----------|---------|-----------|
| native | txt | txt, json, md |
| scanned | txt | txt, json, md |
| mixed | txt | txt, json, md |
| tables | json | json, csv, txt, md |
| images | json | json, md, txt |
| forms | json | json, txt, md, csv |

See [output format details][docs-formats] for schema documentation.

## Use Cases

Text Peeler is built for pipelines. Here are the most common workflows:

- **[LLM Ingestion][uc-llm]** : Feed PDFs into language models as clean, structured text
- **[Batch Processing][uc-batch]** : Process hundreds of mixed PDFs in a single pass
- **[Scanned Documents][uc-scanned]** : OCR pipeline with smart per-page text density checks
- **[Table Extraction][uc-tables]** : Pull tabular data out of PDFs as CSV or JSON
- **[Form Extraction][uc-forms]** : Extract fillable form fields with type-aware parsing
- **[Image Extraction][uc-images]** : Recover embedded images with page context
- **[Ebook Conversion][uc-ebooks]** : Chapter-level text from EPUB and legacy ebook formats
- **[Document Auditing][uc-audit]** : Characterize a PDF without extracting anything

## Architecture

Each extractor is a standalone Python script. No shared base class, no deep inheritance. Shared formatting lives in `output_utils.py`.

```
text_peeler/
├── detect.py           # PDF analysis + routing
├── ensemble.py         # Multi-extractor runner
├── output_utils.py     # txt/json/md/csv formatting
└── extractors/
    ├── native.py       # pymupdf text
    ├── scanned.py      # pymupdf + tesseract OCR
    ├── mixed.py        # per-page routing
    ├── tables.py       # pdfplumber tables
    ├── images.py       # pymupdf images
    ├── forms.py        # pymupdf form widgets
    ├── epub.py         # ebooklib EPUB
    └── ebook.py        # Calibre ebook conversion
```

See the [architecture guide][docs-arch] for implementation details.

## Documentation

| Guide | Description |
|-------|-------------|
| [Installation][docs-install] | All install methods, system dependencies, troubleshooting |
| [Quickstart][docs-quickstart] | First extraction in under a minute |
| [CLI Reference][docs-cli] | Every flag and option for every mode |
| [Output Formats][docs-formats] | JSON schemas, CSV layouts, Markdown structure |
| [Architecture][docs-arch] | How the pieces fit together |

## See Also

**[Gutenfetchen][gutenfetchen-github]** ([PyPI][gutenfetchen-pypi]) - Bulk download and process public domain texts from Project Gutenberg. Pairs well with Text Peeler for building large text corpora from mixed sources.

## License

[MIT][license]

<!-- Link references (absolute URLs for PyPI compatibility) -->

[pypi]: https://pypi.org/project/text-peeler/
[license]: https://github.com/craigtrim/text-peeler/blob/main/LICENSE
[tesseract]: https://github.com/tesseract-ocr/tesseract

[docs-install]: https://github.com/craigtrim/text-peeler/blob/main/docs/installation.md
[docs-quickstart]: https://github.com/craigtrim/text-peeler/blob/main/docs/quickstart.md
[docs-cli]: https://github.com/craigtrim/text-peeler/blob/main/docs/cli-reference.md
[docs-formats]: https://github.com/craigtrim/text-peeler/blob/main/docs/output-formats.md
[docs-arch]: https://github.com/craigtrim/text-peeler/blob/main/docs/architecture.md

[uc-llm]: https://github.com/craigtrim/text-peeler/blob/main/docs/use-cases/llm-ingestion.md
[uc-batch]: https://github.com/craigtrim/text-peeler/blob/main/docs/use-cases/batch-processing.md
[uc-scanned]: https://github.com/craigtrim/text-peeler/blob/main/docs/use-cases/scanned-documents.md
[uc-tables]: https://github.com/craigtrim/text-peeler/blob/main/docs/use-cases/table-extraction.md
[uc-forms]: https://github.com/craigtrim/text-peeler/blob/main/docs/use-cases/form-extraction.md
[uc-images]: https://github.com/craigtrim/text-peeler/blob/main/docs/use-cases/image-extraction.md
[uc-ebooks]: https://github.com/craigtrim/text-peeler/blob/main/docs/use-cases/ebook-conversion.md
[uc-audit]: https://github.com/craigtrim/text-peeler/blob/main/docs/use-cases/document-auditing.md

[gutenfetchen-github]: https://github.com/craigtrim/gutenfetchen
[gutenfetchen-pypi]: https://pypi.org/project/gutenfetchen/

