Metadata-Version: 2.4
Name: doctape
Version: 0.1.0
Summary: Chop large PDFs into page windows, convert each with docling, and reassemble to Markdown
Author-email: Cat Pereira <catpereiradev@gmail.com>
License: MIT
Project-URL: Repository, https://github.com/catherinepereira/doctape
Project-URL: Bug Tracker, https://github.com/catherinepereira/doctape/issues
Keywords: pdf,markdown,docling,ocr,document-conversion
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Text Processing :: Markup :: Markdown
Classifier: Topic :: Utilities
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: docling>=2.99
Requires-Dist: pypdfium2
Provides-Extra: ocr
Requires-Dist: easyocr>=1.7; extra == "ocr"
Dynamic: license-file

## doctape pdf to markdown

Converts large PDFs to Markdown by chopping them into page windows, running each window through [docling](https://github.com/docling-project/docling), and reassembling the results. Per-window processing keeps memory bounded, shows progress, and makes long jobs resumable after a crash.

### Installation

```bash
pip install doctape
```

This installs the fast layout-based pipeline. For OCR on scanned or cover-art pages, install the extra:

```bash
pip install "doctape[ocr]"
```

### Usage

Put PDFs in `docs/` and name the one to convert:

```bash
doctape complan.pdf
```

The PDF is converted in 20-page windows. Per-window Markdown lands in `out/chunks/<name>/`, and the reassembled document in `out/<name>.md` with `<!-- pages NNNN-NNNN -->` markers between windows.

### Options

| Argument       | Default  | Meaning                                                 |
| -------------- | -------- | ------------------------------------------------------- |
| `pdf`          | required | PDF to convert (filename under `--docs-dir`, or a path) |
| `--docs-dir`   | `docs`   | Directory of source PDFs                                |
| `--out-dir`    | `out`    | Output directory                                        |
| `--chunk-size` | `20`     | Pages per window                                        |
| `--force`      | off      | Re-convert chunks that already exist                    |
| `--ocr`        | off      | Force EasyOCR (requires the `ocr` extra, much slower)   |

### Resuming

A chunk with an existing non-empty `.md` is skipped, so re-running the same command after an interruption picks up where it stopped. Reassembly always reflects whatever chunks are on disk. Use `--force` to redo chunks.

### OCR

The default pipeline uses layout and table-structure detection without OCR, which is fast and accurate on digital-native PDFs. Scanned pages and stylized cover art will need OCR: install `doctape[ocr]` and pass `--ocr`. OCR is much slower on CPU, so reserve it for documents that need it and write OCR output to a separate `--out-dir` when comparing against a non-OCR run.

### Python API

```python
from pathlib import Path
from doctape import build_converter, convert_pdf

converter = build_converter(ocr=False)
convert_pdf(Path("docs/complan.pdf"), Path("out"), chunk_size=20,
            force=False, converter=converter)
```
