Metadata-Version: 2.4
Name: pdf2epub-cli
Version: 0.1.0
Summary: Python CLI for turning PDF books into clean, readable EPUB files
Keywords: cli,ebook,epub,ocr,pdf,pdf-to-epub
Author: 破晓
Author-email: 破晓 <AHpx@yandex.com>
License-Expression: AGPL-3.0-or-later
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Text Processing
Classifier: Topic :: Utilities
Requires-Dist: openai>=2.29.0
Requires-Dist: pymupdf>=1.27.2.2
Requires-Dist: pypdf>=6.9.1
Requires-Python: >=3.12
Project-URL: Homepage, https://github.com/ahpxex/pdf2epub
Project-URL: Repository, https://github.com/ahpxex/pdf2epub
Project-URL: Issues, https://github.com/ahpxex/pdf2epub/issues
Project-URL: Documentation, https://github.com/ahpxex/pdf2epub#readme
Project-URL: Changelog, https://github.com/ahpxex/pdf2epub/blob/master/CHANGELOG.md
Description-Content-Type: text/markdown

# pdf2epub

`pdf2epub` is a Python CLI for converting PDF books into readable EPUB files.
The scope is intentionally narrow for now: one solid pipeline, `PDF -> EPUB`.

Requires Python `3.12+`.

## What it does

- reads a PDF and extracts text page by page
- supports both text-layer PDFs and scanned or image-only PDFs
- can call a local or self-hosted OpenAI-compatible model for OCR cleanup and
  Markdown structuring
- builds an EPUB from the cleaned and reflowed content instead of dumping raw
  text

## Install

Published package name on PyPI: `pdf2epub-cli`

Installed command name: `pdf2epub`

If you only want the CLI on your machine, prefer `uv tool` or `pipx`:

```bash
uv tool install pdf2epub-cli
pdf2epub --help
```

```bash
pipx install pdf2epub-cli
pdf2epub --help
```

If you want it inside an existing Python environment:

```bash
pip install pdf2epub-cli
pdf2epub --help
```

## Run from source

Sync dependencies:

```bash
uv sync
```

Run the CLI in the repo:

```bash
uv run pdf2epub --help
```

If you want a local editable command while developing in the repo:

```bash
uv tool install -e .
pdf2epub --help
```

You can also install it directly from GitHub:

```bash
uv tool install git+https://github.com/ahpxex/pdf2epub
pdf2epub --help
```

The PyPI distribution is called `pdf2epub-cli` because the plain
`pdf2epub` package name is already occupied on PyPI. The executable command
remains `pdf2epub`.

## CLI usage

```bash
uv run pdf2epub <pdf_path> [options]
```

Common options:

- `-o, --output <path>`: output EPUB path, default is `<pdf_name>.epub`
- `--page <n>`: print text from a single 1-based page and exit
- `--title <text>`: override EPUB title
- `--author <text>`: override EPUB author
- `--language <code>`: override language code such as `en` or `zh`
- `--text-only`: print extracted text only, do not generate EPUB
- `--extract-mode <auto|native|llm>`: choose the extraction strategy
- `--batch-size <n>`: pages per LLM batch, default `5`
- `--max-workers <n>`: concurrent LLM requests, default `4`
- `--llm-model <name>`: model name
- `--llm-base-url <url>`: OpenAI-compatible base URL
- `--llm-api-key <key>`: API key
- `--llm-timeout <seconds>`: per-request timeout, default `120`
- `--llm-temperature <value>`: sampling temperature, default `0.0`

## Most common workflows

### Convert a text-based PDF

If the PDF already has a selectable text layer, prefer `native` mode:

```bash
uv run pdf2epub ./book.pdf --extract-mode native
```

This writes `./book.epub` by default.

### Set output path and metadata

```bash
uv run pdf2epub ./book.pdf \
  --extract-mode native \
  --output ./out/book.epub \
  --title "Custom Title" \
  --author "Author Name" \
  --language en
```

### Inspect one page

```bash
uv run pdf2epub ./book.pdf --page 12
```

This is useful when you want to debug extraction quality on a specific page.

### Print cleaned text only

```bash
uv run pdf2epub ./book.pdf --text-only --extract-mode native
```

### Convert a scanned PDF

For a local Ollama server or any other OpenAI-compatible endpoint:

```bash
export PDF2EPUB_LLM_BASE_URL=http://127.0.0.1:11434/v1
export PDF2EPUB_LLM_API_KEY=dummy
export PDF2EPUB_LLM_MODEL=glm-ocr

uv run pdf2epub ./scanned-book.pdf --extract-mode llm
```

You can also pass the same config via flags:

```bash
uv run pdf2epub ./scanned-book.pdf \
  --extract-mode llm \
  --llm-model glm-ocr \
  --llm-base-url http://127.0.0.1:11434/v1 \
  --llm-api-key dummy
```

## Extraction modes

### `native`

Reads the embedded PDF text layer directly.

- best for normal text PDFs
- fast
- does not require a model
- does not work for image-only scanned PDFs

```bash
uv run pdf2epub ./book.pdf --extract-mode native
```

### `llm`

Uses the LLM OCR / cleanup / Markdown structuring pipeline.

- best for scanned PDFs, image-heavy PDFs, or noisy extraction
- more robust on image pages
- slower than native mode
- requires model configuration

```bash
uv run pdf2epub ./scan.pdf --extract-mode llm
```

### `auto`

Chooses between `native` and `llm` based on text density.

- in `--text-only` mode, it can fall back automatically
- in normal EPUB generation mode, you should still configure the LLM in advance,
  because the document may be classified as OCR-needed

If you already know the PDF has a clean text layer, `--extract-mode native` is
usually the safer choice.

## LLM configuration

`pdf2epub` looks for these environment variables first:

- `PDF2EPUB_LLM_MODEL`
- `PDF2EPUB_LLM_BASE_URL`
- `PDF2EPUB_LLM_API_KEY`

It also accepts these standard fallbacks:

- `OPENAI_MODEL`
- `OPENAI_BASE_URL`
- `OPENAI_API_KEY`

Example `.env`:

```dotenv
PDF2EPUB_LLM_BASE_URL=http://127.0.0.1:11434/v1
PDF2EPUB_LLM_API_KEY=dummy
PDF2EPUB_LLM_MODEL=glm-ocr
```

## Output behavior

- default output path: same directory as the source PDF, with an `.epub`
  extension
- if `--title`, `--author`, or `--language` are omitted, the CLI tries to infer
  them from PDF metadata and extracted text
- if text extraction succeeds but no chapters can be built, the CLI exits with
  an error instead of silently generating a broken EPUB

## Testing

Run the unit test suite:

```bash
uv run pytest
```

Run the live local-model integration test:

```bash
PDF2EPUB_RUN_LIVE_LLM=1 uv run pytest -m live_llm -s
```

## Benchmarks and local book fixtures

- manual end-to-end artifacts live under `benchmarks/artifacts/e2e/`
- local linked book fixtures live under `benchmarks/local/downloads_books/files/`
- automated quality regression script:

```bash
uv run python scripts/run_quality_regression.py
```

## Naming and scope

The project used to be called `any2epub`. It is now intentionally narrowed to
`pdf2epub`: the current goal is not "convert anything to EPUB", but to make the
PDF-to-EPUB pipeline stable and high quality first.

For package distribution, PyPI uses `pdf2epub-cli` while the executable command
stays `pdf2epub`.

## License

This project is licensed under `AGPL-3.0-or-later`. See `LICENSE`.
