Metadata-Version: 2.4
Name: lazypdf
Version: 0.2.0
Summary: Simple PDF manipulation and conversion for Python
Author-email: Joao Manoel Feck <jmfeck@gmail.com>
License-Expression: BSD-3-Clause
Project-URL: Homepage, https://github.com/jmfeck/lazypdf
Project-URL: Repository, https://github.com/jmfeck/lazypdf
Project-URL: Issues, https://github.com/jmfeck/lazypdf/issues
Keywords: pdf,merge,split,compress,watermark,ocr,convert
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries
Classifier: Topic :: Utilities
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pymupdf<2.0,>=1.24.0
Provides-Extra: ocr
Requires-Dist: pytesseract>=0.3.10; extra == "ocr"
Requires-Dist: Pillow>=10.0.0; extra == "ocr"
Provides-Extra: office
Requires-Dist: python-docx>=1.1.0; extra == "office"
Requires-Dist: python-pptx>=0.6.23; extra == "office"
Requires-Dist: openpyxl>=3.1.0; extra == "office"
Provides-Extra: tables
Requires-Dist: pdfplumber>=0.11.0; extra == "tables"
Provides-Extra: html
Requires-Dist: weasyprint>=60.0; extra == "html"
Provides-Extra: browser
Requires-Dist: playwright>=1.40.0; extra == "browser"
Provides-Extra: repair
Requires-Dist: pikepdf>=8.0.0; extra == "repair"
Provides-Extra: msoffice
Requires-Dist: pywin32>=306; sys_platform == "win32" and extra == "msoffice"
Provides-Extra: all
Requires-Dist: lazypdf[browser,html,msoffice,ocr,office,repair,tables]; extra == "all"
Provides-Extra: dev
Requires-Dist: lazypdf[all]; extra == "dev"
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov>=5.0; extra == "dev"
Requires-Dist: ruff>=0.5.0; extra == "dev"
Dynamic: license-file

# lazypdf

[![Tests](https://github.com/jmfeck/lazypdf/actions/workflows/tests.yml/badge.svg)](https://github.com/jmfeck/lazypdf/actions/workflows/tests.yml)
[![Python](https://img.shields.io/pypi/pyversions/lazypdf)](https://pypi.org/project/lazypdf/)
[![License](https://img.shields.io/github/license/jmfeck/lazypdf)](LICENSE)

Simple PDF manipulation and conversion for Python. Read a PDF, transform it, export to another format. That's it.

No complex pipelines, no bloated abstractions — just a clean, fluent API to merge, split, compress, watermark, convert, and more.

## Install

```bash
pip install lazypdf
```

Optional extras:

```bash
pip install lazypdf[ocr]       # OCR support (pytesseract + Pillow)
pip install lazypdf[office]    # DOCX/XLSX/PPTX export (python-docx, openpyxl, python-pptx)
pip install lazypdf[tables]    # Table extraction (pdfplumber)
pip install lazypdf[html]      # HTML to PDF via WeasyPrint engine
pip install lazypdf[browser]   # HTML to PDF via Playwright engine (Chromium)
pip install lazypdf[repair]    # PDF repair via pikepdf engine
pip install lazypdf[msoffice]  # MS Office COM automation on Windows (pywin32)
pip install lazypdf[all]       # Everything
```

## Quick Start

```python
import lazypdf as lz

# Read -> Transform -> Export
lz.read("input.pdf").rotate(90).compress().to_pdf("output.pdf")

# Merge multiple PDFs
lz.merge("file1.pdf", "file2.pdf", "file3.pdf").to_pdf("merged.pdf")

# Convert images to PDF
lz.read_images("scan1.jpg", "scan2.jpg").to_pdf("scans.pdf")

# Read Office documents (requires MS Office or LibreOffice)
lz.read_docx("report.docx").add_watermark("DRAFT").to_pdf("draft.pdf")
lz.read_xlsx("data.xlsx").to_png("output/")
lz.read_pptx("slides.pptx").extract_pages([1, 3]).to_pdf("summary.pdf")

# Extract specific pages
lz.read("big.pdf").extract_pages([1, 3, 5]).to_pdf("selected.pdf")

# Add watermark and page numbers
(
    lz.read("report.pdf")
    .add_watermark("CONFIDENTIAL", opacity=0.2)
    .add_page_numbers(position="bottom-center")
    .to_pdf("final.pdf")
)

# Export to images
lz.read("slides.pdf").to_png("output_dir/", dpi=300)

# Extract text
text = lz.read("document.pdf").extract_text()

# Encrypt / decrypt
lz.read("doc.pdf").encrypt("password").to_pdf("protected.pdf")
lz.read("protected.pdf").decrypt("password").to_pdf("unlocked.pdf")

# Redact sensitive text (case-sensitive, exact match)
lz.read("doc.pdf").redact("SECRET-123").to_pdf("redacted.pdf")

# Split into individual pages
lz.read("doc.pdf").split("output_dir/", every=1)

# Chain anything
(
    lz.read("input.pdf")
    .merge("extra.pdf")
    .remove_pages([2, 4])
    .rotate(90, pages=[1])
    .crop(left=50, right=50)
    .add_watermark("DRAFT")
    .compress()
    .to_pdf("result.pdf")
)
```

## API Reference

### Entry Points

| Function | Description | Dependency |
|----------|-------------|------------|
| `lz.read(path)` | Read a PDF file | pymupdf |
| `lz.read_pdf(path)` | Alias for `read()` | pymupdf |
| `lz.merge(*paths)` | Merge multiple PDFs | pymupdf |
| `lz.read_images(*paths, page_size=)` | Create PDF from images (default: `"fit"`) | pymupdf |
| `lz.read_jpg(*paths, page_size=)` | Create PDF from JPEGs | pymupdf |
| `lz.read_png(*paths, page_size=)` | Create PDF from PNGs | pymupdf |
| `lz.read_html(path_or_url, engine=)` | Create PDF from HTML (default: `"pymupdf"`) | pymupdf |
| `lz.read_docx(path)` | Read Word document | MS Office / LibreOffice |
| `lz.read_xlsx(path)` | Read Excel spreadsheet | MS Office / LibreOffice |
| `lz.read_pptx(path)` | Read PowerPoint presentation | MS Office / LibreOffice |
| `lz.read_csv(path)` | Read CSV file | MS Office / LibreOffice |
| `lz.from_bytes(data)` | Create PDF from raw bytes | pymupdf |

### Chainable Operations

| Method | Description |
|--------|-------------|
| `.merge(*others)` | Append more PDFs (paths, objects, or lists) |
| `.rotate(degrees, pages=)` | Rotate pages (multiple of 90) |
| `.crop(left=, top=, right=, bottom=, pages=)` | Crop page margins (in points) |
| `.compress(img_quality=, compression_level=)` | Reduce file size (deflate compression, dedup objects) |
| `.add_watermark(text, ...)` | Add text watermark |
| `.add_image_watermark(path, ...)` | Add image watermark (with opacity) |
| `.add_page_numbers(...)` | Insert page numbers |
| `.resize(size, pages=)` | Resize pages to standard paper size (a4, letter, etc.) |
| `.flatten(dpi=, pages=)` | Rasterize pages (burns annotations/forms into flat image) |
| `.extract_pages(pages)` | Keep only specified pages |
| `.remove_pages(pages)` | Remove specified pages |
| `.reorder(order)` | Reorder/duplicate pages |
| `.reverse()` | Reverse page order |
| `.encrypt(password, algorithm=)` | Add password protection (default: AES-256-R5) |
| `.decrypt(password)` | Remove password protection |
| `.redact(text)` | Black out text permanently |
| `.repair(engine=)` | Fix corrupted PDFs (default: `"auto"`) |
| `.ocr(language=)` | Make scanned pages searchable |
| `.copy()` | Create independent copy |

All page parameters are **1-indexed** (first page = 1).

### Export (Terminal Operations)

| Method | Returns |
|--------|---------|
| `.to_pdf(path)` | `str` (output path) |
| `.to_jpg(output_dir)` | `list[str]` (image paths) |
| `.to_png(output_dir)` | `list[str]` (image paths) |
| `.to_images(output_dir, fmt=)` | `list[str]` (image paths) |
| `.to_docx(path)` | `str` (output path) |
| `.to_xlsx(path)` | `str` (output path) |
| `.to_pdfa(path, level=, engine=)` | `str` (output path, default: `"pymupdf"`) |
| `.to_bytes()` | `bytes` |
| `.split(output_dir, every=)` | `list[str]` (PDF paths) |
| `.split_at(output_dir, at=)` | `list[str]` (PDF paths) |

### Extraction & Info

| Method / Property | Returns |
|----------|---------|
| `.extract_text(pages=, engine=, page_separator=)` | `str` |
| `.extract_tables(pages=, flavor=)` | `list[list[list[str]]]` |
| `.extract_images(output_dir, pages=)` | `list[str]` (image paths) |
| `.metadata` | `dict` |
| `.page_count` | `int` |
| `.page_sizes()` | `list[tuple[float, float]]` |

## Limitations

- **Office reads** (`read_docx`, `read_xlsx`, `read_pptx`, `read_csv`) require either Microsoft Office (Windows, auto-detected) or LibreOffice (any OS, must be on PATH). No pure-Python solution exists for reliable Office-to-PDF conversion.
- **`to_docx()`** extracts text only. Images, tables, and complex formatting are not preserved.
- **`to_xlsx()`** only exports tables found in the PDF. Requires `[tables]` and `[office]` extras.
- **OCR** (`ocr()`) requires Tesseract to be installed on the system in addition to the `[ocr]` pip extra.
- **`read_html()`** defaults to PyMuPDF Story engine (basic CSS). For better rendering, use `engine="weasyprint"` (requires GTK) or `engine="playwright"` (requires Chromium).
- **Redaction** (`redact()`) is case-sensitive exact text match. Save the result with `to_pdf()` to persist.
- **PDF/A** (`to_pdfa()`) defaults to PyMuPDF engine which may not pass strict validators. Use `engine="ghostscript"` for full compliance (requires Ghostscript binary).
- **Flatten** (`flatten()`) rasterizes pages to images — text becomes non-searchable. Default DPI is 72; use higher values for better quality.
- **Image watermark** (`add_image_watermark()`) requires Pillow (included in `[ocr]` extra).

## License

BSD-3-Clause
