Metadata-Version: 2.4
Name: pdf-rs
Version: 0.1.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Rust
Classifier: Topic :: Text Processing
Summary: Rust PDF parser with Python bindings
Keywords: pdf,parser,markdown,llm,ocr
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# pdf-rs

`pdf-rs` is a Rust PDF parser with Python bindings. It focuses on fast structural
inspection, text extraction, Markdown output, image/font metadata, incremental
metadata and annotation updates, and LLM-friendly page chunks with OCR handoff
requests.

## Features

- Parse classic xref tables, xref streams, and compressed object streams.
- Decode common stream filters including Flate, ASCIIHex, ASCII85, RunLength,
  and PNG predictors.
- Extract document metadata, page text, outlines, names, page labels, links,
  annotations, forms, embedded files, fonts, and image XObjects.
- Export Markdown and page chunks for RAG/LLM ingestion.
- Produce OCR request placeholders for scanned pages or image regions so an
  external OCR engine can be plugged in without coupling the parser to one OCR
  backend.
- Provide Python bindings through `maturin` / PyO3.

## Rust Usage

```rust
use pdf_rs::{Document, LlmParseOptions};

let document = Document::parse_file("paper.pdf")?;
println!("{}", document.text()?);

let llm = document.to_llm_document_with_options(LlmParseOptions::default())?;
for page in llm.pages {
    println!("page {}: {}", page.page_number, page.text);
}
# Ok::<(), pdf_rs::PdfError>(())
```

## Python Usage

```python
import pdf_rs

doc = pdf_rs.Document.open_mmap("paper.pdf")
print(doc.to_markdown())

for chunk in doc.llm_chunks():
    print(chunk["page_number"], chunk["text"])

for request in doc.ocr_requests(ocr="auto"):
    print(request)
```

## CLI

```powershell
cargo run --bin pdf-rs-cli -- paper.pdf --mmap --summary
cargo run --bin pdf-rs-cli -- paper.pdf --text
cargo run --bin pdf-rs-cli -- paper.pdf --ocr-requests
```

## Local Checks

```powershell
cargo fmt --check
cargo test
cargo clippy --all-targets -- -D warnings
cargo check --features mimalloc
cargo clippy --features python --all-targets -- -D warnings
uvx maturin build --features python --out target\wheels
```

To run the Python smoke test after building a wheel:

```powershell
uv venv target\py-smoke --seed
uv pip install --python target\py-smoke\Scripts\python.exe --force-reinstall target\wheels\pdf_rs-*.whl
target\py-smoke\Scripts\python.exe tests\python_api.py
```

## Release

GitHub Actions builds wheels on Linux, Windows, and macOS when a `v*.*.*` tag is
pushed. The release workflow uploads distributions to PyPI using the
`PYPI_API_TOKEN` repository secret and creates a GitHub release from the same
artifacts.

```powershell
git tag v0.1.0
git push origin v0.1.0
```

