Metadata-Version: 2.4
Name: smartpdfplumber
Version: 0.1.3
Summary: A wrapper around LangChain’s PDFPlumber integration with added support for image-aware data extraction.
Requires-Python: >=3.13
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: dotenv>=0.9.9
Requires-Dist: langchain-community>=0.4.1
Requires-Dist: langchain-core>=1.3.2
Requires-Dist: pdfplumber>=0.11.9
Provides-Extra: dev
Requires-Dist: google-genai>=1.74.0; extra == "dev"
Requires-Dist: groq>=1.2.0; extra == "dev"
Requires-Dist: pytest>=9.0.3; extra == "dev"
Requires-Dist: torch>=2.11.0; extra == "dev"
Requires-Dist: torchvision>=0.26.0; extra == "dev"
Requires-Dist: transformers>=5.7.0; extra == "dev"
Dynamic: license-file

# Smart PDF Plumber

A lightweight wrapper over LangChain’s PDFPlumber integration that extends PDF parsing with image understanding—extracting not just text, but also contextual insights from embedded images.

It is designed for two common cases:

- Extract plain text from PDFs page by page.
- **FEATURE:** describe embedded images and include those descriptions in the page text.

## Features

- Page-level PDF parsing with `pdfplumber`.
- Optional character deduplication for PDFs with repeated text.
- Optional image description support using either Google Gemini or Hugging Face vision-language models.
- LangChain-friendly output: a list of `Document` objects with metadata such as source path, page number, and total pages.

## Installation

Install from PyPI:

```bash
pip install smartpdfplumber
```

The project currently targets Python 3.13 or newer.

## Quick Start

```python
from smartpdfplumber.loader import SmartPDFLoader

loader = SmartPDFLoader("path/to/file.pdf", describe_image=True, inference="groq_ai")
documents = loader.load()

for document in documents:
	print(document.metadata)
	print(document.page_content)
```

### `text_kwargs`

Extra keyword arguments passed to `pdfplumber.Page.extract_text()`.

### `dedupe`

Set `dedupe=True` to call `page.dedupe_chars()` before extracting text. This can help with PDFs that repeat characters in the output.

### `describe_image`

Set `describe_image=True` to include image descriptions inline in the page text.

When this is enabled, you must also provide `inference`.

### `inference`

Supported values:

- `gemini`
- `hf_transformers`
- `groq_ai`

## Image Descriptions

```python
from smartpdfplumber.loader import SmartPDFLoader

loader = SmartPDFLoader(
	"path/to/file.pdf",
	dedupe=True,
	describe_image=True,
	inference="huggingface",
	model="Qwen/Qwen3.5-0.8B" # Optional: Default("Qwen/Qwen3.5-0.8B")
)
documents = loader.load()
```

## Notes

- If you enable image descriptions without passing `model`, the parser raises a `ValueError`.
- If you use `model="hf_transformers"`, the model is loaded lazily and cached for reuse.

## License

MIT
