Metadata-Version: 2.4
Name: smartpdfplumber
Version: 0.1.2
Summary: A wrapper around LangChain’s PDFPlumber integration with added support for image-aware data extraction.
Requires-Python: >=3.13
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: dotenv>=0.9.9
Requires-Dist: langchain-community>=0.4.1
Requires-Dist: langchain-core>=1.3.2
Requires-Dist: pdfplumber>=0.11.9
Dynamic: license-file

# Smart PDF Plumber

A lightweight wrapper over LangChain’s PDFPlumber integration that extends PDF parsing with image understanding—extracting not just text, but also contextual insights from embedded images.

It is designed for two common cases:

- Extract plain text from PDFs page by page.
- Optionally describe embedded images and include those descriptions in the page text.

## Features

- Page-level PDF parsing with `pdfplumber`.
- Optional character deduplication for PDFs with repeated text.
- Optional image description support using either Google Gemini or Hugging Face vision-language models.
- LangChain-friendly output: a list of `Document` objects with metadata such as source path, page number, and total pages.

## Installation

Install from PyPI:

```bash
pip install smartpdfplumber
```

The project currently targets Python 3.13 or newer.

## Quick Start

```python
from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader

loader = SmartPDFLoader("path/to/file.pdf")
documents = loader.load()

for document in documents:
	print(document.metadata)
	print(document.page_content)
```

## Options

`SmartPDFLoader` forwards keyword arguments to `PDFPlumberParser`:

```python
SmartPDFLoader(
	file_path="path/to/file.pdf",
	text_kwargs={"x_tolerance": 2},
	dedupe=True,
	describe_image=False,
	model=None,
)
```

### `text_kwargs`

Extra keyword arguments passed to `pdfplumber.Page.extract_text()`.

### `dedupe`

Set `dedupe=True` to call `page.dedupe_chars()` before extracting text. This can help with PDFs that repeat characters in the output.

### `describe_image`

Set `describe_image=True` to include image descriptions inline in the page text.

When this is enabled, you must also provide `model`.

### `model`

Supported values:

- `gemini`
- `huggingface`

## Image Descriptions

### Google Gemini

```python
from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader

loader = SmartPDFLoader(
	"path/to/file.pdf",
	describe_image=True,
	model="gemini",
)
documents = loader.load()
```

This path uses `google-genai`. Make sure your Google authentication is configured before running it.

### Hugging Face

```python
from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader

loader = SmartPDFLoader(
	"path/to/file.pdf",
	dedupe=True,
	describe_image=True,
	model="huggingface",
)
documents = loader.load()
```

This path uses `transformers`, `torch`, and `torchvision` to run a vision-language model locally.

## Output Format

Each page becomes a LangChain `Document` with metadata similar to:

```python
{
	"source": "path/to/file.pdf",
	"file_path": "path/to/file.pdf",
	"page": 0,
	"total_pages": 12,
}
```

The `page` value is zero-based.

## Example

```python
from SmartPDFPlumber.smart_pdfplumber import SmartPDFLoader

loader = SmartPDFLoader(
	"assets/sample.pdf",
	dedupe=True,
	describe_image=True,
	model="huggingface",
)

documents = loader.load()

for document in documents:
	print(document.page_content[:200])
```

## Notes

- If you enable image descriptions without passing `model`, the parser raises a `ValueError`.
- If you use `model="gemini"`, `google-genai` must be installed and authenticated.
- If you use `model="huggingface"`, the model is loaded lazily and cached for reuse.

## License

MIT
