Metadata-Version: 2.4
Name: billfox
Version: 0.2.1
Summary: Composable document data extraction: load, preprocess, OCR, LLM parse, store with vector search.
Project-URL: Homepage, https://github.com/billfox-ai/billfox
Project-URL: Repository, https://github.com/billfox-ai/billfox
Project-URL: Issues, https://github.com/billfox-ai/billfox/issues
Author: Tuong Le
License-Expression: MIT
License-File: LICENSE
Keywords: document,extraction,ocr,pipeline,receipt
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Image Processing
Requires-Python: >=3.11
Requires-Dist: docling>=2.0
Requires-Dist: httpx>=0.24
Requires-Dist: numpy>=1.24
Requires-Dist: onnxruntime>=1.16
Requires-Dist: pillow>=10.0
Requires-Dist: pydantic>=2.0
Requires-Dist: python-dotenv>=1.0
Requires-Dist: python-ulid>=3.0
Requires-Dist: rich>=13.0
Requires-Dist: tomli-w>=1.0
Requires-Dist: typer>=0.13
Provides-Extra: all
Requires-Dist: aiosqlite>=0.19; extra == 'all'
Requires-Dist: anthropic>=0.40; extra == 'all'
Requires-Dist: google-api-python-client>=2.80; extra == 'all'
Requires-Dist: google-auth-httplib2>=0.2; extra == 'all'
Requires-Dist: google-auth-oauthlib>=1.0; extra == 'all'
Requires-Dist: mistralai>=1.0; extra == 'all'
Requires-Dist: openai>=1.0; extra == 'all'
Requires-Dist: pydantic-ai>=0.1; extra == 'all'
Requires-Dist: sqlalchemy[asyncio]>=2.0; extra == 'all'
Requires-Dist: sqlite-vec>=0.1; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.40; extra == 'anthropic'
Provides-Extra: dev
Requires-Dist: aiosqlite>=0.19; extra == 'dev'
Requires-Dist: anthropic>=0.40; extra == 'dev'
Requires-Dist: coverage>=7.0; extra == 'dev'
Requires-Dist: google-api-python-client>=2.80; extra == 'dev'
Requires-Dist: google-auth-httplib2>=0.2; extra == 'dev'
Requires-Dist: google-auth-oauthlib>=1.0; extra == 'dev'
Requires-Dist: mistralai>=1.0; extra == 'dev'
Requires-Dist: mypy>=1.5; extra == 'dev'
Requires-Dist: openai>=1.0; extra == 'dev'
Requires-Dist: pydantic-ai>=0.1; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.1; extra == 'dev'
Requires-Dist: sqlalchemy[asyncio]>=2.0; extra == 'dev'
Requires-Dist: sqlite-vec>=0.1; extra == 'dev'
Provides-Extra: google-drive
Requires-Dist: google-api-python-client>=2.80; extra == 'google-drive'
Requires-Dist: google-auth-httplib2>=0.2; extra == 'google-drive'
Requires-Dist: google-auth-oauthlib>=1.0; extra == 'google-drive'
Provides-Extra: llm
Requires-Dist: pydantic-ai>=0.1; extra == 'llm'
Provides-Extra: mistral
Requires-Dist: mistralai>=1.0; extra == 'mistral'
Provides-Extra: openai
Requires-Dist: openai>=1.0; extra == 'openai'
Provides-Extra: store
Requires-Dist: aiosqlite>=0.19; extra == 'store'
Requires-Dist: sqlalchemy[asyncio]>=2.0; extra == 'store'
Requires-Dist: sqlite-vec>=0.1; extra == 'store'
Description-Content-Type: text/markdown

# billfox

[![PyPI version](https://img.shields.io/pypi/v/billfox.svg)](https://pypi.org/project/billfox/)
[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)
[![CI](https://github.com/billfox-ai/billfox/actions/workflows/ci.yml/badge.svg)](https://github.com/billfox-ai/billfox/actions)

**Composable document data extraction**: load, preprocess, OCR, LLM parse, store with vector search.

billfox is a Python library that lets you build document processing pipelines from independent, swappable stages. Each stage implements a simple protocol, so you can mix built-in modules with your own.

## Architecture

```
                          billfox pipeline
 ┌─────────┐  ┌──────────────┐  ┌───────────┐  ┌────────┐  ┌───────┐
 │  Source  │→ │ Preprocessor │→ │ Extractor │→ │ Parser │→ │ Store │
 │         │  │   (optional)  │  │   (OCR)   │  │ (LLM)  │  │       │
 └─────────┘  └──────────────┘  └───────────┘  └────────┘  └───────┘
  LocalFile    Resize, YOLO,     MistralOCR     LLMParser   SQLite +
               Chain                            (any LLM)   hybrid
                                                            search
```

**Protocols at every boundary** -- implement `DocumentSource`, `Preprocessor`, `Extractor`, `Parser[T]`, `Embedder`, or `DocumentStore[T]` to plug in your own components.

## Installation

```bash
# Core only (just types and protocols)
pip install billfox

# With Mistral OCR
pip install 'billfox[mistral]'

# With LLM parsing (pydantic-ai)
pip install 'billfox[llm]'

# With SQLite storage and search
pip install 'billfox[store]'

# With CLI
pip install 'billfox[cli]'

# Everything
pip install 'billfox[all]'
```

## Quick Start

### 1. OCR Only -- Extract Markdown from a Document

```python
import asyncio
from billfox.source import LocalFileSource
from billfox.extract import MistralExtractor

async def main():
    source = LocalFileSource()
    extractor = MistralExtractor()  # uses MISTRAL_API_KEY env var

    doc = await source.load("invoice.pdf")
    result = await extractor.extract(doc)
    print(result.markdown)

asyncio.run(main())
```

### 2. Full Pipeline -- OCR + LLM Parse + Store

```python
import asyncio
from pydantic import BaseModel
from billfox import Pipeline
from billfox.source import LocalFileSource
from billfox.extract import MistralExtractor
from billfox.parse import LLMParser
from billfox.preprocess import ResizePreprocessor
from billfox.store import SQLiteDocumentStore

class Invoice(BaseModel):
    vendor_name: str
    total: float
    date: str

async def main():
    pipeline = Pipeline(
        source=LocalFileSource(),
        extractor=MistralExtractor(),
        parser=LLMParser(
            model="openai:gpt-4.1",
            output_type=Invoice,
            system_prompt="Extract invoice fields from this document.",
        ),
        preprocessors=[ResizePreprocessor(max_side=1024)],
        store=SQLiteDocumentStore(db_path="invoices.db", schema=Invoice),
    )

    invoice = await pipeline.run("scan.jpg", document_id="inv-001")
    print(f"{invoice.vendor_name}: ${invoice.total}")

asyncio.run(main())
```

### 3. CLI -- Process from the Terminal

```bash
# Extract markdown via OCR
billfox extract receipt.jpg

# Parse into structured JSON
billfox parse receipt.jpg --schema ./models.py:Receipt --model openai:gpt-4.1

# Search stored documents
billfox search "coffee" --db invoices.db

# Configure API keys
billfox config set api_keys.mistral sk-...
```

## Optional Extras

| Extra     | Packages                         | Use case                        |
|-----------|----------------------------------|---------------------------------|
| `mistral` | `mistralai`                      | Mistral OCR extraction          |
| `yolo`    | `onnxruntime`, `numpy`, `Pillow` | YOLO document cropping          |
| `llm`     | `pydantic-ai`                    | LLM structured parsing          |
| `openai`  | `openai`                         | OpenAI text embeddings          |
| `store`   | `sqlalchemy`, `aiosqlite`, `sqlite-vec` | SQLite storage + search  |
| `cli`     | `typer`, `rich`, `tomli-w`       | Command-line interface          |
| `all`     | All of the above                 | Everything                      |

## Documentation

Full documentation is available at [docs/](docs/):

- [Getting Started](docs/getting-started.md)
- [Custom Extractor](docs/custom-extractor.md)
- [Custom Preprocessor](docs/custom-preprocessor.md)
- [Storage and Search](docs/storage-and-search.md)

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, running tests, and submitting pull requests.

## License

MIT -- see [LICENSE](LICENSE) for details.
