Metadata-Version: 2.4
Name: catchfly
Version: 0.1.0
Summary: Many strategies, one pipeline — from unstructured text to structured data.
Project-URL: Homepage, https://github.com/silene-systems/catchfly
Project-URL: Documentation, https://catchfly.dev
Project-URL: Repository, https://github.com/silene-systems/catchfly
Project-URL: Issues, https://github.com/silene-systems/catchfly/issues
Project-URL: Changelog, https://github.com/silene-systems/catchfly/blob/main/CHANGELOG.md
Author-email: Adrian Michalski <adrian@silene.systems>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: clustering,data-extraction,llm,nlp,normalization,pydantic,schema-discovery,structured-extraction
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: httpx>=0.25
Requires-Dist: pydantic<3.0,>=2.0
Provides-Extra: all
Requires-Dist: instructor>=1.5; extra == 'all'
Requires-Dist: numpy>=1.24; extra == 'all'
Requires-Dist: openai>=1.50; extra == 'all'
Requires-Dist: pandas>=2.0; extra == 'all'
Requires-Dist: pronto>=2.5; extra == 'all'
Requires-Dist: pyarrow>=15.0; extra == 'all'
Requires-Dist: pymupdf>=1.24; extra == 'all'
Requires-Dist: scikit-learn>=1.3; extra == 'all'
Requires-Dist: sentence-transformers>=3.0; extra == 'all'
Requires-Dist: umap-learn>=0.5; extra == 'all'
Provides-Extra: clustering
Requires-Dist: numpy>=1.24; extra == 'clustering'
Requires-Dist: scikit-learn>=1.3; extra == 'clustering'
Requires-Dist: umap-learn>=0.5; extra == 'clustering'
Provides-Extra: docling
Requires-Dist: docling>=2.0; extra == 'docling'
Provides-Extra: embeddings
Requires-Dist: sentence-transformers>=3.0; extra == 'embeddings'
Provides-Extra: export
Requires-Dist: pandas>=2.0; extra == 'export'
Requires-Dist: pyarrow>=15.0; extra == 'export'
Provides-Extra: instructor
Requires-Dist: instructor>=1.5; extra == 'instructor'
Provides-Extra: medical
Requires-Dist: pronto>=2.5; extra == 'medical'
Provides-Extra: openai
Requires-Dist: openai>=1.50; extra == 'openai'
Provides-Extra: pdf
Requires-Dist: pymupdf>=1.24; extra == 'pdf'
Description-Content-Type: text/markdown

# catchfly

*Many strategies, one pipeline — from unstructured text to structured data.*

**Catchfly** automates the pipeline of **schema discovery → structured extraction → normalization** from unstructured text at scale. Interchangeable strategies at each stage let you go from raw documents to clean, normalized, structured data with minimal effort.

## Quick Start

```bash
pip install catchfly[openai,clustering]
```

```python
from catchfly import Pipeline
from catchfly.demo import load_samples

# Load built-in demo data (10 product reviews)
docs = load_samples("product_reviews")

# One line to create a full pipeline
pipeline = Pipeline.quick(model="gpt-4o-mini")

# Discover schema → extract records → normalize values
results = pipeline.run(
    documents=docs,
    domain_hint="Electronics product reviews",
    normalize_fields=["pros"],
)

print(results.schema)            # Discovered Pydantic model
print(results.to_dataframe())    # Extracted + normalized data
print(results.report)            # Cost & usage stats
```

## Local Models (Ollama)

```python
pipeline = Pipeline.quick(
    model="qwen3.5",
    base_url="http://localhost:11434/v1",
)
```

## Modular Usage

Each stage works independently:

```python
# Discovery only
from catchfly.discovery.single_pass import SinglePassDiscovery
discovery = SinglePassDiscovery(model="gpt-4o-mini")
schema = discovery.discover(documents=docs, domain_hint="...")

# Extraction only (bring your own schema)
from catchfly.extraction.llm_direct import LLMDirectExtraction
extractor = LLMDirectExtraction(model="gpt-4o-mini")
records = extractor.extract(schema=MyModel, documents=docs)

# Normalization only (bring your own data)
from catchfly.normalization.embedding_cluster import EmbeddingClustering
normalizer = EmbeddingClustering(embedding_model="text-embedding-3-small")
mapping = normalizer.normalize(values=["NYC", "New York", "NY"], context_field="city")
```

## Async Support

All strategies provide async methods — async-first, sync-friendly:

```python
# Async
results = await pipeline.arun(documents=docs, domain_hint="...")

# Sync (works in notebooks too)
results = pipeline.run(documents=docs, domain_hint="...")
```

## Installation

```bash
pip install catchfly                        # Core only (~5 MB)
pip install catchfly[openai]                # + OpenAI SDK
pip install catchfly[clustering]            # + scikit-learn, numpy, umap
pip install catchfly[export]                # + pandas, pyarrow
pip install catchfly[all]                   # Everything
```

Or with uv:

```bash
uv add catchfly[openai,clustering]
```

## Features

- **Schema Discovery** — LLM proposes a Pydantic schema from sample documents
- **Structured Extraction** — LLM extracts data per-document with retries and validation
- **Normalization** — Cluster and canonicalize messy values (embedding + HDBSCAN)
- **Async-first** — All operations support async with sync wrappers
- **LLM-agnostic** — Works with any OpenAI-compatible endpoint (OpenAI, Ollama, vLLM)
- **Lightweight core** — Only pydantic + httpx; heavy deps are optional
- **Production-ready** — Error handling, cost tracking, provenance, export to DataFrame/CSV/Parquet

## Requirements

- Python 3.10+
- An OpenAI-compatible LLM endpoint

## License

Apache 2.0 — see [LICENSE](LICENSE).

## Links

- [Documentation](https://catchfly.dev)
- [Product Requirements](catchfly_prd.md)
- [Implementation Plan](IMPLEMENTATION_PLAN.md)
