Metadata-Version: 2.4
Name: libkit
Version: 0.2.1
Summary: Toolkit for the retrieval half of RAG: ingest, chunk, embed, store, and hybrid-search a document corpus for LLM skills and MCP services
Project-URL: Homepage, https://github.com/emerose/libkit
Project-URL: Repository, https://github.com/emerose/libkit
Project-URL: Issues, https://github.com/emerose/libkit/issues
Project-URL: Changelog, https://github.com/emerose/libkit/blob/main/CHANGELOG.md
Author-email: Sam Quigley <quigley@emerose.com>
License-Expression: MIT
License-File: LICENSE
Keywords: duckdb,embeddings,hybrid-search,mcp,rag,rerank,retrieval,vector-search
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Indexing
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: aiofiles>=24.1
Requires-Dist: chonkie>=1.0
Requires-Dist: diskcache>=5.6
Requires-Dist: duckdb>=1.1
Requires-Dist: openai>=1.40
Requires-Dist: opentelemetry-api>=1.27
Requires-Dist: platformdirs>=4.0
Requires-Dist: pyarrow>=24.0.0
Requires-Dist: tenacity>=9.0
Requires-Dist: tiktoken>=0.13.0
Provides-Extra: bench
Requires-Dist: datasets>=3.0; extra == 'bench'
Requires-Dist: langchain-community<0.4; extra == 'bench'
Requires-Dist: opentelemetry-sdk>=1.27; extra == 'bench'
Requires-Dist: psutil>=6.0; extra == 'bench'
Requires-Dist: ragas<0.3,>=0.2; extra == 'bench'
Requires-Dist: rapidfuzz>=3.0; extra == 'bench'
Provides-Extra: cohere
Requires-Dist: cohere>=5.0; extra == 'cohere'
Provides-Extra: dev
Requires-Dist: opentelemetry-sdk>=1.27; extra == 'dev'
Requires-Dist: pyright>=1.1; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: fancychunk-cuda
Requires-Dist: nvidia-cudnn-cu12>=9.0; (sys_platform == 'linux' and platform_machine == 'x86_64') and extra == 'fancychunk-cuda'
Requires-Dist: onnxruntime-gpu>=1.20; (sys_platform == 'linux' and platform_machine == 'x86_64') and extra == 'fancychunk-cuda'
Provides-Extra: fancychunk-mlx
Requires-Dist: fancychunk[mlx]>=0.8; (sys_platform == 'darwin' and platform_machine == 'arm64') and extra == 'fancychunk-mlx'
Provides-Extra: fancychunk-torch
Requires-Dist: fancychunk[torch]>=0.8; extra == 'fancychunk-torch'
Provides-Extra: local-rerank
Requires-Dist: accelerate>=1.0; extra == 'local-rerank'
Requires-Dist: sentence-transformers>=3.0; extra == 'local-rerank'
Provides-Extra: mcp
Requires-Dist: mcp>=1.27.1; extra == 'mcp'
Provides-Extra: pdf
Requires-Dist: pdfmux>=0.1; extra == 'pdf'
Provides-Extra: zeroentropy
Requires-Dist: httpx>=0.27; extra == 'zeroentropy'
Description-Content-Type: text/markdown

# libkit

[![CI](https://github.com/emerose/libkit/actions/workflows/ci.yml/badge.svg)](https://github.com/emerose/libkit/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/libkit.svg)](https://pypi.org/project/libkit/)
[![Python](https://img.shields.io/pypi/pyversions/libkit.svg)](https://pypi.org/project/libkit/)
[![License: MIT](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

**libkit** is a toolkit for the *retrieval* half of RAG. It ingests documents
(PDF, Markdown, Office), chunks and embeds them, stores everything in a single
[DuckDB](https://duckdb.org) file, and answers queries with **hybrid search**
(vector + full-text, fused with RRF) plus optional reranking and attribute
weighting.

There's no generation here — libkit gives you the building blocks to stand up a
knowledge base for an LLM **skill** or an **MCP** service, with sensible defaults
and an "it just works" entry point.

```python
from libkit import Library

lib = await Library.open("corpus.duckdb")        # smart defaults
await lib.ingest("paper.pdf")                      # → chunk → embed → store
hits = await lib.query("how does cache eviction work?", limit=5)
for h in hits:
    print(h.score, h.chunk.text[:80])
```

## Why libkit

- **Async-first, batteries-included.** `Library.open()` wires up a recommended
  embedder, the standard loader map, persistent caching, and adaptive request
  coalescing — every piece overridable.
- **Hybrid retrieval.** Dense vector search and DuckDB full-text BM25 run in
  parallel and fuse with Reciprocal Rank Fusion; an optional cross-encoder
  reranker and per-query attribute weighting refine the ranking.
- **One file, no services.** Documents, chunks, vectors, and the FTS index all
  live in a single DuckDB database. No external vector DB to run.
- **Generic metadata.** Four auto-filled top-level fields (`source_url`,
  `content_type`, `title`, `date`) plus a free-form `metadata` JSON column;
  filters and weights work over both.
- **Pluggable backends.** Loaders, embedders, and rerankers are injected as
  protocol-conforming instances — bring your own, or use the bundled adapters
  (OpenAI, DeepInfra, vLLM, local MLX/torch, Cohere, ZeroEntropy, Datalab,
  pdfmux, LibreOffice).
- **Strictly typed.** Ships `py.typed`; `pyright`-checked.

## Install

```bash
pip install libkit            # or: uv add libkit
```

libkit's core is pure-Python with a small dependency set. Heavier or
service-specific backends are opt-in extras:

| Extra | Pulls in | For |
| --- | --- | --- |
| `pdf` | `pdfmux` | Local PDF extraction |
| `cohere` | `cohere` | Cohere reranker |
| `zeroentropy` | `httpx` | ZeroEntropy hosted reranker |
| `local-rerank` | `sentence-transformers`, `accelerate` | In-process cross-encoder rerank |
| `mcp` | `mcp` | Serve a `Library` over MCP |
| `fancychunk-torch` / `fancychunk-mlx` / `fancychunk-cuda` | `fancychunk` | Local embedding/chunking |

```bash
pip install "libkit[pdf,cohere,mcp]"
```

> Some embedders/loaders call hosted APIs (OpenAI, DeepInfra, Cohere,
> ZeroEntropy, Datalab) and read their keys from the environment
> (`OPENAI_API_KEY`, `DEEPINFRA_API_KEY`, `DATALAB_API_KEY`, …).

## Quickstart

```python
import asyncio
from libkit import Library, QueryWeights


async def main():
    # Smart defaults: remote bulk-ingest embeddings, local interactive query
    # embeddings, caching, and coalescing. db_path is the only requirement.
    lib = await Library.open(
        "corpus.duckdb",
        embedding="auto",        # "auto" | "local" | "remote"
        model="qwen3_600m",
    )

    # Ingest. Idempotent on content hash; the loader is chosen by extension.
    # The four top-level fields are auto-filled; override any via metadata=,
    # and add arbitrary keys (stored in the metadata JSON).
    await lib.ingest("paper.pdf", metadata={"doc_type": "paper", "author": "Smith"})
    await lib.ingest("notes.md")

    # Batch ingest yields a result per document as it finishes.
    async for r in lib.ingest_batch(["a.pdf", "b.pdf", "c.md"]):
        if r.error:
            print("failed:", r.path, r.error)

    # Hybrid query with optional recency/attribute weighting and filters.
    results = await lib.query(
        "how does the cache eviction work?",
        limit=8,
        weights=QueryWeights(recency=0.2, attributes={"doc_type": {"paper": 1.5}}),
        filters={"author": "Smith"},
    )
    for r in results:
        print(f"{r.score:.3f}  {r.chunk.source_url}\n      {r.chunk.text[:100]}")

    await lib.close()


asyncio.run(main())
```

### Full control

`Library.open()` is a convenience over an explicit, frozen `LibraryConfig`:

```python
from libkit import Library, LibraryConfig
from libkit.embedders import default_embedder
from libkit.loaders import MarkdownLoader

lib = Library(
    LibraryConfig(
        db_path="corpus.duckdb",
        embedder=default_embedder(embedding="remote"),
        loaders={".md": MarkdownLoader()},
        chunk_size_tokens=512,
        chunk_overlap_tokens=64,
    )
)
```

### Serve over MCP

```python
from libkit import Library
from libkit.mcp import serve_mcp        # requires the `mcp` extra

lib = await Library.open("corpus.duckdb")
await serve_mcp(lib)                     # exposes ingest/query/get/list/delete tools
```

## How it works

```
ingest → load (PDF/MD/Office → Markdown) → chunk → embed → DuckDB
query  → embed query → [vector top-k ‖ FTS top-k] → RRF fuse
         → optional rerank → attribute weighting → results
```

See [`docs/DESIGN.md`](docs/DESIGN.md) for the full design — schema, the
adaptive-concurrency pipeline, caching, and the correctness invariants.

## Status

libkit is at **0.1** — the API is usable and tested, but may still shift before
1.0. Issues and PRs welcome; see [CONTRIBUTING.md](CONTRIBUTING.md).

## License

[MIT](LICENSE) © Sam Quigley
