Metadata-Version: 2.4
Name: hybriddb
Version: 0.4.0
Summary: Embedded, local, open-source hybrid search for AI agents — SQLite + FTS5 + ChromaDB with self-healing journal
Author: Eddy Xu
License: MIT
License-File: LICENSE
Keywords: chromadb,embeddings,fts5,hybrid-search,keyword-search,sqlite,vector-search
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.11
Requires-Dist: chromadb>=0.5.0
Provides-Extra: all
Requires-Dist: duckdb>=1.0.0; extra == 'all'
Requires-Dist: networkx>=3.0; extra == 'all'
Provides-Extra: analytics
Requires-Dist: duckdb>=1.0.0; extra == 'analytics'
Provides-Extra: benchmark
Requires-Dist: numpy>=1.24.0; extra == 'benchmark'
Requires-Dist: pytest-benchmark>=4.0.0; extra == 'benchmark'
Requires-Dist: pytest-timeout>=2.3.0; extra == 'benchmark'
Requires-Dist: sentence-transformers>=3.0.0; extra == 'benchmark'
Provides-Extra: dev
Requires-Dist: duckdb>=1.0.0; extra == 'dev'
Requires-Dist: networkx>=3.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.24.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Provides-Extra: graph
Requires-Dist: networkx>=3.0; extra == 'graph'
Provides-Extra: sentence-transformers
Requires-Dist: sentence-transformers>=3.0.0; extra == 'sentence-transformers'
Description-Content-Type: text/markdown

# HybridDB

> **Purposefully built for AI Agents.** HybridDB gives agents persistent, searchable memory — every conversation turn is indexed and retrievable via keyword, vector, or hybrid search. Used in production by the [Executive Assistant](https://github.com/open-assistants-lab) agent system.

> **Embedded. Local. Open source.** No cloud APIs, no vector DB services, no internet connection required. Runs entirely on-device with SQLite + ChromaDB + your choice of local embedding model. Ships as a single Python package with zero external infrastructure dependencies.

**SQLite + FTS5 + ChromaDB with a self-healing journal.** One Python class that gives you keyword search, vector search, SQL queries, and structured filtering — all kept in sync automatically.

```python
from hybriddb import HybridDB, LONGTEXT, TEXT

db = HybridDB("./my_data")
db.create_table("docs", {"title": TEXT, "body": LONGTEXT})

db.insert("docs", {"title": "Getting Started", "body": "A guide to using HybridDB..."})
db.insert("docs", {"title": "API Reference", "body": "Full API documentation..."})

# Search every text column
db.search("docs", "getting started")

# Search one column
db.search("docs", "body", "how do I begin", mode="hybrid")

# Structured query with parameters
db.query("docs", where="title LIKE ?", params=("%start%",))
```

## Why HybridDB?

Every serious project that needs **both** keyword and semantic search ends up wiring SQLite + FTS5 + ChromaDB together. You handle schema creation, FTS5 triggers, ChromaDB collection management, keeping them in sync, recovering from crashes, rebuilding indexes...

HybridDB does all of that once, done right.

| Feature | Status |
|---------|--------|
| SQL CRUD (insert, update, delete, get, query) | ✅ |
| FTS5 keyword search with BM25 scoring | ✅ |
| ChromaDB semantic/vector search with HNSW | ✅ |
| Hybrid search (RRF fusion of keyword + semantic) | ✅ |
| Recency-weighted scoring | ✅ |
| Schema management (create, add/drop/rename columns) | ✅ |
| Self-healing journal (crash recovery) | ✅ |
| Sync + async APIs | ✅ |
| No external API dependencies (works offline) | ✅ |
| Embedding model pluggable (sentence-transformers, OpenAI, custom) | ✅ |

## Documentation

- [API reference](docs/API.md) — stable public methods, sync/async examples, graph and OLAP facades
- [Benchmarks](docs/BENCHMARKS.md) — smoke vs full benchmark commands and expected runtime behavior
- [Release guide](docs/RELEASE.md) — local build, wheel smoke test, TestPyPI/PyPI publishing

## Installation

```bash
pip install hybriddb
```

HybridDB uses ChromaDB's bundled local MiniLM embedding by default. No API key required.

## Core Concepts

### Column Types

HybridDB maps Python-friendly types to SQLite storage and automatically sets up the right search indexes:

| Type | SQLite | FTS5 | ChromaDB | Use for |
|------|--------|------|----------|---------|
| `TEXT` | TEXT | ✅ | — | Names, titles, short strings |
| `LONGTEXT` | TEXT | ✅ | ✅ | Documents, messages, memory content |
| `INTEGER` | INTEGER | — | — | Counts, ages, IDs |
| `REAL` | REAL | — | — | Prices, scores, confidence values |
| `BOOLEAN` | INTEGER | — | — | Flags, status indicators |
| `JSON` | TEXT | — | — | Tags, metadata, structured data |

**TEXT** columns get automated FTS5 keyword search.
**LONGTEXT** columns get FTS5 + ChromaDB semantic search.

### Search Modes

```python
from hybriddb import HYBRID, LONGTEXT, TEXT, Column, SearchMode

db.create_table("docs", {"title": Column(TEXT), "body": LONGTEXT})

# Keyword only — fast, exact, great for names and titles
db.search("contacts", "name", "Alice", mode="keyword")

# Semantic only — finds "9am standup" when searching for "morning meetings"
db.search("memories", "content", "team rituals", mode=SearchMode.SEMANTIC)

# Hybrid — best of both, RRF fusion, the default
db.search("docs", "body", "getting started guide", mode=SearchMode.HYBRID)
db.search("docs", "body", "getting started guide", mode=HYBRID)

# Search across ALL text columns at once
db.search("contacts", "engineering manager")
db.search_columns("contacts", "engineering manager")
```

### Async API

All core operations have async wrappers that run blocking SQLite/ChromaDB work in a worker thread:

```python
await db.acreate_table("messages", {"content": LONGTEXT})
await db.ainsert("messages", {"content": "async-safe memory"})
results = await db.asearch("messages", "content", "memory")
```

### Public Cursor

For small custom SQL reads or migrations, use the public cursor context manager:

```python
with db.cursor() as cur:
    cur.execute("SELECT COUNT(*) FROM messages")
    count = cur.fetchone()[0]
```

### Namespaced Advanced APIs

Graph and OLAP helpers remain available on `HybridDB`, with namespaced facades for discovery:

```python
node_id = db.graph.add_node("Alice", type="person")
rows = db.olap.query("SELECT COUNT(*) AS total FROM messages")
```

### Recency Scoring

Boost recent content over older content:

```python
results = db.search(
    "messages", "content", "project update",
    recency_weight=0.3,        # 30% weight to recency
    recency_column="timestamp"
)
```

### Self-Healing Journal

All ChromaDB mutations (adds, updates, deletes) are journaled in SQLite. On insert with `sync=True` (default), the journal is processed immediately. On `sync=False`, journal entries are deferred:

```python
# Batch insert — defer ChromaDB sync for speed
db.insert_batch("contacts", big_list_of_rows, sync=False)
db.process_journal()  # Sync everything at once
```

If your process crashes mid-write, the journal replays pending entries on next startup. No ghosts, no drift.

### Health & Maintenance

```python
# Check if SQLite and ChromaDB are in sync
health = db.health("contacts")
# {"sqlite_rows": 5000, "chroma_docs": {"contacts_bio": 5000}, "status": "ok"}

# Reconcile: delete ghosts, add missing docs
result = db.reconcile("contacts")
# {"ghosts_deleted": 0, "missing_added": 3, "metadata_updated": 0}
```

## Custom Embedding Models

By default, HybridDB uses ChromaDB's bundled local MiniLM embedding. Plug in any embedding function if you want a specific model or provider:

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
db = HybridDB("./data", embedding_fn=lambda text: model.encode(text).tolist())
```

Works with any embedding provider — OpenAI, Cohere, Hugging Face, local models.

## License

MIT — see [LICENSE](LICENSE).

## Author

Eddy Xu

Inspired by [claude-mem](https://github.com/thedotmack/claude-mem) by [Matt Mack](https://github.com/thedotmack).

## Status

Alpha — actively developed, API may evolve. Core CRUD and search are stable with full test coverage (35+ tests). Currently used in production in the Executive Assistant agent system.
