Metadata-Version: 2.4
Name: haystack-velesdb
Version: 1.15.0
Summary: Haystack 2.x DocumentStore for VelesDB: The Local AI Memory Database.
Author-email: VelesDB Team <contact@wiscale.fr>
License: MIT
Project-URL: Homepage, https://github.com/cyberlife-coder/VelesDB
Project-URL: Documentation, https://velesdb.com/docs/integrations/haystack
Project-URL: Repository, https://github.com/cyberlife-coder/VelesDB
Keywords: haystack,velesdb,vector-database,embeddings,rag,local-first,semantic-search
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: haystack-ai>=2.0.0
Requires-Dist: velesdb>=1.14.0
Requires-Dist: velesdb-common>=1.14.0
Provides-Extra: dev
Requires-Dist: pytest<9.0,>=7.0; extra == "dev"
Dynamic: license-file

# haystack-velesdb

A Haystack 2.x `DocumentStore` backed by [VelesDB](https://github.com/cyberlife-coder/VelesDB) —
the local-first, microsecond-latency vector database.

This integration joins the existing [LangChain](../langchain/) and [LlamaIndex](../llamaindex/)
connectors, completing the trio of major Python RAG frameworks supported by VelesDB.

## Installation

```bash
pip install haystack-velesdb
```

For development:

```bash
pip install -e "integrations/haystack[dev]"
```

## Quick start

```python
from haystack_velesdb import VelesDBDocumentStore
from haystack.dataclasses import Document

store = VelesDBDocumentStore(
    path="./my_docs",
    collection_name="knowledge_base",
    embedding_dim=768,
    metric="cosine",
)

# Write pre-embedded documents
documents = [
    Document(id="doc1", content="VelesDB is fast.", embedding=[0.1, 0.2, ...]),
    Document(id="doc2", content="Local-first AI memory.", embedding=[0.3, 0.4, ...]),
]
store.write_documents(documents)

# Retrieve by vector
results = store.embedding_retrieval(query_embedding=[0.1, 0.2, ...], top_k=5)
for doc in results:
    print(doc.content, doc.score)
```

## Full RAG pipeline

See [`examples/rag_pipeline.py`](examples/rag_pipeline.py) for a complete PDF ingestion
and semantic search example using `SentenceTransformersDocumentEmbedder`.

```python
from haystack import Pipeline
from haystack.components.converters import PyPDFToDocument
from haystack.components.embedders import (
    SentenceTransformersDocumentEmbedder,
    SentenceTransformersTextEmbedder,
)
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack_velesdb import VelesDBDocumentStore

store = VelesDBDocumentStore(path="./rag_store", embedding_dim=384)

# Indexing pipeline
indexer = Pipeline()
indexer.add_component("converter", PyPDFToDocument())
indexer.add_component("splitter", DocumentSplitter(split_by="sentence", split_length=3))
indexer.add_component("embedder", SentenceTransformersDocumentEmbedder(model="all-MiniLM-L6-v2"))
indexer.add_component("writer", DocumentWriter(document_store=store))
indexer.connect("converter", "splitter")
indexer.connect("splitter", "embedder")
indexer.connect("embedder", "writer")
indexer.run({"converter": {"sources": ["paper.pdf"]}})

# Query pipeline. `InMemoryEmbeddingRetriever` is bound to `InMemoryDocumentStore`
# and would NOT work against a custom DocumentStore — wrap `embedding_retrieval`
# in a thin Haystack component that forwards the call. Full working example in
# `integrations/haystack/examples/rag_pipeline.py` (`_VelesRetriever`).
from haystack import component
from haystack.dataclasses import Document
from typing import List

@component
class VelesRetriever:
    def __init__(self, document_store, top_k: int = 10):
        self._store = document_store
        self._top_k = top_k

    @component.output_types(documents=List[Document])
    def run(self, query_embedding: List[float]):
        return {"documents": self._store.embedding_retrieval(query_embedding, top_k=self._top_k)}

querier = Pipeline()
querier.add_component("embedder", SentenceTransformersTextEmbedder(model="all-MiniLM-L6-v2"))
querier.add_component("retriever", VelesRetriever(document_store=store))
querier.connect("embedder.embedding", "retriever.query_embedding")
result = querier.run({"embedder": {"text": "What is VelesDB?"}})
print(result["retriever"]["documents"])
```

## API reference

### `VelesDBDocumentStore`

| Parameter | Default | Description |
|-----------|---------|-------------|
| `path` | `"./velesdb_haystack"` | Directory where VelesDB persists data |
| `collection_name` | `"haystack_documents"` | VelesDB collection name |
| `embedding_dim` | `768` | Embedding vector dimension |
| `metric` | `"cosine"` | Distance metric: `"cosine"`, `"euclidean"`, or `"dot"` |

### Methods

| Method | Description |
|--------|-------------|
| `write_documents(documents, policy)` | Upsert documents; returns count written |
| `filter_documents(filters)` | Scroll documents matching a VelesDB filter dict |
| `embedding_retrieval(query_embedding, top_k, filters, scale_score)` | Vector similarity search |
| `count_documents()` | Total document count |
| `delete_documents(document_ids)` | Delete by Haystack string IDs |
| `to_dict()` / `from_dict()` | Haystack pipeline serialisation |

**Note on `DuplicatePolicy`:** `NONE` and `OVERWRITE` use VelesDB upsert semantics
and always overwrite on collision.  `FAIL` is fully enforced: a pre-scan is
performed before writing and `DuplicateDocumentError` is raised if any document
already exists (prefer `OVERWRITE` or `NONE` for bulk loads to skip the scan cost).

**Note on document IDs and SHA-256:** Haystack string IDs are mapped to 63-bit
integers using the first 8 bytes of SHA-256 (~9.2 × 10¹⁸ slots).  For a
1 M-document collection the collision probability is roughly 5 × 10⁻¹⁴, which
is negligible for typical RAG workloads.  A `ValueError` is raised at write time
if a collision is detected between a new document and an existing one.

**Note on `scale_score`:** When `True` (default), cosine similarity scores
are normalised from `[-1, 1]` to `[0, 1]` so they behave like probabilities
in downstream re-ranking.

## Running tests

```bash
cd integrations/haystack
pip install -e ".[dev]"
pytest tests/ -v
```

Tests use lightweight fake VelesDB objects — no running server required.
