Metadata-Version: 2.4
Name: vajra-search
Version: 0.2.1
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Rust
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Indexing
Requires-Dist: numpy>=1.20.0
Requires-Dist: sentence-transformers>=2.2.0 ; extra == 'vector'
Provides-Extra: vector
Summary: High-performance BM25 + HNSW vector search using category theory, written in Rust
Keywords: search,bm25,hnsw,vector-search,information-retrieval,category-theory,rust
Author: Rajesh Sampathkumar
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM
Project-URL: Homepage, https://github.com/aiexplorations/vajra_search_engine
Project-URL: Repository, https://github.com/aiexplorations/vajra_search_engine

# Vajra Search Engine (`vajra-search`)

Rust-backed search framework with a Python interface for:

- lexical BM25 search,
- vector ANN search (HNSW),
- hybrid BM25 + vector fusion.

The package ships with a compiled Rust extension (`vajra_search._vajra_search`) and Python orchestration layers for embeddings, vector indexing, and hybrid fusion.

## Installation

Base install:

```bash
pip install vajra-search
```

Optional embedding-model dependency (for `TextEmbeddingMorphism`):

```bash
pip install "vajra-search[vector]"
```

## Search Modes

### 1) Lexical Search (BM25)

```python
from vajra_search import Document, DocumentCorpus, VajraSearch

docs = [
    Document("1", "Rust for Search", "Rust enables predictable low-level performance."),
    Document("2", "BM25 Overview", "BM25 is a lexical ranking algorithm for keyword search."),
    Document("3", "Hybrid Retrieval", "Hybrid retrieval combines lexical and vector signals."),
]

corpus = DocumentCorpus(docs)
engine = VajraSearch(corpus, k1=1.5, b=0.75)

results = engine.search("bm25 keyword ranking", top_k=3)
for r in results:
    # BM25 rank from the Rust layer is zero-based; display as one-based.
    print(f"rank={r.rank + 1} id={r.doc_id} score={r.score:.4f} title={r.title}")
```

### 2) Vector Search (HNSW)

The example below uses a tiny deterministic embedder so it runs without external model downloads.

```python
from typing import List

import numpy as np

from vajra_search import (
    Document,
    NativeHNSWIndex,
    VajraVectorSearch,
)
from vajra_search.embeddings import EmbeddingMorphism


class TinyEmbedding(EmbeddingMorphism[str]):
    """Very small keyword-count embedder for demos/tests."""

    VOCAB = ("rust", "search", "bm25", "vector")

    @property
    def dimension(self) -> int:
        return len(self.VOCAB)

    def embed(self, text: str) -> np.ndarray:
        t = text.lower()
        vec = np.array([t.count(tok) for tok in self.VOCAB], dtype=np.float32)
        norm = np.linalg.norm(vec)
        return vec / norm if norm > 0 else vec

    def embed_batch(self, texts: List[str]) -> np.ndarray:
        return np.vstack([self.embed(t) for t in texts]).astype(np.float32)


docs = [
    Document("1", "Rust Search", "Rust vector search with HNSW."),
    Document("2", "Lexical BM25", "BM25 is strong for exact keyword matching."),
    Document("3", "Vector Retrieval", "Vector search captures semantic similarity."),
]

embedder = TinyEmbedding()
index = NativeHNSWIndex(dimension=embedder.dimension, metric="cosine", max_elements=100)
vsearch = VajraVectorSearch(embedder, index)
vsearch.index_documents(docs, show_progress=False)

results = vsearch.search("vector search in rust", top_k=3)
for r in results:
    print(f"rank={r.rank} id={r.id} score={r.score:.4f} title={r.document.title}")
```

### 3) Hybrid Search (BM25 + Vector)

```python
from typing import List

import numpy as np

from vajra_search import (
    Document,
    DocumentCorpus,
    HybridSearchEngine,
    NativeHNSWIndex,
    VajraSearch,
    VajraVectorSearch,
)
from vajra_search.embeddings import EmbeddingMorphism


class TinyEmbedding(EmbeddingMorphism[str]):
    VOCAB = ("rust", "search", "bm25", "vector")

    @property
    def dimension(self) -> int:
        return len(self.VOCAB)

    def embed(self, text: str) -> np.ndarray:
        t = text.lower()
        vec = np.array([t.count(tok) for tok in self.VOCAB], dtype=np.float32)
        norm = np.linalg.norm(vec)
        return vec / norm if norm > 0 else vec

    def embed_batch(self, texts: List[str]) -> np.ndarray:
        return np.vstack([self.embed(t) for t in texts]).astype(np.float32)


docs = [
    Document("1", "Rust HNSW", "Rust implementation of HNSW vector search."),
    Document("2", "BM25 Fundamentals", "BM25 ranks documents by lexical relevance."),
    Document("3", "Hybrid Ranking", "Hybrid ranking combines BM25 and vector signals."),
]

corpus = DocumentCorpus(docs)
bm25 = VajraSearch(corpus)

embedder = TinyEmbedding()
index = NativeHNSWIndex(dimension=embedder.dimension, metric="cosine", max_elements=100)
vector = VajraVectorSearch(embedder, index)
vector.index_documents(docs, show_progress=False)

hybrid = HybridSearchEngine(bm25, vector, alpha=0.5, method="rrf")
results = hybrid.search("rust vector search ranking", top_k=3)
for r in results:
    print(f"rank={r.rank} id={r.id} score={r.score:.4f} title={r.document.title}")
```

## Runnable Examples

These scripts are included in-repo:

- `examples/lexical_search.py`
- `examples/vector_search.py`
- `examples/hybrid_search.py`

Run:

```bash
python examples/lexical_search.py
python examples/vector_search.py
python examples/hybrid_search.py
```

## API Surface (Python)

Main exports:

- BM25: `Document`, `DocumentCorpus`, `BM25Params`, `VajraSearch`, `VajraSearchParallel`
- HNSW: `HnswIndex` (raw Rust binding), `NativeHNSWIndex` (Python wrapper)
- Vector layer: `VajraVectorSearch`, `VectorSearchResult`
- Hybrid layer: `HybridSearchEngine`
- Embeddings: `TextEmbeddingMorphism`, `PrecomputedEmbeddingMorphism`, `IdentityEmbeddingMorphism`

## Persistence

- Vector index persistence is exposed via:
  - `NativeHNSWIndex.save(path)`
  - `NativeHNSWIndex.load(path)`
  - `VajraVectorSearch.save(path)` / `VajraVectorSearch.load(path, embedder, index_class)`

## Reproducibility and Benchmarks

- Repro steps: `reproduction.md`
- Benchmark harness and datasets are documented in the companion benchmark repos referenced from the project documentation.

### Benchmark Snapshot (Python Interface)

Measured on **2026-03-02** on **Darwin arm64**, **Python 3.13.7** using:

- query protocol: 10 warmup + 100 measured queries (`top_k=10`)
- corpus: deterministic synthetic topic-keyword documents with mixed selectivity queries (broad + selective)
- modes benchmarked through the Python API (`VajraSearch`, `VajraSearchParallel`, `VajraVectorSearch`, `HybridSearchEngine`)
- `lexical_parallel` measures per-query latency from batched `search_batch` execution
- `vector` numbers use a tiny deterministic embedder, so they represent index-path latency (not transformer inference latency)

| Size | Mode | Build (s) | p50 (ms) | p95 (ms) | p99 (ms) | QPS |
|---|---|---:|---:|---:|---:|---:|
| 1,000 | lexical | 0.006 | 0.019 | 0.113 | 0.114 | 25799.8 |
| 1,000 | lexical_parallel | 0.004 | 0.020 | 0.024 | 0.024 | 49387.8 |
| 1,000 | vector | 0.055 | 0.016 | 0.019 | 0.019 | 62366.4 |
| 1,000 | hybrid | 0.060 | 0.057 | 0.158 | 0.223 | 11873.4 |
| 10,000 | lexical | 0.048 | 0.244 | 1.410 | 1.865 | 2018.7 |
| 10,000 | lexical_parallel | 0.044 | 0.157 | 0.171 | 0.171 | 6664.5 |
| 10,000 | vector | 0.508 | 0.017 | 0.018 | 0.019 | 61373.2 |
| 10,000 | hybrid | 0.566 | 0.279 | 1.651 | 1.740 | 2039.0 |
| 20,000 | lexical | 0.091 | 0.517 | 4.436 | 6.013 | 759.2 |
| 20,000 | lexical_parallel | 0.089 | 0.281 | 0.331 | 0.331 | 3512.9 |
| 20,000 | vector | 1.005 | 0.017 | 0.039 | 0.060 | 47738.4 |
| 20,000 | hybrid | 1.192 | 0.659 | 4.042 | 4.402 | 785.4 |
| 50,000 | lexical | 0.238 | 1.859 | 12.796 | 13.658 | 237.9 |
| 50,000 | lexical_parallel | 0.276 | 1.115 | 1.257 | 1.257 | 882.2 |
| 50,000 | vector | 2.625 | 0.017 | 0.026 | 0.035 | 55942.5 |
| 50,000 | hybrid | 2.815 | 1.837 | 12.550 | 13.674 | 237.9 |

Re-run this benchmark:

```bash
./.venv/bin/python scripts/benchmark_python_modes.py --sizes 1000 10000 20000 50000
```

Raw outputs are written to:

- `scripts/benchmark_python_modes_latest.json`
- `scripts/benchmark_python_modes_latest.md`

### Wikipedia Vector Benchmark Snapshot (Companion Harness)

For production-style vector benchmarking against Wikipedia embeddings (1k/10k/20k/50k) and ZVec comparison, use the companion harness documented in `reproduction.md`.

Important for reproducible build-time comparisons: install the local extension with native CPU flags enabled:

```bash
RUSTFLAGS="-C target-cpu=native" pip install -e ~/Github/vajra_search_engine --no-build-isolation
```

One 50k snapshot from that track (fresh run on 2026-03-04):

| Engine/Profile | Build (s) | p50 (ms) | QPS | Recall@10 |
|---|---:|---:|---:|---:|
| ZVec | 2.724 | 0.764 | 1312.7 | 0.998 |
| Vajra quality | 51.083 | 0.208 | 4753.0 | 0.998 |
| Vajra fast | 17.340 | 0.170 | 5682.1 | 0.908 |
| Vajra instant | 4.194 | 0.075 | 10787.2 | 0.718 |

## Release Checks

Before publishing to PyPI:

```bash
./scripts/release_check.sh
```

This validates:

- Python tests
- coverage threshold (`>=80%`)
- Rust workspace tests (with pinned PyO3 interpreter)
- wheel build + clean-venv install/import smoke test

## Release Process

- Tag push (`v*`) builds cross-platform wheels/sdist and creates a GitHub Release.
- PyPI upload is a separate manual action via GitHub workflow (`Publish PyPI`), using the chosen tag.
- Full runbook: `RELEASING.md`

## License

MIT

