Metadata-Version: 2.4
Name: vajra-bm25
Version: 0.1.0
Summary: Categorical BM25 search engine using pure category theory
Author: Rajesh Sampathkumar
License: MIT
Project-URL: Homepage, https://github.com/aiexplorations/vajra_bm25
Project-URL: Documentation, https://github.com/aiexplorations/vajra_bm25#readme
Project-URL: Repository, https://github.com/aiexplorations/vajra_bm25
Project-URL: Issues, https://github.com/aiexplorations/vajra_bm25/issues
Keywords: bm25,search,category-theory,information-retrieval,coalgebra,morphism,functor,text-search,ranking
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Text Processing :: Indexing
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: optimized
Requires-Dist: numpy>=1.20.0; extra == "optimized"
Requires-Dist: scipy>=1.7.0; extra == "optimized"
Provides-Extra: persistence
Requires-Dist: joblib>=1.0.0; extra == "persistence"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: rank-bm25>=0.2.2; extra == "dev"
Provides-Extra: all
Requires-Dist: numpy>=1.20.0; extra == "all"
Requires-Dist: scipy>=1.7.0; extra == "all"
Requires-Dist: joblib>=1.0.0; extra == "all"
Dynamic: license-file

# Vajra BM25

[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

**Vajra** (Sanskrit: वज्र, "thunderbolt/diamond") is a BM25 search engine built on pure category theory.

## What Makes Vajra Different

Vajra implements the standard BM25 ranking algorithm using rigorous mathematical abstractions:

- **Morphisms**: BM25 scoring as a mathematical arrow `(Query, Document) → ℝ`
- **Coalgebras**: Search as state unfolding `QueryState → List[SearchResult]`
- **Functors**: The List functor captures multiple-results semantics

The same math, different vocabulary. The core BM25 formula is identical to other implementations—category theory provides the organizational structure, not runtime magic.

## Installation

```bash
# Basic installation (zero dependencies)
pip install vajra-bm25

# With optimizations (NumPy + SciPy for vectorized operations)
pip install vajra-bm25[optimized]

# With index persistence (save/load indices)
pip install vajra-bm25[persistence]

# Everything
pip install vajra-bm25[all]
```

## Quick Start

```python
from vajra_bm25 import VajraSearch, Document, DocumentCorpus

# Create documents
documents = [
    Document(id="1", title="Category Theory", content="Functors preserve structure"),
    Document(id="2", title="Coalgebras", content="Coalgebras model dynamics"),
    Document(id="3", title="Search Algorithms", content="BFS explores level by level"),
]
corpus = DocumentCorpus(documents)

# Create search engine
engine = VajraSearch(corpus)

# Search
results = engine.search("category functors", top_k=5)

for r in results:
    print(f"{r.rank}. {r.document.title} (score: {r.score:.3f})")
```

## Optimized Usage

For larger corpora (1000+ documents), use the optimized version:

```python
from vajra_bm25 import VajraSearchOptimized, DocumentCorpus

# Load corpus from JSONL
corpus = DocumentCorpus.load_jsonl("corpus.jsonl")

# Create optimized engine
# Automatically uses sparse matrices for >10K documents
engine = VajraSearchOptimized(corpus)

# Search (vectorized, cached)
results = engine.search("neural networks", top_k=10)
```

## Parallel Batch Processing

For high-throughput scenarios:

```python
from vajra_bm25 import VajraSearchParallel

engine = VajraSearchParallel(corpus, max_workers=4)

# Process multiple queries in parallel
queries = ["machine learning", "deep learning", "neural networks"]
batch_results = engine.search_batch(queries, top_k=5)
```

## Performance

At 100,000 documents:

| Implementation | Query Latency | Recall@10 |
|----------------|---------------|-----------|
| rank-bm25 | 133.54 ms | baseline |
| Vajra (base) | 59.14 ms | 65.0% |
| Vajra (optimized) | 1.39 ms | 66.5% |

Vajra (optimized) achieves **96x speedup** over rank-bm25 through:
- Vectorized NumPy operations
- Pre-computed IDF values
- Sparse matrix representation
- LRU query caching
- Partial sort for top-k

## JSONL Format

Vajra uses JSONL for corpus persistence:

```jsonl
{"id": "doc1", "title": "First Document", "content": "Content here"}
{"id": "doc2", "title": "Second Document", "content": "More content"}
```

Load and save:

```python
# Save
corpus.save_jsonl("corpus.jsonl")

# Load
corpus = DocumentCorpus.load_jsonl("corpus.jsonl")
```

## BM25 Parameters

```python
from vajra_bm25 import VajraSearch, BM25Parameters

# Custom BM25 parameters
params = BM25Parameters(
    k1=1.5,  # Term frequency saturation (default: 1.5)
    b=0.75   # Length normalization (default: 0.75)
)

engine = VajraSearch(corpus, params=params)
```

## Categorical Abstractions (Advanced)

For users interested in the category theory foundations:

```python
from vajra_bm25 import (
    Morphism, FunctionMorphism, IdentityMorphism,
    Coalgebra, SearchCoalgebra,
    Functor, ListFunctor,
)

# Morphism composition
f = FunctionMorphism(lambda x: x + 1)
g = FunctionMorphism(lambda x: x * 2)
h = f >> g  # h(x) = (x + 1) * 2

# Identity laws
identity = IdentityMorphism()
assert (f >> identity).apply(5) == f.apply(5)  # f . id = f
assert (identity >> f).apply(5) == f.apply(5)  # id . f = f
```

## API Reference

### Core Classes

- `Document(id, title, content, metadata=None)` - Immutable document
- `DocumentCorpus(documents)` - Collection of documents
- `VajraSearch(corpus, params=None)` - Base search engine
- `VajraSearchOptimized(corpus, k1=1.5, b=0.75)` - Vectorized search
- `VajraSearchParallel(corpus, max_workers=4)` - Parallel batch search

### Search Results

```python
@dataclass
class SearchResult:
    document: Document  # The matched document
    score: float        # BM25 relevance score
    rank: int           # Position in results (1-indexed)
```

## Why Category Theory?

Category theory provides:

1. **Unified abstractions** - Same `Coalgebra.structure_map()` interface for graph search and document retrieval
2. **Explicit type signatures** - `BM25: (Query, Document) → ℝ` makes inputs/outputs clear
3. **Composable pipelines** - `preprocess >> score >> rank` as morphism composition

What it doesn't provide:
- Performance improvements (those come from NumPy/sparse matrices)
- Novel algorithms (BM25 is BM25)
- Runtime machinery (it's just well-organized code)

The honest summary: **category theory is a design vocabulary, not a runtime mechanism**.

## Development

```bash
# Clone repository
git clone https://github.com/aiexplorations/vajra_bm25.git
cd vajra_bm25

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Run with coverage
pytest --cov=vajra_bm25 --cov-report=html
```

## License

MIT License - see [LICENSE](LICENSE) for details.

## Acknowledgments

- BM25 algorithm: Robertson & Zaragoza, "The Probabilistic Relevance Framework"
- Category theory foundations: Rutten, "Universal Coalgebra: A Theory of Systems"
- Inspired by the [State Dynamic Modeling](https://github.com/aiexplorations/state_dynamic_modeling) project
