Metadata-Version: 2.4
Name: full-text-sparse-engine
Version: 0.4.0
Author: Carlo Moro
Author-email: Carlo Moro <cnmoro@gmail.com>
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.7
Description-Content-Type: text/markdown
Requires-Dist: lmdb
Requires-Dist: numpy
Requires-Dist: scipy
Requires-Dist: unidecode
Requires-Dist: scikit-learn
Requires-Dist: lenlp
Requires-Dist: xxhash
Requires-Dist: lightning-disk-kv>=1.1.0
Dynamic: author
Dynamic: requires-python

# Full text search engine - Hybrid LMDB/Inverted Index

This project is a high-performance, persistent full-text search engine written in Python. It utilizes a **Hybrid Architecture** combining an inverted index for sub-100ms candidate filtering and an on-the-fly BM25 vectorizer for relevant ranking.

## Key Features

*   **Sub-100ms Search:** optimized for real-time querying using a custom Inverted Index.
*   **Bidirectional Tree Lookup:** Finds documents using smart prefix expansion ("exon" -> "exonerated") and recursive reduction ("exonerações" -> "exonera").
*   **Metadata Filtering:** Powerful filtering with support for exact matches, ranges (`$gt`, `$lt`), and sets (`$in`).
*   **Atomic Batching:** High-throughput indexing (thousands of docs/sec) via atomic batch commits to LMDB.
*   **Thread-Safe:** Based on LMDB (Lightning Memory-Mapped Database).
*   **CRUD Support:** Fully supports Create, Read, Update, and Delete operations.

## Architecture

Unlike traditional vector search engines that rely heavily on pre-computed large dense matrices, this engine uses a two-step process:

1.  **Filtering (Inverted Index):** A sharded inverted index maps tokens to document IDs. It uses a bidirectional strategy:
    *   **Forward Scan:** Finds words starting with the query token (e.g., query "work" matches "working").
    *   **Backward Scan:** If no exact match is found, it reduces the query token to find root words (e.g., query "working" matches "work").
2.  **Ranking (On-the-fly BM25):** Once candidates are filtered (by metadata and text), a lightweight `IncrementalBM25` model is built instantly in memory for just those candidates to score and rank them.

## Usage

### Initialization

```python
from engine import SearchEngine
import os, shutil

# Define paths
paths = ["./db", "./db_meta", "./db_meta_idx", "./db_text_idx", "./matrix"]

# Cleanup for fresh start
for p in paths:
    if os.path.exists(p): shutil.rmtree(p)

# Initialize
engine = SearchEngine(
    storage_base_path=paths[0],
    metadata_storage_base_path=paths[1],
    metadata_index_storage_base_path=paths[2],
    text_index_storage_base_path=paths[3],
    matrix_path=paths[4]
)
```

### 1. Create (Insert)

You can insert single documents or batches. Batching is recommended for speed.

```python
# Single Insert
doc_id = engine.store_data("The quick brown fox", {"category": "animals"})
print(f"Inserted document ID: {doc_id}")

# Batch Insert (Faster)
docs = ["Apple pie recipe", "Banana bread recipe"]
metas = [{"type": "food"}, {"type": "food"}]
doc_ids = engine.store_data_batch(docs, metas)
print(f"Inserted batch: {doc_ids}")
```

### 2. Read (Search)

Search combines text queries with metadata filters.

```python
# Simple Text Search
results = engine.search("apple", {}) 
# Returns: [(doc_id, text, metadata), ...]

# Metadata Filter + Text
# Finds "fox" only in docs where category == 'animals'
results = engine.search("fox", {"category": "animals"})

# Advanced Range Query
# Find "report" where year >= 2022
results = engine.search("report", {"year": {"$gte": 2022}})

# Set Query
# Find "recipe" where type is 'food' or 'dessert'
results = engine.search("recipe", {"type": {"$in": ["food", "dessert"]}})
```

### 3. Update

To update a document, simply store it again with the same `doc_id`. Note: This engine treats updates as "Upserts". For a clean update (ensuring old index keys are removed), it is often safer to Delete then Insert, though `store_data` with an existing ID will update the storage and add new index keys (old keys remain but point to the valid ID).

```python
# 1. Store original
uid = engine.store_data("Old text content", {"version": 1}, doc_id="custom-id-123")

# 2. Update (Overwrite)
# This updates the content and adds 'new' and 'content' to the index for this ID.
engine.store_data("New content", {"version": 2}, doc_id="custom-id-123")
```

### 4. Delete

Delete documents based on metadata queries. This cleans up the storage and the indexes.

```python
# Delete all documents with category 'animals'
deleted_count = engine.delete({"category": "animals"})
print(f"Deleted {deleted_count} documents.")

# Delete a specific document by ID (if ID is stored in metadata)
# Assuming you stored {"id": "custom-id-123"} in metadata:
engine.delete({"id": "custom-id-123"})
```

### Performance Notes

*   **Indexing:** Use `store_data_batch` for mass ingestion. It groups atomic transactions to drastically reduce disk I/O.
*   **Search:** The engine is optimized for high-selectivity queries (where metadata or text filters reduce the candidate set to < 10,000 documents).

The following performance metrics were collected on a standard machine. The use of a memory-mapped index allows for fast search performance while keeping RAM usage low.

| Metric                | Value               |
|-----------------------|---------------------|
| Number of documents   | 1000                |
| Document size (chars) | 500                 |
| Storage throughput    | 22.63 docs/sec      |
| Search throughput     | 6089.11 queries/sec |

*These numbers are meant to be indicative. Actual performance will vary depending on the hardware and the nature of the data.*
