Metadata-Version: 2.4
Name: mistralai-search-toolkit
Version: 0.0.8
Summary: Modular framework for building IR systems
Author-email: Mistral AI <support@mistral.ai>
License: Apache-2.0
License-File: LICENSE
Keywords: ai,information-retrieval,llm,mistral,rag,search
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: <3.15,>=3.12
Requires-Dist: aiofiles>=24.1.0
Requires-Dist: dynaconf>=3.2.11
Requires-Dist: httpx<1.0.0,>=0.27.0
Requires-Dist: mistral-common[sentencepiece]<2.0.0,>=1.10.0
Requires-Dist: mistralai<3.0.0,>=2.4.4
Requires-Dist: opentelemetry-api<2.0.0,>=1.30.0
Requires-Dist: pydantic<3.0.0,>=2.7.4
Requires-Dist: pypdf<7.0.0,>=6.10.2
Requires-Dist: striprtf<1.0.0,>=0.0.29
Requires-Dist: structlog<26,>=24
Provides-Extra: all
Requires-Dist: eml-parser; extra == 'all'
Requires-Dist: extract-msg; extra == 'all'
Requires-Dist: langchain-core~=1.2; extra == 'all'
Requires-Dist: langchain-text-splitters~=1.1; extra == 'all'
Requires-Dist: markdownify<1,>=0.14; extra == 'all'
Requires-Dist: mistralai-search-toolkit-plugins-vespa; extra == 'all'
Requires-Dist: mistralai-search-toolkit-storage-azure; extra == 'all'
Requires-Dist: mistralai-search-toolkit-storage-gcs; extra == 'all'
Requires-Dist: numbers-parser<4.16.3,>=4.16.2; extra == 'all'
Requires-Dist: pandas~=3.0; extra == 'all'
Requires-Dist: pymupdfpro>=1.25.5; extra == 'all'
Requires-Dist: python-calamine<1.0.0,>=0.5.3; extra == 'all'
Requires-Dist: tqdm>=4.67.0; extra == 'all'
Provides-Extra: extractor-email
Requires-Dist: eml-parser; extra == 'extractor-email'
Requires-Dist: extract-msg; extra == 'extractor-email'
Requires-Dist: markdownify<1,>=0.14; extra == 'extractor-email'
Provides-Extra: extractor-pymupdf
Requires-Dist: pymupdfpro>=1.25.5; extra == 'extractor-pymupdf'
Provides-Extra: extractor-spreadsheet
Requires-Dist: numbers-parser<4.16.3,>=4.16.2; extra == 'extractor-spreadsheet'
Requires-Dist: pandas~=3.0; extra == 'extractor-spreadsheet'
Requires-Dist: python-calamine<1.0.0,>=0.5.3; extra == 'extractor-spreadsheet'
Provides-Extra: html-converter-markdownify
Requires-Dist: markdownify<1,>=0.14; extra == 'html-converter-markdownify'
Provides-Extra: storage-azure
Requires-Dist: mistralai-search-toolkit-storage-azure; extra == 'storage-azure'
Provides-Extra: storage-gcs
Requires-Dist: mistralai-search-toolkit-storage-gcs; extra == 'storage-gcs'
Provides-Extra: text-splitter-langchain
Requires-Dist: langchain-core~=1.2; extra == 'text-splitter-langchain'
Requires-Dist: langchain-text-splitters~=1.1; extra == 'text-splitter-langchain'
Provides-Extra: vespa
Requires-Dist: mistralai-search-toolkit-plugins-vespa; extra == 'vespa'
Description-Content-Type: text/markdown

# Search Toolkit

Modular, backend-agnostic framework for building and evaluating Information Retrieval systems.

## Overview

Search Toolkit provides plug-and-play, extensible components for building production-ready IR pipelines. Every component is swappable and customizable — build exactly what your use case needs.

## What's Included

### Core Components

- **Ingestion**: Document loaders, extractors, text splitters, enrichment, and indexing pipelines
- **Retrieval**: Vector (semantic), keyword (BM25), and hybrid search with RRF fusion
- **Query Processing**: LLM reformulation and custom preprocessing
- **Reranking**: Rerank results to surface the most relevant information
- **Embedders**: Generate embeddings for documents and queries using Mistral's embedding models
- **Storage**: Abstract object storage interface for document persistence

### Backend Agnostic

The toolkit is designed to work with different search backends through plugins. You can use it with any vector database or search engine by installing the appropriate plugin.

## Installation

Install the base package:

```bash
pip install mistralai-search-toolkit
```

Install optional components:

```bash
# Text extraction from PDFs (requires pymupdf-pro)
pip install mistralai-search-toolkit[extractor-pymupdf]

# HTML to markdown conversion
pip install mistralai-search-toolkit[html-converter-markdownify]

# Email extraction
pip install mistralai-search-toolkit[extractor-email]

# Spreadsheet parsing
pip install mistralai-search-toolkit[extractor-spreadsheet]

# LangChain text splitting
pip install mistralai-search-toolkit[text-splitter-langchain]
```

## Quick Start

### 1. Load and Process Documents

```python
import os
from mistralai.search.toolkit.ingestion.loaders import FilesystemFileLoader
from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter
from mistralai.client import Mistral

# Load documents from a directory
loader = FilesystemFileLoader()
documents = loader.load(path="/path/to/documents")

# Split into chunks
splitter = CharacterTextSplitter(chunk_size=512)
chunks = splitter.split(documents)
```

### 2. Generate Embeddings

```python
from mistralai.search.toolkit.embedders import MistralEmbedder, MODEL_1024_EMBEDDING

# Create embedder (uses Mistral's API)
mistral_client = Mistral(api_key=os.environ.get("MISTRAL_API_KEY", "your-api-key"))
embedder = MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING)

# Embed your chunks
embedded_chunks = embedder.embed(chunks)
```

### 3. Create an Index and Search

The toolkit supports multiple search backends through plugins. See the [Vespa Plugin](#vespa-plugin) section below for a complete example.

## Vespa Plugin: Creating a Search Index

[Vespa](https://vespa.ai/) is an open-source search engine that integrates seamlessly with the toolkit.

### Prerequisites

- The Vespa plugin: `pip install mistralai-search-toolkit-plugins-vespa`
- Docker for local development

### Getting Started with Vespa

#### Step 1: Bootstrap Your Vespa Application

First, create the application structure with an initial migration:

```bash
uv run mistral-vespa generate-migration --app-dir ./vespa_app initial_schema
```

This creates `./vespa_app/` and generates a migration file. Fill it with your schema definition:

```python
from mistralai.search.toolkit.plugins.vespa.app.schemas.app import SearchMode
from mistralai.search.toolkit.plugins.vespa.migration import VespaMigration, create_default_schema, set_app_name

class InitialSchema(VespaMigration):
    def migrate(self) -> None:
        set_app_name("articles")
        create_default_schema(
            name="articles",
            mode=SearchMode.INDEX,
            embedding_dimensions=1024,  # Match your embedder's dimensions
            schema_version=1,
        )
```

#### Step 2: Start a Local Vespa Instance

```bash
uv run mistral-vespa local up --query-port 18080 --config-port 19171 --name vespa-dev
```

#### Step 3: Deploy Your Application

Deploy the migrations to the running Vespa instance:

```bash
uv run mistral-vespa migrate \
  --app-dir ./vespa_app \
  --config-server http://localhost:19171 \
  --query-port 18080
```

This generates the `vespa_app` module that you can import.

#### Step 4: Ingest and Search Documents

After deployment, use the generated `vespa_app` to index and search:

```python
import os
from mistralai.search.toolkit.ingestion.pipelines import Pipeline
from mistralai.search.toolkit.ingestion.loaders import FilesystemFileLoader
from mistralai.search.toolkit.ingestion.text_splitters import CharacterTextSplitter
from mistralai.search.toolkit.embedders import MistralEmbedder, MODEL_1024_EMBEDDING
from mistralai.client import Mistral
from mistralai.search.toolkit.plugins.vespa import VespaClientConfig
from mistralai.search.toolkit.retrieval import QueryEngine
from mistralai.search.toolkit.retrieval.retrievers import VectorRetriever
from vespa_app import app  # Generated by migration deployment

# Configuration
mistral_client = Mistral(api_key=os.environ.get("MISTRAL_API_KEY", "your-api-key"))
vespa_config = VespaClientConfig(
    endpoint=os.environ.get("VESPA_ENDPOINT", "http://localhost:18080"),
)
collection_name = "articles"

# Connect to Vespa
vector_store = app.get_search_index(vespa_config, collection_name=collection_name)

# INGESTION: Index your documents
pipeline = Pipeline(
    loader=FilesystemFileLoader(),
    text_splitter=CharacterTextSplitter(chunk_size=512),
    embedder=MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING),
    stores=vector_store,
)

num_chunks = await pipeline.run(documents=["doc1.pdf", "doc2.pdf"])

# RETRIEVAL: Search your documents
embedder = MistralEmbedder(client=mistral_client, model_name=MODEL_1024_EMBEDDING)
query_engine = QueryEngine(
    retriever=[VectorRetriever(client=vector_store, embedder=embedder)],
)

results = await query_engine.search(query="What is RAG?", top_k=5)

# Print results
for result in results.results:
    print(f"Score: {result.score}")
    print(f"Content: {result.content}\n")
```

## Plugins

Extend the toolkit with specialized backends:

| Plugin | Package | Description |
|--------|---------|-------------|
| [Vespa Plugin](https://pypi.org/project/mistralai-search-toolkit-plugins-vespa) | `mistralai-search-toolkit-plugins-vespa` | Vespa search backend |
| [AWS S3 Storage](https://pypi.org/project/mistralai-search-toolkit-storage-s3) | `mistralai-search-toolkit-storage-s3` | AWS S3 storage backend |
| [Azure Blob Storage](https://pypi.org/project/mistralai-search-toolkit-storage-azure) | `mistralai-search-toolkit-storage-azure` | Azure Blob Storage backend |
| [Google Cloud Storage](https://pypi.org/project/mistralai-search-toolkit-storage-gcs) | `mistralai-search-toolkit-storage-gcs` | Google Cloud Storage backend |

## License

This package is licensed under the Apache License 2.0.

## Support

For more information and examples, visit [Vespa documentation](https://docs.vespa.ai/).
