Metadata-Version: 2.4
Name: retrievalagent
Version: 0.2.5
Summary: retrievalagent — multi-backend retrieval-augmented generation with LangGraph
Project-URL: Homepage, https://github.com/bmsuisse/retrievalagent
Project-URL: Repository, https://github.com/bmsuisse/retrievalagent
Author: Dominik Peter
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: langchain-core>=0.3.0
Requires-Dist: langchain-openai>=0.3.0
Requires-Dist: langchain>=0.3.0
Requires-Dist: langgraph>=0.4.0
Requires-Dist: numpy>=1.26.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: requests>=2.31.0
Requires-Dist: stop-words>=2024.1.1
Requires-Dist: tenacity>=8.2.0
Provides-Extra: all
Requires-Dist: azure-identity>=1.19.0; extra == 'all'
Requires-Dist: azure-search-documents>=11.6.0; extra == 'all'
Requires-Dist: chromadb>=1.0.0; extra == 'all'
Requires-Dist: cohere>=5.21.1; extra == 'all'
Requires-Dist: duckdb>=1.2.0; extra == 'all'
Requires-Dist: faiss-cpu>=1.9.0; extra == 'all'
Requires-Dist: httpx>=0.27.0; extra == 'all'
Requires-Dist: lancedb>=0.20.0; extra == 'all'
Requires-Dist: meilisearch>=0.40.0; extra == 'all'
Requires-Dist: pgvector>=0.4.0; extra == 'all'
Requires-Dist: psycopg[binary]>=3.2.0; extra == 'all'
Requires-Dist: python-dotenv>=1.2.2; extra == 'all'
Requires-Dist: qdrant-client>=1.12.0; extra == 'all'
Requires-Dist: rerankers>=0.6.0; extra == 'all'
Requires-Dist: rich>=13.0.0; extra == 'all'
Requires-Dist: sentence-transformers>=3.0.0; extra == 'all'
Provides-Extra: azure
Requires-Dist: azure-identity>=1.19.0; extra == 'azure'
Requires-Dist: azure-search-documents>=11.6.0; extra == 'azure'
Provides-Extra: cache-pg
Requires-Dist: psycopg-pool>=3.2.0; extra == 'cache-pg'
Requires-Dist: psycopg[binary]>=3.2.0; extra == 'cache-pg'
Provides-Extra: chromadb
Requires-Dist: chromadb>=1.0.0; extra == 'chromadb'
Provides-Extra: cli
Requires-Dist: python-dotenv>=1.2.2; extra == 'cli'
Requires-Dist: rich>=13.0.0; extra == 'cli'
Provides-Extra: cohere
Requires-Dist: cohere>=5.21.1; extra == 'cohere'
Provides-Extra: duckdb
Requires-Dist: duckdb>=1.2.0; extra == 'duckdb'
Provides-Extra: embed-anything
Requires-Dist: embed-anything>=0.7.0; extra == 'embed-anything'
Provides-Extra: eval
Requires-Dist: bm25s>=0.3.3; extra == 'eval'
Requires-Dist: datasets>=4.8.4; extra == 'eval'
Requires-Dist: mteb>=2.12.11; extra == 'eval'
Requires-Dist: pandas>=3.0.2; extra == 'eval'
Requires-Dist: pyarrow>=23.0.1; extra == 'eval'
Requires-Dist: python-dotenv>=1.2.2; extra == 'eval'
Requires-Dist: rich>=13.0.0; extra == 'eval'
Provides-Extra: faiss
Requires-Dist: faiss-cpu>=1.9.0; extra == 'faiss'
Provides-Extra: huggingface
Requires-Dist: sentence-transformers>=3.0.0; extra == 'huggingface'
Provides-Extra: jina
Requires-Dist: httpx>=0.27.0; extra == 'jina'
Provides-Extra: lancedb
Requires-Dist: lancedb>=0.20.0; extra == 'lancedb'
Requires-Dist: pyarrow>=23.0.1; extra == 'lancedb'
Provides-Extra: meilisearch
Requires-Dist: meilisearch>=0.40.0; extra == 'meilisearch'
Provides-Extra: pgvector
Requires-Dist: pgvector>=0.4.0; extra == 'pgvector'
Requires-Dist: psycopg[binary]>=3.2.0; extra == 'pgvector'
Provides-Extra: qdrant
Requires-Dist: qdrant-client>=1.12.0; extra == 'qdrant'
Provides-Extra: recommended
Requires-Dist: cohere>=5.21.1; extra == 'recommended'
Requires-Dist: meilisearch>=0.40.0; extra == 'recommended'
Requires-Dist: python-dotenv>=1.2.2; extra == 'recommended'
Requires-Dist: rich>=13.0.0; extra == 'recommended'
Provides-Extra: rerankers
Requires-Dist: rerankers>=0.6.0; extra == 'rerankers'
Provides-Extra: tune
Requires-Dist: optuna>=4.8.0; extra == 'tune'
Description-Content-Type: text/markdown

# retrievalagent

<div align="center">

**An autonomous retrieval-augmented generation agent.**
Plug in any vector store, any LLM, any reranker. Hybrid search,
reranking, query rewriting, an LLM quality gate, and an autonomous
retry loop — built on LangGraph.

[![PyPI](https://img.shields.io/pypi/v/retrievalagent)](https://pypi.org/project/retrievalagent/)
[![Python](https://img.shields.io/pypi/pyversions/retrievalagent)](https://pypi.org/project/retrievalagent/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)
[![CI](https://github.com/bmsuisse/retrievalagent/actions/workflows/ci.yml/badge.svg)](https://github.com/bmsuisse/retrievalagent/actions/workflows/ci.yml)

</div>

---

```python
from retrievalagent import init_agent

rag = init_agent("documents", model="openai:gpt-5.4", backend="qdrant")
state = rag.chat("What is the status of operation overlord?")
print(state.answer)
```

---

## Scope — Retrieval, Not Ingestion

`retrievalagent` is built for **retrieval quality at query time** —
hybrid search, reranking, query rewriting, an autonomous retry loop,
and an LLM quality gate.

**Ingestion is out of scope.** The library does not chunk, clean,
embed-at-scale, or index your corpus. Use a dedicated tool for that —
[Docling](https://github.com/docling-project/docling),
[Unstructured](https://unstructured.io),
[LlamaIndex](https://www.llamaindex.ai), a Databricks job, or a
custom script — then point `retrievalagent` at the resulting index.
Every backend exposes a minimal `add_documents()` helper for
convenience and smoke tests; it is not meant to replace a real
ingestion pipeline.

The narrow surface is deliberate: one thing, done well.

---

## What it does

Most retrieval systems do a single search pass. `retrievalagent`
runs a state machine that retrieves, evaluates the result, rewrites
when needed, and retries — all autonomously, up to `max_iter` rounds.

Per query the agent will:

1. **Understand the intent** — rewrite the question into precise
   search keywords, detect keyword-vs-semantic, and pick the hybrid
   ratio.
2. **Search broadly** — run query variants in parallel across BM25
   and vector search; fuse the results; rerank.
3. **Evaluate** — an LLM quality gate decides whether the retrieved
   docs actually answer the question.
4. **Adapt** — if not, rewrite the query and retry; on hard
   failures, swarm-retrieve with parallel strategies as a fallback.
5. **Generate** — only once the evidence holds, produce a cited,
   grounded answer.

---

## Features

- **Fully async pipeline** — parallel HyDE + preprocessing, zero
  blocking calls; every public op has sync and async variants.
- **LLM quality gate** — rejects weak results, drives the rewrite
  loop until the evidence holds.
- **Multi-query swarm** — fans out across BM25 + vector
  simultaneously, fuses results.
- **Autonomous retry loop** — retrieve → judge → rewrite → retry,
  up to `max_iter` rounds.
- **Hybrid search** — BM25 + vector, fused with RRF or DBSF.
- **HyDE** — hypothetical document embeddings for vague queries.
- **Tool-calling agent** — `get_index_settings`, `get_filter_values`,
  `search_hybrid`, `search_bm25`, `rerank_results`; the LLM picks
  tools dynamically.
- **Multiple rerankers** — Cohere, HuggingFace, Jina, ColBERT,
  RankGPT, embed-anything, or a custom callable.
- **8 search backends** — Meilisearch, Azure AI Search, ChromaDB,
  LanceDB, Qdrant, pgvector, DuckDB, InMemory.
- **Any LLM** — OpenAI, Azure, Anthropic, Ollama, Vertex AI, or any
  LangChain `BaseChatModel`.
- **One-line init** — `init_agent("docs", model="openai:gpt-5.4", backend="qdrant")`.
- **Multi-turn chat** — conversation history with citation-aware
  answers.
- **Auto-strategy** — the agent samples your collection at init and
  tunes itself.
- **Optuna auto-tuner** — 20+ retrieval knobs tuned to your corpus
  in ~5 min; `None` is a first-class value for disabling stages.
  [Full guide](docs/auto-tuning.md).

---

## Install

```bash
# Recommended — Meilisearch + Cohere reranker + interactive CLI
pip install retrievalagent[recommended]

# Base only — in-memory backend, BM25 keyword search
pip install retrievalagent
```

| Extra | What you get | Command |
|-------|-------------|---------|
| **`recommended`** | Meilisearch + Cohere reranker + Rich CLI | `pip install retrievalagent[recommended]` |
| `cli` | Interactive CLI with guided setup wizard | `pip install retrievalagent[cli]` |
| `all` | Every backend + reranker + CLI | `pip install retrievalagent[all]` |

<details>
<summary>Individual backends &amp; rerankers</summary>

```bash
pip install retrievalagent[meilisearch]
pip install retrievalagent[azure]
pip install retrievalagent[chromadb]
pip install retrievalagent[lancedb]
pip install retrievalagent[pgvector]
pip install retrievalagent[qdrant]
pip install retrievalagent[duckdb]
pip install retrievalagent[cohere]
pip install retrievalagent[huggingface]
pip install retrievalagent[jina]
pip install retrievalagent[rerankers]      # ColBERT, Flashrank, RankGPT, …
pip install retrievalagent[embed-anything] # Local Rust-accelerated embeddings + reranking
```

Mix and match: `pip install retrievalagent[qdrant,cohere,cli]`

</details>

---

## Quick Start

### One-liner with `init_agent`

The fastest way to get started — no provider imports, string aliases for everything:

```python
from retrievalagent import init_agent

# Minimal — in-memory backend, LLM from env vars
rag = init_agent("docs")

# OpenAI + Qdrant + Cohere reranker
rag = init_agent(
    "my-collection",
    model="openai:gpt-5.4",
    backend="qdrant",
    backend_url="http://localhost:6333",
    reranker="cohere",
)

# Anthropic + Azure AI Search (native vectorisation, no client-side embeddings)
rag = init_agent(
    "my-index",
    model="anthropic:claude-sonnet-4-6",
    gen_model="anthropic:claude-opus-4-6",
    backend="azure",
    backend_url="https://my-search.search.windows.net",
    reranker="huggingface",
    auto_strategy=True,
)

# Fully local — Ollama + ChromaDB + HuggingFace cross-encoder
rag = init_agent(
    "docs",
    model="ollama:llama3",
    backend="chroma",
    reranker="huggingface",
    reranker_model="cross-encoder/ms-marco-MiniLM-L-6-v2",
)
```

### Multi-collection routing

Pass several collections and let the agent decide which to search. The LLM
picks the relevant subset before retrieval, using either the collection names
alone or optional natural-language descriptions.

```python
from retrievalagent import init_agent

# List form — LLM routes by name only
rag = init_agent(
    collections=["products", "faq", "policies"],
    backend="qdrant",
    backend_url="http://localhost:6333",
    model="openai:gpt-5.4",
)

# Dict form — LLM routes using descriptions (better precision)
rag = init_agent(
    collections={
        "products": "Product catalog: SKUs, prices, specs, availability",
        "faq":      "Customer-facing FAQ, troubleshooting, return policy",
        "policies": "Internal HR/legal/compliance policy documents",
    },
    backend="qdrant",
    backend_url="http://localhost:6333",
    model="openai:gpt-5.4",
)

rag.invoke("What's our return policy?")       # → routes to faq / policies
rag.invoke("Price of SKU 12345?")              # → routes to products
```

Each retrieved document carries its origin in `metadata["_collection"]` so you
can merge, filter, or attribute citations downstream. One backend instance is
built per collection; they share the same backend type and URL.

**Backend aliases**

| Alias | Class | Extra |
|-------|-------|-------|
| `"memory"` / `"in_memory"` | `InMemoryBackend` | _(none)_ |
| `"chroma"` / `"chromadb"` | `ChromaDBBackend` | `retrievalagent[chromadb]` |
| `"qdrant"` | `QdrantBackend` | `retrievalagent[qdrant]` |
| `"lancedb"` / `"lance"` | `LanceDBBackend` | `retrievalagent[lancedb]` |
| `"duckdb"` | `DuckDBBackend` | `retrievalagent[duckdb]` |
| `"pgvector"` / `"pg"` | `PgvectorBackend` | `retrievalagent[pgvector]` |
| `"meilisearch"` | `MeilisearchBackend` | `retrievalagent[meilisearch]` |
| `"azure"` | `AzureAISearchBackend` | `retrievalagent[azure]` |

**Reranker aliases**

| Alias | Class | `reranker_model` | Extra |
|-------|-------|-----------------|-------|
| `"cohere"` | `CohereReranker` | Cohere model name (default: `rerank-v3.5`) | `retrievalagent[cohere]` |
| `"huggingface"` / `"hf"` | `HuggingFaceReranker` | HF model name (default: `cross-encoder/ms-marco-MiniLM-L-6-v2`) | `retrievalagent[huggingface]` |
| `"jina"` | `JinaReranker` | Jina model name (default: `jina-reranker-v2-base-multilingual`) | `retrievalagent[jina]` |
| `"llm"` | `LLMReranker` | _(uses the agent's LLM)_ | _(none)_ |
| `"rerankers"` | `RerankersReranker` | Any model from the `rerankers` library | `retrievalagent[rerankers]` |
| `"embed-anything"` | `EmbedAnythingReranker` | ONNX reranker model (default: `jina-reranker-v1-turbo-en`) | `retrievalagent[embed-anything]` |

```python
# Cohere (default model)
rag = init_agent("docs", model="openai:gpt-5.4", reranker="cohere")

# HuggingFace — multilingual model
rag = init_agent("docs", model="openai:gpt-5.4", reranker="huggingface",
                 reranker_model="cross-encoder/mmarco-mMiniLMv2-L12-H384-v1")

# Jina
rag = init_agent("docs", model="openai:gpt-5.4", reranker="jina")  # uses JINA_API_KEY

# ColBERT via rerankers library
rag = init_agent("docs", model="openai:gpt-5.4", reranker="rerankers",
                 reranker_model="colbert-ir/colbertv2.0",
                 reranker_kwargs={"model_type": "colbert"})

# Pass a pre-built reranker instance directly
from retrievalagent import CohereReranker
rag = init_agent("docs", reranker=CohereReranker(model="rerank-v3.5", api_key="..."))
```

**Model strings:** any `"provider:model-name"` from LangChain's `init_chat_model` — `openai`, `anthropic`, `azure_openai`, `google_vertexai`, `ollama`, `groq`, `mistralai`, and more

### Manual setup

```python
from retrievalagent import Agent, InMemoryBackend

backend = InMemoryBackend(embed_fn=my_embed_fn)
backend.add_documents([
    {"content": "RAG combines retrieval with generation", "source": "wiki"},
    {"content": "Vector search finds similar embeddings", "source": "docs"},
])

rag = Agent(index="demo", backend=backend)

# Single query → full answer
state = rag.invoke("What is retrieval-augmented generation?")
print(state.answer)

# Retrieve only — documents without LLM answer
query, docs = rag.retrieve_documents("What is retrieval-augmented generation?")
for doc in docs:
    print(doc.page_content)

# Override top-K at call time
query, docs = rag.retrieve_documents("hybrid search", top_k=3)
```

### `Agent.from_model` — model string with explicit backend

```python
from retrievalagent import Agent, QdrantBackend

rag = Agent.from_model(
    "openai:gpt-5.4-mini",          # fast model for routing & rewriting
    index="docs",
    gen_model="openai:gpt-5.4",     # powerful model for the final answer
    backend=QdrantBackend("docs", url="http://localhost:6333"),
)
```

---

## Multi-turn Chat

```python
from retrievalagent import Agent, ConversationTurn

rag = Agent(index="articles")
history: list[ConversationTurn] = []

state = rag.chat("What is hybrid search?", history)
history.append(ConversationTurn(question="What is hybrid search?", answer=state.answer))

state = rag.chat("How does it compare to pure vector search?", history)
print(state.answer)
print(f"Sources: {len(state.documents)}")
```

Async variant:

```python
state = await rag.achat("What is hybrid search?", history)
```

### Search-knowledge memory with mem0

`history=` only carries the current session. For long-term **search
knowledge** that improves retrieval on the same corpus over time,
plug [mem0](https://docs.mem0.ai) into the agent. The store grows
into a corpus-aware glossary of term mappings the agent has learned
— informal-to-formal terms, brand spellings, aliases, common typos.
It is **not** a user-preferences store.

When a user query resolves through a non-trivial term expansion (the
matching documents used a different surface form than the query),
the agent's grader flags it for storage. On future queries, mem0
recalls the relevant mapping and feeds it into BM25 so the same
expansion happens automatically.

```bash
pip install mem0ai
```

```python
from retrievalagent import init_agent

rag = init_agent("articles", memory=True)

cfg = {"configurable": {"user_id": "alice"}}
rag.invoke("...", config=cfg)
# → grader may store a search-fact (synonym / alias / typo mapping)
rag.invoke("...", config=cfg)
# → if a stored mapping clears the relevance gate it is injected
#   into BM25 and the system prompt
```

Two thresholds gate the flow:

- **`memory_relevance_threshold`** (env `RAG_MEMORY_RELEVANCE_THRESHOLD`,
  default `0.7`) — mem0 cosine score the recall must exceed before
  a stored fact reaches retrieval/generation.
- **`memory_storage_threshold`** (env `RAG_MEMORY_STORAGE_THRESHOLD`,
  default `0.85`) — LLM `memory_confidence` the grader must report
  before a new fact is persisted.

Writes are **fire-and-forget**: the graph schedules `mem0.add(...)`
as a background `asyncio` task; the user-facing response never waits
on memory I/O. `await rag.adrain_background()` before shutdown if you
need the writes to land before exit.

`state.trace` carries the decision events (`read_memory` with
`memories`/`n_kept`/`threshold` or `skipped: below_threshold`;
`final_grade` with `memory_should_store`/`memory_confidence`) so you
can tune the thresholds for your corpus. See
[`docs/memory.md`](docs/memory.md) for the full memory matrix
(history vs. checkpointer vs. memory_store vs. mem0).

---

## Architecture

retrievalagent has two operating modes — both fully autonomous:

### Graph mode (`rag.chat` / `rag.invoke`)

The default. A LangGraph state machine that runs the full agentic pipeline:

```
Query
  │
  ├─[HyDE]──────────────────────────────────────────┐
  │  Hypothetical document embedding (parallel)      │
  │                                                  ▼
  ▼                                         [Embed HyDE text]
[Preprocess]                                         │
  Extract keywords + variants                        │
  Detect semantic_ratio + fusion strategy            │
  │                                                  │
  └──────────────────────────────────────────────────┘
                        │
                        ▼
              [Hybrid Search × N queries]
               BM25 + Vector, multi-arm
                        │
                        ▼
               [RRF / DBSF Fusion]
                        │
                        ▼
                    [Rerank]
               Cohere / HF / Jina / embed-anything / LLM
                        │
                        ▼
               [Quality Gate]
               LLM judges relevance
                   │         │
                (good)     (bad)
                   │         │
                   ▼         ▼
              [Generate]  [Rewrite] ──► loop (max_iter)
                   │
                   ▼
        Answer + [n] inline citations
```

### Tool-calling agent mode (`rag.invoke_agent`)

The agent receives a set of tools and reasons step-by-step, calling them in whatever order makes sense for the question. No fixed pipeline — pure field improvisation:

```
Query
  │
  ▼
[LLM Agent]  ◄──────────────────────────────────────┐
  Thinks: "What do I need to answer this?"           │
  │                                                  │
  ├── get_index_settings()                           │
  │   Discover filterable / sortable / boost fields  │
  │                                                  │
  ├── get_filter_values(field)                       │
  │   Sample real stored values for a field          │
  │   → build precise filter expressions             │
  │                                                  │
  ├── search_hybrid(query, filter, sort_fields)      │
  │   BM25 + vector, optional filter + sort boost    │
  │                                                  │
  ├── search_bm25(query, filter)                     │
  │   Fallback pure keyword search                   │
  │                                                  │
  ├── rerank_results(query, hits)                    │
  │   Re-rank with configured reranker               │
  │                                                  │
  └── [needs more info?] ─────────────────────────► │

  [done]
  │
  ▼
Answer  (tool calls explained inline)
```

Use `invoke_agent` when questions involve **dynamic filtering** — the agent inspects the index schema, samples real field values, builds filters on the fly, and decides whether to sort by business signals like popularity or recency.

---

## Examples

### 1. Knowledge base Q&A (InMemory, no external services)

```python
from retrievalagent import AgenticRAG, InMemoryBackend
from langchain_openai import ChatOpenAI

docs = [
    {"id": "1", "content": "The Eiffel Tower was built in 1889 for the World's Fair in Paris.", "topic": "history"},
    {"id": "2", "content": "The Louvre is the world's largest art museum, located in Paris.", "topic": "art"},
    {"id": "3", "content": "Photosynthesis converts sunlight and CO2 into glucose and oxygen.", "topic": "science"},
    {"id": "4", "content": "The Python programming language was created by Guido van Rossum in 1991.", "topic": "tech"},
    {"id": "5", "content": "Machine learning is a subset of artificial intelligence.", "topic": "tech"},
]

backend = InMemoryBackend(documents=docs)
llm = ChatOpenAI(model="gpt-5.4-mini")

rag = AgenticRAG(index="kb", backend=backend, llm=llm, gen_llm=llm)

state = rag.invoke("When was the Eiffel Tower built?")
print(state.answer)
# → "The Eiffel Tower was built in 1889 for the World's Fair in Paris. [1]"
print(state.query)        # rewritten query
print(state.iterations)   # how many retrieval rounds it took
```

---

### 2. Retrieve documents without generating an answer

Useful when you want the docs and will handle the answer yourself:

```python
from retrievalagent import AgenticRAG, InMemoryBackend

rag = AgenticRAG(index="kb", backend=backend)

query, docs = rag.retrieve_documents("machine learning", top_k=3)
print(f"Rewritten query: {query}")
for doc in docs:
    print(doc.page_content)
    print(doc.metadata)  # original fields + _rankingScore
```

---

### 3. Multi-turn chat

```python
from retrievalagent import AgenticRAG, InMemoryBackend, ConversationTurn

rag = AgenticRAG(index="kb", backend=backend, llm=llm, gen_llm=llm)
history: list[ConversationTurn] = []

q1 = "What is machine learning?"
s1 = rag.chat(q1, history)
history.append(ConversationTurn(question=q1, answer=s1.answer))
print(s1.answer)

q2 = "How does it relate to AI?"   # pronoun resolved from history
s2 = rag.chat(q2, history)
history.append(ConversationTurn(question=q2, answer=s2.answer))
print(s2.answer)
```

---

### 4. Always-on filter (e-commerce: in-stock items only)

```python
from retrievalagent import AgenticRAG, MeilisearchBackend

backend = MeilisearchBackend(
    "products",
    url="http://localhost:7700",
    api_key="masterKey",
)

# Every search is scoped to in-stock items — no per-call boilerplate
rag = AgenticRAG(
    index="products",
    backend=backend,
    filter="is_in_stock = true",
    llm=llm,
    gen_llm=llm,
)

state = rag.invoke("red running shoes size 42")
for doc in state.documents:
    print(doc.metadata["product_name"], "|", doc.metadata["price"])
```

---

### 5. Filter + own-brand exclusion

```python
# Exclude own-brand articles and search for third-party alternatives
rag = AgenticRAG(
    index="products",
    backend=backend,
    filter="is_own_brand = false",
    llm=llm,
    gen_llm=llm,
)

state = rag.invoke("Find alternatives to our house-brand brake cleaner 500ml")
print(state.answer)
# LLM strips the brand prefix, rewrites to "brake cleaner 500ml",
# filter ensures only third-party results are returned.
```

---

### 6. Async usage (FastAPI / Databricks / Jupyter)

```python
import asyncio
from retrievalagent import AgenticRAG, InMemoryBackend

rag = AgenticRAG(index="kb", backend=backend, llm=llm, gen_llm=llm)

# Async single query
state = await rag.ainvoke("What is photosynthesis?")
print(state.answer)

# Async batch — runs all queries in parallel
states = await rag.abatch([
    "What is photosynthesis?",
    "Who created Python?",
    "Where is the Louvre?",
])
for s in states:
    print(s.answer)
```

Sync variants work from any context including Databricks/Jupyter (running event loop is handled automatically):

```python
# Safe to call from a notebook cell even with a running event loop
state = rag.invoke("What is photosynthesis?")
states = rag.batch(["question one", "question two"])
```

---

### 7. Tool-calling agent — dynamic filter discovery

When you don't know the filter values upfront, the agent inspects the schema and samples field values itself:

```python
from retrievalagent import AgenticRAG, MeilisearchBackend

rag = AgenticRAG(
    index="products",
    backend=MeilisearchBackend("products", url="http://localhost:7700"),
    llm=llm,
    gen_llm=llm,
)

# Agent calls get_index_settings() → get_filter_values("brand") →
# search_hybrid(filter="brand = 'Bosch'", sort_fields=["popularity"])
result = rag.invoke_agent("Show me the most popular Bosch power tools")
print(result)
```

---

### 8. Streaming the final answer

```python
async def stream_answer():
    async for chunk in rag.astream("Explain hybrid search in simple terms"):
        print(chunk, end="", flush=True)

asyncio.run(stream_answer())
```

---

### 9. Qdrant — vector search with metadata filter

```python
from retrievalagent import AgenticRAG, QdrantBackend
from qdrant_client import QdrantClient, models

# Insert docs (done once)
client = QdrantClient("http://localhost:6333")
client.upsert("articles", points=[
    models.PointStruct(id=1, vector=embed("RAG combines retrieval and generation"),
                       payload={"content": "RAG combines retrieval and generation", "year": 2023}),
    models.PointStruct(id=2, vector=embed("Vector databases store high-dimensional embeddings"),
                       payload={"content": "Vector databases store high-dimensional embeddings", "year": 2022}),
])

from qdrant_client.models import FieldCondition, MatchValue

rag = AgenticRAG(
    index="articles",
    backend=QdrantBackend("articles", url="http://localhost:6333", embed_fn=embed),
    llm=llm,
    gen_llm=llm,
)

# Pass native Qdrant filter dict — no string translation needed
state = rag.invoke("what is RAG?")
# Or with explicit filter at retrieve time:
_, docs = rag.retrieve_documents("vector databases")
```

---

### 10. Custom instructions (tone / domain)

```python
rag = AgenticRAG(
    index="legal_docs",
    backend=backend,
    llm=llm,
    gen_llm=llm,
    instructions=(
        "You are a legal assistant. Answer in formal language. "
        "Always cite the article number when referencing a law. "
        "If the context is insufficient, say so explicitly."
    ),
)

state = rag.invoke("What are the notice periods for dismissal?")
print(state.answer)
```

---

## Backends

### Azure AI Search

Native hybrid search — no client-side embeddings needed when the index has an integrated vectorizer:

```python
from retrievalagent import Agent, AzureAISearchBackend

# Native vectorization — service embeds the query server-side
rag = Agent(
    index="my-index",
    backend=AzureAISearchBackend(
        "my-index",
        endpoint="https://my-search.search.windows.net",
        api_key="...",
    ),
)

# Client-side vectorization
rag = Agent(
    index="my-index",
    backend=AzureAISearchBackend(
        "my-index",
        endpoint="https://my-search.search.windows.net",
        api_key="...",
        embed_fn=my_embed_fn,
    ),
)

# With Azure semantic reranking
rag = Agent(
    index="my-index",
    backend=AzureAISearchBackend(
        "my-index",
        endpoint="https://my-search.search.windows.net",
        api_key="...",
        semantic_config="my-semantic-config",
    ),
)
```

### Qdrant

```python
from retrievalagent import Agent, QdrantBackend

rag = Agent(
    index="my_collection",
    backend=QdrantBackend("my_collection", url="http://localhost:6333", embed_fn=my_embed_fn),
)
```

### ChromaDB

```python
from retrievalagent import Agent, ChromaDBBackend

rag = Agent(
    index="my_collection",
    backend=ChromaDBBackend("my_collection", path="./chroma_db", embed_fn=my_embed_fn),
)
```

### LanceDB

```python
from retrievalagent import Agent, LanceDBBackend

rag = Agent(
    index="docs",
    backend=LanceDBBackend("docs", db_uri="./lancedb", embed_fn=my_embed_fn),
)
```

### PostgreSQL + pgvector

```python
from retrievalagent import Agent, PgvectorBackend

rag = Agent(
    index="documents",
    backend=PgvectorBackend(
        "documents",
        dsn="postgresql://user:pass@localhost:5432/mydb",
        embed_fn=my_embed_fn,
    ),
)
```

### DuckDB

```python
from retrievalagent import Agent, DuckDBBackend

rag = Agent(
    index="vectors",
    backend=DuckDBBackend("vectors", db_path="./my.duckdb", embed_fn=my_embed_fn),
)
```

### Meilisearch

```python
from retrievalagent import Agent, MeilisearchBackend

rag = Agent(
    index="articles",
    backend=MeilisearchBackend("articles", url="http://localhost:7700", api_key="masterKey"),
)
```

### InMemory (default, zero dependencies)

```python
from retrievalagent import Agent, InMemoryBackend

backend = InMemoryBackend(embed_fn=my_embed_fn)
backend.add_documents([
    {"content": "RAG combines retrieval with generation", "source": "wiki"},
    {"content": "Vector search finds similar embeddings", "source": "docs"},
])

rag = Agent(index="demo", backend=backend)
```

---

## LLM Configuration

Pass a pre-built LangChain model or use `init_agent` / `Agent.from_model` for string-based init.  
When using `Agent` directly, configure via env vars or pass an explicit model instance.

### OpenAI

```python
from langchain_openai import ChatOpenAI
from retrievalagent import Agent

rag = Agent(
    index="articles",
    llm=ChatOpenAI(model="gpt-5.4", api_key="sk-..."),
    gen_llm=ChatOpenAI(model="gpt-5.4", api_key="sk-..."),
)
```

### Azure OpenAI (explicit keys)

```python
from langchain_openai import AzureChatOpenAI
from retrievalagent import Agent

llm = AzureChatOpenAI(
    azure_endpoint="https://my-resource.openai.azure.com",
    azure_deployment="gpt-5.4",
    api_key="...",
    api_version="2024-12-01-preview",
)
rag = Agent(index="articles", llm=llm, gen_llm=llm)
```

### Azure OpenAI (env vars)

```python
# Set: AZURE_OPENAI_ENDPOINT, AZURE_OPENAI_API_KEY, AZURE_OPENAI_DEPLOYMENT
from retrievalagent import Agent

rag = Agent(index="articles")  # auto-detected
```

### Azure OpenAI with Managed Identity (no API key)

```python
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from langchain_openai import AzureChatOpenAI
from retrievalagent import Agent

token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)
llm = AzureChatOpenAI(
    azure_endpoint="https://my-resource.openai.azure.com",
    azure_deployment="gpt-5.4",
    azure_ad_token_provider=token_provider,
    api_version="2024-12-01-preview",
)
rag = Agent(index="articles", llm=llm, gen_llm=llm)
```

### Anthropic Claude

```bash
pip install langchain-anthropic
```

```python
from langchain_anthropic import ChatAnthropic
from retrievalagent import Agent

llm = ChatAnthropic(model="claude-sonnet-4-6", api_key="sk-ant-...")
rag = Agent(index="articles", llm=llm, gen_llm=llm)
```

### Ollama (local, no API key)

```bash
pip install langchain-ollama
```

```python
from langchain_ollama import ChatOllama
from retrievalagent import Agent

rag = Agent(
    index="articles",
    llm=ChatOllama(model="llama3.2", base_url="http://localhost:11434"),
    gen_llm=ChatOllama(model="llama3.2", base_url="http://localhost:11434"),
)
```

### Google Vertex AI

```bash
pip install langchain-google-vertexai
```

```python
from langchain_google_vertexai import ChatVertexAI
from retrievalagent import Agent

llm = ChatVertexAI(model="gemini-2.0-flash", project="my-gcp-project", location="us-central1")
rag = Agent(index="articles", llm=llm, gen_llm=llm)
```

### Separate fast and generation models

Use a cheap/fast model for query rewriting and routing, a powerful model for the final answer:

```python
from langchain_openai import AzureChatOpenAI
from retrievalagent import Agent

fast_llm = AzureChatOpenAI(azure_deployment="gpt-5.4-mini", api_key="...", api_version="2024-12-01-preview")
gen_llm  = AzureChatOpenAI(azure_deployment="gpt-5.4",      api_key="...", api_version="2024-12-01-preview")

rag = Agent(index="articles", llm=fast_llm, gen_llm=gen_llm)
```

---

## Rerankers

### Cohere

```python
from retrievalagent import Agent, CohereReranker

rag = Agent(index="articles", reranker=CohereReranker(model="rerank-v3.5", api_key="..."))
```

### HuggingFace cross-encoder (local, no API key)

```bash
pip install retrievalagent[huggingface]
```

```python
from retrievalagent import Agent, HuggingFaceReranker

rag = Agent(index="articles", reranker=HuggingFaceReranker())

# Multilingual
rag = Agent(index="articles", reranker=HuggingFaceReranker(model="cross-encoder/mmarco-mMiniLMv2-L12-H384-v1"))
```

### Jina (multilingual API)

```bash
pip install retrievalagent[jina]
```

```python
from retrievalagent import Agent, JinaReranker

rag = Agent(index="articles", reranker=JinaReranker(api_key="..."))  # or JINA_API_KEY env var
```

### rerankers — ColBERT / Flashrank / RankGPT / any cross-encoder

Unified bridge to the [`rerankers`](https://github.com/AnswerDotAI/rerankers) library by answer.ai:

```bash
pip install retrievalagent[rerankers]
```

```python
from retrievalagent import Agent, RerankersReranker

rag = Agent(index="articles", reranker=RerankersReranker("cross-encoder/ms-marco-MiniLM-L-6-v2", model_type="cross-encoder"))
rag = Agent(index="articles", reranker=RerankersReranker("colbert-ir/colbertv2.0", model_type="colbert"))
rag = Agent(index="articles", reranker=RerankersReranker("flashrank", model_type="flashrank"))
rag = Agent(index="articles", reranker=RerankersReranker("gpt-5.4-mini", model_type="rankgpt", api_key="..."))
```

### Embed-anything — Rust-accelerated local embeddings + reranking

Embeddings and reranking in a single Rust-powered package. Fully
local — no API keys, no network calls. Powered by
[embed-anything](https://github.com/StarlightSearch/EmbedAnything).

```bash
pip install retrievalagent[embed-anything]
```

```python
from retrievalagent import Agent, EmbedAnythingEmbedder, EmbedAnythingReranker

# Local embeddings — works as embed_fn (callable)
embedder = EmbedAnythingEmbedder("sentence-transformers/all-MiniLM-L6-v2")

# Local reranker — implements Reranker protocol
reranker = EmbedAnythingReranker("jinaai/jina-reranker-v1-turbo-en")

rag = Agent(
    index="articles",
    backend=QdrantBackend("articles", url="http://localhost:6333", embed_fn=embedder),
    embed_fn=embedder,
    reranker=reranker,
)
```

Mix and match freely — use embed-anything for one piece and a cloud provider for the other:

```python
from retrievalagent import Agent, EmbedAnythingEmbedder, CohereReranker

# Local embeddings + cloud reranker
rag = Agent(index="docs", embed_fn=EmbedAnythingEmbedder(), reranker=CohereReranker())

# Cloud embeddings + local reranker
from retrievalagent import EmbedAnythingReranker
rag = Agent(index="docs", embed_fn=azure_embed_fn, reranker=EmbedAnythingReranker())
```

### Custom reranker

```python
from retrievalagent import Agent, RerankResult

class MyReranker:
    def rerank(self, query: str, documents: list[str], top_n: int) -> list[RerankResult]:
        return [RerankResult(index=i, relevance_score=1.0 / (i + 1)) for i in range(top_n)]

rag = Agent(index="articles", reranker=MyReranker())
```

---

## Tools

When using `invoke_agent`, the LLM has access to a set of tools it can call in any order. No fixed pipeline — the agent decides what it needs.

| Tool | Description |
|------|-------------|
| `get_index_settings()` | Discover filterable, searchable, sortable, and boost fields from the index schema |
| `get_filter_values(field)` | Sample real stored values for a field — used to build precise filter expressions |
| `search_hybrid(query, filter_expr, semantic_ratio, sort_fields)` | BM25 + vector hybrid search with optional filter and sort boost |
| `search_bm25(query, filter_expr)` | Pure keyword search — fallback when hybrid returns poor results |
| `rerank_results(query, hits)` | Re-rank a list of hits with the configured reranker |

The agent follows this reasoning pattern:

1. Call `get_index_settings()` to learn the schema
2. If the question names a specific entity, call `get_filter_values(field)` to find the exact stored value
3. Call `search_hybrid()` with a filter and/or sort if relevant, otherwise broad hybrid search
4. Fall back to `search_bm25()` if results are thin
5. Call `rerank_results()` to surface the most relevant hits
6. Summarise — explaining which filters and signals influenced the answer

```python
from retrievalagent import Agent

rag = Agent(index="products")

# Agent inspects schema, detects brand field, samples values,
# builds filter, sorts by popularity signal — all autonomously
result = rag.invoke_agent("Show me the most popular Bosch power tools")
print(result)
```

---

## Constructor Reference

```python
Agent(
    index="my_index",           # collection / index name
    backend=...,                # SearchBackend (default: InMemoryBackend)
    llm=...,                    # fast LLM — routing, rewrite, filter
    gen_llm=...,                # generation LLM — final answer
    reranker=...,               # Cohere / HuggingFace / Jina / custom
    top_k=10,                   # final result count            [RAG_TOP_K]
    rerank_top_n=5,             # reranker top-n                [RAG_RERANK_TOP_N]
    retrieval_factor=4,         # over-retrieval multiplier     [RAG_RETRIEVAL_FACTOR]
    max_iter=20,                # max retrieve-rewrite cycles   [RAG_MAX_ITER]
    semantic_ratio=0.5,         # hybrid semantic weight        [RAG_SEMANTIC_RATIO]
    fusion="rrf",               # "rrf" or "dbsf"               [RAG_FUSION]
    instructions="",            # extra system prompt for generation
    embed_fn=None,              # (str) -> list[float]
    boost_fn=None,              # (doc_dict) -> float score boost
    filter=None,                # always-on Meilisearch filter expr (e.g. "brand = 'Bosch'")
    category_fields=None,       # fields used by alternative retrieve (None → auto-detect via regex)
    hyde_min_words=8,           # min words to trigger HyDE     [RAG_HYDE_MIN_WORDS]
    hyde_style_hint="",         # style hint for HyDE prompt
    auto_strategy=True,         # auto-tune from index samples
)
```

### Always-on filter

Pin every search to a subset of the index with `filter` — Meilisearch syntax,
AND-joined with any per-call filter (intent, language, ...):

```python
rag = init_agent("products", filter="brand = 'Bosch'")
# every BM25 + vector + hybrid search scoped to Bosch only
```

The legacy `base_filter` kwarg still works but emits a `DeprecationWarning` —
migrate to `filter` at your convenience.

### Category fields (alternative retrieve)

The alternative-retrieve fallback broadens the search by pivoting on
category-like fields (product groups, taxonomy levels, sections, ...). By
default, retrievalagent auto-detects them from the index schema via regex — matching
names like `category`, `product_group_l3`, `article_group_name`, `kategorie`,
`family`, `section`, ... — and prioritises deeper taxonomy levels
(`_l3 > _l2 > _l1`).

Override explicitly when your schema uses unusual names:

```python
rag = init_agent(
    "products",
    category_fields=["taxonomy_leaf", "taxonomy_parent", "department"],
)
```

Pass `category_fields=[]` to disable the fallback entirely.

---

## API Reference

| Method | Returns | Description |
|--------|---------|-------------|
| `rag.invoke(query)` | `RAGState` | Full RAG pipeline (sync) |
| `rag.ainvoke(query)` | `RAGState` | Full RAG pipeline (async) |
| `rag.chat(query, history)` | `RAGState` | Multi-turn chat (sync) |
| `rag.achat(query, history)` | `RAGState` | Multi-turn chat (async) |
| `rag.retrieve_documents(query, top_k)` | `(str, list[Document])` | Retrieve only, no answer |
| `rag.query(query)` | `str` | Answer string directly |
| `rag.invoke_agent(query)` | `str` | Tool-calling agent mode (sync) |
| `rag.ainvoke_agent(query)` | `str` | Tool-calling agent mode (async) |

`RAGState` fields: `answer` · `documents` · `query` · `question` · `history` · `iterations`

---

## Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `AZURE_OPENAI_ENDPOINT` | Azure OpenAI endpoint | — |
| `AZURE_OPENAI_API_KEY` | Azure OpenAI API key | — |
| `AZURE_OPENAI_DEPLOYMENT` | Default deployment | — |
| `AZURE_OPENAI_FAST_DEPLOYMENT` | Fast model deployment | → `DEPLOYMENT` |
| `AZURE_OPENAI_GENERATION_DEPLOYMENT` | Generation deployment | → `DEPLOYMENT` |
| `AZURE_OPENAI_API_VERSION` | API version | `2024-12-01-preview` |
| `OPENAI_API_KEY` | OpenAI API key (fallback) | — |
| `OPENAI_MODEL` | OpenAI model name | `gpt-5.4` |
| `AZURE_COHERE_ENDPOINT` | Azure Cohere endpoint | — |
| `AZURE_COHERE_API_KEY` | Azure Cohere API key | — |
| `COHERE_API_KEY` | Cohere API key (fallback) | — |
| `JINA_API_KEY` | Jina reranker API key | — |
| `MEILI_URL` | Meilisearch URL | `http://localhost:7700` |
| `MEILI_KEY` | Meilisearch API key | `masterKey` |
| `RAG_TOP_K` | Final result count | `10` |
| `RAG_RERANK_TOP_N` | Reranker top-n | `5` |
| `RAG_RETRIEVAL_FACTOR` | Over-retrieval multiplier | `4` |
| `RAG_SEMANTIC_RATIO` | Hybrid semantic weight | `0.5` |
| `RAG_FUSION` | Fusion strategy | `rrf` |
| `RAG_HYDE_MIN_WORDS` | Min words to trigger HyDE | `8` |

---

## Tune It For Your Data

**`retrievalagent` ships with curated tuned defaults** in `[tool.retrievalagent]` of `pyproject.toml`,
found by running the built-in tuner against real German/Swiss product catalog
data. These are better than hand-picked defaults for most retrieval tasks.

For peak performance on **your** corpus (product vs. legal vs. support vs.
scientific), run the tuner yourself — it searches 20+ knobs with
[Optuna](https://optuna.org) TPE sampler and usually beats defaults by 3–10%
combined score (hit@5 + paraphrase consistency).

**Real benchmark** (Meilisearch cloud, 3 German product-catalog indexes, 39
hit cases + 8 paraphrase groups):

| Config | hit@5 | consistency | stable_top1 | combined |
|--------|------:|------------:|------------:|---------:|
| Library defaults | 0.903 | 0.750 | 0.250 | 0.727 |
| Shipped tuned (`[tool.retrievalagent]`) | 0.968 | 0.792 | 0.250 | **0.761** |
| Full corpus-tuned (local) | **0.968** | **0.792** | **0.375** | **0.761** |

`combined = hit@5×0.4 + consistency×0.35 + stable_top1×0.25`

> 📘 **Full walkthrough:** [`docs/auto-tuning.md`](docs/auto-tuning.md) — testset
> design, CLI and Python API, every searched parameter, result interpretation,
> troubleshooting.

### 1. Install

```bash
pip install 'retrievalagent[tune]'
```

### 2. Write a testset

A list of `(query, expected_doc_ids, id_field)` tuples — or a JSON file:

```json
[
  {"query": "Makita Akku Bohrhammer 18V", "expected_ids": ["SKU-1065144"], "id_field": "sku"},
  {"query": "Bosch Winkelschleifer 125mm", "expected_ids": ["SKU-1057802"], "id_field": "sku"}
]
```

### 3. Run the tuner

**CLI (fastest path):**

```bash
python -m retrievalagent.tuner \
  --index my_index \
  --hits testset.json \
  --trials 50 \
  --patience 8         # early-stop if 8 trials show no improvement
```

Use `--pyproject` to write to `pyproject.toml [tool.retrievalagent]` instead of the
gitignored `retrievalagent.config.toml`.

**Python API (full control):**

```python
from retrievalagent import MeilisearchBackend, RAGConfig
from retrievalagent.tuner import RAGTuner, load_testset
from retrievalagent.utils import _make_azure_embed_fn

tuner = RAGTuner(
    backend_factory=lambda: MeilisearchBackend(index="my_index"),
    embed_fn=_make_azure_embed_fn(),
    hit_cases=load_testset("testset.json"),
    eval_k=5,
    # Optional: let the tuner mix/match weak + thinking models across
    # cost tiers (gen_llm / strong_model stays fixed — quality-critical).
    candidate_models=["azure:gpt-5.4-mini", "azure:gpt-5.4-nano"],
)

best = tuner.tune(
    n_trials=50,
    patience=8,             # early-stop on plateau
    trial_timeout_s=120,    # hung trials score 0, never block the study
)
best.save_toml("retrievalagent.config.toml")   # gitignored — local override (recommended)
# or: best.save_pyproject()          # [tool.retrievalagent] — commit if your team shares tuning
```

### 4. Use the tuned config

No code change required — `AgenticRAG` picks up `[tool.retrievalagent]` automatically:

```python
from retrievalagent import AgenticRAG, RAGConfig, MeilisearchBackend

rag = AgenticRAG(
    index="my_index",
    backend=MeilisearchBackend("my_index"),
    embed_fn=embed_fn,
    config=RAGConfig.auto(),   # discovers pyproject.toml → retrievalagent.config.toml → env
)
```

### Config discovery order

1. **Runtime kwarg** — `AgenticRAG(config=RAGConfig(...))`
2. **`retrievalagent.config.toml`** — per-deployment local override. **Gitignored by
   default** so your corpus-specific tuning doesn't leak into source control.
   Wins over pyproject defaults — drop a tuned file here and you're done.
3. **`[tool.retrievalagent]` in `pyproject.toml`** — shipped / shared defaults.
   `retrievalagent` ships with curated values here (tuned on real product-catalog data).
   Matches ruff/black/mypy convention for committed tool config.
4. **`RAG_*` env vars** — containers/CI overrides.
5. **Library defaults** — fallback if nothing else is set.

### What gets tuned

**Scalar thresholds:**

| Parameter | Range | Effect |
|-----------|-------|--------|
| `retrieval_factor` | 2–8 | Over-retrieve multiplier before reranking |
| `rerank_top_n` | 3–10 | Docs kept post-rerank |
| `rerank_cap_multiplier` | 1.5–4 | Caps reranker input at `top_k × m` |
| `semantic_ratio` | 0.3–0.9 | BM25 ⇄ vector balance |
| `fusion` | `rrf` / `dbsf` | Score fusion strategy |
| `short_query_threshold` | 3–8 | When to skip LLM preprocessing |
| `short_query_sort_tokens` | bool | Sort tokens for paraphrase invariance |
| `bm25_fallback_semantic_ratio` | 0.7–1.0 | Semantic ratio used when BM25 fails |
| `rerank_skip_gap` | 0.05–0.3 | Top-1 vs top-5 score gap to skip reranker |
| `name_field_boost_max` | 0.0–0.5 | Post-rerank boost for docs matching query tokens in `name_field`. Higher → precise lookups win; lower → paraphrase stability. |

**Optional (`None` = disable stage, first-class tuning option):**

| Parameter | Range / None | Effect |
|-----------|--------------|--------|
| `bm25_fallback_threshold` | 0.2–0.6 / `None` | BM25 score below which we boost semantic. `None` = never boost. |
| `fast_accept_score` | 0.5–0.95 / `None` | BM25 score to accept fast path. `None` = always slow path. |
| `fast_accept_confidence` | 0.6–0.95 / `None` | LLM confidence to accept fast path. `None` = no LLM confirm. |
| `rerank_skip_dominance` | 0.6–0.95 / `None` | Score to skip reranker on obvious hits. `None` = always rerank. |
| `expert_threshold` | 0.05–0.3 / `None` | Gap to escalate to expert reranker. `None` = never escalate. |
| `hyde_min_words` | 2–20 / `None` | Min words to trigger HyDE. `None` = disable HyDE. |

**Pipeline-stage toggles:**

| Parameter | Default | Effect |
|-----------|---------|--------|
| `enable_hyde` | `true` | Hypothetical-document expansion |
| `enable_filter_intent` | `true` | LLM detects filter intent from query |
| `enable_quality_gate` | `true` | LLM judges retrieval quality before answering |
| `enable_preprocess_llm` | `true` | LLM query rewrite + variant generation |
| `enable_reasoning` | `true` | Per-document relevance reasoning (uses `thinking_model`) |

**LLM tiers (cost ⇄ quality, mix and match):**

| Parameter | Role | Typical cheap pick |
|-----------|------|--------------------|
| `strong_model` | Final cited answer — gen_llm | `azure:gpt-5.4` |
| `weak_model` | Preprocess / quality / filter-intent / rewrite | `azure:gpt-5.4-mini` |
| `thinking_model` | Per-document reasoning when `enable_reasoning=true` | `azure:gpt-5.4-mini` |

Specs are `provider:model` — resolved via LangChain's `init_chat_model`, so
every supported provider works (`azure:`, `openai:`, `anthropic:`,
`bedrock:`, `ollama:`, etc.). `None` = inherit from env vars.

Optuna's TPE sampler learns from prior trials — **50 trials usually beats
hand-tuning**. The tuner explores *disabling* each optional stage as a
first-class hypothesis, so it can discover e.g. "your corpus doesn't need
filter-intent detection." Use a cheap LLM (`gpt-4o-mini`) during tuning to
keep cost down; swap to production LLM afterwards.

### Disabling stages in TOML

TOML has no `null`, so disabled fields go in a `disable` list:

```toml
[tool.retrievalagent]
top_k = 10
semantic_ratio = 0.5
# disable these optional stages for this corpus:
disable = ["bm25_fallback_threshold", "expert_threshold"]
```

`RAGConfig.from_toml()` / `from_pyproject()` translates `disable = [...]`
into `None` values on load.

---

## CLI

```bash
pip install retrievalagent[recommended]

# Guided setup wizard — choose LLM, embedder, backend, reranker
retrievalagent

# Chat mode — full agentic pipeline
retrievalagent --chat -c my_index

# Retriever mode — documents only, no LLM
retrievalagent --retriever -c my_index

# Skip wizard, use env vars
retrievalagent --skip-wizard -c my_index
```

The wizard guides you through:
1. **LLM provider** — OpenAI, Anthropic, Ollama, or env default
2. **Embedding model** — OpenAI, Azure OpenAI, Ollama, or none (BM25 only)
3. **Vector store** — InMemory, Meilisearch, ChromaDB, Qdrant, pgvector, DuckDB, LanceDB, Azure AI Search
4. **Reranker** — Cohere, Jina, HuggingFace, LLM-based, or none
5. **Mode** — Chat (with answers) or Retriever (documents only)

---

## License

MIT — *Licence to code.*
