Metadata-Version: 2.4
Name: ragbuilder
Version: 0.1.6
Summary: RagBuilder SDK - Create optimal Production-ready RAG pipelines
Author-email: Ashwin Aravind <ashwin@krux.ai>, Aravind Parameswaran <aravind@krux.ai>
License: Apache-2.0
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: chromadb
Requires-Dist: datasets>=2.18.0
Requires-Dist: fastapi>=0.100.0
Requires-Dist: jinja2
Requires-Dist: langchain>=0.3.24
Requires-Dist: langchain_chroma==0.2.3
Requires-Dist: langchain-community==0.3.22
Requires-Dist: langchain-core==0.3.55
Requires-Dist: langchain-huggingface==0.1.2
Requires-Dist: langchain-openai==0.3.14
Requires-Dist: opentelemetry-api>=1.23.0
Requires-Dist: opentelemetry-sdk>=1.23.0
Requires-Dist: opentelemetry-exporter-otlp>=1.23.0
Requires-Dist: optuna
Requires-Dist: platformdirs
Requires-Dist: pydantic>=2.0.0
Requires-Dist: python-dotenv
Requires-Dist: ragas==0.2.14
Requires-Dist: rerankers
Requires-Dist: rich>=13.0.0
Requires-Dist: sentence-transformers
Requires-Dist: tenacity==8.4.2
Requires-Dist: rank-bm25
Requires-Dist: uvicorn>=0.30.0
Provides-Extra: graph
Requires-Dist: neo4j>=5.23.0; extra == "graph"
Requires-Dist: langchain-community[neo4j]; extra == "graph"
Provides-Extra: vectorstores
Requires-Dist: elasticsearch>=8.0.0; extra == "vectorstores"
Requires-Dist: faiss-cpu>=1.7.4; extra == "vectorstores"
Requires-Dist: pinecone-client>=3.0.0; extra == "vectorstores"
Requires-Dist: pymilvus>=2.3.0; extra == "vectorstores"
Requires-Dist: qdrant-client>=1.7.0; extra == "vectorstores"
Requires-Dist: weaviate-client>=3.25.0; extra == "vectorstores"
Requires-Dist: langchain-pinecone; extra == "vectorstores"
Requires-Dist: langchain-weaviate; extra == "vectorstores"
Requires-Dist: langchain-qdrant; extra == "vectorstores"
Requires-Dist: langchain-milvus; extra == "vectorstores"
Requires-Dist: langchain-postgres; extra == "vectorstores"
Requires-Dist: langchain-elasticsearch; extra == "vectorstores"
Provides-Extra: document-processors
Requires-Dist: pymupdf>=1.23.0; extra == "document-processors"
Requires-Dist: python-docx>=1.0.0; extra == "document-processors"
Requires-Dist: pikepdf>=8.11.0; extra == "document-processors"
Requires-Dist: pandoc>=2.3; extra == "document-processors"
Requires-Dist: pypdf>=3.17.0; extra == "document-processors"
Requires-Dist: markdown>=3.5.0; extra == "document-processors"
Requires-Dist: beautifulsoup4>=4.12.0; extra == "document-processors"
Requires-Dist: unstructured[all-docs]>=0.11.0; extra == "document-processors"
Provides-Extra: all
Requires-Dist: ragbuilder[graph]; extra == "all"
Requires-Dist: ragbuilder[vectorstores]; extra == "all"
Requires-Dist: ragbuilder[document_processors]; extra == "all"
Dynamic: license-file

![RagBuilder logo](./assets/ragbuilder_dark.png#gh-dark-mode-only)
![RagBuilder logo](./assets/ragbuilder_light.png#gh-light-mode-only)

# 

[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
[![GitHub release](https://img.shields.io/github/release/KruxAI/ragbuilder.svg)](https://github.com/KruxAI/ragbuilder/releases/)
[![GitHub license](https://badgen.net/github/license/KruxAI/ragbuilder)](https://github.com/KruxAI/ragbuilder/blob/master/LICENSE)
[![GitHub commits](https://badgen.net/github/commits/KruxAI/ragbuilder)](https://github.com/KruxAI/ragbuilder/commit/)


![11926](https://github.com/user-attachments/assets/af9e241a-b648-4b2f-ab2a-3c268c7f1ca8)

RagBuilder is a toolkit that helps you create optimal Production-ready Retrieval-Augmented-Generation (RAG) setup for your data automatically. By performing hyperparameter tuning on various RAG parameters (Eg: chunking strategy: semantic, character etc., chunk size: 1000, 2000 etc.), RagBuilder evaluates these configurations against a test dataset to identify the best-performing setup for your data. Additionally, RagBuilder includes several state-of-the-art, pre-defined RAG templates that have shown strong performance across diverse datasets. So just bring your data, and RagBuilder will generate a production-grade RAG setup in just minutes.


## Features

- **Hyperparameter Tuning**: Efficiently optimize your RAG configurations using Bayesian optimization
- **Pre-defined RAG Templates**: Use state-of-the-art templates that have demonstrated strong performance Eg: Graph retriever, Contextual chunker etc.)
- **Evaluation Dataset Options**: Generate synthetic test dataset or provide your own
- **Component Access**: Direct access to vectorstore, retriever, and generator components
- **API Deployment**: Easily deploy as an API service
- **Project Persistence**: Save and load optimized RAG pipelines


## Installation

```bash
# Create a new venv
uv venv ragbuilder

# Activate the new venv
source ragbuilder/bin/activate

# Install
uv pip install ragbuilder
```

See other installation options here ([link](https://docs.ragbuilder.io/quickstart/#installation))

## Quick Start

```python
from ragbuilder import RAGBuilder

# Initialize and optimize with defaults
builder = RAGBuilder.from_source_with_defaults(input_source='https://lilianweng.github.io/posts/2023-06-23-agent/')
results = builder.optimize()

# Run a query through the complete pipeline
response = results.invoke("What is HNSW?")

# View optimization summary
print(results.summary())
```

### Setting Default Models

You can specify default LLM and embedding models that will be used throughout the pipeline:

`````python
from langchain_openai import AzureChatOpenAI, AzureOpenAIEmbeddings

# Initialize with custom defaults
builder = RAGBuilder.from_source_with_defaults(
    input_source='data.pdf',
    default_llm=AzureChatOpenAI(model="gpt-4o", temperature=0.0),
    default_embeddings=AzureOpenAIEmbeddings(model="text-embedding-3-large"),
    n_trials=20  # Set number of optimization trials
)

# Or when creating a RAGBuilder instance with fine grained custom configuration
builder = RAGBuilder(
    data_ingest_config=data_ingest_config, # Custom Data Ingestion parameters
    default_llm=AzureChatOpenAI(model="gpt-4o", temperature=0.0),
    default_embeddings=AzureOpenAIEmbeddings(model="text-embedding-3-large")
)
`````

## Configuration Guide

### Basic Configuration
For most use cases, the default configuration provides good results:

```python
builder = RAGBuilder.from_source_with_defaults(
    input_source='path/to/your/data',
    test_dataset='path/to/test/data'  # Optional
)
```

## Advanced Configuration

For fine-grained control over your RAG pipeline, you can customize every aspect:

````python
from ragbuilder.config import (
    DataIngestOptionsConfig,
    RetrievalOptionsConfig,
    GenerationOptionsConfig
)

# Configure data ingestion
data_ingest_config = DataIngestOptionsConfig(
    input_source="data.pdf",
    document_loaders=[
        {"type": "pymupdf"},
        {"type": "unstructured"}
    ],
    chunking_strategies=[{
        "type": "RecursiveCharacterTextSplitter",
        "chunker_kwargs": {"separators": ["\n\n", "\n", " ", ""]}
    }],
    chunk_size={"min": 500, "max": 2000, "stepsize": 500},
    embedding_models=[{
        "type": "openai",
        "model_kwargs": {"model": "text-embedding-3-large"}
    }]
)

# Initialize with custom configs
builder = RAGBuilder(
    data_ingest_config=data_ingest_config,
    default_llm=AzureChatOpenAI(model="gpt-4o", temperature=0.0),
    default_embeddings=AzureOpenAIEmbeddings(model="text-embedding-3-large")
)

# Run individual module level optimization
builder.optimize_data_ingest()


# Configure retrieval options
retrieval_config = RetrievalOptionsConfig(
    retrievers=[
        {
            "type": "vector_similarity",
            "retriever_k": [20],
            "weight": 0.5
        },
        {
            "type": "bm25",
            "retriever_k": [20],
            "weight": 0.5
        }
    ],
    rerankers=[{
        "type": "BAAI/bge-reranker-base"
    }],
    top_k=[3, 5]
)


# Run retrieval optimization with custom config
builder.optimize_retrieval(retrieval_config)

# Configure Generation related options
gen_config = GenerationOptionsConfig(
    llms = [
        LLMConfig(type="azure_openai", model_kwargs={'model':'gpt-4o-mini', 'temperature':0.2}),
        LLMConfig(type="azure_openai", model_kwargs={'model':'gpt-4o', 'temperature':0.2}),
    ],
    optimization={
        "n_trials": 10, 
        "n_jobs": 1,
        "study_name": "lillog_agents_study",
        "optimization_direction": "maximize"
    },
    evaluation_config={"type": "ragas"},
)

# Run generation optimization with custom config
builder.optimize_generation(gen_config)

results = builder.optimization_results
response = adv_results.invoke("What is HNSW?")
````


## Component Options Reference

### Document Loaders
- `unstructured`: General-purpose loader
- `pymupdf`: Optimized for PDFs
- `pypdf`: Alternative PDF loader
- `web`: Web page loader
- Custom loaders via `custom_class`

### Chunking Strategies
- `RecursiveCharacterTextSplitter`: Recursive character text splitter
- `CharacterTextSplitter`: Character text splitter
- `MarkdownHeaderTextSplitter`: Markdown-header based splitter
- `HTMLHeaderTextSplitter`: HTML-header based splitter
- `SemanticChunker`: Semantic chunker
- `TokenTextSplitter`: Token-based splitter
- Custom splitters via `custom_class`


### Retrievers
- `vector_similarity`: Vector similarity search
- `vector_mmr`: Vector MMR search
- `bm25`: Keyword-based search using BM25
- `multi_query`: Multi-query retrievers
- `parent_doc_full`: Parent document full-doc retrieval
- `parent_doc_large`: Parent document large-chunks retrieval
- `graph`: Graph-based retrieval (requires Neo4j)
- Custom retrievers via `custom_class`

### Rerankers
- `BAAI/bge-reranker-base`: BGE base reranker
- `mixedbread-ai/mxbai-rerank-base-v1`: mxbai reranker base v1
- `mixedbread-ai/mxbai-rerank-large-v1`: mxbai reranker large v1
- `cohere`: Cohere's reranking model
- `jina`: Jina reranker
- `flashrank`: Flaskrank reranker
- `rankllm`: RankLLM reranker
- `colbert`: Colbert reranker
- Custom rerankers via `custom_class`


## Environment Variables

Create a `.env` file in your project directory:

````env
# Required
OPENAI_API_KEY=your_key_here

# Optional - For additional features
MISTRAL_API_KEY=your_key_here
COHERE_API_KEY=your_key_here
AZURE_OPENAI_API_KEY=your_key_here
AZURE_OPENAI_ENDPOINT=your_endpoint_here

# For Graph-based RAG
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_password
````

## Advanced Topics

### Custom Evaluation Metrics
```python
from ragbuilder import EvaluationConfig

config = EvaluationConfig(
    type="custom",
    custom_class="your_module.CustomEvaluator",
    evaluator_kwargs={
        "metrics": ["precision", "recall", "f1_score"]
    }
)
```

### Optimization Configuration
Fine-tune the optimization parameters:
```python
from ragbuilder import OptimizationConfig

config = OptimizationConfig(
    n_trials=20,
    n_jobs=1,
    study_name="my_optimization",
    optimization_direction="maximize"
)
```

## API Deployment

RAGBuilder can be deployed as an API service:

````python
# Initialize and optimize
builder = RAGBuilder.from_source_with_defaults('data.pdf')
results = builder.optimize()

# Deploy as API
builder.serve(host="0.0.0.0", port=8000)
````

Access via:
- `POST /query` - Run queries through the RAG pipeline

## Project Management

Save and load optimized RAG pipelines:

````python
# Save project
builder.save('rag_project/')

# Load existing project
builder = RAGBuilder.load('rag_project/')

# Access components
vectorstore = builder.data_ingest.get_vectorstore()
retriever = builder.retrieval.get_retriever()
generator = builder.generation.get_generator()
````

## Best Practices

1. **Start Simple**
   - Begin with `from_source_with_defaults()`
   - Add complexity only when needed

2. **Test Data Quality**
   - Provide representative test queries
   - Use domain-specific evaluation metrics

3. **Resource Management**
   - Monitor memory usage with large datasets
   - Use chunking for large documents

4. **Production Deployment**
   - Save optimized projects for reuse
   - Monitor API performance metrics
   - Implement rate limiting for API endpoints

## Usage Analytics

We collect anonymous usage metrics to improve RAGBuilder:
- Number of optimization runs
- Success/failure rates
- No personal or business data is collected

To opt-out set `ENABLE_ANALYTICS=False` in `.env`:

## Contributing

We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
