Metadata-Version: 2.4
Name: pydantic-ai-ragstack
Version: 0.1.0
Summary: A standalone RAG package for Pydantic AI with support for multiple vector stores.
Author-email: Rizki Sasri <rizki@example.com>
Requires-Python: >=3.12
Requires-Dist: boto3>=1.34.0
Requires-Dist: cohere>=5.0.0
Requires-Dist: langchain-text-splitters>=0.2.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: pydantic-ai>=0.0.18
Requires-Dist: pydantic-settings>=2.0.0
Requires-Dist: pymupdf>=1.24.0
Requires-Dist: python-docx>=1.1.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: rank-bm25>=0.2.2
Requires-Dist: tabulate>=0.9.0
Provides-Extra: all
Requires-Dist: asyncpg>=0.29.0; extra == 'all'
Requires-Dist: chromadb>=0.5.0; extra == 'all'
Requires-Dist: google-api-python-client>=2.196.0; extra == 'all'
Requires-Dist: google-auth-oauthlib>=1.2.0; extra == 'all'
Requires-Dist: liteparse>=0.0.1; extra == 'all'
Requires-Dist: llama-cloud>=0.0.5; extra == 'all'
Requires-Dist: pgvector>=0.2.0; extra == 'all'
Requires-Dist: pymilvus>=2.4.0; extra == 'all'
Requires-Dist: sentence-transformers>=3.0.0; extra == 'all'
Requires-Dist: sqlalchemy[asyncio]>=2.0.0; extra == 'all'
Provides-Extra: chroma
Requires-Dist: chromadb>=0.5.0; extra == 'chroma'
Provides-Extra: embeddings
Requires-Dist: sentence-transformers>=3.0.0; extra == 'embeddings'
Provides-Extra: loaders
Requires-Dist: google-api-python-client>=2.196.0; extra == 'loaders'
Requires-Dist: google-auth-oauthlib>=1.2.0; extra == 'loaders'
Requires-Dist: liteparse>=0.0.1; extra == 'loaders'
Requires-Dist: llama-cloud>=0.0.5; extra == 'loaders'
Provides-Extra: milvus
Requires-Dist: pymilvus>=2.4.0; extra == 'milvus'
Provides-Extra: pgvector
Requires-Dist: asyncpg>=0.29.0; extra == 'pgvector'
Requires-Dist: pgvector>=0.2.0; extra == 'pgvector'
Requires-Dist: sqlalchemy[asyncio]>=2.0.0; extra == 'pgvector'
Description-Content-Type: text/markdown

# Pydantic AI RAG Stack

`pydantic-ai-ragstack` is a robust and structured framework designed for Retrieval Augmented Generation (RAG) systems. It leverages Python's `pydantic` library to enforce strict data schemas across the entire document ingestion, embedding, storage, and retrieval pipeline. This ensures high reliability and predictability when building AI applications based on proprietary knowledge bases.

## 🚀 Features

* **Structured Data Modeling:** Uses Pydantic models (`Document`, `DocumentPage`, etc.) to define explicit schemas for all data components (metadata, content, images).
* **End-to-End Ingestion Pipeline:** Handles document loading, chunking, image extraction, and metadata enrichment.
* **Vector Store Integration:** Manages interaction with various vector databases (`pydantic_ai_ragstack/vectorstore.py`).
* **Advanced Retrieval:** Includes capabilities for content re-ranking (`pydantic_ai_ragstack/reranker.py`) to improve search relevance.

## 🛠️ Installation

This project requires Python 3.8+ and several dependencies (e.g., LangChain, PyMuPDF, vector store clients).

```bash
# Assuming requirements.txt exists or needs manual installation
pip install -r requirements.txt
```

## 📂 Project Structure Overview

The core logic is organized within the `pydantic_ai_ragstack` package:

* **`models.py`**: Defines all core data structures (e.g., `Document`, `DocumentPageChunk`, `SearchResult`) using Pydantic models.
* **`ingestion.py`**: Manages the entire lifecycle of ingesting raw documents into structured chunks and vector embeddings.
* **`documents.py`**: Handles document loading and parsing (PDF, etc.).
* **`embeddings.py`**: Manages the generation of vector embeddings for content chunks.
* **`retrieval.py`**: Contains logic for querying the vector store and retrieving relevant documents/chunks.
* **`vectorstore.py`**: Abstract layer for interacting with underlying vector databases (e.g., Chroma, Pinecone).

## 🧠 Usage Example: Ingestion and Retrieval Flow

The typical workflow involves three main stages: Load -> Index -> Query.

### Step 1: Document Loading & Processing

Raw documents are loaded and parsed into structured `Document` objects.

```python
from pydantic_ai_ragstack import documents, ingestion
# Assume 'path/to/document.pdf' exists
raw_docs = documents.load("path/to/document.pdf")
structured_data = raw_docs # Contains list[Document]
```

### Step 2: Indexing (Embedding and Storing)

The structured data is chunked, embedded, and stored in the vector store.

```python
from pydantic_ai_ragstack import ingestion
# Assuming an embedding model is configured
status = ingestion.index_documents(structured_data, embedder=MyEmbeddings())

if status == "done":
    print("Indexing complete. Data is ready for retrieval.")
```

### Step 3: Retrieval and Question Answering (RAG)

A query is executed against the indexed documents to retrieve relevant context before generating a final answer using an LLM (not included in this core module).

```python
from pydantic_ai_ragstack import retrieval
# Query the system
query = "What are the main features of the RAG framework?"
search_results = retrieval.search(query, collection_name="my_knowledge_base")

if search_results:
    print("--- Retrieved Context ---")
    for result in search_results:
        print(f"Score: {result.score:.2f}\nContent: {result.content[:100]}...")
    
    # Pass context and original query to an LLM call here
```

## 🧪 Testing

To run unit tests, use `pytest`:

```bash
pip install pytest
pytest tests/
