Metadata-Version: 2.4
Name: ragwire
Version: 1.2.9
Summary: RAGWire — Production-grade RAG toolkit for document ingestion and retrieval with hybrid search support
Author-email: KGP Talkie Private Limited <udemy@kgptalkie.com>
Maintainer-email: KGP Talkie Private Limited <udemy@kgptalkie.com>
License-Expression: MIT
Project-URL: Homepage, https://laxmimerit.github.io/RAGWire/
Project-URL: Documentation, https://laxmimerit.github.io/RAGWire/
Project-URL: Repository, https://github.com/laxmimerit/ragwire.git
Project-URL: Issues, https://github.com/laxmimerit/ragwire/issues
Project-URL: YouTube, https://youtube.com/kgptalkie
Keywords: rag,retrieval,vector-database,qdrant,embeddings,hybrid-search,nlp,document-processing
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: langchain>=0.1.0
Requires-Dist: langchain-core>=0.1.0
Requires-Dist: langchain-community>=0.0.0
Requires-Dist: langchain-text-splitters>=0.0.1
Requires-Dist: qdrant-client>=1.6.0
Requires-Dist: langchain-qdrant>=0.1.0
Requires-Dist: markitdown[pdf]>=0.0.1
Requires-Dist: pyyaml>=6.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: tqdm>=4.66.0
Provides-Extra: openai
Requires-Dist: langchain-openai>=0.0.0; extra == "openai"
Provides-Extra: huggingface
Requires-Dist: langchain-huggingface>=0.0.0; extra == "huggingface"
Provides-Extra: ollama
Requires-Dist: langchain-ollama>=0.0.0; extra == "ollama"
Provides-Extra: google
Requires-Dist: langchain-google-genai>=0.0.0; extra == "google"
Provides-Extra: fastembed
Requires-Dist: fastembed>=0.2.0; extra == "fastembed"
Provides-Extra: groq
Requires-Dist: langchain-groq>=0.0.0; extra == "groq"
Provides-Extra: anthropic
Requires-Dist: langchain-anthropic>=0.0.0; extra == "anthropic"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: isort>=5.0.0; extra == "dev"
Requires-Dist: flake8>=6.0.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
Requires-Dist: mkdocs-material>=9.0.0; extra == "dev"
Provides-Extra: all
Requires-Dist: ragwire[openai]; extra == "all"
Requires-Dist: ragwire[huggingface]; extra == "all"
Requires-Dist: ragwire[ollama]; extra == "all"
Requires-Dist: ragwire[google]; extra == "all"
Requires-Dist: ragwire[fastembed]; extra == "all"
Requires-Dist: ragwire[groq]; extra == "all"
Requires-Dist: ragwire[anthropic]; extra == "all"
Requires-Dist: ragwire[dev]; extra == "all"
Dynamic: license-file

<p align="center">
  <img src="https://raw.githubusercontent.com/laxmimerit/RAGWire/main/assets/ragwire.png" alt="RAGWire logo" width="120"/>
</p>

<h1 align="center">RAGWire</h1>
<p align="center">Production-grade RAG toolkit for document ingestion and retrieval</p>

<p align="center">
  <a href="https://pypi.org/project/ragwire"><img src="https://img.shields.io/pypi/v/ragwire" alt="PyPI"/></a>
  <a href="https://github.com/laxmimerit/ragwire/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-MIT-green" alt="License"/></a>
  <a href="https://youtube.com/kgptalkie"><img src="https://img.shields.io/badge/YouTube-KGP%20Talkie-red" alt="YouTube"/></a>
</p>

<p align="center">
  <a href="https://laxmimerit.github.io/RAGWire/">
    <img src="https://img.shields.io/badge/📖%20Full%20Documentation-laxmimerit.github.io%2FRAGWire-blue?style=for-the-badge&logo=readthedocs&logoColor=white" alt="Documentation"/>
  </a>
</p>

---

## Features

- **Document Loading** — PDF, DOCX, XLSX, PPTX and more via MarkItDown
- **LLM Metadata Extraction** — extracts company, doc type, fiscal period using your LLM; fully customisable via YAML
- **Smart Text Splitting** — markdown-aware and recursive chunking strategies
- **Multiple Embedding Providers** — Ollama, OpenAI, HuggingFace, Google, FastEmbed
- **Qdrant Vector Store** — dense, sparse, and hybrid search
- **Advanced Retrieval** — similarity, MMR, and hybrid search with metadata filtering
- **SHA256 Deduplication** — at both file and chunk level
- **Directory Ingestion** — ingest an entire folder with one call, with optional recursive scan
- **Env Var Substitution** — use `${VAR}` in `config.yaml` for secrets

## Architecture

<p align="center">
  <img src="https://raw.githubusercontent.com/laxmimerit/RAGWire/main/assets/RAGWire-block-diagram.png" alt="RAGWire Architecture" width="100%"/>
</p>

## Installation

```bash
pip install ragwire

# With Ollama support (local, no API key)
pip install "ragwire[ollama]"

# With all providers
pip install "ragwire[all]"
```

## Quick Start

```python
from ragwire import RAGWire

rag = RAGWire("config.yaml")

# Ingest files — SHA256 deduplication, safe to re-run
stats = rag.ingest_documents(["data/Apple_10k_2025.pdf", "data/Microsoft_10k_2025.pdf"])
print(f"Processed: {stats['processed']}, Skipped: {stats['skipped']}, Chunks: {stats['chunks_created']}")

# Or ingest an entire directory
stats = rag.ingest_directory("data/", recursive=True)

# Basic retrieval — returns list of LangChain Document objects
results = rag.retrieve("What is the total revenue?", top_k=5)
for doc in results:
    print(doc.page_content[:300])
    print(doc.metadata["company_name"])   # str, lowercased — e.g. "apple"
    print(doc.metadata["fiscal_year"])    # list[int] — e.g. [2025]  ← NOT a plain int
    print(doc.metadata["file_name"])      # str — e.g. "Apple_10k_2025.pdf"

# Retrieval with explicit metadata filters
results = rag.retrieve(
    "What is the net income?",
    filters={"company_name": "apple", "fiscal_year": 2025}  # pass year as int
)

# OR logic within a field — matches any of the listed values
results = rag.retrieve("Compare revenue trends", filters={"fiscal_year": [2023, 2024, 2025]})

# Agent-controlled filtering (recommended for AI agents)
filters = rag.extract_filters("Apple's revenue in 2025")
# → {"company_name": "apple", "fiscal_year": 2025} or None
results = rag.retrieve("Apple's revenue in 2025", filters=filters)
```

## Configuration

Copy `config.example.yaml` to `config.yaml` and edit. Secrets can be injected via environment variables:

```yaml
vectorstore:
  url: "https://your-cluster.qdrant.io"
  api_key: "${QDRANT_API_KEY}"

llm:
  provider: "openai"
  model: "gpt-5.4-nano"
  api_key: "${OPENAI_API_KEY}"
```

Full example:

```yaml
embeddings:
  provider: "ollama"
  model: "qwen3-embedding:0.6b"
  base_url: "http://localhost:11434"

llm:
  provider: "ollama"
  model: "qwen3.5:9b"
  num_ctx: 16384

vectorstore:
  url: "http://localhost:6333"
  collection_name: "my_docs"
  use_sparse: true

retriever:
  search_type: "hybrid"
  top_k: 5
  auto_filter: false   # set true to enable LLM-based filter extraction from every query
```

## Embedding Providers

```yaml
# Ollama (local)
embeddings:
  provider: "ollama"
  model: "qwen3-embedding:0.6b"

# OpenAI
embeddings:
  provider: "openai"
  model: "text-embedding-3-small"

# HuggingFace (local)
embeddings:
  provider: "huggingface"
  model_name: "sentence-transformers/all-MiniLM-L6-v2"

# Google
embeddings:
  provider: "google"
  model: "models/embedding-001"
```

## Component Usage

```python
from ragwire import (
    MarkItDownLoader,
    get_splitter,
    get_markdown_splitter,
    get_embedding,
    QdrantStore,
    MetadataExtractor,
    hybrid_search,
    mmr_search,
)

# Load a document
loader = MarkItDownLoader()
result = loader.load("document.pdf")

# Split text
splitter = get_markdown_splitter(chunk_size=10000, chunk_overlap=2000)
chunks = splitter.split_text(result["text_content"])

# Embeddings
embedding = get_embedding({"provider": "ollama", "model": "qwen3-embedding:0.6b"})

# Vector store
store = QdrantStore(config={"url": "http://localhost:6333"}, embedding=embedding)
store.set_collection("my_collection")
vectorstore = store.get_store()
```

## Architecture

```
ragwire/
├── core/          # Config loader + RAGWire orchestrator
├── loaders/       # MarkItDown document converter
├── processing/    # Text splitters + SHA256 hashing
├── metadata/      # Pydantic schema + LLM extractor
├── embeddings/    # Multi-provider embedding factory
├── vectorstores/  # Qdrant wrapper with hybrid search
├── retriever/     # Similarity, MMR, hybrid retrieval
└── utils/         # Logging
```

## Troubleshooting

| Error | Fix |
|-------|-----|
| Qdrant connection refused | `docker run -p 6333:6333 qdrant/qdrant` |
| `markitdown[pdf]` missing | `pip install "markitdown[pdf]"` |
| Ollama model not found | `ollama pull <model-name>` |
| `fastembed` missing | `pip install fastembed` (needed for hybrid search) |
| Embedding dimension mismatch | Set `force_recreate: true` in config once, then back to `false` |

## License

MIT © 2026 [KGP Talkie Private Limited](https://kgptalkie.com)

## Links

<p align="center">
  <a href="https://laxmimerit.github.io/RAGWire/">
    <img src="https://img.shields.io/badge/📖%20Documentation-Visit%20Docs-2ea44f?style=for-the-badge&logo=gitbook&logoColor=white" alt="Documentation"/>
  </a>
  &nbsp;
  <a href="https://github.com/laxmimerit/ragwire">
    <img src="https://img.shields.io/badge/⭐%20GitHub-Star%20the%20Repo-181717?style=for-the-badge&logo=github&logoColor=white" alt="GitHub"/>
  </a>
  &nbsp;
  <a href="https://youtube.com/kgptalkie">
    <img src="https://img.shields.io/badge/▶%20YouTube-KGP%20Talkie-FF0000?style=for-the-badge&logo=youtube&logoColor=white" alt="YouTube"/>
  </a>
</p>

- 🌐 Website: [kgptalkie.com](https://kgptalkie.com)
- 📖 Docs: [laxmimerit.github.io/RAGWire](https://laxmimerit.github.io/RAGWire/)
- 💻 GitHub: [github.com/laxmimerit/ragwire](https://github.com/laxmimerit/ragwire)
- 📧 Email: udemy@kgptalkie.com
