Metadata-Version: 2.4
Name: omnirag
Version: 2.0.1
Summary: India's First Multi-Language Voice RAG Framework
Home-page: https://github.com/Giri530/omnirag
Author: Girinath V
Author-email: girinathv48@gmail.com
Project-URL: Bug Tracker, https://github.com/Giri530/omnirag/issues
Project-URL: Documentation, https://github.com/Giri530/omnirag#readme
Project-URL: Source Code, https://github.com/Giri530/omnirag
Keywords: rag,retrieval-augmented-generation,llm,ai,machine-learning,nlp,multi-language,voice,tamil,hindi,translation,faiss,vector-database,qwen,huggingface
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: transformers>=4.30.0
Requires-Dist: torch>=2.0.0
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: faiss-cpu>=1.7.4
Requires-Dist: numpy>=1.24.0
Requires-Dist: accelerate>=0.20.0
Requires-Dist: PyPDF2>=3.0.0
Requires-Dist: ddgs>=0.3.0
Requires-Dist: requests>=2.31.0
Requires-Dist: deep-translator>=1.11.4
Requires-Dist: langdetect>=1.0.9
Requires-Dist: pyttsx3>=2.90
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=4.0.0; extra == "dev"
Requires-Dist: twine>=4.0.0; extra == "dev"
Requires-Dist: build>=0.10.0; extra == "dev"
Provides-Extra: voice-input
Requires-Dist: SpeechRecognition>=3.10.0; extra == "voice-input"
Requires-Dist: pyaudio>=0.2.13; extra == "voice-input"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# 🚀 OmniRAG v2.0 - Multi-Language Voice RAG

[![PyPI version](https://badge.fury.io/py/omnirag.svg)](https://pypi.org/project/omnirag/)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Downloads](https://pepy.tech/badge/omnirag)](https://pepy.tech/project/omnirag)

**India's First Multi-Language Voice RAG Framework** 🇮🇳

Intelligent RAG combining **Liquid + Agentic + Chain** architectures with **unique features**:

- 🌍 **Smart Multi-Language Translation** - 27+ languages including Tamil, Hindi
- 🎤 **Voice Input & Output** - Speak questions, hear answers
- 🧠 **Adaptive RAG** - Automatically adjusts to user expertise level

---

## 🆕 What's New in v2.0?

### ✨ Feature 1: Smart Post-Retrieval Translation

**Revolutionary architecture**: Documents stay in **original language**, translation happens **AFTER retrieval**!

**Why this is better:**
- ✅ Better embeddings (preserve semantic meaning)
- ✅ No storage duplication
- ✅ One document → Many output languages
- ✅ 70% more efficient than traditional approaches

```python
from omnirag import OmniRAG

# Documents in English
rag = OmniRAG(output_language="Tamil")
rag.add_documents(["AI helps solve complex problems."])

# Query in English, get Tamil answer!
result = rag.query("What is AI?")
print(result['answer'])
# Output: "செயற்கை நுண்ணறிவு சிக்கலான சிக்கல்களைத் தீர்க்க உதவுகிறது."
```

**Supported Languages (27+):**
- **Indian:** Tamil, Hindi, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Bengali
- **European:** Spanish, French, German, Italian, Portuguese, Russian, Polish, Dutch, Turkish
- **Asian:** Chinese, Japanese, Korean, Vietnamese, Thai, Indonesian, Malay
- **Other:** Arabic, English

### ✨ Feature 2: Voice Input & Output

**First open-source RAG with built-in voice support!**

```python
# Voice output (text-to-speech)
rag = OmniRAG(enable_voice=True, output_language="Tamil")
result = rag.query("What is Python?", speak_answer=True)
# Hears answer in Tamil! 🔊

# Voice input (speech-to-text) - requires microphone
result = rag.voice_query()
# Speak your question, hear the answer!
```

---

## 🎯 What is OmniRAG?

OmniRAG is an advanced Retrieval-Augmented Generation system that combines three powerful RAG techniques:

### 🌊 Liquid RAG
Automatically adapts answers to user expertise level:
- **Beginner**: Simple explanations with examples
- **Intermediate**: Balanced technical content
- **Expert**: Deep technical details

### 🤖 Agentic RAG
Intelligently chooses the best information source:
- **VectorDB**: For local documents
- **Web Search**: For current information

### ⛓️ Chain RAG
Handles complex multi-part questions:
- Breaks down complex queries
- Answers each part separately
- Synthesizes coherent final answer

---

## ✨ All Features

**v2.0 NEW:**
- 🌍 **Multi-Language Translation** (27+ languages)
- 🎤 **Voice Input & Output** (speak & hear)
- 🔤 **Full Language Names** ("Tamil" not "ta")
- 🔧 **UTF-8 Support** (perfect Tamil/Hindi display)

**v1.0 CORE:**
- ✅ **PDF Support** - Load PDF files directly
- ✅ **Multiple LLM Models** - Qwen, Flan-T5, Mistral, Phi-2
- ✅ **FAISS Vector DB** - Fast similarity search
- ✅ **Web Search** - DuckDuckGo integration (free!)
- ✅ **Smart User Detection** - Auto expertise level detection
- ✅ **Query Decomposition** - Handles complex questions
- ✅ **Fast Caching** - 3x speedup on repeated queries
- ✅ **100% FREE** - No API costs!
- ✅ **Works on CPU** - No GPU required

---

## 📦 Installation

```bash
pip install omnirag
```

### With Voice Input (Optional)

**Windows:**
```bash
pip install pipwin
pipwin install pyaudio
pip install omnirag[voice-input]
```

**Mac:**
```bash
brew install portaudio
pip install omnirag[voice-input]
```

**Linux:**
```bash
sudo apt-get install portaudio19-dev
pip install omnirag[voice-input]
```

### From Source

```bash
git clone https://github.com/Giri530/omnirag.git
cd omnirag
pip install -e .
```

---

## 🚀 Quick Start

### Basic Usage

```python
from omnirag import OmniRAG

# Initialize
rag = OmniRAG(model_name="google/flan-t5-small")

# Add documents
rag.add_documents([
    "Python is a programming language.",
    "It is used for AI and data science."
])

# Query
result = rag.query("What is Python?")
print(result['answer'])
```

### Multi-Language Example

```python
from omnirag import OmniRAG

# Initialize with Spanish output
rag = OmniRAG(
    model_name="google/flan-t5-small",
    output_language="Spanish"  # or "Tamil", "Hindi", etc.
)

# Add English documents
rag.add_documents([
    "AI helps solve complex problems.",
    "Machine Learning is a subset of AI."
])

# Query in English, get Spanish answer!
result = rag.query("What is AI?")
print(result['answer'])
# Output: "La IA ayuda a resolver problemas complejos."
```

### Voice Example

```python
from omnirag import OmniRAG

# Initialize with voice
rag = OmniRAG(
    enable_voice=True,
    output_language="Tamil"
)

rag.add_documents(["Python is great for AI."])

# Text input, voice output
result = rag.query("What is Python?", speak_answer=True)
# Hears answer in Tamil! 🔊

# Voice input, voice output (requires microphone)
result = rag.voice_query()
# Speak question, hear answer!
```

---

## 💡 Usage Examples

### Load Different File Types

```python
# PDF files
rag.load_from_file("research_paper.pdf")

# Text files
rag.load_from_file("notes.txt")

# JSON data
rag.load_from_file("data.json")

# Entire folder
rag.load_from_folder("./documents")

# With chunking for large files
rag.load_from_file("big_file.pdf", chunk_size=500)

# Direct text
rag.add_documents([
    "Python is great for ML.",
    "Qwen is a powerful language model."
])
```

### Different Output Languages

```python
# Default: Spanish
rag = OmniRAG(output_language="Spanish")

# Query 1: Spanish (default)
result1 = rag.query("What is AI?")

# Query 2: Override to Tamil
result2 = rag.query("What is ML?", output_language="Tamil")

# Query 3: Override to French
result3 = rag.query("What is DL?", output_language="French")
```

### Full Language Names

```python
# All these work!
rag = OmniRAG(output_language="Spanish")  # ✅
rag = OmniRAG(output_language="spanish")  # ✅
rag = OmniRAG(output_language="es")       # ✅

# Same for all languages
rag = OmniRAG(output_language="Tamil")    # ✅
rag = OmniRAG(output_language="Hindi")    # ✅
```

### Complex Queries

```python
# OmniRAG automatically breaks down and answers
result = rag.query("""
Compare Python vs Java for machine learning.
Which is better for beginners?
What are the performance differences?
""")

print(result['answer'])
```

### Enable Web Search

```python
rag = OmniRAG(
    model_name="google/flan-t5-small",
    enable_web_search=True  # Free DuckDuckGo search
)

# Queries about "latest" or "recent" automatically use web
result = rag.query("Latest AI developments in 2025")
```

---

## 🎨 Supported Models

### Qwen Models (Recommended!)

```python
# Fast & Efficient
rag = OmniRAG(model_name="Qwen/Qwen2.5-0.5B-Instruct")

# Balanced (Best Choice!)
rag = OmniRAG(model_name="Qwen/Qwen2.5-1.5B-Instruct")

# High Quality
rag = OmniRAG(model_name="Qwen/Qwen2.5-3B-Instruct")
```

### Flan-T5 Models

```python
# Small & Fast
rag = OmniRAG(model_name="google/flan-t5-small")   # 80M params

# Medium
rag = OmniRAG(model_name="google/flan-t5-base")    # 250M params

# Larger & Better
rag = OmniRAG(model_name="google/flan-t5-large")   # 780M params
```

---

## 🏗️ Architecture

```
User Query
    ↓
🌊 LIQUID RAG: Detect expertise level
    ↓
⛓️ CHAIN RAG: Break into sub-queries (if complex)
    ↓
FOR EACH SUB-QUERY:
    ↓
🤖 AGENTIC RAG: Choose tool (VectorDB or Web)
    ↓
    Retrieve relevant chunks (ORIGINAL language)
    ↓
🌊 LIQUID RAG: Transform to user level
    ↓
    Generate sub-answer
    ↓
⛓️ CHAIN RAG: Synthesize all sub-answers
    ↓
🌍 TRANSLATION: Convert to target language (NEW!)
    ↓
🔊 VOICE: Speak answer (if enabled) (NEW!)
    ↓
✨ Perfect Answer!
```

---

## 📊 Performance

| Model | Size | RAM | Speed | Quality |
|-------|------|-----|-------|---------|
| flan-t5-small | 80M | 0.5GB | ⚡⚡⚡ | ⭐⭐ |
| flan-t5-base | 250M | 1GB | ⚡⚡⚡ | ⭐⭐⭐ |
| **Qwen-0.5B** | 0.5B | 1GB | ⚡⚡ | ⭐⭐⭐ |
| **Qwen-1.5B** | 1.5B | 2GB | ⚡⚡ | ⭐⭐⭐⭐ |
| Qwen-3B | 3B | 4GB | ⚡ | ⭐⭐⭐⭐⭐ |

**Recommended:** 
- **For testing:** `flan-t5-small` (fast!)
- **For production:** `flan-t5-base` or `Qwen-0.5B` (balanced)
- **For quality:** `Qwen-1.5B` (best!)

---

## 🔧 Configuration

```python
rag = OmniRAG(
    # LLM Model
    model_name="google/flan-t5-small",
    
    # Embedding Model
    embedding_model="all-MiniLM-L6-v2",
    
    # Web Search
    enable_web_search=True,
    
    # NEW: Output Language
    output_language="Tamil",  # or "auto" for no translation
    
    # NEW: Voice I/O
    enable_voice=True,
    
    # Verbose Output
    verbose=True
)
```

---

## 📖 API Reference

### OmniRAG Class

#### `__init__(model_name, embedding_model, enable_web_search, verbose, output_language, enable_voice)`
Initialize OmniRAG system.

**New Parameters:**
- `output_language` (str): Target language ("Tamil", "Spanish", "auto", etc.)
- `enable_voice` (bool): Enable voice input/output

#### `query(user_query, output_language=None, speak_answer=False)`
Query the system and get answer.

**New Parameters:**
- `output_language` (str): Override default language for this query
- `speak_answer` (bool): Speak the answer aloud

**Returns:**
```python
{
    'answer': str,              # Generated answer
    'sources': list,            # Retrieved sources
    'user_level': str,          # Detected expertise level
    'output_language': str,     # Output language code
    'spoken': bool,             # Whether answer was spoken
}
```

#### `voice_query(output_language=None)` **NEW!**
Voice-to-voice query (requires microphone).

#### `save_to_file(result, filename)` **NEW!**
Save result to file with UTF-8 encoding.

#### Other Methods (from v1.0)

- `load_from_file(file_path, chunk_size=None)`
- `load_from_folder(folder_path, file_extensions=None)`
- `add_documents(documents)`
- `get_stats()`
- `clear_cache()`

---

## 🌍 Use Cases

### Customer Support (Multi-Language)
```python
rag = OmniRAG(output_language="Hindi", enable_voice=True)
rag.load_from_file("product_manual.pdf")

# Hindi-speaking customer
result = rag.query("How do I reset my device?", speak_answer=True)
# Answer in Hindi + spoken aloud!
```

### Educational Platform (Tamil)
```python
rag = OmniRAG(output_language="Tamil")
rag.load_from_file("class10_science.pdf")

# Student query
result = rag.query("What is photosynthesis?")
# Answer in Tamil!
```

### Accessibility Tool
```python
# For visually impaired users
rag = OmniRAG(enable_voice=True)
rag.load_from_folder("./personal_docs")

# Completely hands-free
while True:
    result = rag.voice_query()
    if "exit" in result.get('answer', '').lower():
        break
```

---

## 🌟 Why OmniRAG?

| Feature | LangChain | LlamaIndex | **OmniRAG** |
|---------|-----------|------------|-------------|
| **Post-Retrieval Translation** | ❌ No | ❌ No | ✅ **YES** |
| **Built-in Voice I/O** | ❌ No | ❌ No | ✅ **YES** |
| **Indian Language Support** | ⚠️ Basic | ⚠️ Basic | ✅ **Native** |
| **Full Language Names** | ❌ No | ❌ No | ✅ **YES** |
| **Beginner Friendly** | ⚠️ Complex | ⚠️ Complex | ✅ **Simple** |
| **100% Free** | ✅ Yes | ✅ Yes | ✅ **Yes** |

---

## 🛠️ Development

### Install for Development

```bash
git clone https://github.com/Giri530/omnirag.git
cd omnirag
pip install -e ".[dev]"
```

### Project Structure

```
omnirag/
├── omnirag/
│   ├── __init__.py
│   ├── omnirag.py              # Main class
│   ├── smart_translator.py     # NEW: Translation
│   ├── voice_processor.py      # NEW: Voice I/O
│   ├── liquid_analyzer.py      # User level detection
│   ├── chain_decomposer.py     # Query decomposition
│   ├── agentic_planner.py      # Tool selection
│   ├── content_transformer.py  # Content adaptation
│   ├── vectordb_tool.py        # FAISS database
│   ├── web_search_tool.py      # Web search
│   ├── llm_client.py           # LLM wrapper
│   └── cache.py                # Caching
├── examples/
│   └── quickstart.py
├── setup.py
├── pyproject.toml
├── requirements.txt
└── README.md
```

---

## 📝 Requirements

- Python 3.8+
- 1-4GB RAM (depends on model)
- CPU or GPU (GPU recommended for speed)

**Core Dependencies:**
- transformers, torch, sentence-transformers
- faiss-cpu, PyPDF2, duckduckgo-search

**New Dependencies (v2.0):**
- deep-translator, langdetect (translation)
- pyttsx3 (voice output)
- SpeechRecognition, pyaudio (voice input - optional)

---

## 🤝 Contributing

Contributions welcome! Please:

1. Fork the repository
2. Create feature branch (`git checkout -b feature/amazing`)
3. Commit changes (`git commit -m 'Add amazing feature'`)
4. Push to branch (`git push origin feature/amazing`)
5. Open Pull Request

---

## 📄 License

MIT License - Free for commercial and personal use!

See [LICENSE](LICENSE) for details.

---

## 🙏 Acknowledgments

- **HuggingFace** for transformers library
- **Qwen Team** for excellent models
- **FAISS** for fast vector search
- **Sentence Transformers** for embeddings
- **Deep Translator** for translation API
- **pyttsx3** for text-to-speech

---

## 📧 Contact

- **GitHub**: [@Giri530](https://github.com/Giri530)
- **Email**: girinathv48@gmail.com
- **Issues**: [Report bugs or request features](https://github.com/Giri530/omnirag/issues)

---

## 🌟 Star History

If you find OmniRAG useful, please ⭐ star the repo!

---

## 📚 Citation

```bibtex
@software{omnirag2025,
  title={OmniRAG: Multi-Language Voice RAG Framework},
  author={Girinath V},
  year={2025},
  version={2.0.0},
  url={https://github.com/Giri530/omnirag}
}
```

---

## 🎯 Roadmap

**v2.0 (Current):**
- ✅ Multi-language translation (27+ languages)
- ✅ Voice input and output
- ✅ UTF-8 encoding support

**v2.1 (Planned):**
- [ ] More file formats (DOCX, XLSX)
- [ ] Custom translation models
- [ ] Voice language selection
- [ ] GUI interface

**v3.0 (Future):**
- [ ] Real-time translation
- [ ] Multi-modal RAG (images)
- [ ] Cloud deployment
- [ ] API server

---

**Made with ❤️ in India 🇮🇳**

**100% FREE Forever!**

**Happy RAG-ing! 🚀**
