Metadata-Version: 2.4
Name: docnav
Version: 1.0.1
Summary: AI-powered document querying with citations
Home-page: https://github.com/ailaysa/docnav
Author: Mukesh Anand G
Author-email: Mukesh Anand G <ai.mukeshanandg@gmail.com>
Maintainer-email: Mukesh Anand G <ai.mukeshanandg@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/Mukesh-Anand-G/DocNav
Keywords: document processing,RAG,ai,llm,search,query,rag,retrieval augmented generation,document management,text analysis,citations,openai,gemini,claude
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Indexing
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.21.0
Requires-Dist: tqdm>=4.62.0
Requires-Dist: pathlib2>=2.3.6; python_version < "3.4"
Provides-Extra: full
Requires-Dist: sentence-transformers>=2.0.0; extra == "full"
Requires-Dist: python-docx>=0.8.11; extra == "full"
Requires-Dist: PyPDF2>=3.0.0; extra == "full"
Requires-Dist: pandas>=1.3.0; extra == "full"
Requires-Dist: python-pptx>=0.6.21; extra == "full"
Requires-Dist: openai>=1.0.0; extra == "full"
Requires-Dist: google-generativeai>=0.3.0; extra == "full"
Requires-Dist: anthropic>=0.8.0; extra == "full"
Provides-Extra: ocr
Requires-Dist: pdf2image>=1.16.0; extra == "ocr"
Requires-Dist: pytesseract>=0.3.10; extra == "ocr"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: flake8>=4.0.0; extra == "dev"
Requires-Dist: mypy>=0.991; extra == "dev"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# DocNav: AI-Powered Document Querying with Citations

[![PyPI version](https://badge.fury.io/py/docnav.svg)](https://badge.fury.io/py/docnav)
[![Python versions](https://img.shields.io/pypi/pyversions/docnav.svg)](https://pypi.org/project/docnav/)
[![License](https://img.shields.io/pypi/l/docnav.svg)](https://pypi.org/project/docnav/)
[![Downloads](https://img.shields.io/pypi/dm/docnav.svg)](https://pypi.org/project/docnav/)

DocNav is a professional, industry-grade document management and querying system that enables you to ask questions about your documents and get accurate answers with source citations. Built for both CLI and Python API usage.

## ✨ Features

- **📚 Multi-format Support**: PDF, DOCX, TXT, MD, CSV, Excel, PowerPoint
- **🧠 Smart Chunking**: Intelligent document segmentation for better context
- **🔍 Vector Search**: Fast similarity-based document retrieval
- **🤖 Multiple LLMs**: OpenAI, Gemini, Claude support
- **📝 Citations**: Answers include source document references
- **⚡ Fast Processing**: Parallel document processing with progress bars
- **🎯 Industry Ready**: Production-grade with error handling and logging
- **🔧 Flexible**: CLI tool and Python API

## 🚀 Quick Start

### Installation

```bash
# Basic installation
pip install docnav

# Full installation with all dependencies
pip install docnav[full]

# With OCR support for scanned PDFs
pip install docnav[full,ocr]

# Development installation
pip install docnav[dev]
```

### CLI Usage

```bash
# Create a new corpus
docnav new mydocs

# Add documents
docnav add mydocs documents/ reports.pdf

# Query your documents
docnav query mydocs "What are the main findings?"

# Use different LLM providers
docnav query mydocs "Summarize the budget" --provider gemini --model gemini-2.5-flash
docnav query mydocs "Extract key dates" --provider claude --model claude-3-haiku-20240307

# List documents
docnav list mydocs

# Get statistics
docnav stats mydocs

# Quick query without creating corpus
docnav quick document.pdf "What is this about?"
```

### Python API Usage

```python
from docnav import Corpus, DocumentChunk

# Create or load a corpus
corpus = Corpus("mydocs")

# Add documents
corpus.add(["document.pdf", "report.docx"])

# Ask questions
answer = corpus.ask("What are the main findings?")
print(answer.text)

# Access sources
for source in answer.sources:
    print(f"Source: {source.metadata['file_name']}")
    print(f"Content: {source.text[:200]}...")

# List all documents
documents = corpus.list()
for doc in documents:
    print(f"{doc['file_name']} ({doc['chunks']} chunks)")

# Get statistics
stats = corpus.stats()
print(f"Total documents: {stats['total_documents']}")
print(f"Total chunks: {stats['total_chunks']}")
```

## 📋 Commands Reference

### Corpus Management
- `docnav new <name>` - Create new corpus
- `docnav add <corpus> <files>` - Add documents to corpus
- `docnav list <corpus>` - List documents in corpus
- `docnav stats <corpus>` - Show corpus statistics
- `docnav remove <corpus> <file>` - Remove specific document
- `docnav clear <corpus>` - Clear entire corpus
- `docnav corpora` - List all available corpora

### Querying
- `docnav query <corpus> "<question>"` - Ask question about corpus
- `docnav quick <file> "<question>"` - Quick query single document

### Options
- `--provider <openai|gemini|claude>` - LLM provider
- `--model <model_name>` - Specific model to use
- `--api-key <key>` - API key (overrides environment)
- `--top-k <number>` - Number of chunks to consider (default: 5)
- `--use-ocr` - Use OCR for scanned PDFs
- `--details` - Show detailed information

## 🔧 Configuration

### Environment Variables

Set these for different LLM providers:

```bash
# OpenAI
export OPENAI_API_KEY="your-openai-key"

# Google Gemini
export GOOGLE_API_KEY="your-gemini-key"

# Anthropic Claude
export ANTHROPIC_API_KEY="your-claude-key"
```

### Default Models

- **OpenAI**: `gpt-3.5-turbo`
- **Gemini**: `gemini-2.5-flash`
- **Claude**: `claude-3-haiku-20240307`

## 📁 Storage

DocNav stores corpora in `~/.docnav/corpora/` by default:

```
~/.docnav/
├── corpora/
│   ├── mydocs/
│   │   ├── corpus_index.pkl
│   │   └── metadata.json
│   └── another_corpus/
│       ├── corpus_index.pkl
│       └── metadata.json
```

## 🎯 Advanced Usage

### Custom Chunking

```python
from docnav import Corpus

# Custom chunk size
corpus = Corpus("mydocs", chunk_size=2000)

# Add with custom chunking
corpus.add(["large_document.pdf"], chunk_size=1500)
```

### Filtering Queries

```python
# Query with metadata filters
answer = corpus.ask(
    "Budget information",
    where={"type": "pdf", "file_name": "budget_report.pdf"}
)
```

### Batch Processing

```python
# Process multiple files efficiently
files = [
    "reports/q1.pdf",
    "reports/q2.pdf", 
    "reports/q3.pdf"
]
corpus.add(files, use_ocr=True)
```

## 🔌 API Integration

### OpenAI Integration

```python
# Using OpenAI with custom model
answer = corpus.ask(
    "Analyze the trends",
    llm_provider="openai",
    llm_model="gpt-4-turbo",
    api_key="your-key"
)
```

### Gemini Integration

```python
# Using Google Gemini
answer = corpus.ask(
    "Extract insights",
    llm_provider="gemini", 
    llm_model="gemini-2.5-flash",
    api_key="your-gemini-key"
)
```

### Claude Integration

```python
# Using Anthropic Claude
answer = corpus.ask(
    "Summarize findings",
    llm_provider="claude",
    llm_model="claude-3-sonnet-20240229",
    api_key="your-claude-key"
)
```

## 🛠️ Development

### Setup Development Environment

```bash
# Clone repository
git clone https://github.com/Mukesh-Anand-G/DocNav.git
cd DocNav

# Install in development mode
pip install -e .[dev]

# Run tests
pytest

# Format code
black docnav/
```

### Project Structure

```
docnav/
├── docnav/
│   ├── __init__.py      # Package initialization
│   ├── core.py          # Core functionality
│   ├── cli.py           # Command-line interface
│   └── handlers.py      # CLI command handlers
├── setup.py             # Package setup
├── pyproject.toml       # Modern Python packaging
├── requirements.txt     # Dependencies
└── README.md           # This file
```

## 📊 Performance

- **Processing Speed**: ~1000 pages/minute (depends on hardware)
- **Memory Usage**: ~50MB for 1000 documents
- **Search Latency**: <100ms for typical queries
- **Supported Formats**: 10+ document types

## 🤝 Contributing

We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Submit a pull request

## 📄 License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- OpenAI for GPT models
- Google for Gemini models
- Anthropic for Claude models
- Sentence Transformers team for embedding models
- All contributors and users


## 🗺️ Roadmap

- [ ] Web interface
- [ ] Real-time document monitoring
- [ ] Advanced filtering
- [ ] Graph visualization
- [ ] Plugin system
- [ ] Multi-language support

---

**Made with ❤️ by [Mukesh Anand G]**
