Metadata-Version: 2.4
Name: deepseek-visor-agent
Version: 0.2.0
Summary: Production-ready wrapper for DeepSeek-OCR - Convert documents to structured data for AI agents
Author-email: Jack Chen <jack_ai@qq.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/JackChen-ai/deepseek-visor-agent
Project-URL: Documentation, https://github.com/JackChen-ai/deepseek-visor-agent/blob/main/README.md
Project-URL: Repository, https://github.com/JackChen-ai/deepseek-visor-agent
Project-URL: Issues, https://github.com/JackChen-ai/deepseek-visor-agent/issues
Keywords: deepseek,deepseek-ocr,ocr,ai-agent,langchain,llamaindex,document-understanding,vision-model,document-ocr
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: NOTICE
Requires-Dist: torch>=2.6.0
Requires-Dist: transformers>=4.46.3
Requires-Dist: tokenizers>=0.20.3
Requires-Dist: einops
Requires-Dist: addict
Requires-Dist: easydict
Requires-Dist: pillow>=10.0.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: loguru>=0.7.0
Requires-Dist: PyMuPDF>=1.23.0
Requires-Dist: matplotlib>=3.5.0
Provides-Extra: flash-attn
Requires-Dist: flash-attn==2.7.3; extra == "flash-attn"
Provides-Extra: api
Requires-Dist: fastapi>=0.104.0; extra == "api"
Requires-Dist: uvicorn[standard]>=0.24.0; extra == "api"
Provides-Extra: all
Requires-Dist: flash-attn==2.7.3; extra == "all"
Requires-Dist: fastapi>=0.104.0; extra == "all"
Requires-Dist: uvicorn[standard]>=0.24.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Dynamic: license-file

# DeepSeek Visor Agent

> **Production-ready wrapper for [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR)** - Convert documents to structured data in 3 lines of code

[![PyPI version](https://badge.fury.io/py/deepseek-visor-agent.svg)](https://badge.fury.io/py/deepseek-visor-agent)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)

**Keywords**: DeepSeek OCR, DeepSeek-OCR wrapper, document OCR, AI agent vision tool, LangChain OCR, LlamaIndex OCR

---

## ⚠️ **GPU Requirements (CRITICAL)**

**NVIDIA GPU with Turing+ architecture required**

| ✅ Supported | ❌ Not Supported |
|-------------|-----------------|
| RTX 20/30/40 series (Turing/Ampere/Ada) | GTX 10 series (Pascal - no FlashAttention) |
| Tesla T4, A10, A100 | GTX 1080 Ti, GTX 1660 |
| **Minimum**: RTX 2060 (6GB VRAM) | CPU-only mode |
| **Recommended**: RTX 3090 (24GB VRAM) | AMD GPUs (ROCm) |

**Why?** DeepSeek-OCR requires [FlashAttention 2.x](https://github.com/Dao-AILab/flash-attention), which only supports compute capability 7.5+ (Turing and newer).

**No GPU?** Join our hosted API waitlist (planned for future release).

📖 **Detailed compatibility guide**: [GPU_COMPATIBILITY.md](docs/GPU_COMPATIBILITY.md)

---

## 🎯 What is This?

DeepSeek Visor Agent is a **production-ready Python wrapper** for [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR), the state-of-the-art open-source OCR model by DeepSeek AI.

**Built on DeepSeek-OCR**, this wrapper makes document understanding **effortless for AI agents** by handling all the complexity:

- ✅ **Auto device detection** (CUDA with Turing+ GPUs)
- ✅ **Automatic fallback** (Gundam mode → Base mode → Tiny mode when OOM)
- ✅ **Structured output** (Markdown + extracted fields)
- ✅ **Agent-ready** (LangChain, LlamaIndex, Dify compatible)

## ⚡ Quick Start

### Prerequisites

Before installation, ensure you have:
1. **NVIDIA GPU** with Turing+ architecture (RTX 20/30/40 series, Tesla T4/A100)
2. **CUDA 11.8+** installed and configured
3. **Python 3.9+**

### Installation

**Step 1: Install the package**

```bash
pip install deepseek-visor-agent
```

**Step 2: (First-time only) Model download**

The first time you run the tool, it will automatically download the DeepSeek-OCR model (~6.2 GB) from HuggingFace:

```python
from deepseek_visor_agent import VisionDocumentTool

# This will trigger model download on first run
tool = VisionDocumentTool()
```

The model will be cached in `~/.cache/huggingface/` and reused for subsequent runs.

**Step 3: (Optional) Install FlashAttention for better performance**

```bash
# For RTX GPUs with compute capability 7.5+
pip install flash-attn --no-build-isolation
```

### Basic Usage

**Process Images:**

```python
from deepseek_visor_agent import VisionDocumentTool

# Initialize the tool (auto-detects best device and model)
tool = VisionDocumentTool()

# Process a document image
result = tool.run("invoice.jpg")

print(result["fields"]["total"])  # "$199.00"
print(result["fields"]["date"])   # "2024-01-15"
print(result["document_type"])    # "invoice"
```

**Process PDFs:**

```python
# PDF files work the same way - automatically converts pages to images
result = tool.run("contract.pdf")

print(f"Processed {result['pages']} pages")
print(result["markdown"])  # Multi-page PDFs have <--- Page Split ---> separators

# Process specific pages only
result = tool.run("long_document.pdf", pdf_start_page=0, pdf_end_page=2)
```

That's it! No configuration needed.

## 📖 Complete User Journey

### Scenario 1: Standalone Python Script

**Use Case**: Extract invoice data for accounting automation

```python
from deepseek_visor_agent import VisionDocumentTool
import json

# Initialize once
tool = VisionDocumentTool()

# Process multiple invoices
invoices = ["invoice1.jpg", "invoice2.pdf", "invoice3.png"]
results = []

for invoice_path in invoices:
    result = tool.run(invoice_path, document_type="invoice")
    results.append({
        "file": invoice_path,
        "total": result["fields"]["total"],
        "vendor": result["fields"]["vendor"],
        "date": result["fields"]["date"]
    })

# Export to JSON
with open("extracted_invoices.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"Processed {len(results)} invoices")
```

**Timeline**:
- First run: ~30 seconds (model download + first inference)
- Subsequent runs: ~5-7 seconds per page

### Scenario 2: LangChain AI Agent

**Use Case**: Build a chatbot that can answer questions about uploaded documents

```python
from langchain.tools import tool
from langchain.agents import initialize_agent, AgentType
from langchain_openai import ChatOpenAI
from deepseek_visor_agent import VisionDocumentTool

# Initialize OCR tool
ocr_tool = VisionDocumentTool()

@tool
def analyze_document(image_path: str) -> dict:
    """Analyze any document image and extract structured data"""
    return ocr_tool.run(image_path, document_type="auto")

# Create agent
llm = ChatOpenAI(model="gpt-4", temperature=0)
agent = initialize_agent(
    [analyze_document],
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True
)

# User interaction
response = agent.run("What is the total amount in invoice.jpg?")
print(response)
# Output: "The total amount in the invoice is $199.00, dated 2024-01-15 from Acme Corp."
```

**User Flow**:
1. User uploads document image via chat interface
2. Agent calls `analyze_document` tool
3. DeepSeek-OCR extracts text + fields
4. LLM interprets results and responds naturally

### Scenario 3: Batch PDF Processing

**Use Case**: Process hundreds of multi-page contracts

```python
from deepseek_visor_agent import VisionDocumentTool
from pathlib import Path
import json

tool = VisionDocumentTool()

# Find all PDFs
contracts_dir = Path("./contracts/")
pdf_files = list(contracts_dir.glob("*.pdf"))

results = []
for pdf_path in pdf_files:
    print(f"Processing {pdf_path.name}...")

    result = tool.run(
        str(pdf_path),
        document_type="contract",
        pdf_start_page=0,  # Process first 3 pages only
        pdf_end_page=2
    )

    results.append({
        "filename": pdf_path.name,
        "parties": result["fields"]["parties"],
        "effective_date": result["fields"]["effective_date"],
        "pages_processed": result["pages"]
    })

# Save results
with open("contracts_summary.json", "w") as f:
    json.dump(results, f, indent=2)

print(f"Processed {len(results)} contracts")
```

**Performance**: ~6-7 seconds per page on Tesla T4

### Scenario 4: REST API for No-Code Platforms (Dify/Flowise)

**Use Case**: Integrate with Dify for visual workflow builder

See [Dify Integration Guide](examples/dify_integration.md) for complete setup.

**High-level flow**:
1. Deploy FastAPI wrapper (provided in examples)
2. Configure Dify HTTP node with OCR endpoint
3. Build visual workflow: Upload → OCR → Parse → Respond
4. No Python code needed for end users

## 🔗 Integrations

### LangChain

```python
from langchain.tools import tool
from deepseek_visor_agent import VisionDocumentTool

ocr_tool = VisionDocumentTool()

@tool
def extract_invoice_data(image_path: str) -> dict:
    """Extract structured data from invoice images"""
    return ocr_tool.run(image_path, document_type="invoice")

# Use in your agent
from langchain.agents import initialize_agent, AgentType
from langchain.llms import OpenAI

tools = [extract_invoice_data]
agent = initialize_agent(tools, OpenAI(temperature=0), agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION)

response = agent.run("Extract the total from invoice.jpg")
```

### LlamaIndex

```python
from llama_index.tools import FunctionTool
from deepseek_visor_agent import VisionDocumentTool

tool = VisionDocumentTool()

def ocr_document(image_path: str) -> dict:
    """Process documents with OCR"""
    return tool.run(image_path)

llama_tool = FunctionTool.from_defaults(fn=ocr_document)
```

### Dify / Flowise

See [integration guide](examples/dify_integration.md) for REST API setup.

## 📊 Features

### Automatic Device Management

The tool automatically detects your hardware and selects the optimal configuration:

| Hardware | Inference Mode | Memory Usage |
|----------|----------------|--------------|
| RTX 4090 (24GB) | Gundam | ~10GB |
| RTX 3090 (24GB) | Base | ~6GB |
| RTX 2060 (6GB) | Tiny | ~3GB |
| CPU only | Not Supported | - |

### Automatic Fallback

If inference fails (OOM, CUDA errors), automatically falls back to lower-resolution modes:

```
Gundam mode (OOM) → Large mode → Base mode → Small mode → Tiny mode (Success!)
```

### Supported Document Types

- ✅ **Invoices** - Extracts total, date, vendor, line items
- ✅ **Contracts** - Extracts parties, effective date, terms
- ✅ **PDF Documents** - Multi-page PDFs with automatic page splitting
- 🚧 **Resumes** - Coming soon
- 🚧 **Forms** - Coming soon

### PDF Support

Based on [DeepSeek-OCR official implementation](https://github.com/deepseek-ai/DeepSeek-OCR/blob/master/DeepSeek-OCR-vllm/run_dpsk_ocr_pdf.py):

- ✅ Multi-page PDF processing
- ✅ Automatic page-to-image conversion (PyMuPDF)
- ✅ Configurable DPI (default: 144, same as official)
- ✅ Page range selection
- ✅ Same API as image processing

```python
# Process entire PDF
result = tool.run("contract.pdf")

# Process specific pages (0-indexed)
result = tool.run("doc.pdf", pdf_start_page=0, pdf_end_page=2)

# Adjust quality
result = tool.run("scan.pdf", pdf_dpi=200)  # Higher DPI = better quality
```

### Output Format

```python
{
    "markdown": "# Invoice\n\nDate: 2024-01-15\n...",
    "fields": {
        "total": "$199.00",
        "date": "2024-01-15",
        "vendor": "Acme Corp"
    },
    "document_type": "invoice",
    "confidence": 0.95,
    "metadata": {
        "inference_mode": "tiny",
        "device": "cuda",
        "inference_time_ms": 1823
    },
    "pages": 1  # Number of pages processed (1 for images, N for PDFs)
}
```

## ⚡ Performance

✅ **GPU-Tested on Tesla T4 (16GB VRAM)** - 2025-10-21

| Inference Mode | Inference Time | Test Environment | Notes |
|----------------|----------------|------------------|-------|
| **Tiny** | 5.35s/page | Tesla T4, Simple Doc | Fastest, 64 tokens |
| **Small** | 6.53s/page | Tesla T4, Simple Doc | 100 tokens |
| **Base** | 6.77s/page | Tesla T4, Simple Doc | 256 tokens, **Most Common** |
| **Large** | 6.35s/page | Tesla T4, Simple Doc | 400 tokens |
| **Gundam** | 6.67s/page | Tesla T4, Simple Doc | Crop mode, 256+400 tokens |

⚠️ **Note**: Performance tested on simple text documents. Real-world complex documents (tables, images, forms) may vary.

## 📚 Documentation

### 🚀 Getting Started
- **[📚 Documentation Center](docs/README.md)** - Complete documentation hub
- [GPU Compatibility Guide](docs/GPU_COMPATIBILITY.md)
- [Hardware Limitations](docs/HARDWARE_LIMITATIONS.md)

### 🏗️ For Developers
- [Hardware Limitations](docs/HARDWARE_LIMITATIONS.md)
- [Dify Integration](examples/dify_integration.md)
- [LangChain Example](examples/langchain_example.py)
- [LlamaIndex Example](examples/llamaindex_example.py)
- [PDF Processing Example](examples/pdf_example.py)

## 🛣️ Roadmap

- [x] Core OCR engine with auto-fallback
- [x] Invoice parser
- [x] Contract parser (basic)
- [x] PDF support (via PyMuPDF, official DeepSeek-OCR method)
- [ ] Resume parser
- [ ] Multi-language support
- [ ] Hosted API (Cloud version)
- [ ] LlamaIndex native tool
- [ ] Dify plugin

## 🤝 Contributing

We welcome contributions! Areas where help is needed:

1. **New parsers** - Add support for new document types
2. **Testing** - More test cases and edge cases
3. **Documentation** - Improve guides and examples
4. **Performance** - Optimization suggestions

Please submit issues or pull requests on [GitHub](https://github.com/JackChen-ai/deepseek-visor-agent).

## 📖 Citation

Built on top of [DeepSeek-OCR](https://huggingface.co/deepseek-ai/DeepSeek-OCR):

```bibtex
@misc{deepseek-ocr,
  author = {DeepSeek AI},
  title = {DeepSeek-OCR},
  year = {2025},
  url = {https://huggingface.co/deepseek-ai/DeepSeek-OCR}
}
```

## 📄 License

Apache License 2.0 - see [LICENSE](LICENSE) file for details.

## 🙏 Acknowledgments

- DeepSeek AI team for the amazing OCR model
- Hugging Face for model hosting
- LangChain and LlamaIndex communities for inspiration

## 📬 Contact

- GitHub Issues: [Report bugs or request features](https://github.com/JackChen-ai/deepseek-visor-agent/issues)
- Email: jack_ai@qq.com

---

**Star ⭐ this repo if you find it useful!**
