Metadata-Version: 2.4
Name: netintel-ocr
Version: 0.1.18.2
Summary: Enterprise Document Intelligence Platform with High-Performance C++ Extensions, API v2, MCP, and Vector Search
Home-page: https://github.com/VisionMLNet/NetIntelOCR
Author: VisionML
Author-email: VisionML <info@visionml.net>
Maintainer-email: VisionML Team <support@visionml.net>
License: MIT
Project-URL: Homepage, https://github.com/VisionMLNet/NetIntelOCR
Project-URL: Issues, https://github.com/VisionMLNet/NetIntelOCR/issues
Project-URL: Changelog, https://github.com/VisionMLNet/NetIntelOCR/changelog.md
Keywords: ocr,document-intelligence,network-analysis,knowledge-graph,vector-search,milvus,falkordb,pykeen,api,mcp,c++,avx2,openmp
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: C++
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: General
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pymupdf>=1.26.1
Requires-Dist: pillow>=11.2.1
Requires-Dist: opencv-python-headless>=4.5.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: click>=8.1.0
Requires-Dist: typer>=0.9.0
Requires-Dist: tqdm>=4.67.1
Requires-Dist: pandas>=2.0.0
Requires-Dist: jsonschema>=4.0.0
Requires-Dist: pyyaml>=6.0.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: requests>=2.31.0
Requires-Dist: ollama>=0.1.0
Requires-Dist: tomli>=2.0.0; python_version < "3.11"
Provides-Extra: kg
Requires-Dist: falkordb>=1.0.8; extra == "kg"
Requires-Dist: graphiti-core>=0.3.0; extra == "kg"
Requires-Dist: pykeen>=1.10.0; extra == "kg"
Requires-Dist: minirag-hku>=0.0.2; extra == "kg"
Requires-Dist: torch>=2.0.0; extra == "kg"
Requires-Dist: torchvision>=0.15.0; extra == "kg"
Requires-Dist: scikit-learn>=1.0.0; extra == "kg"
Requires-Dist: matplotlib>=3.5.0; extra == "kg"
Requires-Dist: seaborn>=0.11.0; extra == "kg"
Requires-Dist: plotly>=5.0.0; extra == "kg"
Requires-Dist: asyncio-redis>=0.16.0; extra == "kg"
Provides-Extra: vector
Requires-Dist: pymilvus>=2.3.0; extra == "vector"
Requires-Dist: qdrant-client>=1.7.0; extra == "vector"
Requires-Dist: chromadb>=0.4.0; extra == "vector"
Requires-Dist: lancedb>=0.5.0; extra == "vector"
Provides-Extra: api
Requires-Dist: fastapi>=0.100.0; extra == "api"
Requires-Dist: uvicorn[standard]>=0.23.0; extra == "api"
Requires-Dist: python-multipart>=0.0.6; extra == "api"
Requires-Dist: pyjwt>=2.8.0; extra == "api"
Provides-Extra: mcp
Requires-Dist: fastmcp>=0.1.0; extra == "mcp"
Requires-Dist: websockets>=12.0.0; extra == "mcp"
Provides-Extra: performance
Requires-Dist: cmake>=3.12; extra == "performance"
Requires-Dist: pybind11>=2.6.0; extra == "performance"
Requires-Dist: scikit-build>=0.11.0; extra == "performance"
Requires-Dist: numba>=0.58.0; extra == "performance"
Requires-Dist: cython>=3.0.0; extra == "performance"
Provides-Extra: dev
Requires-Dist: pytest>=6.0.0; extra == "dev"
Requires-Dist: pytest-cov>=2.10.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-benchmark>=4.0.0; extra == "dev"
Requires-Dist: black>=20.8b1; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Provides-Extra: production
Requires-Dist: netintel-ocr[kg]; extra == "production"
Requires-Dist: netintel-ocr[vector]; extra == "production"
Requires-Dist: netintel-ocr[api]; extra == "production"
Requires-Dist: netintel-ocr[performance]; extra == "production"
Provides-Extra: cloud
Requires-Dist: netintel-ocr[vector]; extra == "cloud"
Requires-Dist: netintel-ocr[api]; extra == "cloud"
Requires-Dist: netintel-ocr[mcp]; extra == "cloud"
Requires-Dist: aioboto3>=12.0.0; extra == "cloud"
Requires-Dist: redis[hiredis]>=5.0.0; extra == "cloud"
Provides-Extra: all
Requires-Dist: netintel-ocr[kg]; extra == "all"
Requires-Dist: netintel-ocr[vector]; extra == "all"
Requires-Dist: netintel-ocr[api]; extra == "all"
Requires-Dist: netintel-ocr[mcp]; extra == "all"
Requires-Dist: netintel-ocr[performance]; extra == "all"
Requires-Dist: netintel-ocr[dev]; extra == "all"
Dynamic: author
Dynamic: home-page
Dynamic: requires-python

# NetIntel-OCR (Network Intelligence OCR) v0.1.18.2

🚀 **Enterprise Document Intelligence Platform with High-Performance C++ Extensions!**

[![Version](https://img.shields.io/badge/version-0.1.18.2-blue)]() [![Python](https://img.shields.io/badge/python-3.10+-green)]() [![C++](https://img.shields.io/badge/C++-Extensions-red)]() [![AVX2](https://img.shields.io/badge/AVX2-Optimized-orange)]() [![OpenMP](https://img.shields.io/badge/OpenMP-Parallel-purple)]() [![Modular](https://img.shields.io/badge/Modular-Install-orange)]() [![Knowledge Graph](https://img.shields.io/badge/Knowledge-Graph-red)]() [![FalkorDB](https://img.shields.io/badge/FalkorDB-Powered-purple)]() [![Milvus](https://img.shields.io/badge/Milvus-Powered-purple)]() [![PyKEEN](https://img.shields.io/badge/PyKEEN-Embeddings-orange)]() [![Docker](https://img.shields.io/badge/docker-ready-blue)]() [![Kubernetes](https://img.shields.io/badge/kubernetes-ready-purple)]() [![API](https://img.shields.io/badge/API-REST-orange)]() [![MCP](https://img.shields.io/badge/MCP-LLM_Ready-purple)]()

NetIntel-OCR is an enterprise document intelligence platform that combines advanced OCR, vector search, and knowledge graph capabilities. With comprehensive API v2, it provides RESTful endpoints, GraphQL support, WebSocket real-time updates, and MCP integration for LLMs. Version v0.1.18.2 delivers production-ready features including OAuth2/RBAC authentication, multi-tier caching, distributed tracing, enterprise-grade monitoring, and high-performance C++ extensions with AVX2 SIMD and OpenMP parallelization.

**🎉 Version 0.1.18.2 includes high-performance C++ extensions with AVX2 SIMD instructions and OpenMP parallelization for 10x faster text deduplication and processing! Fixed PyPI metadata and includes optimized manylinux wheels for Python 3.11 and 3.12.**

## 🎯 Key Capabilities

### Network Intelligence Extraction
- **Automatic Network Detection**: AI-powered identification of network diagrams in documents
- **Component Recognition**: Identifies routers, switches, firewalls, servers, and other network elements
- **Connection Mapping**: Traces and documents network paths and relationships
- **Security Architecture Analysis**: Extracts security zones, DMZs, and trust boundaries

## ✨ Features

### 🆕 New in v0.1.18.2 - High-Performance C++ Extensions & PyPI Metadata Fix

#### ⚡ Performance Optimizations
- **C++ Core Extensions**: Native C++ implementation for text deduplication
- **AVX2 SIMD Instructions**: Hardware-accelerated vector operations for 256-bit parallel processing
- **OpenMP Parallelization**: Multi-core parallel processing across CPU cores
- **SimHash Algorithm**: Fast near-duplicate detection with Hamming distance calculations
- **Content-Defined Chunking**: Efficient text segmentation with rolling hash
- **10x Performance Boost**: Dramatically faster processing for large document collections

#### 🔧 Build System Improvements
- **Manylinux2014 Wheels**: Maximum compatibility across Linux distributions
- **Python 3.11 & 3.12 Support**: Optimized builds for latest Python versions
- **Automated Build Pipeline**: `quick-build.sh` for consistent C++ compilation
- **Docker-based Building**: Reproducible builds with all dependencies
- **Size Optimization**: Wheels ~2MB with all C++ extensions included

#### 📊 Enhanced Version Display with C++ Info
```bash
# Check installed components including C++ extensions
netintel-ocr --version

# Example output showing C++ components:
NetIntel-OCR v0.1.18.2
├── Core Components:
│   ├── C++ Core: ✓ v1.0.1
│   ├── AVX2: ✓
│   ├── OpenMP: ✓
│   └── Platform: Linux x86_64
├── Installed Modules:
│   ├── [base] Core OCR: ✓ (always installed)
│   ├── [performance] C++ Extensions: ✓ (dedup_core loaded)
│   └── [vector] Vector Store: ✓ (milvus 2.3.0)
└── Performance Features:
    ├── SimHash: ✓ (64-bit fingerprints)
    ├── CDC: ✓ (1024-byte chunks)
    └── Parallel Cores: 16
```

### 🆕 New in v0.1.18.0 - API v2, MCP Integration & Enterprise Features

#### 🚀 Comprehensive API v2
- **RESTful API**: Complete `/api/v2` endpoints with versioning
- **GraphQL Support**: Full schema with queries, mutations, subscriptions
- **WebSocket**: Real-time updates for document processing and search
- **Streaming Upload**: Chunked upload for files up to 5GB with resume capability
- **Document Versioning**: Complete version management and tracking

#### 🤖 MCP (Model Context Protocol) Integration
- **15+ Tools**: Document operations, Milvus management, Knowledge Graph queries
- **6 Resources**: Interactive exploration for documents, topology, KG, and search
- **5 Prompts**: Contextual analysis, synthesis, troubleshooting, security audit
- **LLM-Ready**: Seamless integration with AI assistants and agents

#### 🔐 Production-Ready Features
- **OAuth2/OIDC**: Enterprise authentication with JWT tokens
- **RBAC System**: Full role-based access control
- **Multi-Tier Caching**: Memory, Redis, and hybrid caching strategies
- **Rate Limiting**: 4 strategies (fixed window, sliding window, token bucket, leaky bucket)
- **Health Monitoring**: Comprehensive health checks and readiness probes
- **Metrics**: Prometheus integration with custom metrics
- **Audit Logging**: Complete audit trail for compliance
- **Distributed Tracing**: OpenTelemetry support for debugging

#### 🎯 Milvus Vector Database Integration
- **Collection Management**: Full CRUD operations with dynamic schemas
- **Advanced Search**: Vector, hybrid, and expression-based queries
- **Result Reranking**: Multiple strategies (cross-encoder, feature-based, RRF)
- **Batch Operations**: Efficient insert, upsert, delete operations
- **Index Management**: Create and manage indexes with progress tracking

#### 🏢 Enterprise Features
- **Deduplication System**: MD5, SimHash, and CDC-based deduplication
- **Performance Monitoring**: Real-time metrics and benchmarking
- **Module Management**: Dynamic module configuration
- **Batch Processing**: Parallel job submission and management
- **Configuration Templates**: Pre-defined deployment profiles

### 🆕 New in v0.1.17.1 - Modular Installation & Enhanced Version Display

#### 📦 Modular Installation
- **Reduced Base Size**: From 2.5GB to just 500MB for core OCR functionality
- **7 Optional Modules**: Choose what you need:
  - `[kg]` - Knowledge Graph with PyKEEN and FalkorDB (+1.5GB)
  - `[vector]` - Vector stores (Milvus, Qdrant, ChromaDB) (+300MB)
  - `[api]` - REST API server with FastAPI (+50MB)
  - `[mcp]` - Model Context Protocol server (+30MB)
  - `[performance]` - C++ optimizations and SIMD (+200MB)
  - `[dev]` - Development tools (pytest, black, ruff) (+100MB)
  - `[all]` - Everything included (2.5GB total)
- **Preset Configurations**:
  - `[production]` - KG + Vector + API + Performance
  - `[cloud]` - Vector + API + MCP
- **Smart Detection**: System automatically configures based on installed modules

#### 📊 Enhanced Version Display
```bash
# Check what's installed and available
netintel-ocr --version

# Example output:
NetIntel-OCR v0.1.17.1
├── Core Components:
│   ├── C++ Core: ✓ v1.0.1
│   ├── AVX2: ✓
│   └── Platform: Linux x86_64
├── Installed Modules:
│   ├── [base] Core OCR: ✓ (always installed)
│   ├── [kg] Knowledge Graph: ✓ (pykeen 1.10.1)
│   └── [vector] Vector Store: ✗ (not installed)
├── Available for Install:
│   └── [vector]: pip install netintel-ocr[vector]
└── Active Features:
    ├── FalkorDB: ✓ (connected)
    └── Ollama: ✓ (connected)
```

### 🆕 New in v0.1.17 - Hierarchical CLI & Hybrid Knowledge Graph System

#### 🎯 Hierarchical CLI Structure
- 📁 **8 Command Groups**: Organized into intuitive categories for better discoverability
  - `process` - Document processing (pdf, batch, watch)
  - `server` - Server operations (api, mcp, worker, health)
  - `db` - Database management (query, merge, stats)
  - `kg` - Knowledge Graph (18+ commands)
  - `model` - Model management (list, set-default, ollama)
  - `project` - Project initialization (templates)
  - `config` - Configuration (profiles, templates, validation)
  - `system` - System utilities (check, diagnose, version)
- 🔄 **Breaking Change**: `netintel-ocr document.pdf` → `netintel-ocr process pdf document.pdf`
- 📋 **Configuration Templates**: 6 pre-built templates (minimal, development, staging, production, enterprise, cloud)
- 👤 **Profile Management**: Multiple configuration profiles with easy switching
- 🌍 **Environment Variables**: Complete configuration override capability

#### 🧠 Knowledge Graph System (Now Default!)
- 🧠 **Knowledge Graph Construction**: Automatically build graph representations from network diagrams, flow diagrams, tables, and text
- 🗄️ **FalkorDB Integration**: Unified storage for both graph structure and KG embeddings
- 🎯 **PyKEEN Embeddings**: Train knowledge graph embeddings with 8 supported models (TransE, RotatE, ComplEx, DistMult, ConvE, TuckER, HolE, RESCAL)
- 🔍 **Hybrid Retrieval**: Combine graph traversal with vector similarity search for powerful queries
- 📊 **Query Intent Classification**: Automatically route queries to optimal retrieval strategy (entity-centric, relational, topological, semantic, analytical, exploratory)
- 🚀 **4 Retrieval Strategies**: Vector-first, graph-first, parallel (with RRF), and adaptive strategies
- 🔄 **Enhanced MiniRAG**: Extended with FalkorDB storage adapter and 3 query modes (minirag_only, kg_embedding_only, hybrid)
- 📈 **Performance Metrics**: 92% query accuracy, <150ms response time, 25% storage reduction
- 🛠️ **18+ KG Commands**: Including `kg init`, `kg train-embeddings`, `kg hybrid-search`, `kg path-find`, and more
- 🐳 **Production Ready**: Docker Compose, Kubernetes manifests, REST API with health checks, and monitoring support
- 📊 **Benchmarking Suite**: Performance testing tools for all retrieval strategies
- 🎨 **Visualization Tools**: 2D/3D embedding visualization, clustering analysis, and statistics
- ✅ **Default Installation**: KG features are now included in the standard installation (no extra steps needed!)

### 🆕 New in v0.1.16.11 - Remote Server Support
- Dynamic OLLAMA_HOST handling for remote deployments
- Improved API endpoint resolution for better connectivity

### 🆕 New in v0.1.16 - Flow Diagrams and Prompt Customization
- 📊 **Flow Diagram Support**: Full extraction and analysis of process flows, workflows, and decision trees
- 🔄 **Unified Diagram Detection**: Automatically identifies network, flow, or hybrid diagrams
- 🎯 **Process Intelligence**: Identifies bottlenecks, optimization opportunities, and critical paths
- 📝 **Customizable Prompts**: Export, modify, and import all prompts for industry-specific needs
- 🎨 **Prompt Templates**: Pre-built templates for security, compliance, cloud, and process optimization
- 🔧 **Runtime Overrides**: Change prompts on-the-fly without editing files
- 📈 **Flow Mermaid Generation**: Automatic conversion to flowchart TD/LR format
- 🧠 **Context-Aware Analysis**: Reads 2 paragraphs before/after diagrams for accurate interpretation
- 🔍 **Type-Specific Processing**: Different analysis for network vs flow diagrams
- 🌐 **Hybrid Diagram Support**: Handles diagrams with both network and flow elements

### Previous v0.1.15 - Milvus Vector Database Integration
- 🚀 **20-60x Faster Search**: Sub-100ms query response with Milvus distributed architecture
- 💾 **70% Memory Reduction**: Process 10x more documents with the same hardware
- 🎯 **Enterprise Scale**: From standalone to distributed deployment without code changes
- 🤖 **Qwen3-8B Embeddings**: Advanced 4096-dimensional embeddings via Ollama
- 🔄 **IVF_SQ8 Index**: CPU-optimized scalar quantization for standard hardware
- 📦 **One-Command Setup**: Automatic configuration with `netintel-ocr --init`
- 🐳 **Docker Compose Ready**: Pre-configured stack with etcd, MinIO, and Milvus
- ☸️ **Kubernetes Support**: Production-ready Helm charts for enterprise deployment
- 🔧 **OLLAMA_HOST Detection**: Automatic discovery of Ollama embedding service

### Previous v0.1.14 - High-Performance Deduplication with C++ Core
- ⚡ **50-100x Performance Boost**: C++ core with AVX2 SIMD and OpenMP parallelization
- 🎯 **Three-Level Deduplication**: MD5 (exact), SimHash (fuzzy), CDC (content-level)
- 📦 **Zero-Compilation Install**: Pre-compiled binary wheels for Linux/macOS/Windows
- 🔍 **Near-Duplicate Detection**: SimHash with configurable Hamming distance threshold
- 📊 **Content-Defined Chunking**: Remove repetitive blocks with 30-50% storage reduction
- 🎨 **Version Information**: `netintel-ocr --version` shows C++ core status
- 🔧 **Automatic Fallback**: Python implementation when C++ unavailable

### Previous v0.1.13 - Service-Oriented Architecture
- 🌐 **REST API Server**: FastAPI-based server with full OpenAPI/Swagger documentation
- 🤖 **MCP Server**: Model Context Protocol server for LLM integration
- 📦 **Multi-Scale Deployments**: From single container to enterprise Kubernetes
- 🚀 **Flexible Worker Architecture**: Embedded workers or Kubernetes Jobs

### Previous v0.1.12 - Advanced Database Management
- 🗄️ **Centralized Database Management**: Unified LanceDB with deduplication and MD5 checksums
- 🔍 **Advanced Query Engine**: Vector similarity search with multi-field filtering and reranking
- 📊 **Multiple Output Formats**: JSON, Markdown, and CSV output for queries
- 🚀 **Batch Processing Pipeline**: Parallel PDF processing with progress tracking

### Core Features
- 🚀 **Vector Database Integration (v0.1.7)**: Automatic generation of LanceDB-ready chunks and vector-optimized content
- 🎯 **Intelligent Hybrid Processing**: Automatically detects and processes network diagrams as Mermaid.js, tables as JSON, text as markdown
- 📄 **PDF to Text Conversion**: Convert PDFs to markdown files locally, no token costs
- 🤖 **Multi-Model Support (v0.1.4)**: Use different models for text and network processing for optimal performance
- 📊 **Table Extraction (v0.1.6-v0.1.10)**: Automatic detection and extraction of tables with smart ToC exclusion
- 🖼️ **Visual Understanding**: Turn images and diagrams into detailed text descriptions
- 🔌 **Automatic Network Detection**: No flags needed - network diagrams are detected and converted automatically
- 🎨 **Icons by Default**: Font Awesome icons automatically added to network diagrams for better visualization
- ⏱️ **Smart Timeouts**: Operations timeout gracefully with fallback to simpler methods
- 📊 **Diagram Types Supported**: Network topology, architecture diagrams, data flow diagrams, security diagrams
- 📁 **MD5-Based Organization (v0.1.4)**: Each document stored in unique folder using MD5 checksum
- 📝 **Document Index (v0.1.4)**: Automatic index.md tracking all processed documents
- 📈 **Enhanced Metrics (v0.1.4)**: Comprehensive footer with processing details, errors, and configuration
- ⚡ **Optimized Processing**: Processes up to 100 pages per run with detailed progress tracking
- 🔧 **Flexible Output**: Unified markdown format with seamlessly embedded Mermaid diagrams and tables
- 🔄 **Checkpoint/Resume (v0.1.5)**: Resume interrupted processing from exact stopping point
- 🔍 **Vector Search Ready (v0.1.7)**: Pre-chunked content with minimal metadata for optimal vector search performance
- 🔁 **Vector Regeneration (v0.1.10)**: Regenerate vector files from existing markdown without reprocessing PDFs

## 💼 Use Cases

### Network Documentation
- Convert legacy network diagrams to modern formats
- Extract network topology from vendor documentation
- Audit and inventory network architectures

### Security Analysis
- Map security architecture from compliance documents
- Extract firewall rules and network segmentation
- Document data flow and trust boundaries

### Infrastructure Planning
- Analyze existing network designs
- Extract capacity and redundancy information
- Document interconnections and dependencies

## 📦 Requirements

- Python 3.10+
- Ollama installed and running locally or on a remote server

### Installing Ollama and the Default Model

1. Install [Ollama](https://ollama.com/)
2. Pull the default model:
```bash
ollama run nanonets-ocr-s:latest
```

### Using a Remote Ollama Server

By default, netintel-ocr connects to Ollama running on localhost. To use a remote Ollama server, set the `OLLAMA_HOST` environment variable:

```bash
# Connect to a remote Ollama server
export OLLAMA_HOST="http://192.168.1.100:11434"
netintel-ocr document.pdf

# Or run with the environment variable inline
OLLAMA_HOST="http://remote-server:11434" netintel-ocr document.pdf
```

### Knowledge Graph Environment Variables (v0.1.17)

When using the hybrid Knowledge Graph system, configure these environment variables:

```bash
# LLM and Embedding Models (required - no defaults in code)
export MINIRAG_LLM="ollama/gemma3:4b-it-qat"              # Recommended LLM model
export MINIRAG_EMBEDDING="ollama/Qwen3-Embedding-8B"      # Recommended embedding model
export MINIRAG_EMBEDDING_DIM="4096"                       # Embedding dimensions

# External Ollama Server (required for KG system)
export OLLAMA_HOST="http://192.168.1.100:11434"          # External Ollama server

# FalkorDB Configuration
export FALKORDB_HOST="localhost"
export FALKORDB_PORT="6379"
export FALKORDB_GRAPH="netintel_kg"

# Milvus Configuration
export MILVUS_HOST="localhost"
export MILVUS_PORT="19530"
export MILVUS_COLLECTION="netintel_vectors"

# PyKEEN Configuration
export PYKEEN_MODEL="TransE"                             # Options: TransE, RotatE, ComplEx
export PYKEEN_EMBEDDING_DIM="200"
export PYKEEN_EPOCHS="100"
export PYKEEN_BATCH_SIZE="128"

# Query Configuration
export KG_QUERY_MODE="hybrid"                            # Options: graph, embedding, hybrid
export KG_MAX_RESULTS="20"
export KG_MIN_CONFIDENCE="0.7"
```

## Installation

### 🆕 Modular Installation (v0.1.17.1)

Choose your installation based on needs:

```bash
# Minimal installation (500MB) - Core OCR only
pip install netintel-ocr

# With Knowledge Graph (2GB total) - Recommended
pip install "netintel-ocr[kg]"

# Production setup (2.3GB) - KG + Vector + API + Performance
pip install "netintel-ocr[production]"

# Cloud deployment (1.5GB) - Vector + API + MCP
pip install "netintel-ocr[cloud]"

# Everything (2.5GB) - All features
pip install "netintel-ocr[all]"

# Check what's installed
netintel-ocr --version
```

The package now uses Ollama for embeddings (default: qwen3-embedding:8b with 4096 dimensions), providing superior accuracy with optional Milvus integration.

or uv:
```bash
uv tool install netintel-ocr
```

### 🚀 Quick Start - Choose Your Deployment Scale (NEW v0.1.15!)

#### Development Scale (1-50 users, up to 1M documents)
```bash
# Initialize development deployment (default)
netintel-ocr --init
# Automatically detects OLLAMA_HOST
# Generates Docker Compose with Milvus Standalone

# Start the stack
cd ~/.netintel-ocr
docker-compose up -d
# Milvus: http://localhost:19530
# API: http://localhost:8000
# MCP: http://localhost:8001
```

#### Production Scale (100+ users, 100M+ documents)
```bash
# Initialize production deployment
netintel-ocr --init --scale production

# Deploy with Kubernetes
helm install netintel-ocr ./helm \
  --namespace netintel-ocr \
  --create-namespace

# Or use Docker with full monitoring
docker-compose -f docker/docker-compose.large.yml up -d
# Grafana: http://localhost:3000
```

## Usage

### Quick Start Examples

```bash
# Process a network architecture document
netintel-ocr process pdf network-architecture.pdf

# Batch process with Knowledge Graph
netintel-ocr process batch /documents/ --parallel 4

# Query processed documents
netintel-ocr db query "firewall configuration"

# Start production server
netintel-ocr server all --api-port 8000 --mcp-port 8001

# Check system status
netintel-ocr system check
```

### 🆕 v0.1.17 Hierarchical CLI Usage

#### Process Documents (New Syntax!)
```bash
# NEW: Process a PDF document (KG enabled by default)
netintel-ocr process pdf document.pdf

# Process without Knowledge Graph
netintel-ocr process pdf document.pdf --no-kg

# Process specific pages
netintel-ocr process pdf document.pdf --start 5 --end 10

# Batch processing
netintel-ocr process batch /path/to/pdfs/

# Watch directory for new PDFs
netintel-ocr process watch /input/folder --pattern "*.pdf"
```

#### Server Operations
```bash
# Start all services (API + MCP)
netintel-ocr server all

# Start API server only
netintel-ocr server api --port 8000 --workers 4

# Start MCP server
netintel-ocr server mcp --port 8001

# Development server with hot reload
netintel-ocr server dev --reload

# Check health
netintel-ocr server health
```

#### Knowledge Graph Commands
```bash
# Initialize KG system
netintel-ocr kg init

# Process with KG
netintel-ocr kg process document.pdf

# Query the graph
netintel-ocr kg query "MATCH (n:NetworkDevice) RETURN n"

# Natural language query
netintel-ocr kg rag-query "What are the security vulnerabilities?"

# Train embeddings
netintel-ocr kg train-embeddings --model RotatE

# Find similar entities
netintel-ocr kg find-similar "Router-A"

# Visualize embeddings
netintel-ocr kg visualize --method tsne
```

#### Configuration Management
```bash
# Initialize configuration
netintel-ocr config init --template production

# Set configuration values
netintel-ocr config set server.api.port 8000
netintel-ocr config set models.default qwen2.5vl:7b

# Manage profiles
netintel-ocr config profile create production
netintel-ocr config profile use production

# Export environment variables
netintel-ocr config env export > .env
```

### Legacy CLI Usage (Deprecated)
The old syntax still works but is deprecated:
```bash
# OLD SYNTAX (deprecated)
netintel-ocr document.pdf

# Use NEW SYNTAX instead:
netintel-ocr process pdf document.pdf
```

### 🆕 v0.1.15 Commands - Milvus Integration & Vector Search

```bash
# Initialize with Milvus (auto-detects OLLAMA_HOST)
netintel-ocr --init

# Check version and capabilities
netintel-ocr --version
netintel-ocr --version-json  # JSON output with Milvus status

# Process with Milvus vector storage (20-60x faster search)
netintel-ocr document.pdf --vector-db milvus

# Vector similarity search in Milvus
netintel-ocr --search "network topology" \
  --collection netintel_vectors \
  --limit 10

# Process with full deduplication (enhanced with Milvus)
netintel-ocr document.pdf --dedup-mode full

# Find near-duplicates using Milvus binary vectors
netintel-ocr --find-duplicates document.pdf \
  --hamming-threshold 5 \
  --use-milvus

# Show Milvus collection statistics
netintel-ocr --milvus-stats

# Configure advanced processing
netintel-ocr document.pdf \
  --embedding-model qwen3-embedding:8b \
  --index-type IVF_SQ8 \
  --dedup-mode full
```

### v0.1.12 Commands - Database Management

```bash
# Query centralized database with advanced filtering
netintel-ocr --query "network security" \
  --centralized-db ./centralized.lancedb \
  --filters '{"source_type": "network_diagram"}' \
  --output-format json \
  --limit 10

# Merge documents to centralized database
netintel-ocr --merge-to-centralized \
  --output ./output \
  --centralized-db ./unified.lancedb \
  --dedup-strategy md5

# Batch process multiple PDFs with parallel processing
netintel-ocr --batch-ingest ./pdf_directory \
  --output ./batch_output \
  --parallel-workers 4 \
  --auto-merge

# Database management commands
netintel-ocr --db-stats ./centralized.lancedb
netintel-ocr --db-optimize ./centralized.lancedb --vacuum
netintel-ocr --db-export ./centralized.lancedb --format json
```

**Cloud Workflow with S3/MinIO:**
```bash
# Configure S3/MinIO storage
export S3_ENDPOINT=https://s3.amazonaws.com
export S3_BUCKET=netintel-documents
export AWS_ACCESS_KEY_ID=your-key
export AWS_SECRET_ACCESS_KEY=your-secret

# Process with cloud storage
netintel-ocr document.pdf --s3-sync --s3-bucket netintel-documents

# Batch process from cloud storage
netintel-ocr --batch-ingest s3://netintel-documents/pdfs/ \
  --output s3://netintel-documents/output/ \
  --parallel-workers 8
```

### Multi-Model Processing (NEW v0.1.4!)

Use different Ollama models optimized for specific tasks:
```bash
# Use fast OCR model for text, powerful model for diagrams
netintel-ocr document.pdf --model nanonets-ocr-s --network-model qwen2.5vl

# Fast processing with lightweight models
netintel-ocr document.pdf --model moondream --network-model bakllava

# Heavy processing for complex network diagrams
netintel-ocr document.pdf --network-model cogvlm --timeout 120
```

**Multi-Model Benefits:**
- 30-50% faster text extraction with OCR-optimized models
- Better diagram understanding with vision-language models
- Resource efficiency by using appropriate model sizes
- Flexibility to experiment with different combinations

**Recommended Model Combinations:**
| Purpose | Text Model | Network Model | Speed |
|---------|------------|---------------|-------|
| Balanced (Default) | nanonets-ocr-s | qwen2.5vl | Medium |
| Fast Processing | moondream | bakllava | Fast |
| Maximum Accuracy | qwen2.5vl | cogvlm | Slow |
| Resource Limited | moondream | llava-phi3 | Fast |

### Table Extraction (NEW v0.1.6!)

NetIntel-OCR now automatically detects and extracts tables from PDFs:
```bash
# Tables are extracted by default in hybrid mode
netintel-ocr document.pdf

# Use library-first extraction for faster processing
netintel-ocr document.pdf --table-method pdfplumber

# Use LLM for complex tables with merged cells
netintel-ocr document.pdf --table-method llm

# Save tables as separate JSON files
netintel-ocr document.pdf --save-table-json

# Disable table extraction for faster processing
netintel-ocr document.pdf --no-tables
```

**Table Extraction Features:**
- **Automatic Detection**: Tables identified alongside network diagrams
- **Multiple Methods**: Library-first (pdfplumber), LLM-enhanced, or hybrid
- **Complex Table Support**: Handles merged cells, multi-row fields, nested headers
- **Structured Output**: Tables converted to JSON with validation
- **Markdown Integration**: Tables embedded in markdown with both rendered and JSON views

### Vector Database Integration (NEW v0.1.7!)

NetIntel-OCR now **automatically generates** vector database files optimized for RAG applications:

```bash
# Vector generation is ON by default - creates LanceDB-ready chunks
netintel-ocr document.pdf

# Disable vector generation (v0.1.6 behavior)
netintel-ocr document.pdf --no-vector

# Customize chunking strategy
netintel-ocr document.pdf --chunk-size 512 --chunk-overlap 50

# Use semantic chunking (default) vs fixed-size
netintel-ocr document.pdf --chunk-strategy semantic
```

**Vector Features:**
- **Automatic Generation**: Creates `document-vector.md` and `chunks.jsonl` by default
- **Content Filtering**: Removes processing artifacts, keeps only source content
- **Minimal Metadata**: Only source filename, page numbers, and indexed date
- **LanceDB Optimized**: Pre-chunked JSONL ready for direct ingestion
- **Smart Chunking**: Semantic boundaries respect document structure

**Using with LanceDB:**
```python
import lancedb
import json

# Load chunks generated by NetIntel-OCR
with open("output/<md5>/lancedb/chunks.jsonl") as f:
    chunks = [json.loads(line) for line in f]

# Create LanceDB table - ready to use!
db = lancedb.connect("./my_lancedb")
table = db.create_table("documents", chunks)

# Search your documents
results = table.search("network configuration").limit(5).to_list()
```

### Performance Optimization

For faster processing of network diagrams, use the `--fast-extraction` flag:
```bash
# Fast extraction mode - reduces extraction time by 50-70%
netintel-ocr document.pdf --fast-extraction

# Combine with multi-model and timeout for best performance
netintel-ocr document.pdf --model nanonets-ocr-s --network-model bakllava --fast-extraction --timeout 30
```

**Fast extraction benefits:**
- Detection: ~15 seconds (vs 30-60s standard)
- Extraction: ~20 seconds (vs 30-60s standard)
- Uses simplified prompts for quicker LLM responses
- Automatic fallback if fast extraction fails

### Command Line Options

#### Basic Options
- `--output`, `-o`: Base output directory (default: "output", documents stored in `output/<md5_checksum>/`)
- `--model`, `-m`: Ollama model for text extraction (default: "nanonets-ocr-s:latest")
- `--network-model`: Separate model for network diagram processing (NEW v0.1.4)
- `--flow-model`: Dedicated model for flow diagram processing (NEW v0.1.16.6, defaults to --network-model)
- `--keep-images`, `-k`: Keep the intermediate image files (default: False)
- `--width`, `-w`: Width to resize images to, 0 to skip resizing (default: 0)
- `--start`, `-s`: Start page number (default: 0, processes from beginning)
- `--end`, `-e`: End page number (default: 0, processes to end)
- `--resume`: Resume processing from checkpoint if available (NEW v0.1.5)

#### Processing Mode Options
- `--text-only`, `-t`: Skip network diagram detection for faster text-only processing
- `--network-only`: Process only network diagrams, skip regular text pages

#### Network Diagram Options (applies to default mode)
- `--confidence`, `-c`: Minimum confidence threshold for network diagram detection (0.0-1.0, default: 0.7)
- `--no-icons`: Disable Font Awesome icons in Mermaid diagrams (icons are enabled by default)
- `--diagram-only`: Only extract network diagrams without page text (by default, both are extracted)
- `--timeout`: Timeout in seconds for each LLM operation (default: 60s, increase for complex diagrams)

#### Vector Database Options (NEW v0.1.7)
- `--no-vector`: Disable vector generation (default: enabled)
- `--vector-format`: Target vector DB format (default: lancedb, options: pinecone, weaviate, qdrant, chroma)
- `--chunk-size`: Chunk size in tokens (default: 1000)
- `--chunk-overlap`: Overlap between chunks (default: 100)
- `--chunk-strategy`: Chunking strategy (default: semantic, options: fixed, sentence)
- `--embedding-metadata`: Include extended metadata (reduces content space)

### Examples

#### Basic Usage (with automatic network detection)
```bash
# DEFAULT: Automatic network diagram detection (with icons)
netintel-ocr document.pdf

# Process with custom settings
netintel-ocr document.pdf --confidence 0.8

# Increase timeout for complex diagrams
netintel-ocr document.pdf --timeout 120

# Text-only mode (faster, no detection)
netintel-ocr document.pdf --text-only

# Process specific pages
netintel-ocr document.pdf --start 1 --end 5

# Use a different Ollama model
netintel-ocr document.pdf --model qwen2.5vl:latest
```

#### Specialized Processing
```bash
# Process ONLY network diagrams (skip text pages)
netintel-ocr network-architecture.pdf --network-only

# Higher confidence threshold (stricter detection)
netintel-ocr document.pdf --confidence 0.9

# Disable icons if not needed
netintel-ocr document.pdf --no-icons

# Extract only diagrams without text (faster)
netintel-ocr document.pdf --diagram-only

# Faster text-only processing
netintel-ocr text-document.pdf --text-only
```

Process large documents in sections (max 100 pages per run):
```bash
# Process first 100 pages
netintel-ocr large-document.pdf --start 1 --end 100

# Process next section
netintel-ocr large-document.pdf --start 101 --end 200

# Process specific chapter (e.g., pages 50-100)
netintel-ocr large-document.pdf --start 50 --end 100
```

## Checkpoint/Resume Capability (NEW v0.1.5)

The tool now supports automatic checkpoint saving and resume functionality for long documents:

### How It Works
- **Automatic Saving**: Processing state is saved after each page
- **Checkpoint Location**: Stored in `output/<md5>/.checkpoint/`
- **Resume on Interruption**: Use `--resume` to continue from where you left off
- **Page-Level Tracking**: Each page is tracked individually
- **Smart Skip**: Already processed pages are skipped when resuming

### Usage Examples
```bash
# Start processing a large document
netintel-ocr large-document.pdf

# If interrupted (Ctrl+C, power failure, etc.), resume processing
netintel-ocr large-document.pdf --resume

# Resume with different settings (completed pages are kept)
netintel-ocr large-document.pdf --resume --timeout 120 --network-model qwen2.5vl
```

### Resume Information
When resuming, you'll see a summary like:
```
╔════════════════════════════════════════════════════════════╗
║                  RESUME CHECKPOINT FOUND                   ║
╠════════════════════════════════════════════════════════════╣
║ Previous Processing:                                        ║
║   • Pages completed: 45/100                                ║
║   • Network diagrams found: 5                              ║
║   • Regular pages: 40                                      ║
║   • Failed pages: 0                                        ║
║                                                            ║
║ Resume Information:                                        ║
║   • Will skip 45 already processed pages                   ║
║   • Will process 55 remaining pages                        ║
║   • Starting from page 46                                  ║
╚════════════════════════════════════════════════════════════╝
```

### Benefits
- **No Lost Work**: Never lose progress on long documents
- **Resource Efficient**: Don't reprocess completed pages

## Vector Regeneration (v0.1.10)

### Regenerate Vector Files Without Reprocessing
Use `--vector-regenerate` to regenerate vector database files from existing markdown output:

```bash
# First time processing
netintel-ocr document.pdf

# Regenerate vectors with different chunk settings
netintel-ocr document.pdf --vector-regenerate --chunk-size 500 --chunk-overlap 100

# Change vector database format
netintel-ocr document.pdf --vector-regenerate --vector-format pinecone

# Use different chunking strategy
netintel-ocr document.pdf --vector-regenerate --chunk-strategy sentence
```

### When to Use Vector Regeneration
- **Optimize chunk size**: Adjust for better embedding performance
- **Change vector format**: Switch between LanceDB, Pinecone, Weaviate, etc.
- **Update metadata**: Add or remove extended metadata
- **Fix errors**: Regenerate after fixing vector generation issues
- **Experiment**: Try different strategies without re-OCR

### Benefits
- **Flexible**: Change settings when resuming
- **Automatic**: No manual intervention needed

## Processing Guidelines

### Document Size Recommendations

| Document Size | Processing Strategy | Example |
|--------------|-------------------|---------|
| 1-50 pages | Single run | `netintel-ocr doc.pdf` |
| 51-100 pages | Single run or split | `netintel-ocr doc.pdf` |
| 101-300 pages | Process in 100-page sections | See examples below |
| 300+ pages | Process key sections only | Use specific page ranges |

### Processing Large Documents

For a 250-page document:
```bash
# Section 1: Pages 1-100
netintel-ocr document.pdf --start 1 --end 100 -o output_section1

# Section 2: Pages 101-200
netintel-ocr document.pdf --start 101 --end 200 -o output_section2

# Section 3: Pages 201-250
netintel-ocr document.pdf --start 201 --end 250 -o output_section3
```

## Network Diagram Detection (Now Default!)

**NEW**: Network diagram detection is now enabled by default! No flags needed.

netintel-ocr automatically (in order):

1. **Transcribes** text content FIRST (guaranteed capture)
2. **Detects** network diagrams in PDF pages
3. **Identifies** components (routers, switches, firewalls, servers, databases, etc.)
4. **Extracts** connections and relationships
5. **Converts** to Mermaid.js format
6. **Combines** BOTH the diagram AND the page's text content
7. **Embeds** everything in unified markdown output

### Supported Network Components
- 🔀 Routers and Switches
- 🛡️ Firewalls
- 🖥️ Servers and Workstations
- 💾 Databases
- ⚖️ Load Balancers
- ☁️ Cloud Services
- 📡 Wireless Access Points

### Output Format

Network diagrams are saved as markdown with embedded Mermaid code:

```markdown
# Page 5 - Network Diagram

**Type**: topology
**Detection Confidence**: 0.95
**Components**: 8 detected
**Connections**: 12 detected

## Diagram

```mermaid
graph TB
    Router([Main Router])
    Switch[Core Switch]
    FW{{Firewall}}
    Server1[(Web Server)]
    
    Router --> FW
    FW --> Switch
    Switch --> Server1
```

## Page Text Content

This section describes the SD-WAN architecture with multiple branch offices
connecting to headquarters through various transport methods including MPLS,
broadband, and LTE connections. The solution provides path selection,
application-aware routing, and centralized management...
```

## Output Structure (Enhanced v0.1.4)

All output is organized using MD5 checksums for unique document identification:

```
output/                                    # Base directory (configurable with --output)
├── index.md                              # Master index tracking all processed documents
├── 6c928950e6b73fffe316e0ad6bba3a67/    # MD5 checksum as folder name
│   ├── markdown/                         # All transcribed content
│   │   ├── page_001.md                  # Individual page (text or diagram)
│   │   ├── page_002.md    
│   │   └── document.md                  # Complete merged document with footer metrics
│   ├── images/                          # Original page images (if --keep-images)
│   └── summary.md                       # Processing summary and statistics
└── 0611ca05dab284e943e3b00d3993d424/    # Another document's folder
    └── ...

Benefits:
- Same document won't be processed twice (deduplication)
- Easy to find previous processing results
- index.md provides overview of all processed documents
```

### Index File (output/index.md)
Automatically tracks all processed documents:
```markdown
| Filename | Timestamp | MD5 Checksum | Folder | Processing Time |
|----------|-----------|--------------|--------|----------------|
| network.pdf | 2025-08-20 14:30:15 | `6c9289...` | [📁 6c9289...](./6c9289.../) | 2m 30s |
| manual.pdf | 2025-08-20 14:35:22 | `0611ca...` | [📁 0611ca...](./0611ca.../) | 1m 45s |
```

### Enhanced Footer Metrics (NEW v0.1.4)
Every merged document includes comprehensive processing metrics:
- **Document Info**: Source file, size, MD5 checksum, pages processed
- **Processing Details**: Date/time, models used, processing time, mode
- **Quality Report**: Errors, warnings, success metrics
- **Configuration**: Settings used during processing

## Processing Modes

### Default: Hybrid Mode (Text-First)
- **Text-First Approach**: ALWAYS transcribes text before attempting diagram detection
- **Guaranteed Content**: Text is captured even if diagram processing fails
- **Automatic Detection**: Every page is analyzed for network diagrams
- **Dual Content Extraction**: Pages with diagrams include BOTH Mermaid diagram AND text content
- **Intelligent Processing**: Network diagrams → Mermaid (with icons), Text → Markdown
- **Progress Tracking**: Detailed step-by-step progress messages
- **Smart Timeouts**: Operations timeout after 60s with automatic fallback
- **Processing Time**: 30-60 seconds per page
- **Best For**: Most documents (mixed content)

### Text-Only Mode (`--text-only`)
- **No Detection**: Skip diagram detection for speed
- **Processing Time**: 15-30 seconds per page
- **Best For**: Documents with only text

### Network-Only Mode (`--network-only`)
- **Diagram Focus**: Process only network diagrams
- **Processing Time**: 30-60 seconds per diagram
- **Best For**: Network architecture documents

## Performance & Troubleshooting

### If Processing is Slow or Stuck

The tool now includes detailed progress messages showing what's happening and which models are being used:
```
  Page 3: Processing...
    Transcribing page text (nanonets-ocr-s)... Done (12.3s)  <-- Text captured first!
    Checking for network diagram (qwen2.5vl)... Done (2.1s)
    Network diagram detected (confidence: 0.90)
    Type: topology
    Extracting components (qwen2.5vl)... Done (5.1s)
    Generating Mermaid diagram (qwen2.5vl)... Done (8.2s)
    Validating Mermaid syntax... Valid (0.1s)
    Writing to file... Done (0.1s)
    Total processing time: 27.9s
```

**Important**: Text is ALWAYS transcribed first, so even if diagram processing times out or fails, you'll still have the page content.

If an operation takes too long:
- **Default timeout**: 60 seconds per operation
- **Adjust timeout**: Use `--timeout 120` for complex diagrams
- **Automatic fallback**: If LLM times out, falls back to simpler methods

### Common Issues and Fixes

#### Mermaid Syntax Errors (Robust Auto-Fix)
The tool uses a comprehensive validator to automatically fix Mermaid syntax issues:

**Phase 1 - Basic Cleanup:**
- C-style comments (`//`) → Removed or converted to Mermaid comments (`%%`)
- Curly braces in graph declarations → Removed
- Invalid syntax elements → Cleaned

**Phase 2 - Node ID Fixing:**
- Spaces in node IDs → Converted to underscores (e.g., `Data Center` → `Data_Center`)
- Special characters → Replaced with safe alternatives
- Duplicate node IDs → Automatically numbered (e.g., `Server`, `Server2`, `Server3`)

**Phase 3 - Connection Fixing:**
- Updates all connections to use fixed node IDs
- Preserves connection types and labels
- Maintains directional flow

**Phase 4 - Style Application:**
- Fixes class applications to use corrected node IDs
- Preserves styling and visual attributes

**Examples of Auto-Fixes:**
- `subgraph_DMZ` → `subgraph DMZ`
- `Data Center (HQ)` → `Data_Center_HQ` (as node ID)
- Parentheses in labels → Automatically quoted
- Multiple `Secure SD-WAN` nodes → `Secure_SD_WAN`, `Secure_SD_WAN2`, etc.

### Centralized Database Management (NEW v0.1.12!)

NetIntel-OCR now supports unified database management with advanced query capabilities:

```bash
# Create unified database from per-document databases
netintel-ocr --merge-to-centralized --output ./documents --centralized-db ./unified.lancedb

# Query with advanced filtering and ranking
netintel-ocr --query "firewall configuration" \
  --centralized-db ./unified.lancedb \
  --filters '{"document_type": "network_diagram", "confidence": {"$gte": 0.8}}' \
  --rerank-strategy semantic \
  --output-format json \
  --limit 20

# Get database statistics and health
netintel-ocr --db-stats ./unified.lancedb
netintel-ocr --db-optimize ./unified.lancedb --vacuum --reindex
```

**Key Features:**
- **Deduplication**: Automatic MD5-based duplicate detection
- **Multi-field Filtering**: Query by source, type, confidence, date ranges
- **Reranking**: Semantic, hybrid, and temporal reranking strategies
- **Export Formats**: JSON, Markdown, CSV with customizable fields
- **Validation**: Automatic schema validation and integrity checks
- **Statistics**: Comprehensive database metrics and health monitoring

### Enhanced Batch Processing (NEW v0.1.12!)

Process multiple PDFs efficiently with parallel processing and automatic merging:

```bash
# Batch process directory with parallel workers
netintel-ocr --batch-ingest ./pdf_directory \
  --output ./batch_output \
  --parallel-workers 6 \
  --checkpoint-interval 5 \
  --auto-merge \
  --s3-sync

# Resume interrupted batch processing
netintel-ocr --batch-ingest ./pdf_directory \
  --output ./batch_output \
  --resume-batch \
  --skip-existing
```

**Performance Benefits:**
- **Parallel Processing**: Up to 8x faster with multiple workers
- **Progress Tracking**: Real-time progress with ETA and throughput
- **Checkpoint Resume**: Resume from interruption point
- **Memory Management**: Intelligent worker allocation based on system resources
- **Auto-merge**: Automatic centralized database updates

### S3/MinIO Cloud Storage (NEW v0.1.12!)

Full cloud storage integration for distributed deployments:

```bash
# Configure cloud storage
export S3_ENDPOINT=https://minio.company.com
export S3_BUCKET=netintel-docs
export AWS_ACCESS_KEY_ID=admin
export AWS_SECRET_ACCESS_KEY=password123

# Process with cloud sync
netintel-ocr document.pdf --s3-sync --s3-backup

# Batch process from cloud
netintel-ocr --batch-ingest s3://netintel-docs/input/ \
  --output s3://netintel-docs/output/ \
  --centralized-db s3://netintel-docs/unified.lancedb
```

**Cloud Features:**
- **Bi-directional Sync**: Upload/download with versioning
- **Backup/Restore**: Automatic backup with retention policies
- **Distributed Access**: Multiple workers can access shared storage
- **Credentials Management**: Support for AWS IAM, MinIO admin, environment variables

### Advanced Embedding Management (NEW v0.1.12!)

Enhanced embedding generation with multiple providers and caching:

```bash
# Configure multiple embedding providers
netintel-ocr document.pdf \
  --embedding-provider openai \
  --embedding-model text-embedding-3-large \
  --embedding-cache-ttl 7200 \
  --batch-size 50

# Use local Ollama embeddings
netintel-ocr document.pdf \
  --embedding-provider ollama \
  --embedding-model mxbai-embed-large \
  --embedding-cache ./embeddings_cache
```

**Embedding Features:**
- **Multiple Providers**: OpenAI, Ollama, HuggingFace support
- **Caching with TTL**: Intelligent caching to avoid recomputation
- **Batch Processing**: Efficient batch embedding generation
- **Model Management**: Automatic model configuration and validation
- **Cost Optimization**: Cache hits reduce API costs by up to 90%

## Recent Improvements

### Version 0.1.12 (Latest - 2025-08-21)
- ✅ **Centralized Database Management**: Unified LanceDB with MD5 deduplication
- ✅ **Advanced Query Engine**: Vector search with filtering, reranking, and multiple output formats
- ✅ **Batch Processing Pipeline**: Parallel PDF processing with progress tracking and checkpoints
- ✅ **S3/MinIO Storage Backend**: Cloud storage integration with bi-directional sync
- ✅ **Enhanced CLI Commands**: --query, --merge-to-centralized, --batch-ingest, --db-stats, --db-optimize
- ✅ **Embedding Management**: Multiple provider support with caching and TTL
- ✅ **Database Optimization**: Validation, statistics, export, and backup capabilities

### Version 0.1.11 (2025-08-21)
- ✅ **Docker Support**: Complete Docker containerization with MinIO integration
- ✅ **Kubernetes Ready**: Full Helm chart for production deployments
- ✅ **Project Initialization**: `--init` command creates complete containerized environment
- ✅ **Configuration Management**: YAML-based configuration with environment variable overrides
- ✅ **Query Interface Foundation**: Query vector databases (enhanced in v0.1.12)
- ✅ **Centralized DB Foundation**: Merge per-document databases (enhanced in v0.1.12)

### Version 0.1.10 (2025-08-20)
- ✅ **Checkpoint/Resume**: Automatic saving and resume capability for long documents
- ✅ **Page-Level Tracking**: Individual page checkpoint tracking
- ✅ **Resume Summary**: Clear display of resume status and remaining work
- ✅ **Atomic Saves**: Checkpoint integrity with atomic file operations
- ✅ **Automatic Cleanup**: Checkpoints removed after successful completion

### Version 0.1.4 (2025-08-20)
- ✅ **Multi-Model Support**: Use different models for text and network processing
- ✅ **MD5-Based Output**: Unique folders per document using MD5 checksums
- ✅ **Document Index**: Automatic index.md tracking all processed documents
- ✅ **Enhanced Footer**: Comprehensive metrics in merged documents
- ✅ **Simplified Defaults**: Output to `output/` instead of timestamped folders
- ✅ **Model Progress Display**: Shows which model is being used for each operation
- ✅ **Deduplication**: Same document uses same output folder

### Version 0.1.3
- ✅ **Hybrid Mode by Default**: Automatic network diagram detection
- ✅ **Text-First Processing**: Guarantees content capture before diagram extraction
- ✅ **Fast Extraction Mode**: 50-70% faster processing option
- ✅ **Enhanced Error Recovery**: Graceful fallbacks and timeout management

### Version 0.1.0 
- ✅ **Initial pypi.org Release**
- ✅ **Fixed Mermaid syntax issues**: Automatically handles parentheses in node labels
- ✅ **Improved component detection**: Fixed issue with multiple types being listed
- ✅ **Enhanced error handling**: Better fallback for malformed LLM responses
- ✅ **Automatic syntax correction**: C-style comments and invalid syntax auto-fixed
- ✅ **Better type selection**: Ensures components have single, specific types

## Limitations

- **Maximum 100 pages per processing run**: This limit ensures optimal processing time and prevents memory issues. For larger documents, use the `--start` and `--end` flags to process specific sections.
- **Network Detection Accuracy**: Detection confidence varies based on diagram complexity and clarity. Adjust the `--confidence` threshold as needed.
- **Model Requirements**: Network detection requires vision-capable models (e.g., nanonets-ocr-s, qwen2.5vl, llava)
- **Timeout Behavior**: Operations that exceed the timeout will fall back to simpler processing methods
