Metadata-Version: 2.4
Name: oprel
Version: 0.5.0
Summary: Oprel is a high-performance Python library for running large language models locally. It provides a production-ready runtime with advanced memory management, hybrid offloading, and full multimodal support.
Home-page: https://github.com/Skyroot-Solutions/Oprel
Author: Oprel Team and Skyroot Solutions
Author-email: Oprel Team and Skyroot Solutions <info@skyrootsolutions.com>
License: MIT
Project-URL: Homepage, https://github.com/ragultv/oprel-SDK
Project-URL: Documentation, https://oprel.skyrootsolutions.com
Project-URL: Repository, https://github.com/ragultv/oprel-SDK
Project-URL: Issues, https://github.com/ragultv/oprel-SDK/issues
Keywords: llm,local-llm,local-ai,inference,llm-inference,ollama,ollama-alternative,ollama-python,gguf,llama-cpp,llama.cpp,quantization,llama,llama3,mistral,gemma,qwen,phi,deepseek,chatbot,text-generation,ai-chat,conversational-ai,offline-ai,cpu-inference,gpu-inference,model-server,ai-runtime,machine-learning,privacy,on-premise,edge-ai,embedded-ai
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Information Technology
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft :: Windows
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Typing :: Typed
Classifier: Environment :: Console
Classifier: Environment :: GPU
Classifier: Natural Language :: English
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: huggingface-hub>=0.20.0
Requires-Dist: psutil>=5.9.0
Requires-Dist: requests>=2.31.0
Requires-Dist: pydantic>=2.10.0
Requires-Dist: rich>=13.0.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: fastapi>=0.115.0
Requires-Dist: uvicorn[standard]>=0.32.0
Requires-Dist: aiofiles>=24.1.0
Requires-Dist: python-multipart>=0.0.20
Requires-Dist: starlette>=0.41.3
Provides-Extra: local
Requires-Dist: torch>=2.1.0; extra == "local"
Requires-Dist: transformers>=4.36.0; extra == "local"
Requires-Dist: bitsandbytes>=0.41.0; extra == "local"
Requires-Dist: accelerate>=0.25.0; extra == "local"
Provides-Extra: cuda
Requires-Dist: torch>=2.1.0; extra == "cuda"
Requires-Dist: transformers>=4.36.0; extra == "cuda"
Requires-Dist: bitsandbytes>=0.41.0; extra == "cuda"
Requires-Dist: accelerate>=0.25.0; extra == "cuda"
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.7.0; extra == "dev"
Requires-Dist: pre-commit>=3.5.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.5.0; extra == "docs"
Requires-Dist: mkdocs-material>=9.4.0; extra == "docs"
Provides-Extra: all
Requires-Dist: oprel[cuda,dev,docs,local,server]; extra == "all"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# Oprel SDK

**Production-ready local LLM inference that beats Ollama in performance**

[![PyPI version](https://badge.fury.io/py/oprel.svg)](https://pypi.org/project/oprel/)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![GitHub](https://img.shields.io/badge/GitHub-OpenSource-blue.svg)](https://github.com/Skyroot-Solutions/Oprel)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Oprel is a high-performance Python library for running large language models and multimodal AI locally. It provides a production-ready runtime with advanced memory management, hybrid offloading, and intelligent optimization.

## 🚀 Key Features

- **Multi-Backend Architecture**:
  - **llama.cpp**: Text generation & vision (GGUF models)
  - **ComfyUI Integration**: Image & video generation (Diffusion models)
  - **Hybrid GPU/CPU**: Smart layer distribution for low VRAM
  
- **Smart Hardware Optimization**:
  - **Hybrid Offloading**: Run 13B models on 4GB GPUs by splitting layers between GPU/CPU
  - **Auto-Quantization**: Automatically selects best quality quantization based on available VRAM
  - **CPU Acceleration**: AVX2/AVX512 optimization (30-50% faster than Ollama's defaults)
  - **KV-Cache Aware**: Precise memory planning prevents OOM crashes
  
- **Production Reliability**:
  - **Memory Pressure Monitor**: Proactive warnings before crashes
  - **Idle Cleanup**: Automatically frees GPU/CPU resources when inactive (15min timeout)
  - **Zero-Latency**: Server mode keeps models cached for instant response
  - **Robust Error Handling**: Clear error messages, no silent failures
  
- **Oprel Studio**: Premium Web UI for chat, model management, and real-time hardware monitoring with integrated RAG and LLM Wiki.

- **Ollama Compatibility**: Drop-in replacement for Ollama API

## 📦 Installation

```bash
pip install oprel
# For server mode
pip install oprel[server]
```

## ⚡ Quick Start

### CLI Usage

```bash
# Chat with a model (auto-downloaded)
oprel run gemma3-1b "Explain recursion in one sentence"

# Interactive chat mode
oprel run gemma3-1b

# Server mode for persistent caching
oprel serve
oprel run gemma3-1b "Hello"  # Instant response!

# Vision models
oprel vision qwen3-vl-7b "What's in this image?" --images photo.jpg

# Start Oprel Studio (Web UI)
oprel start
```

### Python API

```python
from oprel import Model

# Auto-optimized loading
model = Model("gemma3-1b") 
print(model.generate("Write a binary search in Python"))
```

## 🌐 Oprel Studio: The Ultimate Local AI Workspace

**Oprel Studio** is a premium, browser-based command center for your local AI models. Designed for engineers and researchers, it provides a state-of-the-art interface that transforms raw inference into a productive workspace.

### ✨ Immersive Chat Experience
- **Fluid Streaming**: ultra-fast Server-Sent Events (SSE) for instant, typewriter-style responses.
- **Thinking Process Visualization**: DeepSeek-R1 and other reasoning models show their internal "chain of thought" in a beautiful, expandable workspace.
- **Rich Markdown & Code**: Full GFM support with syntax highlighting for 50+ languages.
- **Artifacts Canvas**: Generate Mermaid diagrams or HTML/Tailwind previews and view them in a dedicated side-panel next to your chat.
- **Multi-modal Support**: Drag and drop images for vision-capable models (Qwen-VL, Llama-3.2 Vision).

### 🔌 Beyond Local: External Cloud Providers
Manage your local models alongside industry-leading cloud APIs in one unified interface:
- **Google Gemini**: Full support for 2.0 Flash/Pro with free-tier quota integration.
- **NVIDIA NIM**: High-performance inference via NVIDIA's accelerated cloud.
- **Groq**: Record-breaking inference speeds via LPU™ technology.
- **OpenRouter**: Access 200+ models from a single API key.
- **Custom OpenAI**: Connect any internal or third-party OpenAI-compatible server.

### 🏛️ Visual Model Registry
- **One-Click Deployment**: Pull, load, and switch between models without ever touching the terminal.
- **Quantization Intelligence**: See available quants (Q4_K, Q8_0, etc.) and their memory footprint before loading.
- **Smart Status**: Real-time indicators show which model is currently taking up VRAM/RAM.

### 📊 Real-time Hardware Analytics
Monitor your system's performance as the model generates:
- **Tokens per Second (TPS)**: Live tracking of inference performance.
- **VRAM & RAM**: Precise graphs showing memory consumption across CPU and GPU.
- **CPU/GPU Utilization**: Monitor load to ensure your system is running optimally.

### 🚀 Usage
Start Oprel Studio and it will automatically open in your default browser:
```bash
oprel start
```
The interface is hosted at `http://localhost:11435/gui/`.

## 🎨 Image & Video Generation

**ComfyUI is embedded** - auto-installs and downloads models automatically!

### Usage

```bash
# Specify model in command
oprel gen-image sdxl-turbo "a cyberpunk city at night"

# High quality with FLUX
oprel gen-image flux-1-schnell "a majestic dragon" --width 1024 --height 1024 --steps 30

# With negative prompt
oprel gen-image sdxl-turbo "a cute cat" --negative "blurry, low quality"

# First time downloads model automatically
oprel gen-image flux-1-dev "stunning landscape"  # Auto-downloads 23GB
```

### Download Models

```bash
# List available image models
oprel list-models --category text-to-image

# Pre-download model
oprel pull flux-1-schnell

# Pull video model
oprel pull svd-xt
```

## 🔍 Text Embeddings

Generate embeddings for semantic search and RAG applications:

### CLI Usage

```bash
# Single text embedding
oprel embed nomic-embed-text "Hello world"

# Process files (PDF, DOCX, TXT, JSON)
oprel embed nomic-embed-text --files document.pdf report.docx notes.txt

# Batch processing from file (one text per line)
oprel embed nomic-embed-text --batch texts.txt --output embeddings.json

# JSON output format
oprel embed nomic-embed-text "Machine learning" --format json
```

### Python API

```python
from oprel import embed

# Single embedding
vector = embed("Hello world", model="nomic-embed-text")
print(f"Dimensions: {len(vector)}")

# Batch embeddings
vectors = embed(
    ["Document 1", "Document 2", "Document 3"],
    model="nomic-embed-text"
)

# Semantic search
import math

def cosine_similarity(a, b):
    dot = sum(x*y for x,y in zip(a,b))
    mag_a = math.sqrt(sum(x*x for x in a))
    mag_b = math.sqrt(sum(x*x for x in b))
    return dot / (mag_a * mag_b)

query = embed("machine learning topic")
docs = embed(["AI concepts", "cooking recipes", "ML algorithms"])
similarities = [cosine_similarity(query, doc) for doc in docs]
best_match = similarities.index(max(similarities))
print(f"Best match: Document {best_match}")
```

### Available Embedding Models

- **nomic-embed-text**: General-purpose (768 dims)
- **bge-m3**: Multilingual support (1024 dims)
- **all-minilm-l6-v2**: Lightweight & fast (384 dims)
- **snowflake-arctic**: Optimized for RAG (1024 dims)

```bash
# List all embedding models
oprel list-models --category embeddings
```


### Vision Models

```bash
# Ask about an image
oprel vision qwen3-vl-7b "What's in this image?" --images photo.jpg

# Multi-image analysis
oprel vision qwen3-vl-14b "Compare these images" --images img1.jpg img2.jpg img3.jpg
```

## 🧠 Semantic Wiki & RAG

Oprel features a built-in **LlmWiki** engine: a concept-aware, interlinked knowledge base that uses semantic deduplication to organize your information.

### 📚 Wiki Management
- **Semantic Deduplication**: Automatically detect similar concepts and merge new info into existing pages.
- **Bi-directional Linking**: Auto-generate `[[wiki-links]]` between related concepts.
- **Knowledge Graph**: Automatically extract entities and triples for visual relationship mapping.
- **Self-Healing Audit**: Integrated "Lint" tool to fix broken links and detect contradictions with one click.

### 🤖 Intelligent RAG (Retrieval-Augmented Generation)
Chat with your personal knowledge base in Oprel Studio by toggling "RAG" mode.
- **Hybrid Search**: Combines keyword search with Graph-traversal to find context.
- **Automatic Filing**: LLM can automatically save its synthesis back into your wiki.
- **Transparent Citations**: Every claim in RAG mode is cited back to the source wiki page.

```bash
# Ingest local files or text into your wiki
oprel ingest --file documentation.pdf
oprel ingest --text "KV Cache is a memory optimization..."
```

## 🛠️ Advanced Features

### Hybrid GPU/CPU Offloading
Run larger models on limited VRAM by intelligently splitting layers.
```bash
# Automatically calculated during load
# Example: "20/40 layers on GPU, 20 on CPU"
```

### Smart Quantization
Auto-selects the best quantization that fits your hardware.
```bash
oprel run gemma3-1b --quantization auto  # Default
```

### OpenAI & Ollama Compatible Server (Week 14 ✨)

**Production-ready API server with smart model management**

Start the server:
```bash
oprel serve --host 127.0.0.1 --port 11435
```

The server provides:
- **OpenAI API compatibility**: `/v1/chat/completions`, `/v1/completions`, `/v1/models`
- **Ollama API compatibility**: `/api/chat`, `/api/generate`, `/api/tags`
- **Smart Model Management**: 
  - Models stay loaded for 15 minutes after last use
  - Automatic model switching when switching between models
  - Zero manual load/unload needed
- **Fast SSE Streaming**: Server-Sent Events for instant token delivery
- **CORS Support**: Use from web applications

#### OpenAI API Examples

Python (using OpenAI SDK):
```python
from openai import OpenAI

# Point to local Oprel server
client = OpenAI(
    base_url="http://localhost:11435/v1",
    api_key="not-needed"  # Oprel doesn't require API keys
)

# Chat completion
response = client.chat.completions.create(
    model="qwen3-14b",
    messages=[
        {"role": "user", "content": "Write a Python function to reverse a string"}
    ],
    stream=True  # Enable streaming for fast responses
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
```

cURL:
```bash
# Chat completions (streaming)
curl http://localhost:11435/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-0.5b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

# Text Completions
curl http://localhost:11435/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen2.5-0.5b",
    "prompt": "Once upon a time",
    "max_tokens": 50
  }'

# List Models
curl http://localhost:11435/v1/models
```

#### Ollama API Examples

Python (using Ollama SDK):
```python
import ollama

# Works directly with Ollama SDK
client = ollama.Client(host='http://localhost:11435')
response = client.chat(
    model='llama3', 
    messages=[{'role': 'user', 'content': 'Why is the sky blue?'}],
    stream=True
)

for chunk in response:
    print(chunk['message']['content'], end='')
```

cURL:
```bash
# Ollama-style chat
curl http://localhost:11434/api/chat \
  -d '{
    "model": "qwen3-14b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

# List models (Ollama format)
curl http://localhost:11434/api/tags
```

#### Model Management Behavior

The server automatically manages models with these rules:

1. **First Request**: Model is loaded (takes ~5-30s depending on size)
2. **Subsequent Requests**: Model is already loaded (instant response)
3. **Model Switch**: Old model unloads, new model loads automatically
4. **Idle Timeout**: After 15 minutes of no requests, model is unloaded to free memory
5. **No Manual Management**: You never need to call load/unload - it's automatic!

Example workflow:
```bash
# Start server
oprel serve

# In another terminal:
# First request - loads qwen3-14b (~10s load time)
curl http://localhost:11434/v1/chat/completions -d '{"model":"qwen3-14b","messages":[{"role":"user","content":"Hi"}]}'

# Second request - instant! Model already loaded
curl http://localhost:11434/v1/chat/completions -d '{"model":"qwen3-14b","messages":[{"role":"user","content":"Tell me a joke"}]}'

# Switch to different model - automatically unloads qwen3-14b and loads llama3.1
curl http://localhost:11434/v1/chat/completions -d '{"model":"llama3.1","messages":[{"role":"user","content":"Hi"}]}'

# After 15 minutes of inactivity, llama3.1 is automatically unloaded
```

#### Health Check

```bash
curl http://localhost:11434/health
# Returns: {"status":"healthy","timestamp":1234567890,"current_model":"qwen3-14b"}
```

## 📊 Benchmarks vs Ollama

| Feature | Ollama | Oprel SDK |
|---------|--------|-----------|
| **Model Discovery** | 10-30s | **Instant (<100ms)** |
| **Memory Planning** | Basic | **Precise (KV-Cache aware)** |
| **Low VRAM Support** | Fails/Slow | **Hybrid Offloading** |
| **CPU Speed** | Standard | **30-50% Faster (AVX)** |
| **Vision Models** | Partial | **Full Support** |
| **Image/Video Gen** | No | **ComfyUI Integration** |
| **Crash Safety** | Frequent OOM | **Proactive Warnings** |
| **Auto-Optimization** | Manual config | **Fully Automatic** |
| **Oprel Studio** | No | **Premium Web UI** |
| **RAG** | No | **Integrated** |
| **LLM Wiki** | No | **Integrated** |
| **Model Management** | Manual | **Automatic** |
 


## 🧩 Supported Models

### Text Generation Models (GGUF - llama.cpp backend)
- **Qwen 3 / 2.5**: Best all-around models (32B, 14B, 8B, 3B)
- **Qwen 3 Coder**: SOTA for code generation (32B, 14B, 8B)
- **DeepSeek R1**: Advanced reasoning (14B, 8B, 7B, 1.5B)
- **Llama 3.3 / 3.1**: Meta's flagship (70B, 8B)
- **Gemma 3 / 2**: Google's efficient models (27B, 12B, 9B, 4B)
- **Phi-4**: Microsoft's compact powerhouse (14B)

### Vision Models (VLMs) - GGUF + mmproj
- **Qwen3-VL**: Multi-image understanding (32B, 14B, 7B - supports up to 8 images)
- **Qwen2.5-VL**: Proven vision model (7B, 3B)
- **Llama 3.2 Vision**: Meta's VLM (11B)
- **MiniCPM-V**: Efficient mobile-ready VLM (2.6B)
- **Moondream 2**: Lightweight vision (1.8B)

### Image Generation (Safetensors - ComfyUI backend)
Requires ComfyUI running:
- **FLUX.1-dev**: Best quality
- **FLUX.1-schnell**: Fast generation
- **SDXL Turbo**: Fastest (1-4 steps)

### Video Generation (ComfyUI + AnimateDiff)
Requires ComfyUI with video nodes:
- AnimateDiff
- Stable Video Diffusion (SVD)
- Custom workflows

View all available GGUF models:
```bash
oprel list-models --category text-generation
oprel list-models --category vision
oprel list-models --category coding
oprel list-models --category reasoning
```

## License

MIT License. Made with ❤️ for local AI developers.
