Open Source · Apache 2.0

DeepNetz

Run massive models on minimal hardware.
An open-source LLM inference framework with intelligent backend selection, KV cache compression, and hardware auto-detection.

$ pip install deepnetz Copied!
View on GitHub
Apache-2.0 Python 3.8+ 6 Backends

Built for real-world inference

DeepNetz automatically selects the best backend for your hardware and optimizes memory usage to run models that shouldn't fit.

6 Inference Backends

llama.cpp, HuggingFace Transformers, ExLlamaV2, vLLM, CTranslate2, and ONNX Runtime. DeepNetz picks the optimal one automatically.

KV Cache Compression

Intelligent key-value cache management reduces VRAM usage significantly, enabling larger context windows on constrained hardware.

Built-in Web UI

A clean, responsive chat interface ships out of the box. No separate frontend setup needed — just run and open your browser.

Tool Calling

Native function/tool calling support compatible with OpenAI-style tool definitions. Build agents that interact with external APIs.

Hardware Auto-Detection

Detects your GPU, VRAM, CPU cores, and RAM at startup. Automatically configures quantization level, batch size, and layer offloading.

OpenAI-Compatible API

Drop-in replacement for the OpenAI API. Use your existing code and tools — just change the base URL to your local DeepNetz server.

Honest numbers, real hardware

Measured on consumer hardware. No cherry-picked results. Quality delta is vs. uncompressed FP16 baseline.

Model Parameters Quality Delta Throughput VRAM Used
Llama 3.2 3B +0.4% 42.1 tok/s 2.8 GB
Gemma 2 27B +2.0% 8.7 tok/s 14.2 GB
Qwen 2.5 35B +2.7% 6.3 tok/s 18.1 GB
Command R+ 122B 1.3 tok/s 48 GB (CPU offload)

Benchmarked on a single RTX 4090 (24 GB VRAM), 64 GB DDR5 RAM. The 122B model uses partial CPU offloading. Quality delta measured on MMLU subset. Positive values = better than baseline (due to quantization noise regularization).

How DeepNetz compares

Different tools for different needs. Here's where DeepNetz fits.

Feature DeepNetz Ollama LM Studio vLLM
Open Source ✓ Apache-2.0 ✓ MIT ✕ Proprietary ✓ Apache-2.0
Multiple Backends ✓ 6 backends ✕ llama.cpp only ✕ llama.cpp only ✕ vLLM only
Hardware Auto-Detect ✓ Full ○ Basic ○ Basic ✕ Manual
KV Cache Compression ○ Experimental
Tool Calling ✓ Native ○ Limited
Web UI ✓ Built-in ✕ CLI only ✓ Desktop app
CPU + GPU Offloading ✓ Auto ○ Manual ○ Manual ✕ GPU only
Python Library ✓ pip install ✕ Binary ✕ Binary ✓ pip install
Production Batching ○ Basic ✓ Continuous

Up and running in 30 seconds

Install, load a model, generate. That's it.

Python
# Install
pip install deepnetz

# Python API
from deepnetz import DeepNetz

# Initialize — hardware is auto-detected
dn = DeepNetz()

# Load any supported model (GGUF, HF, GPTQ, AWQ, EXL2)
dn.load("meta-llama/Llama-3.2-3B-Instruct")

# Generate
response = dn.chat("Explain quantum computing in simple terms.")
print(response)

# Or start the server with Web UI
dn.serve(port=8000)  # Opens http://localhost:8000
CLI
# Start a server (OpenAI-compatible API)
$ deepnetz serve --model meta-llama/Llama-3.2-3B-Instruct --port 8000

# Interactive chat in your terminal
$ deepnetz chat --model meta-llama/Llama-3.2-3B-Instruct

# Benchmark a model on your hardware
$ deepnetz bench --model meta-llama/Llama-3.2-3B-Instruct

How it works

DeepNetz sits between your code and the inference backend. It handles hardware detection, model loading, and optimization automatically.

┌─────────────────────────────────────────────────────────────────┐
│                        Your Application                         │
│               (Python API / CLI / Web UI / REST)                │
└──────────────────────────────┬──────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                      DeepNetz Core Engine                       │
│                                                                 │
│  ┌─────────────┐  ┌──────────────┐  ┌────────────────────────┐ │
│  │  Hardware    │  │   Model      │  │   Optimization         │ │
│  │  Detector    │  │   Router     │  │   Pipeline             │ │
│  │             │  │              │  │                        │ │
│  │  • GPU/VRAM │  │  • Format    │  │  • KV Cache Compress.  │ │
│  │  • CPU/RAM  │  │    Detection │  │  • Layer Offloading    │ │
│  │  • Platform │  │  • Backend   │  │  • Quant. Selection    │ │
│  │             │  │    Selection │  │  • Batch Optimization  │ │
│  └──────┬──────┘  └──────┬───────┘  └───────────┬────────────┘ │
│         └────────────────┼──────────────────────┘              │
└──────────────────────────┼──────────────────────────────────────┘
                           │
            ┌──────────────┼──────────────────┐
            ▼              ▼                  ▼
┌────────────────┐ ┌──────────────┐ ┌──────────────────┐
│   llama.cpp    │ │  HuggingFace │ │    ExLlamaV2     │
│   (GGUF)       │ │  Transformers│ │    (EXL2/GPTQ)   │
└────────────────┘ └──────────────┘ └──────────────────┘
┌────────────────┐ ┌──────────────┐ ┌──────────────────┐
│     vLLM       │ │  CTranslate2 │ │  ONNX Runtime    │
│   (server)     │ │  (CT2)       │ │  (ONNX)          │
└────────────────┘ └──────────────┘ └──────────────────┘

Built by

KH

Keyvan Hardani

Software engineer and AI researcher. Building tools to make large language models accessible on everyday hardware.