Metadata-Version: 2.2
Name: sovereign-inference
Version: 0.2.8
Summary: Ultra-fast LLM inference engine with a Vulkan compute backend
License: MIT
Project-URL: Homepage, https://github.com/corbac10099/Sovereign-Engine
Project-URL: Repository, https://github.com/corbac10099/Sovereign-Engine
Requires-Python: >=3.9
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: numpy; extra == "dev"
Description-Content-Type: text/markdown

# Sovereign Engine

> **Ultra-fast, modular LLM inference engine with a Vulkan compute backend**  
> Designed to surpass llama.cpp in throughput and VRAM efficiency.

```
┌──────────────────────────────────────────────────────────────────────────┐
│  Sovereign Engine  v0.2.8                                                │
│  C++20 · Vulkan 1.3 · SPIR-V Compute · pybind11 · Mixed INT4 Quant      │
└──────────────────────────────────────────────────────────────────────────┘
```

[![Build](https://img.shields.io/badge/build-passing-brightgreen)](#building)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](#license)
[![C++20](https://img.shields.io/badge/C%2B%2B-20-orange.svg)]()
[![Vulkan 1.3](https://img.shields.io/badge/Vulkan-1.3-red.svg)]()

---

## Table of Contents

- [Overview](#overview)
- [Key Features](#key-features)
- [Architecture](#architecture)
- [Project Structure](#project-structure)
- [Requirements](#requirements)
- [Building](#building)
- [Usage](#usage)
  - [Convert a Model](#1-convert-a-model)
  - [Python API](#2-python-api)
  - [C++ API](#3-c-api)
  - [C API (FFI)](#4-c-api-ffi)
- [The .sovereign Format](#the-sovereign-format)
- [Quantiser](#quantiser)
- [Memory Manager](#memory-manager)
- [KV Cache (PagedAttention)](#kv-cache-pagedattention)
- [Vulkan Compute Shaders](#vulkan-compute-shaders)
- [Running Tests](#running-tests)
- [Roadmap](#roadmap)
- [Contributing](#contributing)
- [License](#license)

---

## Overview

Sovereign Engine is a **from-scratch, GPU-first LLM inference runtime** written in C++20.  
It targets local inference on consumer hardware (NVIDIA/AMD/Intel) using **Vulkan compute** as the sole GPU backend, which means:

- No CUDA dependency — runs on **any Vulkan 1.2+ GPU**.
- Tight control over VRAM: **paged KV cache**, async layer streaming, dynamic CPU offload.
- **Mixed-precision quantisation** inspired by EXL2 and HQQ — assign INT4/INT3/INT2 per-tensor based on measured sensitivity.
- A clean **Python API** (via pybind11) and a stable **C ABI** for FFI from any language.

---

## Key Features

| Feature | Details |
|---|---|
| **Vulkan backend** | Compute-only, no graphics queue needed. Works on NVIDIA, AMD, Intel, ARM Mali. |
| **Mixed-precision quantisation** | FP16 → INT8 → Q4\_K → Q3\_K → Q2\_K per tensor, HQQ solver, EXL2-style importance scoring. |
| **Async layer pipeline** | Double-buffered PCIe staging: GPU runs layer N while CPU DMA-copies layer N+1. |
| **PagedAttention KV cache** | Block-based VRAM pool, copy-on-write forking, O(1) alloc/free. |
| **Dynamic CPU offload** | Falls back to AVX-512 / NEON when VRAM pressure exceeds threshold. |
| **Streaming generation** | Token-by-token callback; GIL-safe Python generator. |
| **Rich sampling** | Temperature, Top-P, Top-K, Min-P, Repetition Penalty, Mirostat v1/v2, GBNF grammar, JSON schema. |
| **Proprietary `.sovereign` format** | Page-aligned mmap, per-tensor CRC32C, zero-copy Vulkan upload. |
| **GQA / MHA / MQA** | All attention variants supported via a single fused GLSL shader. |
| **RoPE + sliding window** | Inline rotary embeddings, optional Mistral/Gemma sliding-window mask. |

---

## Architecture

```
┌─────────────────────────────────────────────────────────────────────────┐
│                           Python / C++ / C                               │
│                    (sovereign_inference.Engine)                          │
└────────────────────────────────┬────────────────────────────────────────┘
                                 │
                    ┌────────────▼────────────┐
                    │       engine.cpp         │  prefill / decode_step /
                    │    (inference loop)      │  generate / forward
                    └──┬──────┬───────┬───────┘
                       │      │       │
         ┌─────────────▼─┐ ┌──▼────┐ ┌▼──────────────────┐
         │  VulkanContext │ │Quant  │ │ AsyncMemoryManager │
         │  (device,      │ │izer   │ │ (layer streaming,  │
         │   pipelines,   │ │       │ │  CPU offload)      │
         │   cmd bufs)    │ └───────┘ └────────────────────┘
         └───────┬────────┘                    │
                 │                   ┌─────────▼──────────┐
     ┌───────────▼────────────┐      │  PagedKVCache       │
     │  SPIR-V Compute Shaders│      │  (block pool,       │
     │  ┌─────────────────┐   │      │   CoW fork,         │
     │  │ rmsnorm.comp    │   │      │   descriptor sets)  │
     │  │ matmul_int4.comp│   │      └────────────────────┘
     │  │ attention_gqa   │   │
     │  │ silu_gate.comp  │   │
     │  │ sampler.comp    │   │
     │  └─────────────────┘   │
     └────────────────────────┘

┌────────────────────────────────────────────────────────────┐
│                  sovereign-convert CLI                      │
│  SafeTensors → profile → budget allocate → HQQ quant       │
│               → pack INT4/3/2 → write .sovereign           │
└────────────────────────────────────────────────────────────┘
```

---

## Project Structure

```
sovereign-engine/
├── CMakeLists.txt              # Root build configuration
├── README.md
├── .gitignore
│
├── include/sovereign/          # Public C++ headers
│   ├── engine.hpp              # Top-level inference API
│   ├── format.hpp              # .sovereign binary format spec
│   ├── vulkan_context.hpp      # Vulkan device + pipeline management
│   ├── memory_manager.hpp      # Async pipeline memory manager
│   ├── kv_cache.hpp            # PagedAttention KV cache
│   └── quantizer.hpp           # Mixed-precision quantiser
│
├── src/
│   ├── vulkan/
│   │   └── vulkan_context.cpp
│   ├── format/
│   │   └── format.cpp
│   ├── compute/
│   │   └── kv_cache.cpp
│   ├── inference/
│   │   └── engine.cpp
│   ├── quantizer/
│   │   └── quantizer.cpp
│   └── memory/
│       └── memory_manager.cpp
│
├── shaders/                    # GLSL compute shaders (compiled to SPIR-V)
│   ├── rmsnorm.comp
│   ├── matmul_int4.comp
│   ├── attention_gqa.comp
│   ├── silu_gate.comp
│   └── sampler.comp
│
├── bindings/
│   └── python/
│       └── sovereign_py.cpp    # pybind11 Python bindings
│
├── tools/
│   └── converter/
│       └── main.cpp            # sovereign-convert CLI
│
├── tests/
│   ├── CMakeLists.txt
│   ├── test_format.cpp
│   ├── test_quantizer.cpp
│   ├── test_kv_cache.cpp
│   └── test_engine.cpp
├── package.json                # Shader compiler package metadata
├── package-lock.json           # Shader compiler lock file
│
├── examples/
│   └── basic_generate.py       # Python streaming example
│
├── scripts/
│   ├── build.sh                # Build helper script
│   └── compile_shaders.js      # Shader compiler tool using WebGPU glslang
│
└── third_party/
    ├── volk/                   # Meta-loader for dynamic Vulkan loading (tracked)
    │   ├── volk.h
    │   └── volk.c
    └── vk_mem_alloc.h          # Fetched automatically via CMake (not tracked)
```

---

## Requirements

### Runtime
* **Vulkan 1.2+ Compatible GPU**: Works on NVIDIA, AMD, Intel, Apple Silicon (via MoltenVK), and ARM Mali.
* **GPU Driver**: Must support Vulkan 1.2 and the required extensions listed below. No SDK required at runtime!

### Build (Zero-Dependency & SDK-Free)
Thanks to our dynamic meta-loader architecture (`volk`) and automatic CMake dependency management, the **Vulkan SDK is completely optional to build Sovereign Engine!**

| Dependency | Version | Mandatory? | Notes |
|---|---|---|---|
| **CMake** | ≥ 3.25 | **Yes** | Handles the build orchestration |
| **C++ Compiler** | C++20 | **Yes** | MSVC 2022 / GCC 12+ / Clang 15+ |
| **Vulkan SDK** | ≥ 1.3 | **No** (Optional) | If absent, CMake automatically fetches headers; uses precompiled SPIR-V shaders |
| **Python** | ≥ 3.9 | **No** (Optional) | Only required to compile Python/pybind11 bindings |

### Required Vulkan Extensions
Your GPU driver must support:
```
VK_KHR_timeline_semaphore        (core in 1.2)
VK_KHR_synchronization2          (core in 1.3)
VK_EXT_memory_budget
VK_KHR_buffer_device_address
VK_KHR_shader_float16_int8
VK_EXT_scalar_block_layout
VK_KHR_8bit_storage
VK_KHR_16bit_storage
```

---

## Building

### Quick start

```bash
# Clone
git clone https://github.com/corbac10099/sovereign-engine.git
cd sovereign-engine

# Build (fetches vk_mem_alloc.h automatically)
chmod +x scripts/build.sh
./scripts/build.sh

# Or with all options explicit:
./scripts/build.sh --release --tests --python --avx512
```

### Manual CMake

```bash
mkdir build && cd build
cmake .. \
    -DCMAKE_BUILD_TYPE=Release \
    -DSOVEREIGN_BUILD_PYTHON=ON \
    -DSOVEREIGN_BUILD_TESTS=ON \
    -DSOVEREIGN_ENABLE_AVX512=ON
cmake --build . --parallel $(nproc)
```

### Debug build with AddressSanitizer

```bash
./scripts/build.sh --debug
```

---

## Usage

### 1. Convert a Model

Download a HuggingFace model (e.g. Gemma 4B) in SafeTensors format, then convert:

```bash
# Basic conversion – mixed quantisation targeting 4.5 bpw
./build/sovereign-convert \
    --input  /path/to/gemma-4b/ \
    --output gemma-4b.sovereign \
    --arch   gemma \
    --quant  mixed \
    --bpw    4.5

# With calibration corpus for better importance scoring
./build/sovereign-convert \
    --input  /path/to/gemma-4b/ \
    --output gemma-4b-calibrated.sovereign \
    --quant  mixed \
    --bpw    4.5 \
    --calib  calibration_corpus.txt \
    --verbose
```

**Quantisation modes:**

| Mode | Approx bpw | Description |
|---|---|---|
| `fp16` | 16 | No quantisation, maximum quality |
| `int8` | 8 | Symmetric INT8 throughout |
| `q4k` | 4.5 | Q4\_K block quantisation |
| `q3k` | 3.5 | Q3\_K block quantisation |
| `q2k` | 2.6 | Q2\_K aggressive compression |
| `mixed` | target | Adaptive per-tensor (recommended) |

#### Python-based Conversion

You can also convert models directly inside Python without having to compile the C++ CLI tool:

```python
import sovereign_inference

result = sovereign_inference.convert(
    input_dir="/path/to/gemma-4b/",
    output_path="gemma-4b.sovereign",
    arch="gemma",
    quant="mixed",
    bpw=4.5,
    verbose=True
)

if result["success"]:
    print(f"Conversion successful! Achieved bpw: {result['achieved_bpw']:.2f}")
else:
    print(f"Conversion failed: {result['error_message']}")
```

---

### 2. Python API

```python
import sovereign_inference

# Load model
cfg = sovereign_inference.LoadConfig()
cfg.gpu_layer_count       = 2**31 - 1   # load everything into VRAM
cfg.kv_cache_vram_fraction = 0.80

with sovereign_inference.Engine.load("gemma-4b.sovereign", cfg) as engine:
    print(f"Model  : {engine.model_name}")
    print(f"Device : {engine.device_name}  ({engine.vram_gib:.1f} GiB)")

    # --- Streaming generation ---
    params = sovereign_inference.GenerateParams()
    params.max_new_tokens        = 512
    params.sampling.temperature  = 0.7
    params.sampling.top_p        = 0.9
    params.sampling.min_p        = 0.05
    params.sampling.repetition_penalty = 1.1

    stats = engine.generate(
        prompt   = "Explain quantum entanglement briefly:",
        params   = params,
        callback = lambda tok, tid, lp: print(tok, end="", flush=True) or True,
    )
    print(f"\n[{stats.tokens_per_second:.1f} tok/s | {stats.generated_tokens} tokens]")

    # --- Generator protocol ---
    for text, token_id, logprob in engine.stream("Once upon a time", params):
        print(text, end="", flush=True)

    # --- Raw logits for custom sampling ---
    ids    = engine.tokenize("The sky is")
    logits = engine.forward(ids)   # numpy float32 array [vocab_size]
```

---

### 3. C++ API

```cpp
#include "sovereign/engine.hpp"

int main() {
    sovereign::LoadConfig cfg;
    cfg.kv_cache_vram_fraction = 0.80;

    auto engine = sovereign::Engine::load("gemma-4b.sovereign", cfg);
    if (!engine) return 1;

    sovereign::GenerateParams params;
    params.max_new_tokens       = 512;
    params.sampling.temperature = 0.7f;
    params.sampling.top_p       = 0.9f;

    auto stats = engine->generate(
        "Explain quantum entanglement:",
        params,
        [](std::string_view tok, sovereign::TokenId, float) {
            std::cout << tok << std::flush;
            return true;   // return false to stop early
        });

    std::fprintf(stderr, "\n%.1f tok/s\n", stats.tokens_per_second);
}
```

---

### 4. C API (FFI)

```c
#include "sovereign/engine.hpp"   // exposes extern "C" block

SovereignEngine* engine = sovereign_engine_load(
    "gemma-4b.sovereign",
    0,       // vram_budget (0 = auto)
    ~0u,     // gpu_layers  (all)
    true     // use_mmap
);

sovereign_engine_generate(
    engine,
    "Hello, world!",
    0.7f, 0.9f, 0, 0.05f, 1.1f,  // temperature, top_p, top_k, min_p, rep_penalty
    256,
    my_callback, NULL
);

sovereign_engine_free(engine);
```

---

## The .sovereign Format

The `.sovereign` binary format is designed for **zero-copy, memory-mapped inference**:

```
┌──────────────┬──────────────────────────────────────────────────────┐
│ Offset       │ Section                                              │
├──────────────┼──────────────────────────────────────────────────────┤
│ 0x0000       │ FileHeader        (256 bytes, fixed)                 │
│ 0x0100       │ ModelConfig       (256 bytes, padded to 64B)         │
│ aligned      │ TokenizerBlob     (UTF-8 JSON)                       │
│ aligned      │ TensorIndex[]     (N × 192 bytes each)               │
│ PAGE-ALIGNED │ TensorDataBlock   (mmap-ready, 4K page aligned) ◀──┐ │
└──────────────┴──────────────────────────────────────────────────────┘
                                                                       │
Vulkan can mmap this block directly into a VkBuffer via               │
VK_EXT_external_memory_host — zero CPU copy during weight loading. ───┘
```

**Key properties:**
- Magic bytes: `SVRN` (0x53, 0x56, 0x52, 0x4E)
- All multi-byte fields: **little-endian**
- Per-tensor **CRC32C checksums** (hardware-accelerated via SSE4.2)
- Per-tensor `DType` field: supports F32, F16, BF16, INT8, INT4, INT3, INT2, Q4\_K, Q3\_K, Q2\_K
- Feature flags bitmask: `MMAP_READY`, `HAS_TOKENIZER`, `GROUPED_QUERY`, `RoPE_SCALED`, …

---

## Quantiser

The quantiser runs a 3-phase pipeline:

### Phase 1 – Calibration Profiling
Computes per-tensor activation statistics on a small calibration corpus (≥ 512 tokens):
- **Hessian proxy** (mean squared activation magnitude)
- **Outlier ratio** (fraction with |w| > 3σ)
- **Kurtosis** (distribution peakedness)

### Phase 2 – Budget Allocation
Assigns a `DType` to each tensor to hit a target average bpw:
```
importance ≥ 0.75  →  FP16 / INT8   (embeddings, first/last layers, norms)
importance ≥ 0.50  →  INT4 / Q4_K  (Q/K/V projections)
importance ≥ 0.25  →  Q3_K
importance <  0.25  →  Q2_K
```
Iteratively rebalances until `|achieved_bpw - target_bpw| < 5%`.

### Phase 3 – HQQ Quantisation
Per-block iterative solver minimising the Hessian-weighted MSE:
```
min_{scale, zero} ‖W − dequant(quant(W, scale, zero))‖²_H
```
Default: 20 iterations, block size 128 elements, FP16 scale storage.

---

## Memory Manager

The `AsyncMemoryManager` implements a double-buffered layer-streaming pipeline:

```
CPU Thread             GPU Compute Queue       DMA Transfer Queue
──────────             ─────────────────       ─────────────────

[Layer N-1 ready] ──▶  Compute(Layer N-1)
                                │
[Stream Layer N+1] ─────────────┼──────────▶ DMA(Layer N+1)
  (from mmap/RAM)               │                  │
                                ▼                  ▼
                        Compute(Layer N) ◀── Layer N ready
```

**VRAM pressure response:**
- > 88% → start evicting LRU layers (LRU free-list)
- > 95% → force CPU offload via AVX-512 / NEON kernels

---

## KV Cache (PagedAttention)

Inspired by [vLLM's PagedAttention](https://arxiv.org/abs/2309.06180):

- One giant VRAM pool pre-allocated at startup (no per-block VkBuffer overhead).
- **Block size**: 16 tokens per block (configurable, must be power-of-2).
- **Copy-on-write** forking: beam search / speculative decoding shares blocks until a write occurs.
- **Descriptor sets** pre-allocated per `(block_id × layer)` pair to avoid per-inference allocation.
- Optional `ConstantContextCache` for RWKV / Mamba models (O(1) memory regardless of sequence length).

---

## Vulkan Compute Shaders

All shaders are compiled from GLSL (`.comp`) to SPIR-V at CMake configure time:

| Shader | Purpose |
|---|---|
| `rmsnorm.comp` | Fused RMSNorm with subgroup reduction; supports Gemma variant |
| `matmul_int4.comp` | Tiled INT4×FP16 GEMM with on-the-fly dequantisation and double-buffered B tiles |
| `attention_gqa.comp` | GQA/MHA/MQA fused attention: RoPE inline, PagedAttention block table, Flash-Attention tiled softmax |
| `silu_gate.comp` | Fused SwiGLU (SiLU × hadamard) for LLaMA/Gemma FFN |
| `sampler.comp` | GPU-resident sampling: temperature → top-K → softmax → top-P → min-P → multinomial |

All shaders use `GL_EXT_scalar_block_layout` and `GL_KHR_shader_subgroup_arithmetic` for efficient subgroup reductions.

---

## Running Tests

```bash
# Build and run all tests
./scripts/build.sh --tests
cd build && ctest --output-on-failure

# Run a specific suite
./build/test_quantizer --success
./build/test_format    --success
./build/test_kv_cache  --success
./build/test_engine    --success

# Integration test (requires a converted model)
SOVEREIGN_TEST_MODEL=gemma-4b.sovereign ctest -R test_integration
```

---

## Roadmap

- [ ] **Continuous batching** — interleave multiple requests in a single GPU pass
- [ ] **Speculative decoding** — draft model integration for 2-4× decode speedup
- [ ] **Cooperative matrix** — VK\_KHR\_cooperative\_matrix path for tensor-core acceleration
- [ ] **io\_uring Direct Storage** — bypass staging buffers for PCIe 4.0+ NVMe
- [ ] **Rust bindings** — PyO3 alternative to pybind11
- [ ] **Windows support** — MinGW + Vulkan SDK on Windows
- [ ] **Web UI** — minimal OpenAI-compatible HTTP server (compatible with llama.cpp clients)
- [ ] **LoRA / adapter merging** — runtime LoRA weight injection without repack
- [ ] **RWKV / Mamba** — constant-memory inference via `ConstantContextCache`
- [ ] **Benchmark suite** — automated comparison vs llama.cpp on standard prompts

---

## Contributing

Contributions are welcome.  Please open an issue before submitting large pull requests.

```bash
# Fork, clone, then create a feature branch
git checkout -b feat/my-feature

# Build with tests + debug symbols
./scripts/build.sh --debug --tests

# Make sure all tests pass before submitting
cd build && ctest --output-on-failure
```

Code style: follow the existing C++20 conventions (no exceptions in hot paths, `[[nodiscard]]` everywhere, PIMPL for public headers, RAII for all Vulkan handles).

---

## License

MIT License — see [LICENSE](LICENSE) for details.

---

*Sovereign Engine is an independent project and is not affiliated with Google, NVIDIA, AMD, or any model vendor.*
