Metadata-Version: 2.4
Name: quicksilver-cpu
Version: 0.2.0
Summary: High-performance CPU inference for GGUF quantized models
Home-page: https://github.com/kossisoroyce/quicksilver-cpu
Author: Quicksilver Contributors
License: Apache-2.0
Project-URL: Homepage, https://github.com/kossisoroyce/quicksilver-cpu
Project-URL: Repository, https://github.com/kossisoroyce/quicksilver-cpu
Project-URL: Issues, https://github.com/kossisoroyce/quicksilver-cpu/issues
Keywords: inference,llm,gguf,quantization,cpu,simd,avx2
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: C++
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.20
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0; extra == "dev"
Requires-Dist: pybind11>=2.10; extra == "dev"
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# Quicksilver CPU ⚡

[![PyPI version](https://badge.fury.io/py/quicksilver-cpu.svg)](https://pypi.org/project/quicksilver-cpu/)
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)

**High-performance CPU inference for GGUF quantized models**

Quicksilver CPU is a lightweight, standalone inference engine optimized for running quantized LLMs on CPUs. It achieves **95+ tokens/second** through AVX2/AVX-512 SIMD optimizations, significantly outperforming llama.cpp.

## Features

- 🚀 **Blazing Fast**: 95+ tok/s on modern CPUs (2.2x faster than llama.cpp)
- 📦 **Lightweight**: Minimal dependencies, pure CPU focus
- 🔧 **Native GGUF**: Direct parsing without external libraries
- ⚡ **SIMD Optimized**: AVX2/AVX-512 + OpenMP parallelization
- 🎯 **Quantization Support**: Q4_K, Q5_0, Q6_K, Q8_0, and more
- 🔄 **Streaming**: Token-by-token generation with callbacks
- 📊 **Batch Processing**: Efficient multi-request handling
- 🧠 **Smart Caching**: Prompt cache + int8 KV compression (3.9x)
- 🛠️ **CLI Tools**: Benchmark, info, and generation commands
- 📈 **Profiling**: Built-in performance diagnostics

## Installation

### From PyPI (coming soon)

```bash
pip install quicksilver-cpu
```

### From Source

```bash
git clone https://github.com/kossisoroyce/quicksilver-cpu.git
cd quicksilver-cpu

# Install dependencies
pip install pybind11 numpy

# Build and install
pip install -e .

# Or build just the C++ kernel
cd quicksilver_cpu/csrc
python setup.py build_ext --inplace
```

## Quick Start

### Basic Inference

```python
from quicksilver_cpu import Engine

# Load model
engine = Engine("model.gguf")

# Generate tokens
tokens = engine.generate([1, 2, 3], max_tokens=50)
print(f"Generated: {tokens}")
```

### Streaming Generation

```python
from quicksilver_cpu import Engine, StreamingGenerator

engine = Engine("model.gguf")
generator = StreamingGenerator(engine)

for token in generator.stream(prompt_tokens=[1, 2, 3], max_tokens=50):
    print(f"Token: {token.token_id}", end=" ")
```

### Batch Processing

```python
from quicksilver_cpu import Engine, BatchProcessor, BatchRequest

engine = Engine("model.gguf")
processor = BatchProcessor(engine)

requests = [
    BatchRequest(id="1", prompt_tokens=[1, 2, 3], max_tokens=20),
    BatchRequest(id="2", prompt_tokens=[4, 5, 6], max_tokens=20),
]

results, metrics = processor.process_batch(requests)
print(f"Processed {len(results)} requests at {metrics.avg_tokens_per_second:.1f} tok/s")
```

### Benchmarking

```python
from quicksilver_cpu import benchmark

tok_per_sec = benchmark("model.gguf", n_tokens=100)
print(f"Speed: {tok_per_sec:.1f} tok/s")
```

### CPU Configuration

```python
from quicksilver_cpu import configure_threads, get_cpu_info, print_cpu_info

# Show CPU info
print_cpu_info()

# Configure optimal threading
config = configure_threads(num_threads=8, bind_cores=True)
print(f"Using {config.num_threads} threads")
```

### Prompt Caching

```python
from quicksilver_cpu import PromptCache

# Cache repeated prompts for faster inference
cache = PromptCache(max_entries=100)

# Store prompt state
cache.put(system_prompt_tokens, cache_len=len(system_prompt_tokens))

# Find matching prefix for new prompts
match, prefix_len = cache.find_prefix_match(new_prompt_tokens)
if match:
    print(f"Reusing {prefix_len} cached tokens!")
```

### KV Cache Compression

```python
from quicksilver_cpu import KVCacheManager

# Use int8 compression for 3.9x memory savings
kv_cache = KVCacheManager(
    num_layers=32,
    num_kv_heads=8,
    head_dim=64,
    max_seq_len=4096,
    use_int8=True,  # 3.9x compression
)

print(f"Memory: {kv_cache.memory_usage_mb():.1f} MB")
print(f"Compression: {kv_cache.compression_ratio():.1f}x")
```

### Profiling

```python
from quicksilver_cpu import get_profiler, Engine

engine = Engine("model.gguf")
profiler = get_profiler()

profiler.start("inference")
tokens = engine.generate([1, 2, 3], max_tokens=50)
profiler.stop("inference")

profiler.print_report()
```

### CLI Usage

```bash
# Show model information
quicksilver-cpu info -m model.gguf

# Benchmark inference speed
quicksilver-cpu benchmark -m model.gguf -n 100 --threads 8

# Generate text
quicksilver-cpu generate -m model.gguf -p "Hello world" --max-tokens 50 --stream
```

## Audio Support (TTS, STT, STS)

Quicksilver CPU includes full audio support for Text-to-Speech, Speech-to-Text, and Speech-to-Speech processing.

### Text-to-Speech (TTS)

```python
from quicksilver_cpu.audio import TTSEngine, TTSConfig

config = TTSConfig(
    model_path="qwen3-tts.gguf",
    sample_rate=24000,
    temperature=0.7,
)
engine = TTSEngine(config)

# Generate speech
result = engine.synthesize("Hello, welcome to Quicksilver!")
result.save("output.wav")

# Streaming generation
for chunk in engine.stream("Long text for streaming..."):
    play_audio(chunk)  # Real-time playback
```

### Speech-to-Text (STT)

```python
from quicksilver_cpu.audio import STTEngine, STTConfig

config = STTConfig(
    model_path="whisper.gguf",
    language="en",  # Auto-detect if None
)
engine = STTEngine(config)

# Transcribe audio
result = engine.transcribe("audio.wav")
print(result.text)

# Export subtitles
with open("subtitles.srt", "w") as f:
    f.write(result.to_srt())

# Streaming transcription
for segment in engine.stream("long_audio.wav"):
    print(f"[{segment.start:.1f}s] {segment.text}")
```

### Speech-to-Speech (STS)

```python
from quicksilver_cpu.audio import STSEngine, STSConfig

config = STSConfig(
    stt_model_path="whisper.gguf",
    tts_model_path="tts.gguf",
    source_language="es",
    target_language="en",
)
engine = STSEngine(config)

# Translate speech
result = engine.translate("spanish_audio.wav")
result.save("english_audio.wav")

# Voice conversion
result = engine.convert_voice("input.wav", target_voice="voice_sample.wav")

# Real-time streaming translation
for chunk in engine.stream("live_audio.wav"):
    play_audio(chunk)
```

### Audio Utilities

```python
from quicksilver_cpu.audio import load_audio, save_audio, AudioBuffer

# Load/save audio
audio, sr = load_audio("input.wav", target_sr=24000)
save_audio(audio, "output.wav", sample_rate=24000)

# Streaming buffer
buffer = AudioBuffer(sample_rate=24000)
buffer.append(audio_chunk)
full_audio = buffer.get_audio()
```

## Supported Quantization Types

| Type | Bits/Weight | Block Size | Status |
|------|-------------|------------|--------|
| Q4_K | 4.5 | 256 | ✅ AVX2 optimized |
| Q5_0 | 5.5 | 32 | ✅ Supported |
| Q6_K | 6.5 | 256 | ✅ AVX2 optimized |
| Q8_0 | 8.5 | 32 | ✅ Supported |
| Q4_0 | 4.5 | 32 | ✅ Supported |
| Q2_K | 2.5 | 256 | ✅ Supported |
| Q3_K | 3.4 | 256 | ✅ Supported |
| Q5_K | 5.5 | 256 | ✅ Supported |
| F16 | 16 | 1 | ✅ Supported |

## Performance

Benchmarked on Intel Core i7-9750H with SmolLM2-135M Q4_K_M:

| Engine | Tokens/sec | Speedup |
|--------|------------|---------|
| llama.cpp | 43 | 1.0x |
| **Quicksilver CPU** | **95.7** | **2.22x** |

### Key Optimizations

1. **AVX2 SIMD** - 8-wide FMA operations for Q4_K/Q6_K GEMV
2. **Fused Operations** - Combined gate+up projections for better cache locality
3. **OpenMP Parallelization** - Multi-threaded layer computations
4. **Int8 KV Cache** - 3.9x memory compression with minimal quality loss
5. **Prompt Caching** - Reuse computations for repeated prefixes
6. **Memory Alignment** - 64-byte aligned allocations for SIMD efficiency

## Requirements

### Mandatory
- **Python 3.9+**
- **NumPy >= 1.20**
- **AVX2 CPU** - Required for SIMD optimizations (Intel Haswell+, AMD Excavator+)
- **C++17 compiler** - clang++ or g++

### Strongly Recommended
- **OpenMP** - For multi-threaded inference (2-4x speedup)
  ```bash
  # macOS
  brew install libomp
  
  # Ubuntu/Debian
  sudo apt install libomp-dev
  
  # Fedora/RHEL
  sudo dnf install libomp-devel
  ```

### For Building from Source
- **pybind11** - `pip install pybind11`

### Platform Support

| Platform | Status |
|----------|--------|
| macOS (Apple Silicon) | ✅ Tested |
| macOS (Intel) | ✅ Supported |
| Linux (x86_64) | ✅ Supported |
| Windows | ⚠️ Experimental |

## API Reference

### `Engine`

```python
Engine(model_path: str, verbose: bool = True)
```

- `generate(prompt_tokens, max_tokens, temperature, top_p)` - Generate tokens
- `forward(token_id)` - Single forward pass, returns logits
- `reset_cache()` - Clear KV cache

### `StreamingGenerator`

```python
StreamingGenerator(engine, tokenizer=None, default_max_tokens=256)
```

- `stream(prompt_tokens, max_tokens, temperature, top_p)` - Yield tokens
- `stream_async(...)` - Async version
- `stop()` - Request early stop

### `BatchProcessor`

```python
BatchProcessor(engine, tokenizer=None)
```

- `process_batch(requests, progress_callback)` - Process multiple requests

## License

Apache 2.0

## Contributing

Contributions welcome! Please open an issue or PR on GitHub.

## Related Projects

- [Quicksilver](https://github.com/kossisoroyce/quicksilver) - Full inference engine with GPU support
- [llama.cpp](https://github.com/ggerganov/llama.cpp) - Original GGUF implementation
