Metadata-Version: 2.4
Name: bit-ttt-engine
Version: 0.7.0
Classifier: Development Status :: 4 - Beta
Classifier: Programming Language :: Rust
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: License :: OSI Approved :: MIT License
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Operating System :: Microsoft :: Windows
Requires-Dist: tokenizers>=0.19
Requires-Dist: huggingface-hub>=0.20
Requires-Dist: bit-ttt-engine[server] ; extra == 'all'
Requires-Dist: fastapi>=0.100 ; extra == 'server'
Requires-Dist: uvicorn>=0.20 ; extra == 'server'
Requires-Dist: sse-starlette>=1.6 ; extra == 'server'
Provides-Extra: all
Provides-Extra: server
License-File: LICENSE
Summary: Fast local LLM inference with TTT (Test-Time Training) and LoRA — the model that learns while it runs
Keywords: llm,inference,ttt,lora,gguf,quantization,cuda
Home-Page: https://github.com/imonoonoko/Bit-TTT-Engine
Author: imonoonoko
License: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# 🧠 Bit-TTT-Engine

[![PyPI](https://img.shields.io/pypi/v/bit-ttt-engine.svg)](https://pypi.org/project/bit-ttt-engine/)
[![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Rust](https://img.shields.io/badge/rust-1.70+-orange.svg)](https://www.rust-lang.org/)

**Fast local LLM inference that learns while it runs.**

- 🏎️ **47+ tok/s** on RTX 4060 Ti (7B Q4_K_M)
- 🧠 **TTT** (Test-Time Training) — adapts during inference (world's first!)
- 🎨 **LoRA** — fine-tune with one flag
- 📦 **5 models** — Llama-2/3, Gemma-2, Qwen2.5, Mistral
- 🔌 **OpenAI-compatible API** — drop-in replacement

## 🚀 Quick Start

```bash
pip install bit-ttt-engine
```

```python
import cortex_rust

# Load any GGUF model (auto-downloads from HuggingFace!)
model = cortex_rust.load("user/model-GGUF")

# Chat
response = model.chat([
    {"role": "user", "content": "Hello!"}
])
print(response)

# Stream
for token in model.chat_stream([
    {"role": "user", "content": "Tell me a story"}
]):
    print(token, end="", flush=True)
```

## 🖥️ CLI

```bash
# Interactive chat
bit-ttt chat model.gguf

# Generate text
bit-ttt generate model.gguf -p "Once upon a time"

# OpenAI-compatible API server
bit-ttt serve model.gguf --port 8000

# With LoRA + Q8 KV cache
bit-ttt chat model.gguf --lora adapter.bin --q8-cache
```

## 🧠 TTT — Test-Time Training

**The model learns while it generates.** No other local LLM does this.

```python
model = cortex_rust.load("model.gguf")
model.enable_ttt(True)

# Each conversation makes the model smarter
response = model.chat([{"role": "user", "content": "My name is Alice"}])
# Next time, it remembers context better!
```

## ⚡ Performance

| Model | Speed | VRAM |
|-------|-------|------|
| Llama-2 7B (Q4_K_M) | 47.8 tok/s | ~5 GB |
| Llama-3 8B (Q4_K_M) | 36.8 tok/s | ~6 GB |
| Mistral 7B (Q4_K_M) | 40.8 tok/s | ~5 GB |
| Qwen2.5 1.5B (Q4_K_M) | 70.4 tok/s | ~2 GB |

With `--q8-cache`: **82% VRAM reduction** for KV cache.

## 🔌 OpenAI-Compatible API

```bash
bit-ttt serve model.gguf --port 8000
```

```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hi!"}],
    stream=True,
)
```

## 📖 Links

- [GitHub](https://github.com/imonoonoko/Bit-TTT-Engine)
- [Documentation](https://github.com/imonoonoko/Bit-TTT-Engine#readme)

## 💖 License

MIT License

