Metadata-Version: 2.4
Name: aurestral
Version: 1.0.1
Summary: Local GGUF AI inference library built on llama-cpp-python with hardware auto-tuning
Author: AyaX_CreationZ
License-Expression: MIT
Project-URL: Homepage, https://github.com/AyaX_CreationZ/aurestral
Project-URL: Documentation, https://github.com/AyaX_CreationZ/aurestral#readme
Keywords: llm,gguf,llama-cpp,inference,local-ai
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: <3.14,>=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: llama-cpp-python>=0.2.90
Requires-Dist: psutil>=5.9.0
Provides-Extra: dev
Requires-Dist: build>=1.0.0; extra == "dev"
Requires-Dist: twine>=5.0.0; extra == "dev"
Dynamic: license-file

# Aurestral

Placeholder Local GGUF inference for Python, powered by [llama-cpp-python](https://github.com/abetlen/llama-cpp-python). Aurestral discovers models in your project’s `models/` folder, auto-tunes thread counts, context size, and GPU offload for your hardware, and ships with an interactive chatbot CLI. To be evolved, stay tuned - Aurestral Console.

## Requirements

- **Python 3.9–3.12** (3.13+ may work; **3.14 is not supported yet** — no `llama-cpp-python` wheels)
- A GGUF model file (e.g. from [Hugging Face](https://huggingface.co/models?library=gguf))

## Installation

### Recommended (Windows)

PyPI only ships a **source tarball** for `llama-cpp-python`, which often fails on Windows (long paths). Install a **prebuilt wheel** first, then Aurestral:

```powershell
# Use Python 3.11 or 3.12 (not 3.14)
py -3.12 -m venv .venv
.\.venv\Scripts\Activate.ps1

# Prebuilt CPU wheel (fast, no compile)
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu

pip install --upgrade aurestral
```

**NVIDIA GPU (e.g. RTX 4060)** — use the CUDA wheel index instead of CPU:

```powershell
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124
pip install --upgrade aurestral
```

### Simple install (Linux / macOS)

```bash
pip install aurestral
```

On **macOS**, prebuilt wheels usually include Metal acceleration.

### If install still fails on Windows

1. Enable [long path support](https://pip.pypa.io/warnings/enable-long-paths) in Windows settings.
2. Use a short temp folder before installing:
   ```powershell
   New-Item -ItemType Directory -Force C:\tmp | Out-Null
   $env:TEMP = "C:\tmp"
   $env:TMP = "C:\tmp"
   pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
   ```
3. Confirm Python version: `python --version` should show 3.12.x, not 3.14.

## Project layout

Place GGUF files in a `models/` directory at your project root (or set `AURESTRAL_MODELS_DIR`):

```
my-project/
├── models/
│   └── llama-3.2-3b-instruct.Q4_K_M.gguf
└── main.py
```

## Quick start

### Interactive chatbot

```bash
cd my-project
aurestral
# or explicitly:
aurestral chat -m llama-3.2-3b-instruct.Q4_K_M.gguf
```

Chat commands: `/help`, `/clear`, `/exit`

### Python API

```python
from aurestral import load_model, ChatSession, generate

# One-shot completion
text = generate("Explain quantum entanglement in one sentence.")
print(text)

# Reusable model handle
model = load_model()  # auto-picks sole GGUF, or pass name="my-model"
reply = model.chat([
    {"role": "user", "content": "Hello!"},
])
print(reply)

# Multi-turn session with streaming
session = ChatSession.create(system_prompt="You are a concise coding assistant.")
session.send("Write a Python hello world.", stream=True)
```

### List models and hardware info

```bash
aurestral list
aurestral info
aurestral run "The capital of France is" --stream
```

## Hardware auto-tuning

On load, Aurestral inspects CPU cores, RAM, and whether `llama-cpp-python` was built with GPU offload support. It sets:

| Setting | Behavior |
|--------|----------|
| `n_threads` | Physical cores minus one |
| `n_ctx` | 1k–8k based on available RAM |
| `n_gpu_layers` | `-1` (all layers) when GPU offload is available |
| `use_mlock` | Enabled on high-RAM CPU-only setups |
| `flash_attn` | Enabled when GPU offload is available |

Override defaults with `InferenceConfig` or `auto_tune=False`:

```python
from aurestral import InferenceConfig, load_model

cfg = InferenceConfig(n_ctx=8192, n_gpu_layers=35)
model = load_model("my-model.gguf", config=cfg, auto_tune=False)
```

## Configuration reference

**Environment**

- `AURESTRAL_MODELS_DIR` — path to models folder (instead of `./models`)

**`InferenceConfig`** — load-time: `n_ctx`, `n_batch`, `n_threads`, `n_gpu_layers`, `use_mmap`, `use_mlock`, `flash_attn`

**`GenerateConfig`** — generation-time: `max_tokens`, `temperature`, `top_p`, `top_k`, `repeat_penalty`, `stop`, `stream`

## License

MIT License — see [LICENSE](LICENSE).
