Metadata-Version: 2.4
Name: hallutok
Version: 0.2.0
Summary: Anti-Hallucination & Token Optimization library for Groq, Gemini and local LLMs
Author-email: Joel Pawar <joelpawarwork@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/joelpawar08/hallutok
Project-URL: Issues, https://github.com/joelpawar08/hallutok/issues
Project-URL: Documentation, https://github.com/joelpawar08/hallutok#readme
Keywords: llm,hallucination,token-optimization,groq,gemini,ai
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Provides-Extra: groq
Requires-Dist: groq>=0.9.0; extra == "groq"
Provides-Extra: gemini
Requires-Dist: google-generativeai>=0.7.0; extra == "gemini"
Provides-Extra: local
Requires-Dist: ollama>=0.1.0; extra == "local"
Requires-Dist: transformers>=4.40.0; extra == "local"
Requires-Dist: torch>=2.0.0; extra == "local"
Provides-Extra: all
Requires-Dist: groq>=0.9.0; extra == "all"
Requires-Dist: google-generativeai>=0.7.0; extra == "all"
Requires-Dist: ollama>=0.1.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"

# Hallutok

**Token optimization and hallucination detection for LLM applications.**

[![PyPI version](https://badge.fury.io/py/hallutok.svg)](https://pypi.org/project/hallutok/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

Hallutok is a Python library that wraps LLM calls with two things most production apps need but rarely have built-in: prompt compression to reduce token spend, and response scoring to catch hallucinations before they reach your users. It works with Groq, Gemini, Ollama, and HuggingFace. It also ships with a full CLI so you can use every feature directly from your terminal.

---

## Table of Contents

- [Installation](#installation)
- [CLI](#cli)
  - [chat](#hallutok-chat)
  - [optimize](#hallutok-optimize)
  - [validate](#hallutok-validate)
  - [session](#hallutok-session)
  - [models](#hallutok-models)
  - [stats](#hallutok-stats)
- [Quick Start](#quick-start)
- [API Providers — Groq and Gemini](#api-providers--groq-and-gemini)
- [Runtime Engine — Local Models](#runtime-engine--local-models)
- [Complete Runtime Example](#complete-runtime-example)
- [Components Reference](#components-reference)
  - [HallutokClient](#hallutokclient)
  - [TokenOptimizer](#tokenoptimizer)
  - [HallucinationValidator](#hallucinationvalidator)
  - [HallutokEngine](#hallutokengine)
  - [ContextWindowManager](#contextwindowmanager)
  - [SessionManager](#sessionmanager)
  - [LatencyOptimizer](#latencyoptimizer)
- [Result Objects](#result-objects)
- [Roadmap](#roadmap)

---

## Installation

```bash
# Groq support
pip install hallutok[groq]

# Gemini support
pip install hallutok[gemini]

# Both API providers
pip install hallutok[all]
```

For local model support via Ollama or HuggingFace, install the additional dependencies:

```bash
pip install ollama                          # for Ollama
pip install transformers torch             # for HuggingFace
```

---

## CLI

Hallutok installs a `hallutok` command alongside the library. Every feature — chatting, optimizing prompts, validating text, managing sessions — is available from the terminal without writing any Python.

```
hallutok <command> [options]

Commands:
  chat        Chat with a model (Ollama, Groq, or Gemini)
  optimize    Compress a prompt and see token savings
  validate    Score any text for hallucination risk
  session     List, inspect, and export saved sessions
  models      List available Ollama models
  stats       Show installed dependencies and system info
```

---

### hallutok chat

Send a prompt to any model and get a response with token savings and hallucination analysis printed inline.

```bash
# Ollama (default — requires Ollama running locally)
hallutok chat "What are black holes?"
hallutok chat "Explain quantum computing" --model phi3 --temperature 0.3

# Groq
hallutok chat "What causes inflation?" --groq gsk_your_key
hallutok chat "Summarize this" --groq gsk_your_key --model mixtral-8x7b-32768

# Gemini
hallutok chat "Explain neural networks" --gemini AIza_your_key

# With a system prompt
hallutok chat "What is the Higgs boson?" --system "You are a physics professor."

# Continue a named session
hallutok chat "What did we discuss?" --session my-research

# Save the session after chatting
hallutok chat "Tell me about supernovae" --save session.json

# Only print the response, no analytics
hallutok chat "What is DNA?" --quiet

# Output as JSON (useful for piping to other tools)
hallutok chat "What is AI?" --json

# Read a long prompt from a file
hallutok chat --file my_prompt.txt --model llama3

# Skip optimization or validation
hallutok chat "Hello" --no-optimize --no-validate
```

**Options:**

| Flag | Default | Description |
|---|---|---|
| `--model`, `-m` | `llama3` | Model name for Ollama |
| `--groq` | — | Use Groq with this API key |
| `--gemini` | — | Use Gemini with this API key |
| `--temperature`, `-t` | `0.4` | Sampling temperature |
| `--max-tokens` | `1024` | Max response tokens |
| `--system`, `-s` | — | System prompt |
| `--session` | — | Session name to continue |
| `--save` | — | Save session to JSON file after chat |
| `--file`, `-f` | — | Read prompt from a file |
| `--no-validate` | off | Skip hallucination validation |
| `--no-optimize` | off | Skip token optimization |
| `--json` | off | Output result as JSON |
| `--quiet`, `-q` | off | Print only the response |

---

### hallutok optimize

Compress a prompt and see exactly how many tokens were saved, before sending anything to a model.

```bash
hallutok optimize "Please note that I would like you to explain in order to help me understand what black holes are."

# With a token limit
hallutok optimize "Your long prompt here..." --max-tokens 100

# From a file
hallutok optimize --file my_prompt.txt

# JSON output
hallutok optimize "Please explain..." --json
```

Example output:

```
Original Prompt  (18 tokens)
  Please note that I would like you to explain in order to help me understand what black holes are.

Optimized Prompt  (7 tokens)
  Explain what black holes are.

Token Optimization
  Before  : 18 tokens
  After   : 7 tokens
  Saved   : 11 tokens (61.1%)
```

---

### hallutok validate

Score any text for hallucination risk using the mathematical HRS scoring system. Useful for auditing model outputs or any text before publishing.

```bash
hallutok validate "I think maybe studies show that eating chocolate probably cures cancer."

# From a file
hallutok validate --file response.txt

# Just print the risk level (LOW / MEDIUM / HIGH)
hallutok validate "Some text..." --quiet

# JSON output with full score breakdown
hallutok validate "Some text..." --json
```

Example output:

```
Hallucination Analysis
  HRS Score : ████░░░░░░░░░░░ 0.612
  Risk      : HIGH
  SCS=0.410  ECS=0.700  CDS=1.000  FGS=0.900

  Flags:
  - Hedging language detected: "I think", "maybe", "probably"
  - Ungrounded claim: "studies show" without citation

  Suggestions:
  - Remove hedging phrases or support claims with citations
  - Add specific sources for statistical claims
```

---

### hallutok session

List and inspect saved session files, or export them to Markdown.

```bash
# List all session JSON files in current directory
hallutok session list

# List sessions in a specific directory
hallutok session list --dir ./sessions

# Show session stats and full chat history
hallutok session show my_session.json

# Export session as a readable Markdown chat log
hallutok session export my_session.json
hallutok session export my_session.json --output chat_log.md

# Show session as JSON
hallutok session show my_session.json --json
```

---

### hallutok models

List all models currently available in your local Ollama installation.

```bash
hallutok models

# Specify a different Ollama host
hallutok models --host http://192.168.1.100:11434

# JSON output
hallutok models --json
```

Example output:

```
Available Ollama Models
  MODEL                               SIZE         MODIFIED
  llama3:latest                       4823 MB      2024-06-01
  mistral:latest                      4108 MB      2024-05-20
  phi3:latest                         2301 MB      2024-05-18
```

---

### hallutok stats

Show system information, installed dependencies, and quick-start examples.

```bash
hallutok stats

# JSON output
hallutok stats --json
```

Example output:

```
Version Info
  hallutok_version       : 0.2.0
  python_version         : 3.11.4
  platform               : Darwin
  architecture           : arm64

Dependencies
  groq                      : installed
  google.generativeai        : installed
  ollama                    : installed
  transformers               : not installed
  torch                     : not installed
```

---

## Quick Start

```python
from hallutok import HallutokClient

client = HallutokClient.with_groq(
    api_key="gsk_your_key",
    model="llama3-8b-8192",
    temperature=0.3,
)

result = client.chat("Explain what black holes are.")

print(result.response)
print(result.token_report)
# {'tokens_before': 12, 'tokens_after': 9, 'tokens_saved': 3, 'percent_saved': 25.0}

if result.validation.is_likely_hallucination:
    print("Flags:", result.validation.flags)
```

---

## API Providers — Groq and Gemini

### Groq

```python
from hallutok import HallutokClient

client = HallutokClient.with_groq(
    api_key="gsk_your_groq_key",
    model="llama3-8b-8192",
    temperature=0.3,
    max_response_tokens=1024,
    system_prompt="You are a factual assistant. Cite sources when possible.",
)

result = client.chat(
    "Please note that I would like you to explain in order to help me "
    "understand what black holes are and how they work in detail."
)

print(result.response)
print(result.token_report)
# {'tokens_before': 34, 'tokens_after': 13, 'tokens_saved': 21, 'percent_saved': 61.8}

if result.validation.is_likely_hallucination:
    print("Risk:", result.validation.risk_level)
    print("Flags:", result.validation.flags)
    print("Suggestions:", result.validation.suggestions)
```

### Gemini

```python
from hallutok import HallutokClient

client = HallutokClient.with_gemini(
    api_key="AIza_your_gemini_key",
    model="gemini-1.5-flash",
    temperature=0.4,
)

result = client.chat("Explain quantum entanglement to a 10-year-old.")
print(result.response)
print(result.token_report)
```

### Custom provider setup

```python
from hallutok import HallutokClient
from hallutok.providers import GroqProvider, GeminiProvider

provider = GroqProvider(api_key="gsk_...", model="mixtral-8x7b-32768")
# provider = GeminiProvider(api_key="AIza_...", model="gemini-1.5-pro")

client = HallutokClient(
    provider=provider,
    optimize_tokens=True,
    validate_responses=True,
    max_prompt_tokens=512,
    temperature=0.4,
    max_response_tokens=1024,
    system_prompt="You are a factual assistant.",
    cache_enabled=True,
)

result = client.chat("What causes inflation?")
```

### Pre-flight token estimation

Check how many tokens a prompt will use before sending it:

```python
estimate = client.estimate_cost_tokens(
    "Please note that I would like you to in order to help me explain "
    "how machine learning works and what it does."
)
print(estimate)
# {'tokens_before': 28, 'tokens_after': 11, 'tokens_saved': 17, 'percent_saved': 60.7}
```

---

## Runtime Engine — Local Models

The `HallutokEngine` brings the full Hallutok pipeline to local models. Load any model from Ollama or HuggingFace and get token optimization, hallucination scoring, context window management, session persistence, and latency optimization out of the box — no API key required.

### Loading a model

```python
from hallutok.runtime import HallutokEngine

# From Ollama (requires Ollama running at localhost:11434)
engine = HallutokEngine.from_ollama("llama3")

# From HuggingFace Hub
engine = HallutokEngine.from_huggingface(
    "mistralai/Mistral-7B-Instruct-v0.2",
    device="auto",    # auto-detects cuda / mps / cpu
    quantize=True,    # 4-bit quantization to reduce memory
)

# From a local model directory
engine = HallutokEngine.from_local("/path/to/model")
```

### Engine configuration options

```python
engine = HallutokEngine.from_ollama(
    model="llama3",
    max_tokens=4096,              # total context window token budget
    trim_strategy="sliding",      # how to handle context overflow
    kv_cache=True,                # cache identical prompts
    warm_up=True,                 # pre-warm model to cut first-call latency
    stream=False,
    system_prompt="You are a concise, factual assistant.",
)
```

---

## Complete Runtime Example

This single script demonstrates every runtime feature — context management, session tracking, latency optimization, hallucination detection, export, and engine stats. Copy and run it against any Ollama model.

```python
from hallutok.runtime import HallutokEngine

# ── 1. Load the engine ────────────────────────────────────────────────────────
engine = HallutokEngine.from_ollama(
    model="llama3",
    max_tokens=4096,
    trim_strategy="sliding",
    kv_cache=True,
    warm_up=True,
    system_prompt="You are a factual assistant. Keep answers concise.",
)

# ── 2. Create a session ───────────────────────────────────────────────────────
session = engine.create_session(
    name="demo-session",
    system_prompt="You are a factual assistant.",
    max_tokens=4096,
    trim_strategy="sliding",
)

# ── 3. Multi-turn conversation ────────────────────────────────────────────────
questions = [
    "What are black holes?",
    "Please note that I would like you to explain how Hawking radiation works.",
    "How does the event horizon relate to the singularity?",
    "What would happen to a person falling into a black hole?",
]

for question in questions:
    result = session.chat(question, temperature=0.4, max_tokens=512)

    print(f"\nQ: {question}")
    print(f"A: {result.response[:200]}...")
    print(f"   Tokens saved   : {result.tokens_saved} ({result.tokens_saved_pct}%)")
    print(f"   HRS score      : {result.hallucination_score:.3f}")
    print(f"   Risk level     : {result.hallucination_risk}")
    print(f"   Latency        : {result.latency_ms:.0f}ms")
    print(f"   Cache hit      : {result.cache_hit}")
    print(f"   Context used   : {result.context_tokens_used} / {result.context_tokens_used + result.context_tokens_available} tokens")

    if result.is_hallucination:
        print(f"   Flags          : {result.hallucination_flags}")
        print(f"   Suggestions    : {result.suggestions}")

    # Math score breakdown
    print(f"   HRS breakdown  : {result.math_scores}")

# ── 4. Flag an important turn (never trimmed from context) ────────────────────
result = session.chat(
    "Summarize everything we discussed.",
    flag_turn=True,
    temperature=0.3,
)
print(f"\nSummary: {result.response[:300]}")

# ── 5. Session analytics ──────────────────────────────────────────────────────
stats = session.get_stats()
print(f"\n--- Session Stats ---")
print(f"Total turns           : {stats['total_turns']}")
print(f"Total tokens saved    : {stats['total_tokens_saved']}")
print(f"Avg tokens saved      : {stats['avg_tokens_saved_pct']}%")
print(f"Hallucinations caught : {stats['total_hallucinations_caught']}")
print(f"Avg HRS score         : {stats['avg_hallucination_score']}")
print(f"Avg latency           : {stats['avg_latency_ms']}ms")
print(f"Session duration      : {stats['session_duration_s']}s")
print(f"Context trims         : {stats['context_trims']}")

# ── 6. Engine-wide stats ──────────────────────────────────────────────────────
engine_stats = engine.get_stats()
print(f"\n--- Engine Stats ---")
print(f"Model          : {engine_stats['model']}")
print(f"Source         : {engine_stats['source']}")
print(f"Device         : {engine_stats['device']}")
print(f"Total sessions : {engine_stats['total_sessions']}")
print(f"Uptime         : {engine_stats['uptime_s']}s")
print(f"Latency stats  : {engine_stats['latency']}")

# ── 7. Export session ─────────────────────────────────────────────────────────
session.save("my_session.json")
session.export_markdown("chat_log.md")
session.export_csv("analytics.csv")

# ── 8. Load a saved session ───────────────────────────────────────────────────
restored = engine.load_session("my_session.json")
print(f"\nRestored session: {restored.name}")
print(f"Last response: {restored.last_response()[:100]}")

# ── 9. Clear caches ───────────────────────────────────────────────────────────
engine.clear_cache()
```

---

## Components Reference

### HallutokClient

The main entry point for Groq and Gemini API usage.

```python
from hallutok import HallutokClient

client = HallutokClient(
    provider=provider,
    optimize_tokens=True,       # compress prompts before sending
    validate_responses=True,    # score responses for hallucination
    max_prompt_tokens=512,      # hard cap on prompt size (None = no cap)
    temperature=0.5,
    max_response_tokens=1024,
    system_prompt=None,
    cache_enabled=True,
)
```

| Method | Description |
|---|---|
| `chat(prompt, ...)` | Send a prompt through the full pipeline |
| `estimate_cost_tokens(prompt)` | Preview token savings before sending |
| `clear_cache()` | Flush the optimizer prompt cache |
| `HallutokClient.with_groq(api_key, model, **kwargs)` | Factory for Groq |
| `HallutokClient.with_gemini(api_key, model, **kwargs)` | Factory for Gemini |

---

### TokenOptimizer

Compresses prompts before they are sent to any model.

```python
from hallutok.optimizer import TokenOptimizer

opt = TokenOptimizer(cache_enabled=True)

raw = """
Please note that I would like you to, in order to be helpful,
can you please explain, it is important to note that, machine learning
is a subset of AI. Machine learning is a subset of AI.
"""

compressed = opt.optimize(raw, max_tokens=100)
report = opt.savings_report(raw, compressed)
print(report)
# {'tokens_before': 54, 'tokens_after': 12, 'tokens_saved': 42, 'percent_saved': 77.8}
```

The optimizer applies these steps in order:

| Step | What it does |
|---|---|
| Whitespace normalization | Collapses spaces, trims blank lines |
| Boilerplate stripping | Removes "Please note that", "I would like you to", "It is important to note", etc. |
| Deduplication | Removes repeated sentences |
| Phrase compression | "in order to" -> "to", "due to the fact that" -> "because" |
| Truncation | Cuts to `max_tokens` at a sentence boundary |

---

### HallucinationValidator

Scores any text for hallucination risk using the Hallucination Risk Score (HRS), a composite of four mathematical sub-scores.

```python
from hallutok.antihallucination import HallucinationValidator

validator = HallucinationValidator()

response = "I think maybe studies show that eating chocolate probably cures cancer."
result = validator.validate(response)

print(result.confidence_score)         # 0.0–1.0, higher = more confident
print(result.risk_level)               # "LOW" | "MEDIUM" | "HIGH"
print(result.is_likely_hallucination)  # True / False
print(result.flags)                    # list of detected issues
print(result.warnings)                 # human-readable descriptions
print(result.suggestions)             # recommended actions
print(result.cleaned_response)         # response with disclaimer appended if flagged
print(result.math_scores)             # SCS, ECS, CDS, FGS, HRS breakdown
```

**HRS scoring breakdown:**

| Score | Name | What it measures |
|---|---|---|
| SCS | Semantic Confidence Score | Hedging language ("I think", "maybe", "probably") |
| ECS | Evidence Consistency Score | Ungrounded claims ("Studies show", "Research suggests") |
| CDS | Contradiction Detection Score | Internal contradictions ("always" + "never" in same text) |
| FGS | Factual Grounding Score | Numeric anomalies, implausible figures |
| HRS | Hallucination Risk Score | Composite of all four |

**Detection layers:**

| Layer | Examples caught |
|---|---|
| Hedging | "I think", "maybe", "perhaps", "I'm not sure", "I believe" |
| Ungrounded claims | "Studies show", "Research suggests", "Experts say" |
| Numeric anomalies | Percentages over 100%, implausible statistics |
| Contradictions | Contradictory absolute terms in the same response |

---

### HallutokEngine

The runtime engine for local model inference with the full Hallutok pipeline.

```python
from hallutok.runtime import HallutokEngine

# Factory methods
engine = HallutokEngine.from_ollama(model, host, **kwargs)
engine = HallutokEngine.from_huggingface(model_id, device, quantize, token, **kwargs)
engine = HallutokEngine.from_local(path, device, **kwargs)
```

**Constructor parameters:**

| Parameter | Type | Default | Description |
|---|---|---|---|
| `max_tokens` | int | 4096 | Context window token budget |
| `trim_strategy` | str | "sliding" | Context overflow strategy |
| `kv_cache` | bool | True | Cache identical prompt responses |
| `warm_up` | bool | True | Pre-warm model on load |
| `stream` | bool | False | Enable streaming responses |
| `system_prompt` | str | None | Default system instruction |

**Methods:**

| Method | Description |
|---|---|
| `create_session(name, system_prompt, max_tokens, trim_strategy)` | Create a new chat session |
| `load_session(path, max_tokens, trim_strategy)` | Restore session from JSON |
| `get_stats()` | Engine-wide performance stats |
| `clear_cache()` | Flush KV and optimizer caches |

---

### ContextWindowManager

Manages the token budget for a conversation and automatically trims messages when the budget is exceeded.

```python
from hallutok.runtime.context_manager import ContextWindowManager

ctx = ContextWindowManager(
    max_tokens=4096,
    trim_strategy="sliding",
    reserve_tokens=512,
)

ctx.add_message("system", "You are a helpful assistant.", flagged=True)
ctx.add_message("user", "What are black holes?")
ctx.add_message("assistant", "Black holes are regions of extremely strong gravity.")

print(ctx.stats())
# {
#   'messages': 3,
#   'total_tokens': 28,
#   'available_tokens': 3556,
#   'budget': 4096,
#   'usage_percent': 0.7,
#   'trim_count': 0,
#   'strategy': 'sliding'
# }
```

**Trim strategies:**

| Strategy | Behavior |
|---|---|
| `sliding` | Keep system messages and the last N conversation turns |
| `drop_oldest` | Remove oldest non-system, non-flagged messages first |
| `summarize` | Compress older messages into an extractive summary note |
| `priority` | Keep system messages, flagged turns, and the last 6 messages |

Messages added with `flagged=True` are never removed by any trim strategy.

---

### SessionManager

Tracks conversation history, computes per-session analytics, and handles persistence and export.

```python
from hallutok.runtime.session_manager import SessionManager
from hallutok.runtime.context_manager import ContextWindowManager

ctx = ContextWindowManager(max_tokens=4096)
session = SessionManager(name="my-session", context_manager=ctx)
```

**Methods:**

| Method | Description |
|---|---|
| `record_turn(prompt, optimized_prompt, response, token_report, validation_result, latency_ms)` | Record a completed turn |
| `get_stats()` | Return aggregated session analytics |
| `save(path)` | Save session to JSON |
| `SessionManager.load(path, context_manager)` | Load session from JSON |
| `export_markdown(path)` | Export readable chat log as Markdown |
| `export_csv(path)` | Export per-turn analytics as CSV |
| `last_response()` | Return the most recent assistant response |
| `clear()` | Clear history and context |

**SessionStats fields:**

```python
stats = session.get_stats()

stats.session_name
stats.total_turns
stats.total_tokens_before
stats.total_tokens_after
stats.total_tokens_saved
stats.avg_tokens_saved_pct
stats.total_hallucinations_caught
stats.avg_hallucination_score
stats.avg_latency_ms
stats.session_duration_s
stats.context_trims
```

---

### LatencyOptimizer

Manages KV caching, warm-up, and latency tracking for the runtime engine.

```python
from hallutok.runtime.latency_optimizer import LatencyOptimizer

lat = LatencyOptimizer(
    kv_cache_enabled=True,
    kv_cache_size=64,
    stream=False,
    warm_up=True,
)

# Cache operations
lat.store_cache("What is AI?", "AI is artificial intelligence.")
cached = lat.get_cached("What is AI?")  # returns response or None

# Latency stats
print(lat.latency_stats())
# {
#   'calls': 12,
#   'avg_ms': 134.2,
#   'min_ms': 98.1,
#   'max_ms': 312.4,
#   'p95_ms': 280.0,
#   'cache_hits': 3,
#   'stream_mode': False
# }
```

---

## Result Objects

### ChatResult (API providers)

Returned by `HallutokClient.chat()`.

| Field | Type | Description |
|---|---|---|
| `response` | str | Final model response (with disclaimer if flagged) |
| `original_prompt` | str | The prompt as you wrote it |
| `optimized_prompt` | str | The prompt after token optimization |
| `token_report` | dict | tokens_before, tokens_after, tokens_saved, percent_saved |
| `validation` | ValidationResult | Full hallucination validation result |
| `provider` | str | "groq" or "gemini" |
| `warnings` | list[str] | Aggregated warnings from optimizer and validator |

### EngineResult (Runtime Engine)

Returned by `session.chat()`.

| Field | Type | Description |
|---|---|---|
| `response` | str | Final model response |
| `original_prompt` | str | Raw input prompt |
| `optimized_prompt` | str | Prompt after optimization |
| `tokens_before` | int | Token count before optimization |
| `tokens_after` | int | Token count after optimization |
| `tokens_saved` | int | Tokens saved |
| `tokens_saved_pct` | float | Percentage saved |
| `hallucination_score` | float | HRS composite score (0.0–1.0) |
| `hallucination_risk` | str | "LOW", "MEDIUM", or "HIGH" |
| `is_hallucination` | bool | Whether response is flagged |
| `hallucination_flags` | list[str] | Detected issues |
| `math_scores` | dict | SCS, ECS, CDS, FGS, HRS sub-scores |
| `latency_ms` | float | End-to-end latency in milliseconds |
| `cache_hit` | bool | True if served from KV cache |
| `context_tokens_used` | int | Tokens currently in context window |
| `context_tokens_available` | int | Tokens remaining in budget |
| `suggestions` | list[str] | Recommendations if hallucination detected |

---

## Roadmap

- [x] Token optimization pipeline
- [x] Hallucination detection with mathematical HRS scoring
- [x] Groq and Gemini provider adapters
- [x] Runtime Engine with Ollama and HuggingFace support
- [x] Context Window Manager with four trim strategies
- [x] Session Manager with history, analytics, and export
- [x] Latency Optimizer with KV cache and P95 tracking
- [x] CLI with chat, optimize, validate, session, models, and stats commands
- [ ] Async support via `achat()`
- [ ] Streaming responses
- [ ] OpenAI and Together AI provider adapters
- [ ] Self-consistency hallucination verification
- [ ] Per-call token budget enforcement

---

## License

MIT License — see [LICENSE](LICENSE) for details.
