Metadata-Version: 2.4
Name: hallutok
Version: 0.1.3
Summary: Production-ready token optimization and hallucination detection for LLM applications. Works with Groq, Gemini, Ollama, and HuggingFace. Includes a full local inference runtime with context window management, session persistence, KV caching, and mathematical hallucination risk scoring.
Author-email: Joel Pawar <joelpawarwork@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/joelpawar08/hallutok
Project-URL: Issues, https://github.com/joelpawar08/hallutok/issues
Project-URL: Documentation, https://github.com/joelpawar08/hallutok#readme
Keywords: llm,hallucination,hallucination-detection,token-optimization,groq,gemini,ollama,huggingface,ai,nlp,prompt-optimization,llm-safety,local-llm,runtime-engine
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Provides-Extra: groq
Requires-Dist: groq>=0.9.0; extra == "groq"
Provides-Extra: gemini
Requires-Dist: google-generativeai>=0.7.0; extra == "gemini"
Provides-Extra: all
Requires-Dist: groq>=0.9.0; extra == "all"
Requires-Dist: google-generativeai>=0.7.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"

# Hallutok

**Token optimization and hallucination detection for LLM applications.**

[![PyPI version](https://badge.fury.io/py/hallutok.svg)](https://pypi.org/project/hallutok/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

Hallutok is a Python library that wraps LLM calls with two things most production apps need but rarely have built-in: prompt compression to reduce token spend, and response scoring to catch hallucinations before they reach your users. It works with Groq, Gemini, Ollama, and HuggingFace.

---

## Table of Contents

- [Installation](#installation)
- [Quick Start](#quick-start)
- [API Providers — Groq and Gemini](#api-providers--groq-and-gemini)
- [Runtime Engine — Local Models](#runtime-engine--local-models)
- [Complete Runtime Example](#complete-runtime-example)
- [Components Reference](#components-reference)
  - [HallutokClient](#hallutokclient)
  - [TokenOptimizer](#tokenoptimizer)
  - [HallucinationValidator](#hallucinationvalidator)
  - [HallutokEngine](#hallutokengine)
  - [ContextWindowManager](#contextwindowmanager)
  - [SessionManager](#sessionmanager)
  - [LatencyOptimizer](#latencyoptimizer)
- [Result Objects](#result-objects)
- [Roadmap](#roadmap)

---

## Installation

```bash
# Groq support
pip install hallutok[groq]

# Gemini support
pip install hallutok[gemini]

# Both API providers
pip install hallutok[all]
```

For local model support via Ollama or HuggingFace, install the additional dependencies:

```bash
pip install ollama                          # for Ollama
pip install transformers torch             # for HuggingFace
```

---

## Quick Start

```python
from hallutok import HallutokClient

client = HallutokClient.with_groq(
    api_key="gsk_your_key",
    model="llama3-8b-8192",
    temperature=0.3,
)

result = client.chat("Explain what black holes are.")

print(result.response)
print(result.token_report)
# {'tokens_before': 12, 'tokens_after': 9, 'tokens_saved': 3, 'percent_saved': 25.0}

if result.validation.is_likely_hallucination:
    print("Flags:", result.validation.flags)
```

---

## API Providers — Groq and Gemini

### Groq

```python
from hallutok import HallutokClient

client = HallutokClient.with_groq(
    api_key="gsk_your_groq_key",
    model="llama3-8b-8192",
    temperature=0.3,
    max_response_tokens=1024,
    system_prompt="You are a factual assistant. Cite sources when possible.",
)

result = client.chat(
    "Please note that I would like you to explain in order to help me "
    "understand what black holes are and how they work in detail."
)

print(result.response)
print(result.token_report)
# {'tokens_before': 34, 'tokens_after': 13, 'tokens_saved': 21, 'percent_saved': 61.8}

if result.validation.is_likely_hallucination:
    print("Risk:", result.validation.risk_level)
    print("Flags:", result.validation.flags)
    print("Suggestions:", result.validation.suggestions)
```

### Gemini

```python
from hallutok import HallutokClient

client = HallutokClient.with_gemini(
    api_key="AIza_your_gemini_key",
    model="gemini-1.5-flash",
    temperature=0.4,
)

result = client.chat("Explain quantum entanglement to a 10-year-old.")
print(result.response)
print(result.token_report)
```

### Custom provider setup

```python
from hallutok import HallutokClient
from hallutok.providers import GroqProvider, GeminiProvider

provider = GroqProvider(api_key="gsk_...", model="mixtral-8x7b-32768")
# provider = GeminiProvider(api_key="AIza_...", model="gemini-1.5-pro")

client = HallutokClient(
    provider=provider,
    optimize_tokens=True,
    validate_responses=True,
    max_prompt_tokens=512,
    temperature=0.4,
    max_response_tokens=1024,
    system_prompt="You are a factual assistant.",
    cache_enabled=True,
)

result = client.chat("What causes inflation?")
```

### Pre-flight token estimation

Check how many tokens a prompt will use before sending it:

```python
estimate = client.estimate_cost_tokens(
    "Please note that I would like you to in order to help me explain "
    "how machine learning works and what it does."
)
print(estimate)
# {'tokens_before': 28, 'tokens_after': 11, 'tokens_saved': 17, 'percent_saved': 60.7}
```

---

## Runtime Engine — Local Models

The `HallutokEngine` brings the full Hallutok pipeline to local models. Load any model from Ollama or HuggingFace and get token optimization, hallucination scoring, context window management, session persistence, and latency optimization out of the box — no API key required.

### Loading a model

```python
from hallutok.runtime import HallutokEngine

# From Ollama (requires Ollama running at localhost:11434)
engine = HallutokEngine.from_ollama("llama3")

# From HuggingFace Hub
engine = HallutokEngine.from_huggingface(
    "mistralai/Mistral-7B-Instruct-v0.2",
    device="auto",    # auto-detects cuda / mps / cpu
    quantize=True,    # 4-bit quantization to reduce memory
)

# From a local model directory
engine = HallutokEngine.from_local("/path/to/model")
```

### Engine configuration options

```python
engine = HallutokEngine.from_ollama(
    model="llama3",
    max_tokens=4096,              # total context window token budget
    trim_strategy="sliding",      # how to handle context overflow
    kv_cache=True,                # cache identical prompts
    warm_up=True,                 # pre-warm model to cut first-call latency
    stream=False,
    system_prompt="You are a concise, factual assistant.",
)
```

---

## Complete Runtime Example

This single script demonstrates every runtime feature — context management, session tracking, latency optimization, hallucination detection, export, and engine stats. Copy and run it against any Ollama model.

```python
from hallutok.runtime import HallutokEngine

# ── 1. Load the engine ────────────────────────────────────────────────────────
engine = HallutokEngine.from_ollama(
    model="llama3",
    max_tokens=4096,
    trim_strategy="sliding",
    kv_cache=True,
    warm_up=True,
    system_prompt="You are a factual assistant. Keep answers concise.",
)

# ── 2. Create a session ───────────────────────────────────────────────────────
session = engine.create_session(
    name="demo-session",
    system_prompt="You are a factual assistant.",
    max_tokens=4096,
    trim_strategy="sliding",
)

# ── 3. Multi-turn conversation ────────────────────────────────────────────────
questions = [
    "What are black holes?",
    "Please note that I would like you to explain how Hawking radiation works.",
    "How does the event horizon relate to the singularity?",
    "What would happen to a person falling into a black hole?",
]

for question in questions:
    result = session.chat(question, temperature=0.4, max_tokens=512)

    print(f"\nQ: {question}")
    print(f"A: {result.response[:200]}...")
    print(f"   Tokens saved   : {result.tokens_saved} ({result.tokens_saved_pct}%)")
    print(f"   HRS score      : {result.hallucination_score:.3f}")
    print(f"   Risk level     : {result.hallucination_risk}")
    print(f"   Latency        : {result.latency_ms:.0f}ms")
    print(f"   Cache hit      : {result.cache_hit}")
    print(f"   Context used   : {result.context_tokens_used} / {result.context_tokens_used + result.context_tokens_available} tokens")

    if result.is_hallucination:
        print(f"   Flags          : {result.hallucination_flags}")
        print(f"   Suggestions    : {result.suggestions}")

    # Math score breakdown
    print(f"   HRS breakdown  : {result.math_scores}")

# ── 4. Flag an important turn (never trimmed from context) ────────────────────
result = session.chat(
    "Summarize everything we discussed.",
    flag_turn=True,
    temperature=0.3,
)
print(f"\nSummary: {result.response[:300]}")

# ── 5. Session analytics ──────────────────────────────────────────────────────
stats = session.get_stats()
print(f"\n--- Session Stats ---")
print(f"Total turns           : {stats['total_turns']}")
print(f"Total tokens saved    : {stats['total_tokens_saved']}")
print(f"Avg tokens saved      : {stats['avg_tokens_saved_pct']}%")
print(f"Hallucinations caught : {stats['total_hallucinations_caught']}")
print(f"Avg HRS score         : {stats['avg_hallucination_score']}")
print(f"Avg latency           : {stats['avg_latency_ms']}ms")
print(f"Session duration      : {stats['session_duration_s']}s")
print(f"Context trims         : {stats['context_trims']}")

# ── 6. Engine-wide stats ──────────────────────────────────────────────────────
engine_stats = engine.get_stats()
print(f"\n--- Engine Stats ---")
print(f"Model          : {engine_stats['model']}")
print(f"Source         : {engine_stats['source']}")
print(f"Device         : {engine_stats['device']}")
print(f"Total sessions : {engine_stats['total_sessions']}")
print(f"Uptime         : {engine_stats['uptime_s']}s")
print(f"Latency stats  : {engine_stats['latency']}")

# ── 7. Export session ─────────────────────────────────────────────────────────
session.save("my_session.json")
session.export_markdown("chat_log.md")
session.export_csv("analytics.csv")

# ── 8. Load a saved session ───────────────────────────────────────────────────
restored = engine.load_session("my_session.json")
print(f"\nRestored session: {restored.name}")
print(f"Last response: {restored.last_response()[:100]}")

# ── 9. Clear caches ───────────────────────────────────────────────────────────
engine.clear_cache()
```

---

## Components Reference

### HallutokClient

The main entry point for Groq and Gemini API usage.

```python
from hallutok import HallutokClient

client = HallutokClient(
    provider=provider,
    optimize_tokens=True,       # compress prompts before sending
    validate_responses=True,    # score responses for hallucination
    max_prompt_tokens=512,      # hard cap on prompt size (None = no cap)
    temperature=0.5,
    max_response_tokens=1024,
    system_prompt=None,
    cache_enabled=True,
)
```

| Method | Description |
|---|---|
| `chat(prompt, ...)` | Send a prompt through the full pipeline |
| `estimate_cost_tokens(prompt)` | Preview token savings before sending |
| `clear_cache()` | Flush the optimizer prompt cache |
| `HallutokClient.with_groq(api_key, model, **kwargs)` | Factory for Groq |
| `HallutokClient.with_gemini(api_key, model, **kwargs)` | Factory for Gemini |

---

### TokenOptimizer

Compresses prompts before they are sent to any model.

```python
from hallutok.optimizer import TokenOptimizer

opt = TokenOptimizer(cache_enabled=True)

raw = """
Please note that I would like you to, in order to be helpful,
can you please explain, it is important to note that, machine learning
is a subset of AI. Machine learning is a subset of AI.
"""

compressed = opt.optimize(raw, max_tokens=100)
report = opt.savings_report(raw, compressed)
print(report)
# {'tokens_before': 54, 'tokens_after': 12, 'tokens_saved': 42, 'percent_saved': 77.8}
```

The optimizer applies these steps in order:

| Step | What it does |
|---|---|
| Whitespace normalization | Collapses spaces, trims blank lines |
| Boilerplate stripping | Removes "Please note that", "I would like you to", "It is important to note", etc. |
| Deduplication | Removes repeated sentences |
| Phrase compression | "in order to" -> "to", "due to the fact that" -> "because" |
| Truncation | Cuts to `max_tokens` at a sentence boundary |

---

### HallucinationValidator

Scores any text for hallucination risk using the Hallucination Risk Score (HRS), a composite of four mathematical sub-scores.

```python
from hallutok.antihallucination import HallucinationValidator

validator = HallucinationValidator()

response = "I think maybe studies show that eating chocolate probably cures cancer."
result = validator.validate(response)

print(result.confidence_score)         # 0.0–1.0, higher = more confident
print(result.risk_level)               # "LOW" | "MEDIUM" | "HIGH"
print(result.is_likely_hallucination)  # True / False
print(result.flags)                    # list of detected issues
print(result.warnings)                 # human-readable descriptions
print(result.suggestions)             # recommended actions
print(result.cleaned_response)         # response with disclaimer appended if flagged
print(result.math_scores)             # SCS, ECS, CDS, FGS, HRS breakdown
```

**HRS scoring breakdown:**

| Score | Name | What it measures |
|---|---|---|
| SCS | Semantic Confidence Score | Hedging language ("I think", "maybe", "probably") |
| ECS | Evidence Consistency Score | Ungrounded claims ("Studies show", "Research suggests") |
| CDS | Contradiction Detection Score | Internal contradictions ("always" + "never" in same text) |
| FGS | Factual Grounding Score | Numeric anomalies, implausible figures |
| HRS | Hallucination Risk Score | Composite of all four |

**Detection layers:**

| Layer | Examples caught |
|---|---|
| Hedging | "I think", "maybe", "perhaps", "I'm not sure", "I believe" |
| Ungrounded claims | "Studies show", "Research suggests", "Experts say" |
| Numeric anomalies | Percentages over 100%, implausible statistics |
| Contradictions | Contradictory absolute terms in the same response |

---

### HallutokEngine

The runtime engine for local model inference with the full Hallutok pipeline.

```python
from hallutok.runtime import HallutokEngine

# Factory methods
engine = HallutokEngine.from_ollama(model, host, **kwargs)
engine = HallutokEngine.from_huggingface(model_id, device, quantize, token, **kwargs)
engine = HallutokEngine.from_local(path, device, **kwargs)
```

**Constructor parameters:**

| Parameter | Type | Default | Description |
|---|---|---|---|
| `max_tokens` | int | 4096 | Context window token budget |
| `trim_strategy` | str | "sliding" | Context overflow strategy |
| `kv_cache` | bool | True | Cache identical prompt responses |
| `warm_up` | bool | True | Pre-warm model on load |
| `stream` | bool | False | Enable streaming responses |
| `system_prompt` | str | None | Default system instruction |

**Methods:**

| Method | Description |
|---|---|
| `create_session(name, system_prompt, max_tokens, trim_strategy)` | Create a new chat session |
| `load_session(path, max_tokens, trim_strategy)` | Restore session from JSON |
| `get_stats()` | Engine-wide performance stats |
| `clear_cache()` | Flush KV and optimizer caches |

---

### ContextWindowManager

Manages the token budget for a conversation and automatically trims messages when the budget is exceeded.

```python
from hallutok.runtime.context_manager import ContextWindowManager

ctx = ContextWindowManager(
    max_tokens=4096,
    trim_strategy="sliding",
    reserve_tokens=512,
)

ctx.add_message("system", "You are a helpful assistant.", flagged=True)
ctx.add_message("user", "What are black holes?")
ctx.add_message("assistant", "Black holes are regions of extremely strong gravity.")

print(ctx.stats())
# {
#   'messages': 3,
#   'total_tokens': 28,
#   'available_tokens': 3556,
#   'budget': 4096,
#   'usage_percent': 0.7,
#   'trim_count': 0,
#   'strategy': 'sliding'
# }
```

**Trim strategies:**

| Strategy | Behavior |
|---|---|
| `sliding` | Keep system messages and the last N conversation turns |
| `drop_oldest` | Remove oldest non-system, non-flagged messages first |
| `summarize` | Compress older messages into an extractive summary note |
| `priority` | Keep system messages, flagged turns, and the last 6 messages |

Messages added with `flagged=True` are never removed by any trim strategy.

---

### SessionManager

Tracks conversation history, computes per-session analytics, and handles persistence and export.

```python
from hallutok.runtime.session_manager import SessionManager
from hallutok.runtime.context_manager import ContextWindowManager

ctx = ContextWindowManager(max_tokens=4096)
session = SessionManager(name="my-session", context_manager=ctx)
```

**Methods:**

| Method | Description |
|---|---|
| `record_turn(prompt, optimized_prompt, response, token_report, validation_result, latency_ms)` | Record a completed turn |
| `get_stats()` | Return aggregated session analytics |
| `save(path)` | Save session to JSON |
| `SessionManager.load(path, context_manager)` | Load session from JSON |
| `export_markdown(path)` | Export readable chat log as Markdown |
| `export_csv(path)` | Export per-turn analytics as CSV |
| `last_response()` | Return the most recent assistant response |
| `clear()` | Clear history and context |

**SessionStats fields:**

```python
stats = session.get_stats()

stats.session_name
stats.total_turns
stats.total_tokens_before
stats.total_tokens_after
stats.total_tokens_saved
stats.avg_tokens_saved_pct
stats.total_hallucinations_caught
stats.avg_hallucination_score
stats.avg_latency_ms
stats.session_duration_s
stats.context_trims
```

---

### LatencyOptimizer

Manages KV caching, warm-up, and latency tracking for the runtime engine.

```python
from hallutok.runtime.latency_optimizer import LatencyOptimizer

lat = LatencyOptimizer(
    kv_cache_enabled=True,
    kv_cache_size=64,
    stream=False,
    warm_up=True,
)

# Cache operations
lat.store_cache("What is AI?", "AI is artificial intelligence.")
cached = lat.get_cached("What is AI?")  # returns response or None

# Latency stats
print(lat.latency_stats())
# {
#   'calls': 12,
#   'avg_ms': 134.2,
#   'min_ms': 98.1,
#   'max_ms': 312.4,
#   'p95_ms': 280.0,
#   'cache_hits': 3,
#   'stream_mode': False
# }
```

---

## Result Objects

### ChatResult (API providers)

Returned by `HallutokClient.chat()`.

| Field | Type | Description |
|---|---|---|
| `response` | str | Final model response (with disclaimer if flagged) |
| `original_prompt` | str | The prompt as you wrote it |
| `optimized_prompt` | str | The prompt after token optimization |
| `token_report` | dict | tokens_before, tokens_after, tokens_saved, percent_saved |
| `validation` | ValidationResult | Full hallucination validation result |
| `provider` | str | "groq" or "gemini" |
| `warnings` | list[str] | Aggregated warnings from optimizer and validator |

### EngineResult (Runtime Engine)

Returned by `session.chat()`.

| Field | Type | Description |
|---|---|---|
| `response` | str | Final model response |
| `original_prompt` | str | Raw input prompt |
| `optimized_prompt` | str | Prompt after optimization |
| `tokens_before` | int | Token count before optimization |
| `tokens_after` | int | Token count after optimization |
| `tokens_saved` | int | Tokens saved |
| `tokens_saved_pct` | float | Percentage saved |
| `hallucination_score` | float | HRS composite score (0.0–1.0) |
| `hallucination_risk` | str | "LOW", "MEDIUM", or "HIGH" |
| `is_hallucination` | bool | Whether response is flagged |
| `hallucination_flags` | list[str] | Detected issues |
| `math_scores` | dict | SCS, ECS, CDS, FGS, HRS sub-scores |
| `latency_ms` | float | End-to-end latency in milliseconds |
| `cache_hit` | bool | True if served from KV cache |
| `context_tokens_used` | int | Tokens currently in context window |
| `context_tokens_available` | int | Tokens remaining in budget |
| `suggestions` | list[str] | Recommendations if hallucination detected |

---

## Roadmap

- [x] Token optimization pipeline
- [x] Hallucination detection with mathematical HRS scoring
- [x] Groq and Gemini provider adapters
- [x] Runtime Engine with Ollama and HuggingFace support
- [x] Context Window Manager with four trim strategies
- [x] Session Manager with history, analytics, and export
- [x] Latency Optimizer with KV cache and P95 tracking
- [ ] Async support via `achat()`
- [ ] Streaming responses
- [ ] OpenAI and Together AI provider adapters
- [ ] Self-consistency hallucination verification
- [ ] Per-call token budget enforcement

---

## License

MIT License — see [LICENSE](LICENSE) for details.
