Metadata-Version: 2.4
Name: hallutok
Version: 0.1.1
Summary: Anti-Hallucination & Token Optimization library for Groq and Gemini APIs
Author-email: Joel Pawar <joelpawarwork@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/joelpawar08/hallutok
Project-URL: Issues, https://github.com/joelpawar08/hallutok/issues
Project-URL: Documentation, https://github.com/joelpawar08/hallutok#readme
Keywords: llm,hallucination,token-optimization,groq,gemini,ai
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Provides-Extra: groq
Requires-Dist: groq>=0.9.0; extra == "groq"
Provides-Extra: gemini
Requires-Dist: google-generativeai>=0.7.0; extra == "gemini"
Provides-Extra: all
Requires-Dist: groq>=0.9.0; extra == "all"
Requires-Dist: google-generativeai>=0.7.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"

# 🛡️ Hallutok

**Anti-Hallucination & Token Optimization for Groq, Gemini, Ollama, and HuggingFace**

[![PyPI version](https://badge.fury.io/py/hallutok.svg)](https://pypi.org/project/hallutok/)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue.svg)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

Hallutok solves real problems that kill your API quota and reliability:

| Problem | Hallutok's Solution |
|---|---|
| Long prompts burning through tokens | `TokenOptimizer` compresses prompts before sending |
| LLM making up facts / hedging | `HallucinationValidator` scores and flags sketchy responses |
| Running local/offline LLMs without guardrails | `HallutokEngine` wraps any local model with the full pipeline |
| Multi-turn context blowing past model limits | `ContextWindowManager` auto-trims with smart strategies |

---

## ✨ Features

- **Token Optimization** — whitespace cleanup, filler-phrase compression, deduplication, smart truncation, in-memory caching
- **Anti-Hallucination** — mathematical HRS scoring, detects hedging, ungrounded claims, numeric anomalies, contradictions
- **Groq + Gemini** — works with both APIs via thin, swappable provider adapters
- **Runtime Engine** — load any Ollama or HuggingFace model and get the full optimization pipeline locally
- **Context Window Manager** — smart token budget with 4 trim strategies (sliding, drop_oldest, summarize, priority)
- **Session Manager** — multi-turn history, save/load JSON, export Markdown + CSV
- **Latency Optimizer** — KV cache, warm-up pings, per-call latency tracking with P95
- **Zero hard dependencies** — core library is pure Python; providers and runtime are optional extras
- **Savings reporting** — see exactly how many tokens you saved per call

---

## 📦 Installation

```bash
# With Groq support
pip install hallutok[groq]

# With Gemini support
pip install hallutok[gemini]

# Both
pip install hallutok[all]

# With local model support (Ollama / HuggingFace)
pip install hallutok[local]
```

---

## 🚀 Quick Start

### Using Groq

```python
from hallutok import HallutokClient

client = HallutokClient.with_groq(
    api_key="gsk_your_groq_key",
    model="llama3-8b-8192",
    temperature=0.3,
)

result = client.chat(
    "Please note that I would like you to explain in order to help me "
    "understand what black holes are and how they work."
)

print(result.response)
print(result.token_report)
# {'tokens_before': 48, 'tokens_after': 19, 'tokens_saved': 29, 'percent_saved': 60.4}

if result.validation.is_likely_hallucination:
    print("⚠️  Flags:", result.validation.flags)
```

### Using Gemini

```python
from hallutok import HallutokClient

client = HallutokClient.with_gemini(
    api_key="AIza_your_gemini_key",
    model="gemini-1.5-flash",
)

result = client.chat("Explain quantum entanglement to a 10-year-old.")
print(result.response)
```

### Using providers directly

```python
from hallutok import HallutokClient
from hallutok.providers import GroqProvider, GeminiProvider

provider = GroqProvider(api_key="gsk_...", model="mixtral-8x7b-32768")

client = HallutokClient(
    provider=provider,
    optimize_tokens=True,
    validate_responses=True,
    max_prompt_tokens=512,
    temperature=0.4,
    max_response_tokens=1024,
    system_prompt="You are a factual assistant. Cite sources when possible.",
)

result = client.chat("What causes inflation?")
```

---

## 🖥️ Runtime Engine (v0.1.1 — New)

The **HallutokEngine** lets you run any local model (via Ollama or HuggingFace) with the full Hallutok pipeline — token optimization, hallucination detection, context management, session tracking, and latency optimization — all built in.

### Loading a model

```python
from hallutok.runtime import HallutokEngine

# From Ollama (requires Ollama running locally)
engine = HallutokEngine.from_ollama("llama3")

# From HuggingFace Hub
engine = HallutokEngine.from_huggingface(
    "mistralai/Mistral-7B-Instruct-v0.2",
    device="auto",      # auto-detects cuda / mps / cpu
    quantize=True,      # 4-bit quantization to save memory
)

# From a local model directory
engine = HallutokEngine.from_local("/path/to/my-model")
```

### Creating sessions and chatting

```python
session = engine.create_session(
    name="research-chat",
    system_prompt="You are a factual assistant. Always cite sources.",
    max_tokens=4096,
    trim_strategy="sliding",   # sliding | drop_oldest | summarize | priority
)

result = session.chat("Explain black holes")

print(result.response)
print(result.hallucination_score)   # HRS 0.0–1.0
print(result.hallucination_risk)    # LOW / MEDIUM / HIGH
print(result.tokens_saved)
print(result.latency_ms)
print(result.cache_hit)             # True if served from KV cache
```

### EngineResult fields

```python
result.response                 # final (possibly cleaned) response
result.original_prompt          # your raw input
result.optimized_prompt         # what was actually sent to the model
result.tokens_before            # tokens in original prompt
result.tokens_after             # tokens in optimized prompt
result.tokens_saved             # tokens saved
result.tokens_saved_pct         # percentage saved
result.hallucination_score      # HRS score 0.0–1.0
result.hallucination_risk       # "LOW" | "MEDIUM" | "HIGH"
result.is_hallucination         # bool
result.hallucination_flags      # list of detected issues
result.math_scores              # {"SCS": ..., "ECS": ..., "CDS": ..., "FGS": ..., "HRS": ...}
result.latency_ms               # end-to-end latency in milliseconds
result.cache_hit                # True if response came from KV cache
result.context_tokens_used      # tokens currently in context window
result.context_tokens_available # tokens remaining in budget
result.suggestions              # recommendations if hallucination detected
```

### Multi-turn conversations

```python
session = engine.create_session("science-qa")

r1 = session.chat("What are black holes?")
r2 = session.chat("How does Hawking radiation work?")
r3 = session.chat("What is the event horizon?", flag_turn=True)  # never trimmed
```

### Session persistence and export

```python
# Save and reload a session
session.save("my_session.json")
restored = engine.load_session("my_session.json")

# Export in multiple formats
session.export_markdown("chat_log.md")   # human-readable chat log
session.export_csv("analytics.csv")     # per-turn analytics
```

### Session analytics

```python
stats = session.get_stats()

print(stats["total_turns"])
print(stats["total_tokens_saved"])
print(stats["avg_tokens_saved_pct"])   # e.g. 42.3
print(stats["total_hallucinations_caught"])
print(stats["avg_hallucination_score"])
print(stats["avg_latency_ms"])
print(stats["session_duration_s"])
print(stats["context_trims"])          # how many times context was auto-trimmed
```

### Engine-wide stats

```python
print(engine.get_stats())
# {
#   "model": "llama3",
#   "source": "ollama",
#   "device": "cpu",
#   "quantized": True,
#   "memory_mb": 4000.0,
#   "total_sessions": 3,
#   "uptime_s": 142.7,
#   "latency": {"calls": 12, "avg_ms": 134.2, "min_ms": 98.1, "max_ms": 312.4, "p95_ms": 280.0},
#   "context_budget": 4096,
#   "trim_strategy": "sliding"
# }

engine.clear_cache()  # flush KV cache and optimizer cache
```

### Advanced engine configuration

```python
engine = HallutokEngine.from_ollama(
    model="mistral",
    max_tokens=8192,             # context window budget
    trim_strategy="priority",    # keep system + flagged + last 6 turns
    kv_cache=True,               # cache identical prompts
    warm_up=True,                # pre-warm model to reduce first-call latency
    stream=False,
    system_prompt="You are a concise, factual assistant.",
)
```

---

## 🔧 Components

### TokenOptimizer

```python
from hallutok.optimizer import TokenOptimizer

opt = TokenOptimizer()

raw = """
Please note that I would like you to, in order to be helpful,
can you please explain, it is important to note that, machine learning
is a subset of AI. Machine learning is a subset of AI.
"""

compressed = opt.optimize(raw, max_tokens=100)
report = opt.savings_report(raw, compressed)
# {'tokens_before': 54, 'tokens_after': 12, 'tokens_saved': 42, 'percent_saved': 77.8}
```

What the optimizer does, in order:
1. Normalize whitespace (collapse spaces, trim blank lines)
2. Strip boilerplate ("Please note that", "I would like you to", etc.)
3. Deduplicate repeated sentences
4. Replace verbose phrases ("in order to" → "to", "due to the fact that" → "because", …)
5. Truncate to `max_tokens` at a sentence boundary

### HallucinationValidator

```python
from hallutok.antihallucination import HallucinationValidator

validator = HallucinationValidator()

response = "I think maybe studies show that eating chocolate probably cures cancer."
result = validator.validate(response)

print(result.confidence_score)          # e.g. 0.72
print(result.is_likely_hallucination)   # True / False
print(result.risk_level)                # "LOW" | "MEDIUM" | "HIGH"
print(result.flags)                     # list of issues found
print(result.warnings)                  # human-readable descriptions
print(result.suggestions)              # what to do about it
print(result.cleaned_response)          # response + disclaimer if flagged
print(result.math_scores)              # SCS, ECS, CDS, FGS, HRS breakdown
```

**Detection layers:**

| Layer | What it catches |
|---|---|
| Hedging | "I think", "maybe", "perhaps", "I'm not sure", etc. |
| Ungrounded claims | "Studies show…", "Research suggests…" without citations |
| Numeric anomalies | Percentages over 100%, other implausible numbers |
| Contradictions | "always" + "never", "increases" + "decreases" in same text |

### ContextWindowManager

```python
from hallutok.runtime.context_manager import ContextWindowManager

ctx = ContextWindowManager(
    max_tokens=4096,
    trim_strategy="sliding",   # sliding | drop_oldest | summarize | priority
    reserve_tokens=512,        # tokens reserved for the response
)

ctx.add_message("system", "You are a helpful assistant.", flagged=True)
ctx.add_message("user", "What are black holes?")
ctx.add_message("assistant", "Black holes are regions of extremely strong gravity.")

print(ctx.stats())
# {'messages': 3, 'total_tokens': 28, 'available_tokens': 3556, 'budget': 4096,
#  'usage_percent': 0.7, 'trim_count': 0, 'strategy': 'sliding'}
```

**Trim strategies:**

| Strategy | Behavior |
|---|---|
| `sliding` | Keep system messages + last N conversation turns |
| `drop_oldest` | Remove oldest non-system, non-flagged messages first |
| `summarize` | Compress older messages into an extractive summary |
| `priority` | Keep system + flagged turns + last 6 messages |

### LatencyOptimizer

```python
from hallutok.runtime.latency_optimizer import LatencyOptimizer

lat = LatencyOptimizer(kv_cache_enabled=True, kv_cache_size=64)

# KV cache
lat.store_cache("What is AI?", "AI is artificial intelligence.")
cached = lat.get_cached("What is AI?")   # returns cached response

# Latency stats
print(lat.latency_stats())
# {'calls': 12, 'avg_ms': 134.2, 'min_ms': 98.1, 'max_ms': 312.4,
#  'p95_ms': 280.0, 'cache_hits': 3, 'stream_mode': False}
```

---

## 💡 Tips to Maximize Token Savings

1. **Avoid filler openers** — "Can you please", "I would like you to", "It is important that"
2. **Don't repeat yourself** — Hallutok deduplicates, but it's faster to not duplicate at all
3. **Use `max_prompt_tokens`** — set a hard cap so you never accidentally send a 4k-token prompt
4. **Lower the temperature** — `temperature=0.3` reduces hallucination risk significantly
5. **Use a system prompt** — instruct the model to cite sources and avoid speculation
6. **Check `token_report` per call** — it tells you exactly what was saved
7. **Use `flag_turn=True`** for important turns you never want trimmed from context

---

## 📊 ChatResult Fields (API Providers)

```python
result.response           # final (possibly cleaned) text
result.original_prompt    # your original input
result.optimized_prompt   # what was actually sent to the API
result.token_report       # {tokens_before, tokens_after, tokens_saved, percent_saved}
result.validation         # ValidationResult object
result.provider           # "groq" or "gemini"
result.warnings           # list of human-readable warnings
```

---

## 🗺️ Roadmap

- [x] Token optimization pipeline
- [x] Hallucination detection with HRS scoring
- [x] Groq + Gemini provider adapters
- [x] Runtime Engine (Ollama + HuggingFace support)
- [x] Context Window Manager with smart trimming
- [x] Session Manager with history and export
- [x] Latency Optimizer with KV cache
- [ ] Async support (`achat()`)
- [ ] Streaming responses
- [ ] OpenAI / Together AI provider adapters
- [ ] Per-call token budget enforcement
- [ ] Self-consistency hallucination verification

---

## 📄 License

MIT License — see [LICENSE](LICENSE) for details.
