Metadata-Version: 2.4
Name: llm-guard-kit
Version: 0.1.1
Summary: Predict, diagnose, and repair LLM failures automatically. AUROC 0.966–0.993.
Author: Avighan Majumder
License: MIT
Project-URL: Repository, https://github.com/avighan/qppg
Project-URL: Issues, https://github.com/avighan/qppg/issues
Keywords: llm,reliability,failure-prediction,anomaly-detection,knn,claude,openai
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.20
Requires-Dist: scikit-learn>=0.24
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: anthropic>=0.7.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: matplotlib>=3.4; extra == "dev"

# llm-guard

**Predict, diagnose, and repair LLM failures automatically.**

[![PyPI](https://img.shields.io/pypi/v/llm-guard.svg)](https://pypi.org/project/llm-guard/)
[![Python](https://img.shields.io/pypi/pyversions/llm-guard.svg)](https://pypi.org/project/llm-guard/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

---

## What it does

`llm-guard` wraps any LLM call with a three-stage reliability layer:

1. **Predict** — scores every query for failure risk in <15ms before the LLM responds
2. **Diagnose** — clusters accumulated failures into a labeled error taxonomy
3. **Heal** — synthesises targeted repair instructions from failure patterns; applies them automatically on future queries

**Validated results** (Claude Haiku, internal benchmarks):

| Benchmark        | Task type   | AUROC  | Precision@10 |
|-----------------|-------------|--------|--------------|
| MATH-500        | Math        | 0.966  | 100%         |
| HumanEval       | Code        | 0.993  | 100%         |
| TriviaQA        | Factual QA  | 0.992  | 100%         |

Cost: <$0.25 to validate on 664 benchmark problems.

---

## Install

```bash
pip install llm-guard-kit
```

Requires Python 3.9+ and an Anthropic API key.

---

## Quick start — three calibration paths

### Path A: You have labeled correct examples

```python
from llm_guard import LLMGuard

guard = LLMGuard(api_key="sk-ant-...")

# Fit on questions your LLM is known to handle correctly
guard.fit(correct_questions=[
    "What is the capital of France?",
    "What is 12 * 15?",
    # ... 50+ examples recommended
])

result = guard.query("What is 15% of 240?")
print(result.answer)      # "36"
print(result.confidence)  # "high" | "medium" | "low"
print(result.risk_score)  # 0.12  (lower = more familiar = lower failure risk)
```

### Path B: No labels — use self-consistency

```python
guard = LLMGuard(api_key="sk-ant-...")

# Runs each question 5 times; those with 80%+ agreement are "probably correct"
guard.fit_from_consistency(
    questions=my_question_pool,  # 100–500 questions
    n_samples=5,
    agreement_threshold=0.8,
)

result = guard.query("Explain the water cycle.")
print(result.confidence)  # "high"
```

### Path C: Automated verifier (code, math, SQL, schema)

```python
import subprocess, textwrap

def python_verifier(question, response):
    """Returns True if the code response passes the test suite."""
    try:
        exec(compile(response, "<llm>", "exec"), {})
        return True
    except Exception:
        return False

guard = LLMGuard(api_key="sk-ant-...")
guard.fit_from_execution(
    questions=coding_questions,
    verifier_fn=python_verifier,
)

result = guard.query("Write a function that reverses a string.")
print(result.answer)
```

---

## Error Autopsy

Cluster accumulated failures into a labeled taxonomy (read-only, does not modify guard state):

```python
clusters = guard.diagnose(
    failed_questions=failed_qs,
    model_answers=model_answers,
    correct_answers=correct_answers,   # optional but enables suggested_fix
)

for c in clusters:
    print(f"Cluster {c['cluster_id']} ({c['size']} failures): {c['label']}")
    print(f"  Fix: {c.get('suggested_fix', 'n/a')}")
```

Example output:
```
Cluster 0 (12 failures): The model misreads multi-step word problems,
  computing intermediate values correctly but applying them to the wrong sub-question.
  Fix: Explicitly label each sub-goal before computing.
Cluster 1 (8 failures): Off-by-one errors in loop boundary conditions.
  Fix: Always verify that loop indices match the stated range inclusivity.
```

---

## Prompt Healer

Learn from failures and auto-apply targeted repairs on future queries in the same error cluster:

```python
guard.learn_from_errors(
    failed_questions=failed_qs,
    model_answers=model_answers,
    correct_answers=correct_answers,
)

# Future queries near a known failure cluster get the repair instruction injected automatically
result = guard.query("If a train travels 60 mph for 2.5 hours, how far does it go?")
print(result.tool_used)   # "error_fix_0"  ← repair tool was applied
print(result.confidence)  # "medium"
```

---

## GuardResult fields

| Field          | Type           | Description                                         |
|---------------|----------------|-----------------------------------------------------|
| `answer`       | str            | LLM response text                                   |
| `risk_score`   | float          | Mean KNN distance; higher = more likely to fail     |
| `confidence`   | str            | `"high"` / `"medium"` / `"low"`                     |
| `tool_used`    | str \| None    | Repair tool ID if applied                           |
| `cluster_id`   | int \| None    | Error cluster ID if matched                         |
| `was_retried`  | bool           | True if a resource-failure retry fired              |
| `raw_response` | str            | Full LLM response (same as `answer` currently)     |

---

## Constructor parameters

```python
guard = LLMGuard(
    api_key="sk-ant-...",           # Anthropic key (or set ANTHROPIC_API_KEY)
    model="claude-haiku-4-5-20251001",  # any Claude model
    embedding_model="all-MiniLM-L6-v2", # sentence-transformers model
    n_neighbors=5,                  # k for KNN scoring
)
```

---

## How it works

The failure predictor uses **KNN anomaly scoring** on sentence-transformer embeddings:

1. During calibration, embed all known-correct questions → build a KNN index
2. At query time, embed the new question → compute mean distance to k nearest correct examples
3. High distance = unfamiliar territory = high failure risk (AUROC 0.966–0.993)

Risk thresholds are auto-calibrated from the training distribution (75th and 95th percentile), so they work across any domain without manual tuning.

**Failure-type detection** (applied at medium/high risk):
- `stop_reason == "max_tokens"` → resource failure → retry with 2x tokens (no tool)
- Otherwise → reasoning failure → apply synthesised cluster repair tool

---

## Limitations

- **Calibration quality matters.** `fit()` requires ≥6 correct examples; `fit_from_consistency()` works best when baseline accuracy is >70%. With very low baseline accuracy, few questions will agree across samples.
- **Embeddings are language-level.** The predictor detects unfamiliar *phrasing*, not unfamiliar *reasoning steps*. Two questions that look similar but require different reasoning may get similar scores.
- **repair tools are heuristic.** `learn_from_errors()` synthesises prompt additions using the LLM — they help on average but are not guaranteed to fix every instance of a cluster.
- **Currently Anthropic-only.** OpenAI/other provider support is on the roadmap.
- **Not a security filter.** This tool predicts factual/reasoning failures, not prompt injection or jailbreaks.

---

## Roadmap

- [ ] OpenAI and Ollama provider support
- [ ] Async/streaming API
- [ ] Save/load guard state (`.save()` / `.load()`)
- [ ] Score-only mode (no LLM call required)
- [ ] Dashboard for failure cluster visualization

---

## License

MIT. See [LICENSE](LICENSE).

---

## Citation

If you use this in research:

```
Majumder, A. (2025). LLM Reliability Guard: KNN-based failure prediction
for large language models. AUROC 0.966–0.993 on math, code, and factual QA.
https://github.com/avighan/qppg
```
