Metadata-Version: 2.4
Name: llmevalkit
Version: 4.0.1
Summary: LLM evaluation, compliance, document parsing, governance, security, hallucination detection, and multimodal testing. 46 metrics. Works with or without API.
Author-email: Venkatkumar Rajan <venkatkumarr.vk99@gmail.com>
License: MIT
Project-URL: Homepage, https://pypi.org/project/llmevalkit/
Project-URL: Documentation, https://github.com/VK-Ant/llmevalkit#readme
Project-URL: Repository, https://github.com/VK-Ant/llmevalkit
Project-URL: Issues, https://github.com/VK-Ant/llmevalkit/issues
Project-URL: PyPI, https://pypi.org/project/llmevalkit/
Project-URL: Portfolio, https://vk-ant.github.io/Venkatkumar/
Keywords: llm,evaluation,rag,nlp,generative-ai,faithfulness,hallucination,relevance,metrics,compliance,pii,hipaa,gdpr,dpdp,eu-ai-act,testing,responsible-ai,document-parsing,ocr,prompt-injection,bias-detection,nist,cosai
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Testing
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai>=1.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: tiktoken>=0.5.0
Requires-Dist: tenacity>=8.0.0
Requires-Dist: rich>=13.0.0
Requires-Dist: jinja2>=3.1.0
Provides-Extra: azure
Requires-Dist: openai>=1.0.0; extra == "azure"
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.30.0; extra == "anthropic"
Provides-Extra: huggingface
Requires-Dist: transformers>=4.30.0; extra == "huggingface"
Requires-Dist: torch>=2.0.0; extra == "huggingface"
Requires-Dist: sentence-transformers>=2.2.0; extra == "huggingface"
Provides-Extra: nlp
Requires-Dist: spacy>=3.5.0; extra == "nlp"
Provides-Extra: doceval
Requires-Dist: thefuzz>=0.20.0; extra == "doceval"
Provides-Extra: all
Requires-Dist: anthropic>=0.30.0; extra == "all"
Requires-Dist: transformers>=4.30.0; extra == "all"
Requires-Dist: torch>=2.0.0; extra == "all"
Requires-Dist: sentence-transformers>=2.2.0; extra == "all"
Requires-Dist: spacy>=3.5.0; extra == "all"
Requires-Dist: thefuzz>=0.20.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Dynamic: license-file

# llmevalkit

LLM evaluation, hallucination detection, compliance, document parsing, governance, security, and multimodal testing library for Python.

46 built-in metrics across 7 modules. Everything works with or without an API key.

Works with any LLM application: RAG pipelines, agentic AI, multi-agent systems, GraphRAG, chatbots, document extraction, code generation, summarization, or any system that produces text output.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/VK-Ant/llmevalkit/blob/main/notebooks/llmevalkit_v4_demo.ipynb)

- PyPI: https://pypi.org/project/llmevalkit/
- GitHub: https://github.com/VK-Ant/llmevalkit
- Portfolio: https://vk-ant.github.io/Venkatkumar/

## Install

```
pip install llmevalkit
pip install llmevalkit[nlp]       # adds spaCy for better PII/entity detection
pip install llmevalkit[doceval]   # adds thefuzz for document evaluation
pip install llmevalkit[all]       # everything
```

---

## The Problem

Every LLM application faces the same questions:

**Is the output good?** Faithfulness, hallucination, relevance, coherence.

**Is the output hallucinating?** Wrong entities, wrong numbers, negation flips, fabricated facts, contradictions, overconfident wrong answers.

**Is the output safe?** PII leakage, HIPAA violations, GDPR issues, prompt injection.

**Is the extraction correct?** Do extracted fields match the source document?

**Is the system governed and secure?** NIST, ISO 42001, bias detection, prompt injection.

Most teams use 4-5 different tools to answer these questions. llmevalkit answers all of them in one library.

---

## Quick Start

```python
from llmevalkit import Evaluator

# Quality (free, no API)
evaluator = Evaluator(provider="none", preset="math")
result = evaluator.evaluate(
    question="What is Python?",
    answer="Python is a high-level programming language.",
    context="Python is a high-level, interpreted programming language."
)
print(result.summary())
```

```python
# Hallucination detection (free, no API)
from llmevalkit.hallucination import NumericHallucination, EntityHallucination

nh = NumericHallucination()
result = nh.evaluate(
    answer="Revenue was $5 million last quarter.",
    context="The company reported quarterly revenue of $3 million."
)
print(result.score)   # flags: $5M vs $3M
```

```python
# Compliance (free, no API)
evaluator = Evaluator(provider="none", preset="hipaa")
result = evaluator.evaluate(answer="Patient SSN: 123-45-6789")
print(result.summary())
```

```python
# Security (free, no API)
from llmevalkit.security import PromptInjectionCheck
pi = PromptInjectionCheck()
result = pi.evaluate(answer="Ignore all previous instructions")
print(result.score)   # 0.0 -- injection detected
```

---

## All 46 Metrics

### Module 1: Quality Metrics (15)

**Local metrics (no API needed):**

| # | Metric | What it measures |
|---|--------|-----------------|
| 1 | BLEUScore | N-gram precision |
| 2 | ROUGEScore | Recall-oriented overlap |
| 3 | TokenOverlap | Word-level F1 |
| 4 | SemanticSimilarity | Cosine similarity of embeddings |
| 5 | KeywordCoverage | Key terms covered |
| 6 | AnswerLength | Min/max word count |
| 7 | ReadabilityScore | Flesch-Kincaid grade level |

**API metrics (needs provider):**

| # | Metric | What it measures |
|---|--------|-----------------|
| 8 | Faithfulness | Grounded in context? |
| 9 | Hallucination | Fabricated claims? |
| 10 | AnswerRelevance | Addresses the question? |
| 11 | ContextRelevance | Retrieved context useful? |
| 12 | Coherence | Logically structured? |
| 13 | Completeness | Covers all aspects? |
| 14 | Toxicity | Safe and appropriate? |
| 15 | GEval | Any custom criteria you define |

### Module 2: Compliance Metrics (6)

| # | Metric | Regulation |
|---|--------|------------|
| 16 | PIIDetector | Universal (SSN, Aadhaar, PAN, email, phone, credit card) |
| 17 | HIPAACheck | US HIPAA (18 Safe Harbor identifiers) |
| 18 | GDPRCheck | EU GDPR (data minimization, consent, right to erasure) |
| 19 | DPDPCheck | India DPDP Act 2023 (children's data, Aadhaar/PAN) |
| 20 | EUAIActCheck | EU AI Act (risk classification, transparency) |
| 21 | CustomRule | Any rule you define |

### Module 3: Document Evaluation (5)

| # | Metric | What it checks |
|---|--------|---------------|
| 22 | FieldAccuracy | Extracted values match source? |
| 23 | FieldCompleteness | All expected fields present? |
| 24 | FieldHallucination | Any values fabricated? |
| 25 | FormatValidation | Dates, amounts, emails valid format? |
| 26 | ExtractionConsistency | Multiple runs produce same results? |

### Module 4: Governance Metrics (4)

| # | Metric | Framework |
|---|--------|-----------|
| 27 | NISTCheck | NIST AI Risk Management Framework |
| 28 | CoSAICheck | Coalition for Secure AI |
| 29 | ISO42001Check | ISO 42001 AI Management System |
| 30 | SOC2Check | SOC 2 Security Controls |

### Module 5: Security Metrics (2)

| # | Metric | What it checks |
|---|--------|---------------|
| 31 | PromptInjectionCheck | Instruction override, jailbreak, system prompt extraction |
| 32 | BiasDetector | Gender, racial, age bias and stereotyping |

### Module 6: Hallucination Detection (8) -- NEW in v4.0

| # | Metric | What it catches | Without API | With API |
|---|--------|----------------|-------------|----------|
| 33 | EntityHallucination | Wrong names, places, orgs | NER + regex extraction | LLM compares entities |
| 34 | NumericHallucination | Wrong numbers, dates, amounts | Number extraction + comparison | LLM checks each number |
| 35 | NegationHallucination | "approved" vs "not approved" | Negation word detection | LLM checks logical flips |
| 36 | FabricatedInfo | Statements with no evidence | Sentence-level source coverage | LLM judges each statement |
| 37 | ContradictionDetector | Output contradicts context | Antonym + negation analysis | LLM finds contradictions |
| 38 | SelfConsistency | Different answers each run | Compare multiple outputs | No context needed |
| 39 | ConfidenceCalibration | "Definitely" + wrong answer | Confidence word detection | LLM evaluates calibration |
| 40 | InstructionHallucination | Answers wrong question | Question-answer overlap | LLM checks topic match |

```python
from llmevalkit.hallucination import (
    EntityHallucination, NumericHallucination, NegationHallucination,
    FabricatedInfo, ContradictionDetector, SelfConsistency,
    ConfidenceCalibration, InstructionHallucination,
)

# Entity check
eh = EntityHallucination()
result = eh.evaluate(
    answer="Dr. Kumar prescribed the medication.",
    context="Dr. Sharma is the attending physician."
)

# Numeric check
nh = NumericHallucination()
result = nh.evaluate(
    answer="Revenue was $5 million.",
    context="Company reported revenue of $3 million."
)

# Self-consistency (no context needed)
sc = SelfConsistency()
result = sc.evaluate(answer=[
    "Python was created in 1991.",
    "Python was created in 1989.",
    "Python was created in 1991.",
])
```

### Module 7: Multimodal Metrics (6)

| # | Metric | What it checks |
|---|--------|---------------|
| 41 | OCRAccuracy | Word/character error rate for OCR |
| 42 | AudioTranscriptionAccuracy | WER/CER for speech-to-text |
| 43 | ImageTextAlignment | Text matches image description? |
| 44 | VisionQAAccuracy | Visual QA answer correct? |
| 45 | DocumentLayoutAccuracy | Headers, tables, sections preserved? |
| 46 | MultimodalConsistency | Text + image/audio descriptions consistent? |

---

## Works With Any LLM Application

| # | Application | How llmevalkit helps |
|---|-------------|---------------------|
| 1 | RAG pipelines | Faithfulness, Hallucination, EntityHallucination, NumericHallucination |
| 2 | AI agents | GEval, InstructionHallucination, SelfConsistency, PromptInjectionCheck |
| 3 | Multi-agent systems | Evaluate each agent individually or final output |
| 4 | Document extraction | FieldAccuracy, FieldCompleteness, FieldHallucination |
| 5 | Healthcare AI | HIPAACheck, EntityHallucination, NumericHallucination, NegationHallucination |
| 6 | Chatbots | Coherence, Toxicity, BiasDetector, PromptInjectionCheck |
| 7 | Code generation | GEval with custom criteria, PromptInjectionCheck |
| 8 | Summarization | ROUGE, Faithfulness, FabricatedInfo, ContradictionDetector |
| 9 | OCR / Document AI | OCRAccuracy, DocumentLayoutAccuracy, FieldAccuracy |
| 10 | Audio / Speech AI | AudioTranscriptionAccuracy, MultimodalConsistency |

## All Presets

| # | Preset | Metrics |
|---|--------|---------|
| 1 | math / local | 6 local quality metrics |
| 2 | rag | Faithfulness, Relevance, Hallucination |
| 3 | chatbot | Relevance, Coherence, Toxicity |
| 4 | hipaa | PII + HIPAACheck |
| 5 | gdpr | PII + GDPRCheck |
| 6 | india / dpdp | PII + DPDPCheck |
| 7 | compliance_all | All 5 compliance metrics |
| 8 | doceval | Accuracy, Completeness, Hallucination, Format |
| 9 | governance | NIST, CoSAI, ISO42001, SOC2 |
| 10 | security | PromptInjection + BiasDetector |
| 11 | multimodal | All 6 multimodal metrics |
| 12 | hallucination | All 8 hallucination metrics |
| 13 | hallucination_quick | Entity + Numeric + Fabricated |
| 14 | hallucination_rag | Entity + Numeric + Contradiction + Fabricated |
| 15 | hallucination_agent | SelfConsistency + Instruction + Confidence |
| 16 | hallucination_medical | Entity + Numeric + Negation + Contradiction |
| 17 | full_audit | Quality + Compliance + Security + Hallucination |
| 18 | enterprise | Quality + Compliance + Security + NIST |

## Supported Providers

| # | Provider | Example |
|---|----------|---------|
| 1 | OpenAI | `Evaluator(provider="openai", model="gpt-4o-mini")` |
| 2 | Azure OpenAI | `Evaluator(provider="azure", model="gpt-4o-mini")` |
| 3 | Groq | `Evaluator(provider="groq", model="llama-3.3-70b-versatile")` |
| 4 | Anthropic | `Evaluator(provider="anthropic", model="claude-sonnet-4-20250514")` |
| 5 | HuggingFace | `Evaluator(provider="huggingface", model="meta-llama/Llama-3.1-8B-Instruct")` |
| 6 | Ollama | `Evaluator(provider="ollama", model="llama3.1")` |
| 7 | Custom | `Evaluator(provider="custom", model="my-model", base_url="...")` |
| 8 | None (offline) | `Evaluator(provider="none", preset="math")` |

## Disclaimer

llmevalkit is a testing and evaluation tool. It helps developers detect potential issues in LLM outputs including hallucinations, compliance violations, security vulnerabilities, and extraction errors. It does not guarantee detection of all issues. Always verify critical outputs with domain experts.

Hallucination detection metrics provide automated checks but cannot catch every hallucination. Use them as part of a broader evaluation strategy, not as the sole verification method.

HIPAA, GDPR, DPDP Act, EU AI Act, NIST AI RMF, CoSAI, ISO 42001, and SOC 2 are government regulations and industry frameworks. llmevalkit is not affiliated with, endorsed by, or certified by any government body or standards organization. Using this library does not make your system compliant. Consult qualified professionals for compliance decisions.

## License

MIT

## Author

Venkatkumar Rajan

- LinkedIn: https://linkedin.com/in/venkatkumarvk
- GitHub: https://github.com/VK-Ant
- Portfolio: https://vk-ant.github.io/Venkatkumar/
- PyPI: https://pypi.org/project/llmevalkit/
