Metadata-Version: 2.4
Name: factlens
Version: 2026.4.28.2
Summary: Geometric LLM hallucination detection. No second LLM. Deterministic. Auditable.
Project-URL: Homepage, https://factlens.dev
Project-URL: Documentation, https://docs.factlens.dev
Project-URL: Repository, https://github.com/factlens/factlens
Project-URL: Issues, https://github.com/factlens/factlens/issues
Project-URL: Changelog, https://github.com/factlens/factlens/blob/main/CHANGELOG.md
Project-URL: Research (SGI), https://arxiv.org/abs/2512.13771
Project-URL: Research (DGI), https://arxiv.org/abs/2602.13224
Project-URL: Research (Confabulation), https://arxiv.org/abs/2603.13259
Author-email: Javier Marin <javier@jmarin.info>
License-Expression: MIT
License-File: LICENSE
Keywords: ai-safety,dgi,embedding-geometry,eu-ai-act,factual-accuracy,grounding,hallucination-detection,llm-evaluation,rag,sgi
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: numpy>=1.24.0
Requires-Dist: sentence-transformers<6.0.0,>=2.7.0
Provides-Extra: all
Requires-Dist: anthropic>=0.25.0; extra == 'all'
Requires-Dist: autogen-agentchat>=0.4.0; extra == 'all'
Requires-Dist: crewai>=0.80.0; extra == 'all'
Requires-Dist: google-generativeai>=0.5.0; extra == 'all'
Requires-Dist: langchain-core>=0.3.0; extra == 'all'
Requires-Dist: langsmith>=0.1.0; extra == 'all'
Requires-Dist: openai>=1.0.0; extra == 'all'
Requires-Dist: semantic-kernel>=1.0.0; extra == 'all'
Provides-Extra: anthropic
Requires-Dist: anthropic>=0.25.0; extra == 'anthropic'
Provides-Extra: autogen
Requires-Dist: autogen-agentchat>=0.4.0; extra == 'autogen'
Provides-Extra: crewai
Requires-Dist: crewai>=0.80.0; extra == 'crewai'
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pip-audit>=2.7; extra == 'dev'
Requires-Dist: pre-commit>=3.7; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest-mock>=3.12; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.5.0; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-gen-files>=0.5; extra == 'docs'
Requires-Dist: mkdocs-literate-nav>=0.6; extra == 'docs'
Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.25; extra == 'docs'
Provides-Extra: google
Requires-Dist: google-generativeai>=0.5.0; extra == 'google'
Provides-Extra: integrations
Requires-Dist: autogen-agentchat>=0.4.0; extra == 'integrations'
Requires-Dist: crewai>=0.80.0; extra == 'integrations'
Requires-Dist: langchain-core>=0.3.0; extra == 'integrations'
Requires-Dist: langsmith>=0.1.0; extra == 'integrations'
Requires-Dist: semantic-kernel>=1.0.0; extra == 'integrations'
Provides-Extra: langchain
Requires-Dist: langchain-core>=0.3.0; extra == 'langchain'
Requires-Dist: langsmith>=0.1.0; extra == 'langchain'
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == 'openai'
Provides-Extra: providers
Requires-Dist: anthropic>=0.25.0; extra == 'providers'
Requires-Dist: google-generativeai>=0.5.0; extra == 'providers'
Requires-Dist: openai>=1.0.0; extra == 'providers'
Provides-Extra: semantic-kernel
Requires-Dist: semantic-kernel>=1.0.0; extra == 'semantic-kernel'
Description-Content-Type: text/markdown

<div align="center">
  <img src="https://raw.githubusercontent.com/factlens/factlens/main/docs/assets/FactLens_header-03.png" alt="factlens" width="800">


# Geometric LLM hallucination detection. No second LLM. Deterministic. Auditable.



[![Python](https://img.shields.io/badge/python-3.10%20|%203.11%20|%203.12%20|%203.13-blue?style=flat-square)](https://github.com/factlens/factlens)
[![License: MIT](https://img.shields.io/badge/License-MIT-green?style=flat-square)](https://opensource.org/licenses/MIT)
[![CI](https://img.shields.io/github/actions/workflow/status/factlens/factlens/ci.yml?branch=main&label=CI&style=flat-square)](https://github.com/factlens/factlens/actions)
[![Docs](https://img.shields.io/badge/docs-docs.factlens.dev-blue?style=flat-square)](https://docs.factlens.dev)
[![Version](https://img.shields.io/badge/version-2026.4.28-orange?style=flat-square)](https://github.com/factlens/factlens/releases)
[![Release to PyPI](https://img.shields.io/github/actions/workflow/status/factlens/factlens/release.yml?style=flat-square&label=Release%20to%20PyPI)](https://github.com/factlens/factlens/actions/workflows/release.yml)

[Documentation](https://docs.factlens.dev) | [Research Papers](#research) | [Examples](examples/) | [Contributing](CONTRIBUTING.md)

</div>

---

***factlens*** detects LLM hallucinations using embedding geometry instead of a second LLM. It computes deterministic, auditable scores from the spatial relationships between questions, responses, and source context in an embedding space. The result is a verification signal you can explain in an audit, reproduce on demand, and run in regulated environments.

## Why ***factlens***?

| Problem | How factlens solves it |
|---|---|
| Second-LLM judges are non-deterministic and expensive | Single embedding model (`all-mpnet-base-v2`), deterministic output, sub-second latency |
| Probabilistic scores cannot be audited | Geometric ratios and angular measurements with clear mathematical definitions |
| Regulatory compliance requires explainability | Every score traces to Euclidean distances and cosine similarities in $R^n$ |
| One method does not fit all use cases | SGI for RAG/context verification, DGI for context-free chat, `evaluate()` auto-selects |

`SGI`: Semantic Grounding Index | `DGI`: Directional Grounding Index


## Installation

```bash
pip install factlens
```

With LLM provider support:

```bash
pip install "factlens[openai]"       # OpenAI
pip install "factlens[anthropic]"    # Anthropic
pip install "factlens[google]"       # Google Generative AI
pip install "factlens[providers]"    # All providers
```

With framework integrations:

```bash
pip install "factlens[langchain]"    # LangChain
pip install "factlens[crewai]"       # CrewAI
pip install "factlens[semantic-kernel]"  # Semantic Kernel
pip install "factlens[autogen]"      # AutoGen
pip install "factlens[all]"          # Everything
```

**Requirements:** Python 3.10+, numpy, sentence-transformers.

## Quick start

### SGI -- with context (RAG verification)

SGI (Semantic Grounding Index) measures whether a response engaged with the provided context or stayed anchored to the question. It requires three inputs.

```python
from factlens import compute_sgi

result = compute_sgi(
    question="What is the capital of France?",
    context="France is in Western Europe. Its capital is Paris.",
    response="The capital of France is Paris.",
)

print(result.value)       # 1.23 — ratio of distances
print(result.normalized)  # 0.61 — mapped to [0, 1]
print(result.flagged)     # False — above review threshold
print(result.explanation) # "SGI=1.230 — strong context engagement (pass)"
```

**Interpretation:** `SGI > 1.0` means the response is closer to the context than to the question in embedding space. The response engaged with the source material.

### DGI -- without context

DGI (Directional Grounding Index) detects hallucinations without requiring source context. It checks whether the question-to-response displacement vector aligns with the characteristic direction of verified grounded responses.

```python
from factlens import compute_dgi

result = compute_dgi(
    question="What causes seasons on Earth?",
    response="Seasons are caused by Earth's 23.5-degree axial tilt.",
)

print(result.value)       # 0.42 — cosine similarity to reference direction
print(result.normalized)  # 0.71 — mapped to [0, 1]
print(result.flagged)     # False — above pass threshold (0.30)
```

**Domain calibration** improves DGI accuracy from AUROC ~0.76 (generic) to 0.90-0.99:

```python
from factlens import compute_dgi

result = compute_dgi(
    question="What is the statute of limitations for breach of contract in California?",
    response="Four years under California Code of Civil Procedure Section 337.",
    reference_csv="legal_calibration_pairs.csv",
)
```

### evaluate() -- auto-select

The `evaluate()` function picks the right method automatically: SGI when context is provided, DGI when it is not.

```python
from factlens import evaluate

# With context -> SGI
score = evaluate(
    question="What is X?",
    response="X is Y.",
    context="According to the manual, X is Y.",
)
assert score.method == "sgi"

# Without context -> DGI
score = evaluate(
    question="What is X?",
    response="X is Y.",
)
assert score.method == "dgi"
```

### Batch evaluation

```python
from factlens import evaluate_batch

items = [
    {"question": "Q1?", "response": "A1.", "context": "Source."},
    {"question": "Q2?", "response": "A2."},
    {"question": "Q3?", "response": "A3.", "context": "Reference."},
]

results = evaluate_batch(items)
flagged = [r for r in results if r.flagged]
print(f"{len(flagged)}/{len(results)} flagged for review")
```

### CLI

```bash
# Single response check
factlens check \
  --question "What is the capital of France?" \
  --response "The capital of France is Paris." \
  --context "France is in Western Europe. Its capital is Paris."

# Batch CSV evaluation
factlens evaluate input.csv --output results.csv

# Domain calibration
factlens calibrate --pairs domain_pairs.csv --output calibration.json

# Run the confabulation benchmark
factlens benchmark
```

### LLM provider guard

```python
from factlens.providers.openai import OpenAIProvider

provider = OpenAIProvider(model="gpt-4o")
response = provider.complete(
    prompt="Summarize this document.",
    context="The document text here...",
)

if response.factlens_score and response.factlens_score.flagged:
    print("Hallucination risk detected — review recommended.")
else:
    print(response.text)
```

## Architecture

```
factlens/
├── __init__.py              # Public API: compute_sgi, compute_dgi, evaluate, calibrate
├── sgi.py                   # Semantic Grounding Index (context-required)
├── dgi.py                   # Directional Grounding Index (context-free)
├── evaluate.py              # High-level evaluate() and evaluate_batch()
├── calibrate.py             # Domain-specific DGI calibration
├── score.py                 # Result types: SGIResult, DGIResult, FactlensScore
├── _version.py              # CalVer version (2026.4.28)
├── _internal/               # Private implementation
│   ├── geometry.py          # Euclidean distance, displacement, unit normalize
│   ├── embeddings.py        # Sentence transformer encoding
│   ├── thresholds.py        # Decision boundaries and normalization
│   └── csv_loader.py        # Calibration data loading
├── cli/
│   └── main.py              # CLI: check, evaluate, calibrate, benchmark
├── providers/               # LLM provider wrappers
│   ├── _base.py             # BaseLLMProvider protocol + LLMResponse
│   ├── openai.py            # OpenAI provider
│   ├── anthropic.py         # Anthropic provider
│   └── google.py            # Google Generative AI provider
└── integrations/            # Framework integrations
    ├── langchain/           # LangChain evaluator + callback
    ├── crewai/              # CrewAI tool
    ├── semantic_kernel/     # Semantic Kernel filter
    └── autogen/             # AutoGen checker
```

The architecture follows a layered design:

```
┌─────────────────────────────────────────────┐
│            Public API (evaluate)             │
├──────────────────┬──────────────────────────┤
│   SGI (sgi.py)   │      DGI (dgi.py)        │
├──────────────────┴──────────────────────────┤
│        _internal (geometry, embeddings)      │
├─────────────────────────────────────────────┤
│  sentence-transformers (all-mpnet-base-v2)   │
└─────────────────────────────────────────────┘
         ▲                          ▲
         │                          │
   ┌─────┴─────┐            ┌──────┴──────┐
   │ Providers  │            │Integrations │
   │ (OpenAI,   │            │ (LangChain, │
   │  Anthropic,│            │  CrewAI,    │
   │  Google)   │            │  SK, AutoGen│
   └────────────┘            └─────────────┘
```

## Scoring methods

### SGI (Semantic Grounding Index)

```
SGI = dist(phi(response), phi(question)) / dist(phi(response), phi(context))
```

| Score | Interpretation |
|---|---|
| SGI > 1.20 | Strong context engagement (pass) |
| 0.95 < SGI < 1.20 | Partial engagement (review recommended) |
| SGI < 0.95 | Weak engagement (flagged) |

### DGI (Directional Grounding Index)

```
delta = phi(response) - phi(question)
DGI = dot(delta / ||delta||, mu_hat)
```

| Score | Interpretation |
|---|---|
| DGI > 0.30 | Aligns with grounded patterns (pass) |
| 0.00 < DGI < 0.30 | Weak alignment (flagged) |
| DGI < 0.00 | Opposes grounded direction (high risk) |

## Providers and integrations

| Component | Install extra | Description |
|---|---|---|
| OpenAI | `openai` | Wraps `openai` SDK with automatic scoring |
| Anthropic | `anthropic` | Wraps `anthropic` SDK with automatic scoring |
| Google | `google` | Wraps `google-generativeai` with automatic scoring |
| LangChain | `langchain` | Evaluator + callback handler |
| CrewAI | `crewai` | Tool for agent pipelines |
| Semantic Kernel | `semantic-kernel` | Function calling filter |
| AutoGen | `autogen` | Agent chat checker |

## Domain calibration

Generic DGI uses a bundled reference direction that achieves AUROC ~0.76. For production use, calibrate with 20-100 verified question-response pairs from your domain:

```python
from factlens import calibrate

result = calibrate(csv_path="my_domain_pairs.csv")
print(f"Concentration: {result.concentration:.2f}")
result.save("calibration.json")
```

Domain-specific calibration typically reaches AUROC 0.90-0.99. The confabulation benchmark (arXiv:2603.13259) reports DGI AUROC 0.958 with domain calibration.

## Research

factlens implements the methods described in three peer-reviewed papers:

1. **Semantic Grounding Index (SGI)**
   Marin, J. (2025). *Semantic Grounding Index for LLM Hallucination Detection.*
   [arXiv:2512.13771](https://arxiv.org/abs/2512.13771)

2. **Directional Grounding Index (DGI)**
   Marin, J. (2026). *A Geometric Taxonomy of Hallucinations in Large Language Models.*
   [arXiv:2602.13224](https://arxiv.org/abs/2602.13224)

3. **Confabulation Benchmark**
   Marin, J. (2026). *Rotational Dynamics of Factual Constraint Processing in Large Language Models.*
   [arXiv:2603.13259](https://arxiv.org/abs/2603.13259)

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for development setup, code standards, and PR process.

## License

[MIT](LICENSE) -- Javier Marin (javier@jmarin.info)
