Metadata-Version: 2.4
Name: verifyindex
Version: 0.1.0
Summary: A multi-dimensional metric for evaluating the evidential quality of LLM responses
Author-email: Vishal Srivastava <vishal@evaluabilityai.com>, Tanmay Sah <tanmay@evaluabilityai.com>
License: MIT
Project-URL: Homepage, https://verifyindex.ai
Project-URL: Documentation, https://verifyindex.ai/docs
Project-URL: Repository, https://github.com/evaluabilityai/verifyindex
Project-URL: Paper, https://arxiv.org/abs/XXXX.XXXXX
Keywords: llm,evaluation,factuality,hallucination,governance,rbed,verifyindex
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: ml
Requires-Dist: transformers>=4.40; extra == "ml"
Requires-Dist: sentence-transformers>=2.7; extra == "ml"
Requires-Dist: torch>=2.0; extra == "ml"
Provides-Extra: retrieval
Requires-Dist: faiss-cpu>=1.7.4; extra == "retrieval"
Requires-Dist: wikipedia-api>=0.6; extra == "retrieval"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: black>=24.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: mypy>=1.10; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Dynamic: license-file

# VerifyIndex

**A multi-dimensional metric for evaluating the evidential quality of Large Language Model responses.**

VerifyIndex decomposes the factuality of an LLM response into six sub-dimensions — Verifiability, Evidence coverage, Retrieval precision, Inferential support, Fidelity, and Yield — enabling diagnosis of failure modes that single-percentage metrics conflate.

Designed for both research evaluation and enterprise deployment governance, VerifyIndex integrates with the [R-BED (Risk-Based Evaluability Design)](https://evaluabilityai.com) framework for operational use in regulated industries.

## Installation

```bash
pip install verifyindex
```

For the full retrieval-and-classifier pipeline (recommended for production use):

```bash
pip install "verifyindex[ml,retrieval]"
```

## Quick Start

```python
from verifyindex import VerifyIndex

vi = VerifyIndex()

result = vi.score(
    response="Marie Curie was born in Warsaw in 1867 and won two Nobel Prizes in Physics and Chemistry.",
    knowledge_source="wikipedia",
)

print(result.summary())
```

Output:

```
VerifyIndex score: 0.784
  Verifiability (V):        1.000
  Evidence coverage (E):    1.000
  Retrieval precision (R):  0.887
  Inferential support (I):  1.000
  Fidelity (F):             0.500
  Total atomic claims:      4
```

The composite Y score is the geometric mean of V, E, R, I, F. The full profile is available for diagnostic use.

## The VerifyIndex Profile

Each response produces a six-dimensional profile:

| Dim | Name | Measures |
|-----|------|----------|
| **V** | Verifiability | Fraction of claims that are checkable against sources |
| **E** | Evidence Coverage | Fraction of verifiable claims with retrievable evidence |
| **R** | Retrieval Precision | Quality of retrieved evidence for each claim |
| **I** | Inferential Support | Fraction of claims entailed by their evidence |
| **F** | Fidelity | Fraction of entailed claims that faithfully represent the evidence |
| **Y** | Yield (composite) | Geometric mean of V, E, R, I, F |

Two responses can have identical Y scores but very different profiles. VerifyIndex exposes this for downstream decisions.

## Enterprise Deployment: R-BED Integration

For regulated deployments using the R-BED governance framework:

```python
from verifyindex import VerifyIndex
from verifyindex.rbed import rbed_evidence_report

vi = VerifyIndex()
result = vi.score(response=response_text, knowledge_source="internal_kb")

# Produce structured evidence for R-BED sub-dimensions
evidence = rbed_evidence_report(
    profile=result.profile,
    thresholds={"V": 0.85, "E": 0.80, "R": 0.75, "I": 0.85, "F": 0.90, "Y": 0.75},
)

for key, finding in evidence.findings.items():
    status = "PASS" if finding["passed"] else "FAIL"
    print(f"{key}: {finding['score']:.3f} ({status}) — R-BED Vertex {finding['vertex']}")
```

Each VerifyIndex sub-dimension maps to specific R-BED sub-dimensions. See the [paper](https://arxiv.org/abs/XXXX.XXXXX) Section 6 or the R-BED book Chapters 6-8 for the full mapping.

## Current Status

VerifyIndex 0.1.0 is an **alpha release** providing the package structure, interfaces, and reference stub implementations. Production-grade classifiers for Verifiability and Fidelity are in development and will be released in v0.2.0.

For an early-stage integration, plug in your own retrieval and classifier implementations by subclassing `Retriever` and passing model identifiers to the `VerifyIndex` constructor.

## Citation

If you use VerifyIndex in your research or product, please cite:

```bibtex
@article{srivastava2026verifyindex,
  title={VerifyIndex: A Multi-Dimensional Metric for Evaluating the Evidential Quality of Large Language Model Responses},
  author={Srivastava, Vishal and Sah, Tanmay},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2026}
}
```

And, if applicable to your governance context, the R-BED framework:

```bibtex
@book{srivastava2026rbed,
  title={The AI Evaluability Crisis: How to Build Evaluable AI Systems Using R-BED},
  author={Srivastava, Vishal and Sah, Tanmay},
  year={2026},
  publisher={EvaluabilityAI}
}
```

## License

MIT
