Metadata-Version: 2.4
Name: vrty
Version: 1.0.1
Summary: Deterministic LLM-output quality scoring in milliseconds. No AI judge in the loop.
License-Expression: MIT
Project-URL: Homepage, https://github.com/sundeyp/vrty
Project-URL: Repository, https://github.com/sundeyp/vrty
Project-URL: Issues, https://github.com/sundeyp/vrty/issues
Keywords: llm,evaluation,scoring,tf-idf,deterministic
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: <3.12,>=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: dev
Requires-Dist: pytest==8.3.3; extra == "dev"
Dynamic: license-file

# VRTY

[![CI](https://github.com/sundeyp/vrty/actions/workflows/vrty.yml/badge.svg)](https://github.com/sundeyp/vrty/actions/workflows/vrty.yml)
[![PyPI](https://img.shields.io/pypi/v/vrty.svg)](https://pypi.org/project/vrty/)
[![Python 3.11](https://img.shields.io/badge/python-3.11.9-blue.svg)](https://www.python.org/downloads/release/python-3119/)
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![Runtime deps: 0](https://img.shields.io/badge/runtime%20deps-0-brightgreen.svg)](pyproject.toml)

**The deterministic, zero-dependency LLM evaluator. Single-digit milliseconds, no API key, byte-identical across runs.**

*A stdlib alternative to ROUGE for no-reference scoring, and a sanity layer
in front of GPT-as-judge when reproducibility matters.*

VRTY scores a `(prompt, response)` pair on four standard, auditable
dimensions and returns a single composite plus a per-dimension breakdown.
Every formula is a textbook formula you can verify against a reference in
five minutes. There is no LLM call anywhere in the scoring path.

> **What VRTY does not do.** VRTY measures *surface text properties* —
> vocabulary overlap, sentence flow, term coverage, information density.
> **It does not check whether the answer is true.** A confident wrong answer
> that echoes the prompt's vocabulary will score *higher* than a correct
> one-word answer (see [Known properties and limitations](#known-properties-and-limitations):
> `"London is the capital of France."` scores 0.879; `"Paris."` scores 0.350).
> Use VRTY to catch malformed, off-topic, or padded output; pair it with a
> fact-check or human review when correctness matters.

```python
from vrty import score
result = score("What is the capital of France?", "Paris is the capital of France.")
print(result.composite)               # 0.8653358523094898
print(result.explanations["relevance"])  # Relevance: 0.83 - response strongly overlaps with the prompt's key terms.
```

That is the entire 60-second example. Four lines, runs as-is, returns a
score. No configuration, no API key.

> **About that 0.865.** That number is what *factoid* prompts look like —
> short prompt, short answer, heavy vocabulary overlap. Open-ended prompts
> (customer support, instruction-following, prose drafts) typically score
> **0.20 – 0.40** because the response is *expected* not to echo prompt
> vocabulary. VRTY is calibrated *relative to a fixed prompt*, not as an
> absolute quality threshold. See [Calibration bands](#calibration-bands)
> below before setting CI gates.

---

## Install

```sh
pip install vrty
```

Or from source:

```sh
git clone https://github.com/sundeyp/vrty
cd vrty
pip install -e .
```

Determinism is guaranteed only on the pinned interpreter (Python 3.11.9)
and pinned dependency set. The scoring path has **zero third-party runtime
dependencies** — everything is Python stdlib. See [Determinism](#determinism)
below.

---

## The four dimensions

| Dimension | Formula | What it measures |
|---|---|---|
| **Relevance** | TF·IDF weighted cosine similarity between prompt and response | How much the response's content overlaps the prompt's content |
| **Coherence** | Mean cosine similarity of adjacent-sentence TF·IDF vectors | How much each sentence shares with the next (topical flow) |
| **Completeness** | IDF-weighted fraction of prompt content terms that appear in the response | How many of the prompt's key terms are addressed |
| **Conciseness** | `|unique content tokens| / |total tokens|` (content-word type–token ratio) | Information density vs padding |

Each dimension returns a value in `[0.0, 1.0]`. The composite is a fixed,
version-locked weighted sum:

```
composite = 0.35 * relevance
          + 0.20 * coherence
          + 0.30 * completeness
          + 0.15 * conciseness
```

The weights are pinned constants, not configurable. Configurability is
explicitly post-v1.0.

---

## What you get back

`score()` returns a frozen `VrtyScore` object with a 9-key `to_dict()`:

```python
{
  "composite":       0.8653358523094898,
  "relevance":       0.8295310065985426,
  "coherence":       1.0,
  "completeness":    1.0,
  "conciseness":     0.5,
  "explanations": {
    "relevance":    "Relevance: 0.83 - response strongly overlaps with the prompt's key terms.",
    "coherence":    "Coherence: 1.00 - adjacent sentences carry consistent topic.",
    "completeness": "Completeness: 1.00 - most of the prompt's key terms appear in the response.",
    "conciseness":  "Conciseness: 0.50 - response has moderate information density."
  },
  "vrty_version": "1.0.1",
  "idf_sha256":      "0e475bcaa5524d1e26cbb166bb5c138e37f87e1e47b75e6506c6460a94259fd2",
  "weights":         {"relevance": 0.35, "coherence": 0.20, "completeness": 0.30, "conciseness": 0.15}
}
```

`vrty_version` and `idf_sha256` make every score reproducible — together
they pin the scoring logic and the exact IDF data used.

---

## CLI

```sh
vrty --prompt "What is the capital of France?" \
        --response "Paris is the capital of France."
```

Equivalent stdlib invocation:

```sh
python -m vrty --prompt "..." --response "..."
```

Accepts `--prompt-file PATH` / `--response-file PATH` for long inputs;
`/dev/stdin` works as a file path. `--pretty` indents the JSON.
Exit codes: `0` success, `1` I/O error, `2` argparse error.

---

## Benchmarks

VRTY is not an embedding-based scorer; if you need semantic similarity that
survives paraphrase, use **BERTScore** or **MoverScore**. VRTY is not n-gram
precision against a reference; if you have reference answers, use **BLEU**
or **ROUGE**. VRTY's niche is *no-reference, no-model, deterministic*
scoring — the gap ROUGE leaves when you don't have a gold reference, and
the gap GPT-as-judge leaves when you need reproducibility.

Reproducibility, cost, and latency vs ROUGE and LLM-as-judge. VRTY and
ROUGE were measured on the same machine with the same 1000 synthetic
(prompt, response) pairs per response-size bucket; reproduce via
`python tools/benchmark.py`. LLM-as-judge cost and latency are intentionally
not measured here — they depend on model choice and provider pricing, both
of which drift; fill them in for your own model before relying on the
comparison.

|                        | VRTY | ROUGE (rouge-score 0.1.2) | LLM-as-judge |
|------------------------|------|---------------------------|--------------|
| **Reproducibility**    | Byte-identical across processes (pinned Python 3.11.9, asserted in CI on three subprocesses with adversarial `PYTHONHASHSEED` values) | Deterministic for a fixed tokenizer | Non-deterministic; varies with temperature, sampling, model version |
| **Cost per score**     | $0 (no API call) | $0 (local) | $ per call × tokens; measure with your chosen model |
| **Latency p99 — 100 tokens**  | **0.16 ms** | 1.66 ms | typically 500–2000 ms (network + inference) |
| **Latency p99 — 500 tokens**  | **0.52 ms** | 6.66 ms | typically 500–2000 ms |
| **Latency p99 — 2000 tokens** | **2.94 ms** | 25.96 ms | typically 1000–5000 ms |
| **Network required**   | No | No | Yes |
| **Reference hardware** | AMD Ryzen 7 8745HS, 16 cores, 27 GiB RAM, Ubuntu 24.04, Python 3.11.9 | (same) | (varies by provider) |

**Latency claim (v1.0)**: `< 3 ms p99 for responses under 2000 tokens on
AMD Ryzen 7 8745HS`. Reproduce: `python tools/benchmark.py` from a clean
venv with `vrty` and `rouge-score==0.1.2` installed.

VRTY is roughly **9–10× faster than ROUGE** at every input size in this
table because the scoring path is pure stdlib with no regex-based stemmer
and no sentence-pair grid construction.

---

## Calibration bands

Expected composite ranges by prompt type, observed across realistic input.
Use these to set CI gates and user-facing displays — do not assume a single
threshold works across prompt types.

| Prompt type | Typical composite | Use the score as |
|---|---|---|
| Factoid Q&A where the answer echoes prompt vocabulary (`"capital of France?"` → `"Paris is the capital of France."`) | 0.70 – 0.90 | Absolute threshold viable |
| Customer-support / instruction-following | 0.20 – 0.40 | Relative delta from a baseline answer on the *same* prompt |
| Open-ended prose (email drafts, summaries) | 0.15 – 0.35 | Relative delta only |
| Repetition / padding spam with OOV technical terms | can score 0.60+ | Catch by pairing with a length / repetition sanity check |

**Practical rule.** Compute a baseline composite on a known-good response
to your prompt, then gate on `score >= baseline * k` for some
`k ∈ [0.7, 0.9]`. Do not gate on `composite > 0.8` as an absolute — that
will fire false-negative on obviously-fine open-ended responses.

---

## Determinism

Identical input returns byte-identical output. This guarantee holds under
the following conditions, all of which are documented and enforced:

- **Pinned interpreter**: Python 3.11.9 (CPython, official build or
  python-build-standalone). The CI matrix runs on this version. Other 3.x
  versions are likely to produce identical output but are not asserted.
- **Pinned IDF data**: `vrty/data/idf.json.gz` ships with the package
  and is SHA-256-verified at import. A modified data file fails fast with
  `VrtyDataError` before any score is computed.
- **Zero third-party runtime dependencies**: the scoring path uses only
  CPython stdlib (`re`, `math`, `collections`, `json`, `gzip`,
  `hashlib`, `importlib.resources`, `unicodedata`). No `numpy`, no
  `scikit-learn`, no BLAS-backed FP variance.
- **Sort-before-reduction**: every set and dict is sorted before any
  floating-point accumulation, so dict-iteration order under
  `PYTHONHASHSEED` randomization cannot change the result.

The test suite asserts byte-identity on `json.dumps(result.to_dict(),
sort_keys=True)` across three fresh OS subprocesses with `PYTHONHASHSEED`
set to `0`, `12345`, and the CPython default (`random`).

---

## Self-host

A one-command Docker self-host is shipped alongside the library. See the
[Dockerfile](Dockerfile) for the pinned image and the
[GitHub Actions snippet](.github/workflows/vrty.yml) for CI/CD
integration.

```sh
docker build -t vrty:1.0.0 .
docker run --rm vrty:1.0.0 \
  --prompt "What is the capital of France?" \
  --response "Paris is the capital of France."
```

---

## Known properties and limitations

**Read this section before integrating VRTY into anything load-bearing.**
Seven honest limitations of the v1.0 design.

### 1. VRTY scores surface properties, not factual correctness

The four dimensions measure **term overlap, sentence flow, key-term
coverage, and information density**. They do *not* verify that the response
is factually true. A correct answer that does not echo prompt vocabulary
scores low on relevance and completeness; a confident wrong answer that
echoes prompt vocabulary scores high.

Worked example, prompt = `"What is the capital of France?"`:

| Response                                  | Correct? | Composite | Relevance | Completeness | Conciseness |
|-------------------------------------------|----------|-----------|-----------|--------------|-------------|
| `"Paris is the capital of France."`       | yes      | 0.865     | 0.830     | 1.000        | 0.500       |
| `"London is the capital of France."`      | **no**   | 0.879     | 0.867     | 1.000        | 0.500       |
| `"Paris."`                                | yes      | 0.350     | 0.000     | 0.000        | 1.000       |
| `"London."`                               | **no**   | 0.350     | 0.000     | 0.000        | 1.000       |
| `"Banana."`                               | **no**   | 0.350     | 0.000     | 0.000        | 1.000       |

The verbose incorrect answer scores *higher* than the verbose correct one
(slight IDF asymmetry between `"london"` and `"paris"` in the bundled
corpus); the three terse responses — one correct, two wrong — receive
identical 0.350 scores. **VRTY cannot distinguish them; an external
fact-check must.** Use VRTY to detect malformed, off-topic, or padded
outputs; use a separate fact-check or human review to verify truth.

### 2. Conciseness and completeness intentionally pull against each other

A response that covers every prompt term tends to be longer (lower
conciseness); a terse response tends to omit prompt terms (lower
completeness). This tension is correct behavior, not a bug. Always read
the per-dimension breakdown — a single composite hides the trade-off.

### 3. Single-sentence coherence returns 1.0 by deliberate choice

When the response is one sentence (or zero — see the empty-response
wrapper), there is no adjacent-sentence pair that can disagree, so
coherence is set to 1.0. This is a deliberate v1.0 convention: penalizing
short responses on coherence would double-count what completeness already
measures via prompt-term coverage.

### 4. OOV tokens receive maximum IDF weight by deliberate choice

Tokens not present in the bundled IDF corpus are assigned `idf_oov =
log(N+1) + 1`, the value the smoothed IDF formula assigns to a token that
appears in zero documents. This treats unseen words as maximally
informative — the standard add-one (Laplace) smoothing choice — so
technical jargon and proper nouns are not silently dropped to zero weight.

### 5. Conciseness is a type–token ratio, which is mildly length-sensitive

The conciseness measure (`|unique content tokens| / |total tokens|`) tends
to decline for longer responses because the vocabulary saturates while the
length keeps growing. This is a known property of the type–token ratio
(Hess et al. 1986). Two responses of very different lengths are not
directly comparable on conciseness alone; interpret the conciseness score
together with the other dimensions and the response length.

### 6. Repetition can score high when prompt terms are out-of-corpus

Because OOV tokens receive maximum IDF weight (limitation 4 above) and
conciseness is a type–token ratio (limitation 5), a response that *repeats*
OOV technical terms (e.g. `"multi-head multi-head attention attention
attention transformer transformer transformer."` against a transformer-
architecture prompt) can score *higher* than a substantive paragraph on the
same prompt. Mitigation: combine the VRTY composite with a basic length /
repetition sanity check, or treat the composite as one signal among
several. This is a known property of TF·IDF-family scorers, not unique to
VRTY.

### 7. The bundled IDF corpus is 19th-century English literature

IDF weights are computed from ten US-public-domain Project Gutenberg books
(Austen, Melville, Shelley, Doyle, Stoker, Carroll, Wilde, Dickens, Wells,
Thoreau) — about 5,400 200-token pseudo-documents, 32,000-word vocabulary.
Modern technical vocabulary like `"API"`, `"endpoint"`, `"deploy"`,
`"kubernetes"`, `"async"` is not in the corpus and falls into the OOV
bucket, where it receives the maximum IDF weight (see limitation 4).

This generally *helps* technical text (rare jargon is correctly treated as
informative) but can cause uneven weighting when one technical term is
in-corpus by coincidence and a similar one is not. **A domain-matched IDF
corpus is explicitly post-v1.0**; v1.0 disclaims this rather than fixes it.
Non-English text scores as-is with no special handling and is similarly
disclaimed.

---

## Input contract

Behavior on degenerate inputs is part of the v1.0 spec, not an afterthought:

| Input                                | Behavior                                                       |
|--------------------------------------|----------------------------------------------------------------|
| Empty response                       | Every dimension and the composite return `0.0`; explanations say "response contained no scorable tokens." |
| Empty prompt                         | Relevance and completeness return `0.0`; coherence and conciseness depend only on the response and score normally |
| Inputs above 2,048 tokens            | Truncated at 2,048 tokens (the `MAX_TOKENS` constant) before scoring; truncation is deterministic |
| Non-English text                     | NFKD-normalized then ASCII-stripped; accented Latin folds to base letters; non-Latin scripts (CJK, Cyrillic, Arabic, ...) drop entirely. Quality outside English is not claimed |
| Response identical to prompt         | Scored normally; no special case |
| Single word                          | Scored normally; no special case |

---

## License

MIT — see [LICENSE](LICENSE).

---

## Versioning

`vrty_version` is included with every score so any historical score is
traceable to the exact scoring logic that produced it. The bundled IDF
data file's SHA-256 (`idf_sha256`) is also returned with every score so
two scores from different builds can be compared at the data-pinning
level, not just the code level. Bumping either invalidates byte-equality
guarantees and requires a version bump.

A score from `vrty_version="1.0.0"` will be reproducible on any future
machine that installs `vrty==1.0.0` on Python 3.11.9.
