Metadata-Version: 2.4
Name: trace-score
Version: 0.1.1
Summary: Multi-turn LLM Conversation Consistency Metric
Home-page: https://github.com/Giri530/trace-score
Author: Girinath V
Author-email: Girinath V <your-email@gmail.com>
License: MIT License
        
        Copyright (c) 2026 Girinath V
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
        
Project-URL: Homepage, https://github.com/Giri530/trace-score
Project-URL: Repository, https://github.com/Giri530/trace-score
Keywords: nlp,llm,evaluation,multi-turn,consistency,trace-score
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: numpy>=1.21.0
Requires-Dist: torch>=1.11.0
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# TRACE Score

**Multi-turn LLM Conversation Consistency Metric**

> The unified, deterministic, reference-free evaluation metric for
> multi-turn conversational consistency in Large Language Models.

[![PyPI version](https://badge.fury.io/py/trace-score.svg)](https://pypi.org/project/trace-score/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)

---

## The Problem

BLEU, ROUGE, BERTScore, and RAGAS evaluate each conversation turn in isolation. They cannot detect failures that span multiple turns:

| Failure Type | Example | BLEU | ROUGE | BERTScore | TRACE |
|---|---|---|---|---|---|
| Fact forgotten | User says "I am diabetic" at turn 1 → model recommends sugary food at turn 6 | Miss | Miss | Miss | Catch |
| Correction ignored | User corrects model → model reverts to old behavior next turn | Miss | Miss | Miss | Catch |
| Self-contradiction | Model says X at turn 2, contradicts X at turn 7 | Miss | Miss | Miss | Catch |
| Topic drift | Conversation drifts off topic over multiple turns | Miss | Miss | Miss | Catch |

---

## Formula

```
TRACE(C) = Σ(wᵢ · Sᵢ) − λ·P − δ·V + α·(T·C) + β·(A·R)
```

Time-decay aggregation weights recent turns more heavily:

```
Sᵢ = (1/Z) · Σ γ^(N-t) · Sᵢ,ₜ
```

| Component | Measures |
|-----------|---------|
| T — Temporal Retention | Did the model remember user-stated facts? |
| R — Reliability Consistency | Did the model contradict itself? |
| A — Adaptive Correction | Did the model retain user corrections? |
| C — Context Coherence | Did the conversation stay on topic? |
| E — Epistemic Stability | Did the model's confidence stay calibrated? |

Default: γ=0.80, λ=0.15, δ=0.10, α=0.05, β=0.05, all wᵢ=0.20

---

## Install

```bash
pip install trace-score
```

---

## Quick Start

```python
from trace_score import compute_TRACE

conversation = [
    ("user",      "I am diabetic and allergic to nuts."),
    ("assistant", "I will suggest safe low-sugar options."),
    ("user",      "Actually I eat fish too. I am pescatarian."),
    ("assistant", "Spicy chicken with cashews would be great!"),
]

result = compute_TRACE(conversation, verbose=True)

print(result["trace_score"])        # 0.41
print(result["A"])                  # 0.00 — correction ignored
print(result["interpretation"])     # Poor consistency
print(result["formula_breakdown"])
```

---

## Batch Evaluation

```python
from trace_score import TRACEEvaluator

evaluator = TRACEEvaluator()   # models loaded once
results   = [evaluator.evaluate(conv) for conv in conversations]
```

---

## Benchmark Results

Evaluated on **102 multi-turn conversations** (34 templates × 3 runs)
generated by Llama-3.1-8B via Groq API.

### Overall Metric Comparison

| Category | n | TRACE | BLEU | ROUGE-L | BERTScore |
|---|---|---|---|---|---|
| Fact Memory | 36 | 0.688 | 0.048 | 0.172 | 0.796 |
| Correction | 36 | 0.632 | 0.183 | 0.321 | 0.840 |
| Contradiction | 30 | 0.871 | 0.124 | 0.255 | 0.837 |
| **Overall** | **102** | **0.721** | **0.108** | **0.236** | **0.822** |

TRACE category separation range: **0.239**
BERTScore category separation range: **0.044**
TRACE separates 5.4x more than BERTScore across categories.

---

### TRACE Component Breakdown by Category

| Category | T | R | A | C | E |
|---|---|---|---|---|---|
| Fact Memory | 0.137 | 0.955 | 1.000 | 0.503 | 0.697 |
| Correction | 0.491 | 0.927 | 0.144 | 0.465 | 0.712 |
| Contradiction | 0.973 | 0.875 | 0.900 | 0.510 | 0.696 |

The A component (Adaptive Correction) drops to 0.144 for Correction
conversations, revealing that Llama-3.1-8B ignores user corrections
85.6% of the time. BERTScore scores the same conversations at 0.840.
This failure is invisible to all per-turn metrics.

---

### The Gap TRACE Reveals

Conversations where BERTScore is high but TRACE is low:

| Category | TRACE | BERTScore | Gap |
|---|---|---|---|
| Correction | 0.314 | 0.876 | 0.562 |
| Correction | 0.381 | 0.861 | 0.480 |
| Correction | 0.442 | 0.822 | 0.380 |
| Correction | 0.494 | 0.864 | 0.370 |
| Correction | 0.535 | 0.884 | 0.349 |

In all cases, A=0.00 — the model acknowledged the correction but
failed to retain it. BERTScore sees fluent per-turn outputs and reports
high scores. TRACE sees the cross-turn failure.

---

### Human Evaluation

102 conversations rated by 3 annotators on 5 consistency dimensions
(Q1 Memory, Q2 No-Contradiction, Q3 Correction, Q4 Coherence,
Q5 Overall, scale 1-5).

| Annotator | n | Q1 | Q2 | Q3 | Q4 | Q5 |
|---|---|---|---|---|---|---|
| Girinath V | 34 | 4.50 | 4.41 | 4.35 | 4.47 | 4.35 |
| Hari V | 34 | 4.09 | 4.29 | 4.26 | 4.26 | 4.21 |
| Kaarthic VR | 34 | 4.85 | 4.88 | 4.94 | 4.97 | 4.85 |
| Combined | 102 | 4.48 | 4.53 | 4.52 | 4.57 | 4.47 |

Human overall mean: 4.47/5 (0.868 normalized to [0,1]).

---

## Why TRACE?

| Metric | Multi-turn | Reference-free | Deterministic | Time-decay | Diagnostic |
|--------|-----------|----------------|---------------|-----------|-----------|
| BLEU | No | No | Yes | No | No |
| ROUGE | No | No | Yes | No | No |
| BERTScore | No | No | Yes | No | No |
| RAGAS | No | Yes | No | No | Partial |
| TRACE | Yes | Yes | Yes | Yes | Yes |

---

## Models Used

| Model | Purpose | Size |
|-------|---------|------|
| all-MiniLM-L6-v2 | Semantic similarity (T, A, C, E) | 80MB |
| cross-encoder/nli-deberta-v3-small | Contradiction detection (R, A) | 184MB |

Models download automatically on first use. CPU-friendly, no GPU required.

---

## Citation

```bibtex
@article{girinathv2026trace,
  title   = {TRACE: A Unified Deterministic Metric for Multi-turn
             Conversational Consistency in Large Language Models},
  author  = {Girinath.V},
  year    = {2026}
}
```

---

*Author: Girinath V*
*GitHub: https://github.com/Giri530/trace-score*
