Metadata-Version: 2.1
Name: trust_eval
Version: 0.1.1
Summary: Metric to measure RAG responses with inline citations
Home-page: https://github.com/shanghongsim/trust-eval
License: CC BY-NC 4.0
Keywords: RAG,evaluation,metrics,citation
Author: Shang Hong Sim
Author-email: simshanghong@gmail.com
Requires-Python: >=3.10,<3.12
Classifier: License :: Other/Proprietary License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Dist: colorlog (>=6.9.0,<7.0.0)
Requires-Dist: fuzzywuzzy (>=0.18.0,<0.19.0)
Requires-Dist: nltk (>=3.9.1,<4.0.0)
Requires-Dist: peft (>=0.14.0,<0.15.0)
Requires-Dist: py3nvml (>=0.2.7,<0.3.0)
Requires-Dist: python-levenshtein (>=0.26.1,<0.27.0)
Requires-Dist: scipy (>=1.14.1,<2.0.0)
Requires-Dist: sentence-transformers (>=3.4.0,<4.0.0)
Requires-Dist: vllm (>=0.6.6.post1,<0.7.0)
Project-URL: Repository, https://github.com/shanghongsim/trust-eval
Description-Content-Type: text/markdown

# Trust Eval

Welcome to **Trust Eval**! 🌟  

A comprehensive tool for evaluating the trustworthiness of inline-cited outputs generated by large language models (LLMs) within the Retrieval-Augmented Generation (RAG) framework. Our suite of metrics measures correctness, citation quality, and groundedness.

This is the official implementation of the metrics introduced in the paper *"Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse"* (accepted at ICLR '25).

## Installation 🛠️

### Prerequisites

- **OS:** Linux  
- **Python:** Versions 3.10 – 3.12 (preferably 3.10.13)  
- **GPU:** Compute capability 7.0 or higher (e.g., V100, T4, RTX20xx, A100, L4, H100)

### Steps

1. **Set up a Python environment**

   ```bash
   conda create -n trust_eval python=3.10.13
   conda activate trust_eval
   ```

2. **Install dependencies**

   ```bash
   pip install trust_eval
   ```

   > Note: that vLLM will be installed with CUDA 12.1. Please ensure your CUDA setup is compatible.

3. **Set up NLTK**

   ```bash
   import nltk
   nltk.download('punkt_tab')
   ```

4. **Download benchmark datasets**
Please download the evaluation dataset from [Huggingface](https://huggingface.co/datasets/declare-lab/Trust-Score/tree/main/Trust-Score) and place the folder as the same level as the prompt folder (see demo for example).

## Quickstart 🔥

Evaluate your RAG setup with these main 8 lines.

### Generating Responses

```python
from config import EvaluationConfig, ResponseGeneratorConfig
from evaluator import Evaluator
from logging_config import logger
from response_generator import ResponseGenerator

# Configure the response generator
generator_config = ResponseGeneratorConfig.from_yaml(yaml_path="generator_config.yaml")

# Generate and save responses
generator = ResponseGenerator(generator_config)
generator.generate_responses()
generator.save_responses()
```

### Evaluating Responses

```python
# Configure the evaluator
evaluation_config = EvaluationConfig.from_yaml(yaml_path="eval_config.yaml")

# Compute and save evaluation metrics
evaluator = Evaluator(evaluation_config)
evaluator.compute_metrics()
evaluator.save_results()
```

Please refer to [quickstart](./docs/quickstart/) for the complete guide.

## Contact 📬

For questions or feedback, reach out to Shang Hong (`simshanghong@gmail.com`).

## Citation 📝

If you use this software in your research, please cite the [Trust-Eval](https://arxiv.org/abs/2409.11242) paper as below.

```bibtex
@misc{song2024measuringenhancingtrustworthinessllms,
      title={Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse}, 
      author={Maojia Song and Shang Hong Sim and Rishabh Bhardwaj and Hai Leong Chieu and Navonil Majumder and Soujanya Poria},
      year={2024},
      eprint={2409.11242},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.11242}, 
}
```

