Metadata-Version: 2.4
Name: smallevals
Version: 0.1.8
Summary: Small Language Models Evaluation Suite for RAG Systems
Author-email: Mehmet Burak Sayıcı <mburaksayici@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/mburaksayici/smallevals
Project-URL: Repository, https://github.com/mburaksayici/smallevals
Project-URL: Issues, https://github.com/mburaksayici/smallevals/issues
Project-URL: Documentation, https://github.com/mburaksayici/smallevals#readme
Keywords: rag,evaluation,vector-database,retrieval,nlp,llm,small-language-models
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai>=2.8.1
Requires-Dist: python-dotenv>=1.2.1
Requires-Dist: tqdm>=4.67.1
Requires-Dist: sentence-transformers>=5.1.2
Requires-Dist: pandas>=2.0.0
Requires-Dist: peft>=0.18.0
Requires-Dist: docling>=2.60.0
Requires-Dist: nltk>=3.8
Requires-Dist: numpy<2.3,>=2
Requires-Dist: transformers>=4.30.0; sys_platform == "win32"
Requires-Dist: hf-transfer>=0.1.9; sys_platform == "win32"
Requires-Dist: torch>=2.0.0; sys_platform == "win32"
Requires-Dist: llama-cpp-python>=0.3.16; sys_platform == "darwin" or sys_platform == "linux"
Requires-Dist: plotly>=5.0.0
Requires-Dist: dash>=2.14.0
Requires-Dist: dash-bootstrap-components>=1.5.0
Requires-Dist: chromadb>=1.3.5
Requires-Dist: weaviate-client>=4.18.1
Requires-Dist: faiss-cpu>=1.13.0
Requires-Dist: pinecone-client>=3.0.0
Requires-Dist: pymilvus>=2.6.3
Requires-Dist: elasticsearch>=8.11
Requires-Dist: elastic-transport>=8.11
Requires-Dist: psycopg2-binary>=2.9.11
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: qdrant-client>=1.16.1
Requires-Dist: pymongo>=4.0.0
Requires-Dist: turbopuffer>=0.1.0
Provides-Extra: test
Requires-Dist: pytest>=9.0.1; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Requires-Dist: testcontainers[elasticsearch,weaviate]>=4.13.3; extra == "test"
Requires-Dist: pyarrow>=10.0.0; extra == "test"
Dynamic: license-file

# smallevals <img src="logo/smallevals_emoji_32_32.png" alt="logo" width="32" height="32"> - Small Language Models Evaluation Suite for RAG Systems

A lightweight evaluation framework powered by tiny ( really tiny <img src="logo/smallevals_emoji_32_32.png" alt="logo" width="32" height="32"> ) 0.6B models — runs 100% locally on CPU/GPU/MPS, attach any vector DB connection and run, fast and free.

![smallevals demo](logo/demo.gif)

Evaluation tools requiring LLM-as-a-judge or external, that costs/doesn't scale easily. <img src="logo/smallevals_emoji_32_32.png" alt="logo" width="32" height="32"> evaluates in seconds in GPU, in minutes in any CPU  <img src="logo/smallevals_emoji_32_32.png" alt="logo" width="32" height="32"> <img src="logo/smallevals_emoji_32_32.png" alt="logo" width="32" height="32">!


## Evaluate Retrieval

Evaluation of RAG system includes retrieval and RAG stage, <img src="logo/smallevals_emoji_32_32.png" alt="logo" width="32" height="32"> attacks to test retrieval and RAG answers(in the near future)!

## Models

| Model Name | Task | Status | Link |
|------------|-------|--------|------|
| **QAG-0.6B** | Generate golden Q/A from chunks or docs (synthetic evaluation data) | Available | [🤗](https://huggingface.co/mburaksayici/golden_generate_qwen_0.6b_v3) |
| **CRC-0.6B** | Context relevance classifier (question ↔ retrieved chunk) | Incoming | — |
| **GJ-0.6B** | Groundedness / faithfulness judge (answer ↔ context) | Incoming | — |
| **ASM-0.6B** | Answer correctness / semantic similarity | Incoming | — |

**Current Focus**: Retrieval evaluation (QAG-0.6B), after being sure the model generates correct answers and better questions for RAG(it does, but still room for improvement), the model will be the first model of pipeline leading to  (RAG) generation evaluation models (CRC-0.6B, GJ-0.6B, ASM-0.5B) which are the future work.


### How does it work? 
Question Generator model, reads your chunk, assumes the chunk is the one that answers the question, and tries to match it back via Vector DB query. 

This allows directly to test your retrieval pipelines tied to your RAG systems. Whatever the complexity of your RAG system, you'll be sure if your vector queries works fine.

### Why this is a need? 
Other frameworks requiring APIs are costly, hard-to-scale, although they are better(for now). 


## Installation

```bash
pip install smallevals
```

## Quick Start

### Evaluate Retrieval Quality (Python)

Connect to your favourite Vector DB (Milvus, Elastic, PGVector, Chroma, Pinecone, FAISS, Weawiate), attach your favourite embeddings, generate questions, and visualise results!

Under the hood, <img src="logo/smallevals_emoji_32_32.png" alt="logo" width="32" height="32"> generates question per chunk, and tries to retrieve it as a single-first relevant docs, calculate scores.

```python
from smallevals import evaluate_retrievals, SmallEvalsVDBConnection

vdb = SmallEvalsVDBConnection(
    connection=chroma_client,
    collection="my_collection",
    embedding=embedding
)

# Run evaluation
result = evaluate_retrievals(connection=vdb, top_k=10, n_chunks=200) # Generate question for 200 chunks, and test to retrieve them!
```
And evaluate results!

```bash
smallevals dash --host 0.0.0.0 --port 8050 --debug
```

### Generate QA from Documents (CLI)

```bash
smallevals generate_qa --docs-dir ./documents --num-questions 100
```


### **QAG-0.6B**

The model was trained on TriviaQA, SQuAD 2.0, Hand-curated synthetic data generated using Qwen-70B , generating a question from the chunk/doc. 


```
Given the passage below, extract ONE question/answer pair grounded strictly in a single atomic fact.

PASSAGE:
"Eiffel tower is built at 1989"

Return ONLY a JSON object.
```

```
{
  "question": "When was the Eiffel Tower completed?",
  "answer": "1889"
}
```

Known issues: 
- Model is trained on text/wiki data, bias towards well structured text.
- Dataset contains question that ask generic questions, dataset will be more carefully crafted in v3. 

### Other Models:

Other models to be trained to eliminate the need of external LLMs. 

**CRC-0.6B** : Context relevance classifier (question ↔ retrieved chunk)
**GJ-0.6B** : Groundedness / faithfulness judge (answer ↔ context)  
**ASM-0.6B** | Answer correctness / semantic similarity 
