Metadata-Version: 2.4
Name: smallevals
Version: 0.1.2
Summary: Small Language Models Evaluation Suite for RAG Systems
Author-email: Mehmet Burak Sayıcı <mburaksayici@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/mburaksayici/smallevals
Project-URL: Repository, https://github.com/mburaksayici/smallevals
Project-URL: Issues, https://github.com/mburaksayici/smallevals/issues
Project-URL: Documentation, https://github.com/mburaksayici/smallevals#readme
Keywords: rag,evaluation,vector-database,retrieval,nlp,llm,small-language-models
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai>=2.8.1
Requires-Dist: python-dotenv>=1.2.1
Requires-Dist: tqdm>=4.67.1
Requires-Dist: sentence-transformers>=5.1.2
Requires-Dist: pandas>=2.0.0
Requires-Dist: peft>=0.18.0
Requires-Dist: langchain-community>=0.4.1
Requires-Dist: llama-index-core>=0.10.0
Requires-Dist: transformers>=4.30.0; sys_platform == "win32"
Requires-Dist: hf-transfer>=0.1.9; sys_platform == "win32"
Requires-Dist: torch>=2.0.0; sys_platform == "win32"
Requires-Dist: llama-cpp-python>=0.3.16; sys_platform == "darwin" or sys_platform == "linux"
Requires-Dist: transformers>=4.30.0; sys_platform == "darwin" or sys_platform == "linux"
Requires-Dist: torch>=2.0.0; sys_platform == "darwin" or sys_platform == "linux"
Requires-Dist: plotly>=5.0.0
Requires-Dist: dash>=2.14.0
Requires-Dist: dash-bootstrap-components>=1.5.0
Requires-Dist: chromadb>=1.3.5
Requires-Dist: weaviate-client>=4.18.1
Requires-Dist: faiss-cpu>=1.13.0
Requires-Dist: pinecone-client>=3.0.0
Requires-Dist: pymilvus>=2.6.3
Requires-Dist: elasticsearch>=8.11
Requires-Dist: elastic-transport>=8.11
Requires-Dist: psycopg2-binary>=2.9.11
Requires-Dist: sqlalchemy>=2.0.0
Requires-Dist: qdrant-client>=1.16.1
Requires-Dist: pymongo>=4.0.0
Requires-Dist: turbopuffer>=0.1.0
Provides-Extra: test
Requires-Dist: pytest>=9.0.1; extra == "test"
Requires-Dist: pytest-cov>=4.0.0; extra == "test"
Requires-Dist: testcontainers[elasticsearch,weaviate]>=4.13.3; extra == "test"
Requires-Dist: pyarrow>=10.0.0; extra == "test"
Dynamic: license-file

# smallevals <img src="logo/smallevals_emoji_32_32.png" alt="logo" width="32" height="32"> - Small Language Models Evaluation Suite for RAG Systems

A lightweight evaluation framework powered by tiny ( really tiny <img src="logo/smallevals_emoji_32_32.png" alt="logo" width="32" height="32"> ) 0.6B models — runs 100% locally on CPU/GPU/MPS, extremely fast and cheap.

Evaluation tools requiring LLM-as-a-judge, that costs/doesn't scale easily. <img src="logo/smallevals_emoji_32_32.png" alt="logo" width="32" height="32"> evaluates in seconds in GPU, in minutes in any CPU  <img src="logo/smallevals_emoji_32_32.png" alt="logo" width="32" height="32"> <img src="logo/smallevals_emoji_32_32.png" alt="logo" width="32" height="32">!

## Evaluate Retrieval

Evaluation of RAG system includes retrieval and RAG stage, <img src="logo/smallevals_emoji_32_32.png" alt="logo" width="32" height="32"> attacks to test retrieval and RAG answers(in the near future)!

## Models

| Model Name | Task | Status | Link |
|------------|-------|--------|------|
| **QAG-0.6B** | Generate golden Q/A from chunks (synthetic evaluation data) | Available | [🤗](mburaksayici/golden_generate_qwen_0.6b_v2_gguf) |
| **CRC-0.6B** | Context relevance classifier (question ↔ retrieved chunk) | Incoming | — |
| **GJ-0.6B** | Groundedness / faithfulness judge (answer ↔ context) | Incoming | — |
| **ASM-0.6B** | Answer correctness / semantic similarity | Incoming | — |

**Current Focus**: Retrieval evaluation (QAG-0.5B). Generation evaluation models (CRC-0.5B, GJ-0.5B, ASM-0.5B) are future work.

## Installation

```bash
pip install smallevals
```

## Quick Start

### Evaluate Retrieval Quality (Python)

Connect to your favourite Vector DB (Milvus, Elastic, PGVector, Chroma, Pinecone, FAISS, Weawiate), attach your favourite embeddings, generate questions, and visualise results!

Under the hood, <img src="logo/smallevals_emoji_32_32.png" alt="logo" width="32" height="32"> generates question per chunk, and tries to retrieve it as a single-first relevant docs, calculate scores.

```python
from smallevals import evaluate_retrievals, SmallEvalsVDBConnection

vdb = SmallEvalsVDBConnection(
    connection=chroma_client,
    collection="my_collection",
    embedding=embedding
)

# Run evaluation
result = evaluate_retrievals(connection=vdb, top_k=10, n_chunks=200) # Generate question for 200 chunks, and test to retrieve them!
```
And evaluate results!

### Generate QA from Documents (CLI)

```bash
smallevals --docs-dir ./documents --num-questions 100
```


### **QAG-0.6B**

The model was trained on TriviaQA, SQuAD 2.0, Hand-curated synthetic data generated using Qwen-70B , generating a question from the chunk/doc. 


```
Given the passage below, extract ONE question/answer pair grounded strictly in a single atomic fact.

PASSAGE:
"Eiffel tower is built at 1989"

Return ONLY a JSON object.
```

```
{
  "question": "When was the Eiffel Tower completed?",
  "answer": "1889"
}
```

Known issues: 
- Model is trained on text/wiki data, bias towards well structured text.
- Dataset contains question that ask generic questions, dataset will be more carefully crafted in v3. 

### Other Models:

Other models to be trained to eliminate the need of external LLMs. 

**CRC-0.6B** : Context relevance classifier (question ↔ retrieved chunk)
**GJ-0.6B** : Groundedness / faithfulness judge (answer ↔ context)  
**ASM-0.6B** | Answer correctness / semantic similarity 
