Metadata-Version: 2.4
Name: llm-reasoning-quality
Version: 1.0.0
Summary: A multi-dimensional behavioral framework for evaluating LLM reasoning quality beyond accuracy
Author: Garima Agrawal, Huan Liu
Author-email: Ali Şenol <alisenol@tarsus.edu.tr>
License: MIT
Project-URL: Homepage, https://github.com/senolali/LLM-Reasoning-Quality-Evaluation-Metrics
Project-URL: Repository, https://github.com/senolali/LLM-Reasoning-Quality-Evaluation-Metrics
Project-URL: Bug Tracker, https://github.com/senolali/LLM-Reasoning-Quality-Evaluation-Metrics/issues
Project-URL: Paper, https://arxiv.org/abs/2605.24661
Keywords: llm,evaluation,reasoning,nlp,large-language-models,benchmarking,consistency,robustness,logical-coherence
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Linguistic
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Requires-Dist: pyyaml>=6.0
Requires-Dist: openai>=1.0.0
Requires-Dist: anthropic>=0.20.0
Requires-Dist: google-generativeai>=0.5.0
Requires-Dist: datasets>=2.14.0
Requires-Dist: sentence-transformers>=2.2.0
Requires-Dist: bert-score>=0.3.13
Requires-Dist: openpyxl>=3.1.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: requests>=2.31.0
Provides-Extra: local
Requires-Dist: torch>=2.0.0; extra == "local"
Requires-Dist: transformers>=4.40.0; extra == "local"
Requires-Dist: accelerate>=0.20.0; extra == "local"
Requires-Dist: bitsandbytes>=0.41.0; extra == "local"
Requires-Dist: sentencepiece>=0.1.99; extra == "local"
Provides-Extra: viz
Requires-Dist: matplotlib>=3.7.0; extra == "viz"
Requires-Dist: seaborn>=0.12.0; extra == "viz"
Provides-Extra: all
Requires-Dist: llm-reasoning-quality[local,viz]; extra == "all"

# LLM Reasoning Quality Evaluation Framework

> A config-driven, multi-dimensional framework for evaluating reasoning quality in Large Language Models — beyond simple answer correctness.

**6 metrics · 7 models (4 API + 3 local) · 4 benchmark datasets · no code changes needed to add models or datasets**

---

## Table of Contents

- [Overview](#overview)
- [Metrics](#metrics)
- [Models](#models)
- [Datasets](#datasets)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Configuration](#configuration)
- [Outputs](#outputs)
- [Project Structure](#project-structure)
- [Adding New Models](#adding-new-models)
- [Adding New Datasets](#adding-new-datasets)
- [Known Issues & Platform Notes](#known-issues--platform-notes)
- [Citation](#citation)

---

## Overview

Standard LLM evaluation asks: *"Is the answer correct?"*

This framework asks: *"How well does the model reason?"*

It evaluates models across **6 complementary dimensions** of reasoning quality, producing a composite score that captures correctness, behavioral stability, robustness, logical integrity, and efficiency simultaneously.

```
Q = f(CQ, CS, RS, LS, ES, SS)
```

The framework is fully **config-driven** — models, datasets, metrics, and aggregation weights are all controlled from a single YAML file. No code changes are needed for common use cases.

---

## Metrics

| Symbol | Name | Formula | What It Measures |
|--------|------|---------|--------------------|
| **CQ** | Correctness | `(1/N) Σ I(ŷᵢ = yᵢ)` | Fraction of correct answers |
| **CS** | Consistency | `(2/K(K−1)) Σ I(ŷᵢ⁽ᵏ⁾ = ŷᵢ⁽ˡ⁾)` | Same answer across K repeated runs (pairwise)? |
| **RS** | Robustness | `(1/N) Σ (1/P) Σ I(ŷᵢ = ŷᵢᵖ) · I(yᵢ = ŷᵢ)` | Same answer on semantically equivalent rephrases? |
| **LS** | Logical Coherence | `1 − (1/N) Σ (1/(nᵢ−1)) Σ ψ(sⱼ, sⱼ₊₁)` | No contradictions between consecutive reasoning steps? |
| **ES** | Efficiency | Harmonic mean of CQ and inverse normalized token count | Correct **and** concise? |
| **SS** | Stability | `(2/K(K−1)) Σ BERTScore(Tᵢ⁽ᵏ⁾, Tᵢ⁽ˡ⁾)` | Same reasoning *process* across K runs? |

### Key design decisions

**CQ — Multi-strategy matching pipeline:** Raw model outputs are often verbose (e.g. *"John has 8 apples."* instead of *"8"*). The correctness metric applies 7 sequential matching strategies before marking an answer wrong: exact match → normalized → number extraction → yes/no extraction → A/B/C/D extraction → substring match → numeric tolerance. This prevents local models from being penalized purely for output format.

**RS — Conditioned on correctness:** Robustness is only counted for questions the model originally answered correctly. A model that gets everything wrong trivially gets RS=1.0 otherwise.

**LS — NLI-based contradiction detection:** Uses `cross-encoder/nli-deberta-v3-small` to detect contradictions between consecutive reasoning steps. Falls back gracefully to LS=1.0 if the NLI model is unavailable.

**ES — Harmonic mean:** Prevents rewarding short-but-wrong or long-but-correct responses equally. Both correctness and conciseness must be high for ES to be high.

**SS — BERTScore similarity:** Measures semantic similarity between reasoning traces across runs, not just whether the final answer matches. Falls back to Jaccard similarity if `bert-score` is not installed.

**CS/SS and temperature:** Running with `deterministic: true` (temperature=0) produces CS=SS=1.0 for all models — this is a mathematical artifact, not a real measurement. Set `temperature: 0.7` per model in config to get meaningful CS/SS scores.


### Aggregation strategies

Seven built-in weighting schemes are computed for every experiment. All appear as separate columns in the Excel output.

| Strategy            | CQ   | CS   | RS   | LS   | ES   | SS   | Use case                          |
| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | --------------------------------- |
| Balanced            | 1/6  | 1/6  | 1/6  | 1/6  | 1/6  | 1/6  | General comparison                |
| Safety Priority     | 0.30 | 0.20 | 0.30 | 0.10 | 0.05 | 0.05 | High-stakes deployment            |
| Accuracy Priority   | 0.40 | 0.25 | 0.15 | 0.10 | 0.05 | 0.05 | Accuracy-critical tasks           |
| Efficiency Priority | 0.20 | 0.15 | 0.15 | 0.10 | 0.30 | 0.10 | Resource-constrained deployment   |
| Medical Triage      | 0.40 | 0.05 | 0.30 | 0.20 | 0.03 | 0.02 | Clinical decision support         |
| Legal/Compliance    | 0.15 | 0.25 | 0.20 | 0.35 | 0.03 | 0.02 | Audit-sensitive applications      |
| Edge Device/IoT     | 0.30 | 0.03 | 0.10 | 0.05 | 0.50 | 0.02 | Resource-limited edge deployment  |

Custom strategies can be added directly in `config.yaml` — no code changes needed.
---

## Models

| # | Model | Provider | Type | Parameters | RAM estimate |
|---|-------|----------|------|------------|-------------|
| 1 | GPT-4o-mini | OpenAI | API | — | — |
| 2 | Gemini 2.0 Flash | Google | API | — | — |
| 3 | DeepSeek-V3 | DeepSeek | API | — | — |
| 4 | Groq LLaMA-3.3-70B | Groq | API (OpenAI-compatible) | — | — |
| 5 | Phi-2 | Microsoft | Local (HF) | 2.7B | ~6 GB (float32) |
| 6 | Qwen2.5-1.5B-Instruct | Alibaba | Local (HF) | 1.5B | ~4 GB (float32) |
| 7 | Mistral-7B-Instruct-v0.3 | Mistral AI | Local (HF) | 7B | ~5 GB (4-bit) |
| 8 | LLaMA-3-8B-Instruct | Meta | Local (HF) | 8B | ~6 GB (4-bit) |
| — | Claude Haiku 4.5 | Anthropic | API | — | — |

Local models are **loaded one at a time** and released from RAM before the next model loads — allowing evaluation on machines without enough RAM to hold all models simultaneously.

HuggingFace models are **downloaded automatically** on first run and cached in `~/.cache/huggingface/`.

---

## Datasets

| Dataset | Type | Size used | Answer format | Source |
|---------|------|-----------|---------------|--------|
| Synthetic | Auto-generated | Configurable | Mixed | Built-in |
| GSM8K | Math word problems | 250 (default) | Numerical | `openai/gsm8k` |
| StrategyQA | Commonsense reasoning | 250 (default) | Yes / No | `wics/strategy-qa` |
| MMLU | Multi-subject knowledge | 225 (default) | A / B / C / D | `cais/mmlu` |

> **Note:** MMLU loads 225 items by default (not 250) because the `moral_reasoning` subject does not exist in the `cais/mmlu` dataset. The framework skips missing subjects automatically and continues with the remaining ones.

The **synthetic dataset** generates three categories of questions automatically: arithmetic/logic reasoning, adversarial questions (designed to expose brittle reasoning), and robustness items (paraphrase pairs).

Custom JSON datasets can also be added — see [Adding New Datasets](#adding-new-datasets).

---

## Installation

### Prerequisites

- Python 3.11 (recommended)
- Miniconda or Anaconda

> ⚠️ **PyTorch must be installed first and separately** — platform-specific instructions below. Do not run `pip install -r requirements.txt` before PyTorch is installed.

---

### Windows + NVIDIA GPU (tested & recommended)

**Tested configuration:** GTX 1650 4 GB · CUDA 12.1 · Python 3.11 · PyTorch 2.4.0 · bitsandbytes 0.44.0

> ⚠️ Use `conda install` for PyTorch on Windows — **not** `pip install torch --index-url`. The pip CUDA wheels cause `fbgemm.dll` or `cusparse64_11.dll` errors on many Windows systems. Conda resolves all DLL dependencies automatically.

**Step 1 — Create environment:**
```powershell
conda create -n llm_eval_gpu python=3.11 -y
conda activate llm_eval_gpu
```

**Step 2 — Install PyTorch via conda (CUDA 12.1):**
```powershell
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
```

**Step 3 — Verify GPU:**
```powershell
python -c "import torch; print('CUDA:', torch.cuda.is_available()); print('GPU:', torch.cuda.get_device_name(0))"
```
Expected: `CUDA: True` and your GPU name.

**Step 4 — Install bitsandbytes (pinned version):**
```powershell
pip install bitsandbytes==0.44.0
```

**Step 5 — Install remaining dependencies:**
```powershell
pip install -r requirements.txt
pip install transformers -U
```

---

### Windows CPU-only

**Step 1 — Create environment:**
```powershell
conda create -n llm_eval python=3.11 -y
conda activate llm_eval
```

**Step 2 — Install PyTorch CPU wheel (max 2.3.x):**
```powershell
pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cpu
```
> ⚠️ Do NOT use PyTorch 2.4+ on Windows CPU — it causes `fbgemm.dll` errors.

**Step 3 — Pin transformers:**
```powershell
pip install "transformers==4.45.2"
```

**Step 4 — Install remaining dependencies:**
```powershell
pip install -r requirements.txt
```

---

### Linux / macOS

```bash
conda create -n llm_eval_gpu python=3.11 -y
conda activate llm_eval_gpu

# GPU (CUDA 12.1 — tested):
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip install bitsandbytes==0.44.0

# CPU:
pip install torch

pip install -r requirements.txt
pip install transformers -U
```

---

## Quick Start

### 1. Set API keys

**Windows PowerShell:**
```powershell
$env:OPENAI_API_KEY    = "sk-..."
$env:ANTHROPIC_API_KEY = "sk-ant-..."
$env:GOOGLE_API_KEY    = "AIza..."
$env:DEEPSEEK_API_KEY  = "sk-..."
```

**Linux / macOS:**
```bash
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export GOOGLE_API_KEY="AIza..."
export DEEPSEEK_API_KEY="sk-..."
```

> Models with missing API keys are **automatically skipped** — you don't need all keys to run the framework.

### 2. Run full evaluation

```bash
python main.py --config config/config.yaml
```

Or specify a custom config:
```bash
python main.py --config config/my_experiment.yaml
```

---

## Configuration

Everything is controlled from a single YAML file. The default is `config/config.yaml`.

### Experiment settings

```yaml
experiment:
  name: "my_experiment"     # Used as prefix for output folder name
  seed: 42                  # Random seed for reproducibility
  deterministic: true       # true = greedy decoding (temperature=0)
  output_dir: "outputs"     # Where results are saved

max_workers: 1              # Always set to 1 for local models — parallel loading
                            # causes meta tensor errors and CUDA OOM
```

### Adding an API model

```yaml
models:
  - name: "GPT-4o"          # Display name (appears in Excel / radar chart)
    type: "openai"           # openai | anthropic | gemini | deepseek | local | mock
    params:
      model_id: "gpt-4o"
      api_key_env: "OPENAI_API_KEY"   # Environment variable name
      max_tokens: 512
      temperature: 0.7       # Optional — overrides deterministic setting for CS/SS
      max_retries: 3
      timeout: 60
```

### Adding a local HuggingFace model

```yaml
models:
  - name: "Qwen2.5-1.5B"
    type: "local"
    params:
      model_id: "Qwen/Qwen2.5-1.5B-Instruct"
      device: "cuda"
      use_4bit: true                   # Attempts 4-bit; falls back to float16 if unsupported
      max_new_tokens: 64               # 64 is sufficient for all benchmark answer formats
      temperature: 0.7
```

**RAM guide for local models:**

| Model size | `use_4bit` | VRAM needed |
|---|---|---|
| 1.5B–2.7B | `false` | ~4–6 GB (float32) |
| 1.5B–2.7B | `true` | ~1.5–2 GB (4-bit, may fall back to float16) |
| 7B–8B | `true` | ~5–6 GB (4-bit, required) |

> **Note on 4-bit fallback:** For small models (Qwen2.5-1.5B, Phi-2) on some hardware/driver configurations, 4-bit loading may fail with a meta tensor error. The framework catches this automatically and falls back to float16 CUDA. The `copying from a non-meta parameter` warnings in the log are expected in this case and do not affect results.

> **Note on `max_new_tokens`:** 64 tokens is sufficient for numerical answers (GSM8K, synthetic), Yes/No answers (StrategyQA), and A/B/C/D answers (MMLU). Longer explanations may be truncated, but this does not affect metric scoring since answer extraction reads the first matching token.

### Metric settings

```yaml
metrics:
  consistency_runs: 3           # K — number of repeated runs per question for CS
  robustness_perturbations: 3   # P — number of paraphrase variants per question for RS
  stability_runs: 3             # K — number of repeated runs per question for SS
  nli_model: "cross-encoder/nli-deberta-v3-small"
  bertscore_model: "distilbert-base-uncased"
```

> **Performance note for local models:** Each item requires `1 + consistency_runs + robustness_perturbations` inference calls. With the defaults (3+3) this is 7 calls/item. At ~7–8s/call on a GTX 1650 (float16), 975 items takes approximately **12–15 hours per local model**. Reducing to `consistency_runs: 2` and `robustness_perturbations: 2` brings this to ~8–10 hours.

### Adding a custom aggregation strategy

```yaml
aggregation:
  strategies:
    my_strategy:
      correctness:       0.50
      robustness:        0.30
      logical_coherence: 0.20
      consistency:       0.00
      efficiency:        0.00
      stability:         0.00
```

Weights are **auto-normalized** if they don't sum exactly to 1.0.

### Temperature and CS/SS measurement

By default, `deterministic: true` sets temperature=0. This causes CS=SS=1.0 for all models (deterministic models always produce the same output — this is a mathematical artifact, not a meaningful measurement).

To get meaningful CS/SS scores, add `temperature: 0.7` per model:

```yaml
experiment:
  deterministic: true    # Keep this — only temperature param overrides it

models:
  - name: "GPT-4o-mini"
    type: "openai"
    params:
      model_id: "gpt-4o-mini"
      temperature: 0.7    # ← This overrides deterministic for this model only
```

---

## Outputs

All results are saved to `outputs/<experiment_name>_<timestamp>/`:

| File | Description |
|------|-------------|
| `reasoning_quality_results.xlsx` | Full results: raw metrics, all aggregation strategies, per-dataset breakdown, metadata |
| `radar_plot.png` | Multi-dimensional radar chart — one polygon per model |
| `summary.json` | Complete results in machine-readable JSON |
| `<ModelName>_result.json` | Per-model detailed results |

### Excel structure

The Excel file contains multiple sheets:

- **Results** — one row per model, columns: CQ, CS, RS, LS, ES, SS + all aggregation strategy scores
- **Per-Dataset Breakdown** — same metrics split by dataset (GSM8K, MMLU, etc.)
- **Experiment Metadata** — config parameters, timestamps, dataset sizes

---

## Project Structure

```
LLM-Reasoning-Quality-Evaluation-Metrics/
│
├── config/
│   ├── config.yaml             ← Main config: add models/datasets/strategies here
│   └── config_test.yaml        ← Quick test (mock + Phi-2 + synthetic, ~5 min)
│
├── models/
│   ├── base_model.py           ← Abstract base class (cache, interface)
│   │                             Cache disabled for stochastic models (temperature>0)
│   ├── openai_model.py         ← GPT-4o-mini, GPT-4o, any OpenAI-compatible API
│   ├── anthropic_model.py      ← Claude models
│   ├── gemini_model.py         ← Gemini models
│   ├── deepseek_model.py       ← DeepSeek (OpenAI-compatible endpoint)
│   ├── local_model.py          ← HuggingFace local models
│   │                             4-bit quantization with float16 fallback
│   │                             Sequential RAM management (one model at a time)
│   │                             Pre-loading before evaluation loop (no per-item reload)
│   │                             Prompt templates per model family
│   └── mock_model.py           ← Deterministic mock for testing without APIs
│
├── llm_datasets/
│   ├── base_dataset.py         ← Abstract base + JSON file loader
│   ├── synthetic_dataset.py    ← Auto-generated reasoning/adversarial/robustness items
│   ├── gsm8k_dataset.py        ← GSM8K math word problems
│   ├── mmlu_dataset.py         ← MMLU multi-subject multiple choice
│   │                             Skips missing subjects (e.g. moral_reasoning) gracefully
│   ├── strategyqa_dataset.py   ← StrategyQA commonsense yes/no
│   └── multi_dataset.py        ← Combines multiple datasets, tracks source per item
│
├── metrics/
│   ├── accuracy.py             ← CQ — 7-strategy fuzzy matching pipeline
│   ├── consistency.py          ← CS — pairwise agreement across K runs
│   ├── robustness.py           ← RS — perturbation matching (conditioned on CQ)
│   ├── logical_consistency.py  ← LS — NLI contradiction detection
│   ├── efficiency.py           ← ES — harmonic mean of CQ and inverse token count
│   ├── explainability.py       ← SS — BERTScore across reasoning traces
│   └── aggregation.py          ← Weighted composite Q score, 4 strategies
│
├── evaluation/
│   └── evaluator.py            ← Main pipeline: load → generate → 6 metrics → export
│                                 Local model detection → forces workers=1
│                                 Pre-loads model once before evaluation loop
│                                 Per-dataset breakdown support
│
├── visualization/
│   └── radar_plot.py           ← Radar chart + grouped bar chart
│
├── utils/
│   ├── logger.py               ← Structured logging
│   ├── reproducibility.py      ← Seed setting across Python / NumPy / PyTorch
│   └── experiment_tracker.py   ← JSON + Excel export, result aggregation
│
├── outputs/                    ← Auto-created; all results go here
├── requirements.txt
└── main.py                     ← Entry point; config parsing + model/dataset registration
```

---

## Adding New Models

### Option A — Config only (API models)

For any OpenAI-compatible API:

```yaml
- name: "My-Model"
  type: "openai"
  params:
    model_id: "my-model-id"
    api_key_env: "MY_API_KEY"
    max_tokens: 512
```

For HuggingFace local models:

```yaml
- name: "My-Local-Model"
  type: "local"
  params:
    model_id: "org/model-name"
    device: "cuda"
    use_4bit: true            # Falls back to float16 if unsupported
    max_new_tokens: 64
    temperature: 0.7
```

### Option B — Custom model class

1. Create `models/my_model.py` extending `BaseModel`
2. Implement `generate(prompt)` and `generate_with_trace(prompt)`
3. Add a `_build_mytype()` function in `main.py`
4. Register in the `MODEL_BUILDERS` dict in `main.py`
5. Use `type: "mytype"` in config

---

## Adding New Datasets

### Option A — JSON file (no code needed)

Prepare a JSON file with this structure:

```json
[
  {
    "id": "q001",
    "question": "What is 2 + 2?",
    "answer": "4",
    "type": "reasoning",
    "perturbations": [
      "What does 2 plus 2 equal?",
      "Calculate 2 + 2",
      "Find the sum of 2 and 2"
    ]
  }
]
```

Then add to config:

```yaml
datasets:
  - name: "my_dataset"
    type: "json"
    params:
      path: "data/my_questions.json"
      num_samples: 100
```

The `perturbations` field is used for RS (robustness) metric. If omitted, robustness is skipped for that item.

### Option B — HuggingFace dataset class

1. Create `llm_datasets/my_dataset.py` extending `BaseDataset`
2. Implement the `load()` method to populate `self._data`
3. Register the type in `main.py`

---

## Known Issues & Platform Notes

### Windows GPU — DLL errors (fbgemm.dll / cusparse64_11.dll)

Both errors share the same cause: pip CUDA wheels have DLL dependency issues on many Windows systems.

**Fix — use `conda install` instead of `pip install`:**
```powershell
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
```
This resolves all DLL issues automatically.

> ⚠️ **Do NOT install torch 2.10.x.** It breaks torchvision/torchaudio compatibility and reintroduces DLL errors. If you accidentally upgrade, restore with:
> ```powershell
> pip uninstall torch torchvision torchaudio bitsandbytes -y
> pip cache purge
> conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidia
> pip install bitsandbytes==0.44.0
> ```

### bitsandbytes version compatibility

| bitsandbytes | torch | Status |
|---|---|---|
| 0.44.0 | 2.4.0 + CUDA 12.1 | ✅ Tested, works |
| 0.49.x | 2.4.0 | ❌ Incompatible — causes CUDA errors |
| any | 2.10.x | ❌ Do not use torch 2.10.x |

### Windows CPU — PyTorch version

PyTorch 2.4+ causes `fbgemm.dll` on Windows CPU pip wheels. Use 2.3.x for CPU-only:

```powershell
pip install torch==2.3.1 --index-url https://download.pytorch.org/whl/cpu
pip install "transformers==4.45.2"
```

### transformers version (CPU-only systems)

`transformers >= 4.46` requires `torch >= 2.4`. On Windows CPU where 2.4 cannot be installed, pin transformers to 4.45.2. On GPU systems with PyTorch 2.4+, install the latest transformers freely:

```bash
pip install transformers -U
```

### Local model — meta tensor / copying from non-meta parameter warnings

During 4-bit model loading you may see many warnings like:
```
UserWarning: for model.layers.X...: copying from a non-meta parameter in the
checkpoint to a meta parameter in the current model, which is a no-op.
```

This is **expected and harmless**. It means the 4-bit loading path was attempted but fell back to float16 CUDA. The model loads correctly in float16 and inference proceeds normally. The warning appears once per layer (28 layers × ~10 weights = ~280 lines for Qwen2.5-1.5B).

### Local model — 4-bit fallback to float16

For small models (Qwen2.5-1.5B, Phi-2) 4-bit quantization may fail on some hardware with:
```
Cannot copy out of meta tensor; no data!
```
The framework catches this and automatically falls back to float16 CUDA. You will see:
```
[Qwen2.5-1.5B] 4-bit failed (...), falling back to float16 CUDA
```
This is not an error — evaluation continues normally. float16 uses slightly more VRAM (~3 GB for 1.5B) but works reliably on GTX 1650.

### Local model — workers must be 1

The evaluator automatically detects local models and forces `workers=1` regardless of the `max_workers` config setting. Running multiple local model workers causes repeated HuggingFace downloads, CUDA OOM, and meta tensor errors. This is by design.

### Flash attention warning

```
Torch was not compiled with flash attention.
```

This is harmless on GTX 1650 (Turing architecture). The model uses standard scaled dot-product attention instead, which works correctly. Flash attention requires Ampere or newer (RTX 3000+).

### logits type warning

```
Starting from v4.46, the logits model output will have the same type as the model
```

Harmless informational warning from `transformers`. Does not affect results.

### MMLU — moral_reasoning subject not found

```
[MMLU] Could not load subject 'moral_reasoning': BuilderConfig 'moral_reasoning' not found.
```

`moral_reasoning` does not exist in the `cais/mmlu` dataset. The framework logs this warning and skips it automatically. MMLU loads 225 items from the remaining subjects instead of 250. This is expected.

### Mistral / LLaMA tokenizer error

```
Cannot instantiate this tokenizer from a slow version... sentencepiece
```

Fix:
```bash
pip install sentencepiece
```

### CS = SS = 1.0 for all models

This happens when `deterministic: true` and no `temperature` is set per model. The cache returns the same response for all K runs. Fix: add `temperature: 0.7` to each model in config. See [Temperature and CS/SS measurement](#temperature-and-css-measurement).

### Per-dataset breakdown is slow (NLI/BERTScore reload)

In earlier versions, NLI and BERTScore models were reloaded for each dataset in the per-dataset breakdown, causing 30–80 min overhead per dataset. This is fixed in v5: NLI/BERTScore models are now kept in memory across all per-dataset passes and released only once after all datasets are processed.

---

## Citation

If you use this framework in your research, please cite:

```bibtex
@article{senol2026reasoning,
  title         = {Measuring Reasoning Quality in Large Language Models: A Multi-Dimensional Behavioral Framework},
  author        = {Şenol, Ali and Agrawal, Garima and Liu, Huan},
  year          = {2026},
  eprint        = {2605.24661},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url           = {https://arxiv.org/abs/2605.24661}
}
```

---

## License

MIT License — see `LICENSE` for details.
