Metadata-Version: 2.4
Name: coreb
Version: 0.1.0
Summary: CoREB: Code Retrieval and Reranking Benchmark — a graded-relevance benchmark for code retrieval and reranking models
Author: CoREB Team
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/hq-bench/coreb
Project-URL: Documentation, https://github.com/hq-bench/coreb
Project-URL: Repository, https://github.com/hq-bench/coreb
Project-URL: Dataset, https://huggingface.co/datasets/hq-bench/coreb
Keywords: code retrieval,embedding,benchmark,code search,information retrieval
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: datasets>=2.14.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: torch>=2.0.0
Requires-Dist: pytrec-eval-terrier>=0.5.5
Requires-Dist: tqdm>=4.60.0
Requires-Dist: easyllm_kit>=0.0.9
Provides-Extra: hf
Requires-Dist: transformers>=4.30.0; extra == "hf"
Provides-Extra: gemini
Requires-Dist: google-genai>=1.0.0; extra == "gemini"
Provides-Extra: all
Requires-Dist: transformers>=4.30.0; extra == "all"
Requires-Dist: google-genai>=1.0.0; extra == "all"
Requires-Dist: sentence-transformers>=2.2.0; extra == "all"
Requires-Dist: scikit-learn>=1.0.0; extra == "all"
Requires-Dist: pandas>=1.5.0; extra == "all"
Requires-Dist: omegaconf>=2.3.0; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=4.0; extra == "dev"
Dynamic: license-file

# CoREB: Code Retrieval and Reranking Benchmark

[![PyPI version](https://img.shields.io/pypi/v/coreb)](https://pypi.org/project/coreb/)
[![Downloads](https://img.shields.io/pypi/dm/coreb)](https://pypi.org/project/coreb/)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue)](LICENSE)
[![Dataset](https://img.shields.io/badge/HuggingFace-hq--bench%2Fcoreb-yellow)](https://huggingface.co/datasets/hq-bench/coreb)

**CoREB** is a graded-relevance benchmark for evaluating code retrieval and reranking models across three tasks:

| Task | Query | Target | Example |
|------|-------|--------|---------|
| **Text-to-Code** (T2C) | Natural language description | Code solution | "Find the longest substring without repeating characters" → Python solution |
| **Code-to-Code** (C2C) | Code in language A | Equivalent code in language B | Python solution → Java translation |
| **Code-to-Text** (C2T) | Code snippet | Problem description | Python solution → problem statement |

## Key Features

- **Graded relevance**: 3-level qrel scheme (rel=2: positive, rel=1: hard negative, rel=0: irrelevant) — hard negatives are same-problem distractors that penalize nDCG when retrieved above true positives
- **5 programming languages**: Python, C++, Java, Go, Ruby
- **Problem-disjoint train/test splits**: v202602 (training) and v202603 (testing) cover non-overlapping contest windows
- **Drop-in evaluation**: compatible with standard IR evaluation (pytrec\_eval) with `relevance_level=2`

## Installation

```bash
pip install coreb
```

For HuggingFace model support:
```bash
pip install coreb[hf]        # transformers backend
pip install coreb[gemini]    # Google Gemini API
pip install coreb[all]       # everything
```

## Quick Start

### Load the Dataset

```python
from datasets import load_dataset

# Load v202603 release (latest)
code_corpus = load_dataset("hq-bench/coreb", "code_corpus", split="release_v2603")
text_corpus = load_dataset("hq-bench/coreb", "text_corpus", split="release_v2603")

# Load task-specific queries and qrels
t2c_queries = load_dataset("hq-bench/coreb", "text2code_queries", split="release_v2603")
t2c_qrels = load_dataset("hq-bench/coreb", "text2code_qrels", split="release_v2603")

print(f"Code corpus: {len(code_corpus)} documents")
print(f"T2C queries: {len(t2c_queries)} queries, {len(t2c_qrels)} qrels")
```

### Run Evaluation

```python
from coreb_runner.benchmark import (
    load_jsonl,
    convert_corpus_to_coir_format,
    convert_queries_to_coir_format,
    convert_qrels_to_coir_format,
    EvaluateRetrieval,
    DenseRetrievalExactSearch,
    create_model_wrapper,
)

# Load data (from local JSONL files or convert from HF datasets)
corpus = convert_corpus_to_coir_format(load_jsonl("code_corpus.jsonl"))
queries = convert_queries_to_coir_format(load_jsonl("text2code_queries.jsonl"))
qrels = convert_qrels_to_coir_format(load_jsonl("text2code_qrels.jsonl"))

# Create model wrapper
model = create_model_wrapper("jinaai/jina-embeddings-v3", model_type="huggingface")

# Run retrieval + evaluation
retriever = DenseRetrievalExactSearch(model, batch_size=64)
evaluator = EvaluateRetrieval(retriever, k_values=[1, 3, 5, 10])
results = evaluator.retrieve(corpus, queries)
ndcg, _map, recall, precision = evaluator.evaluate(qrels, results, evaluator.k_values)

print(f"nDCG@10: {ndcg['NDCG@10']:.4f}")
print(f"Recall@10: {recall['Recall@10']:.4f}")
```

### Evaluation with Graded Relevance

CoREB uses `relevance_level=2` — only rel>=2 items count as relevant for binary metrics (Recall, MAP, Precision). Hard negatives (rel=1) penalize nDCG by occupying top ranks with zero gain but do not inflate Recall/MRR.

```python
# The EvaluateRetrieval class handles this automatically:
# - rel=1 (hard negatives) are zeroed out for nDCG computation
# - relevance_level=2 is set for pytrec_eval binary metrics
print(f"Relevance threshold: {EvaluateRetrieval.RELEVANCE_LEVEL}")  # 2
```

## Dataset Structure

Available on HuggingFace: [`hq-bench/coreb`](https://huggingface.co/datasets/hq-bench/coreb)

**8 configs** x **2 splits** (`release_v2602`, `release_v2603`):

| Config | v2603 Rows | Description |
|--------|-----------|-------------|
| `code_corpus` | 1,744 | Code solutions (5 languages, 2 generator models) |
| `text_corpus` | 875 | Problem descriptions (175 original + 700 LLM noise) |
| `text2code_queries` | 1,123 | T2C queries (canonical, full, search subtasks) |
| `text2code_qrels` | 5,950 | T2C relevance judgments (2,814 pos + 3,136 hard neg) |
| `code2code_queries` | 278 | C2C queries (cross-language) |
| `code2code_qrels` | 1,457 | C2C relevance judgments (623 pos + 834 hard neg) |
| `code2text_queries` | 1,200 | C2T queries (canonical, full, match subtasks) |
| `code2text_qrels` | 4,610 | C2T relevance judgments (820 pos + 2,650 hard neg) |

## Benchmark Results (v202603, nDCG@10)

| Rank | Model | Avg | T2C | C2C | C2T |
|------|-------|-----|-----|-----|-----|
| 1 | GemEmb-2 | 0.639 | 0.434 | 0.698 | 0.784 |
| 2 | C2LLM-7B | 0.623 | 0.443 | 0.659 | 0.766 |
| 3 | jina-code-1.5b | 0.607 | 0.414 | 0.671 | 0.735 |
| 4 | C2LLM-0.5B | 0.604 | 0.430 | 0.657 | 0.725 |
| 5 | jina-code-0.5b | 0.596 | 0.386 | 0.677 | 0.725 |
| 6 | F2LLM-4B | 0.547 | 0.407 | 0.500 | 0.735 |
| 7 | Qwen3-Emb-4B | 0.495 | 0.390 | 0.392 | 0.704 |
| 8 | F2LLM-1.7B | 0.485 | 0.383 | 0.383 | 0.690 |
| 9 | Qwen3-Emb-0.6B | 0.443 | 0.349 | 0.384 | 0.597 |
| 10 | F2LLM-0.6B | 0.439 | 0.344 | 0.334 | 0.641 |
| 11 | Qwen3-Emb-8B | 0.428 | 0.328 | 0.320 | 0.635 |

## Citation

Coming soon.

## License

This project is licensed under the Apache License 2.0 — see [LICENSE](LICENSE) for details.
