Metadata-Version: 2.4
Name: clever-bench
Version: 1.4.0
Summary: Python library for running the CLEVER Benchmark
Project-URL: Homepage, https://github.com/trishullab/clever-bench
Project-URL: Issues, https://github.com/trishullab/clever/issues
Author-email: Amitayush Thakur <amitayush@utexas.edu>
License-File: LICENSE
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Requires-Python: >=3.9
Requires-Dist: pyyaml>=6.0
Description-Content-Type: text/markdown

# CLEVER: Curated Lean Verified Code Generation Benchmark

[![CI](https://github.com/trishullab/clever/actions/workflows/lean_action_ci.yml/badge.svg)](https://github.com/trishullab/clever/actions/workflows/lean_action_ci.yml)
[![PyPI version](https://badge.fury.io/py/clever-bench.svg)](https://badge.fury.io/py/clever-bench)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/clever-bench)](https://pypi.org/project/clever-bench/)
[![arXiv](https://img.shields.io/badge/arXiv-2505.13938-b31b1b.svg)](https://arxiv.org/abs/2505.13938)
[![🏆 leaderboard](https://img.shields.io/badge/🏆-leaderboard-ff8811)](https://trishullab.github.io/clever/leaderboard.html)

## Overview

CLEVER is a benchmark suite for **end-to-end code generation and formal verification** in Lean 4, adapted from the HumanEval dataset. The goal is to move beyond test-case-driven evaluation by requiring models to generate not only implementations but also formal specifications and proofs — all verifiable by Lean’s type checker.

![CLEVER Framework](img-clever-overview.png)

## Benchmark Focus

CLEVER evaluates models across a staged verification pipeline:

- **Task 1:** Generate a formal specification from a natural language description.
- **Task 2:** Prove semantic equivalence to a hidden ground-truth specification.
- **Task 3:** Synthesize a Lean implementation that satisfies the specification.
- **Task 4:** Prove the implementation's correctness against the reference specification.

Each step is independently verified, allowing fine-grained diagnosis of model performance.

## Key Features

- **Non-computable specifications** that prevent implementation leakage and enforce semantic reasoning.
- **Staged certification** design to isolate failures across spec generation, implementation, and proof.
- **Supports symbolic proof search** with agents like COPRA, enabling deeper proof automation analysis.

---

## 🚀 Submitting Your Solutions to CLEVER

To evaluate your LLM-generated solutions against the CLEVER benchmark, use the Python API to package and submit them as `LeanProblemView` objects. Each submission is compiled and verified using Lean 4, and results are returned as structured `ValidationResult` objects.

### 🔧 Steps

0. **Installing the python package**:
   You can simply use `pip install clever-bench` to install the package from PyPi.
   Alternatively, you can install from source by cloning the repository and then use `pip install -e .` 
   
   After installing the package, run the command `clever-bench-install` (this will work on Linux, no prerequisite of installing Lean 4).

   If you are not on Linux, then you will have install the Lean 4 dependency by yourself. After Lean 4 installation, you can run `cd <path-to-clever-package-installation/lean4-dir> && lake exe cache get && lake build` (or equivalent command) to build the Lean 4 environment (This is a one time step only).

1. **Load the benchmark**:
   ```python
   from clever_bench.benchmark import Benchmark
   benchmark = Benchmark(is_sample=True)  # or is_sample=False for actual HumanEval problems in `src/lean4/human_eval`
   benchmark.load_all()
   ```

2. **Select a task** (e.g., proof generation):
   ```python
   from clever_bench.task import ProblemViewTask, TaskComponent
   task = ProblemViewTask(benchmark, TaskComponent.PROOF_GENERATION)
   ```

3. **Get a problem and fill in your solution**:
   ```python
   problem = task.get_view(3) # Abstraction to hide the staged problem details and only show relevant fields for the selected task for problem with id 3
   problem.implementation = "<your Lean implementation>"
   problem.correctness_proof = "<your proof>"
   ```

4. **Submit the solution**:
   ```python
   import asyncio
   result = asyncio.run(task.submit_async(problem, timeout_in_ms=30000))
   print(result.correctness_ok, result.error_message)
   ```

### ✅ Result

The returned `ValidationResult` will tell you whether your implementation compiled, and whether the proofs were accepted by Lean (i.e., no `sorry`).

CLEVER also supports multi-stage verification: the Python API automatically hides irrelevant fields during each task (e.g., only showing the natural language description field for spec generation), enabling clean task-specific prompting and evaluation.

This process allows you to validate your solutions programmatically—whether you're using LLMs, proof agents, or writing Lean by hand.

You can try our baselines here: [Baseline Provers](https://github.com/trishullab/clever-prover)

## 🏆 Leaderboard

View the latest results and submit your own: [**CLEVER Leaderboard**](https://trishullab.github.io/clever/leaderboard.html)

We welcome submissions from the community! To add your results to the leaderboard:

1. **Evaluate your approach** using the Python API described above
2. **Document your methodology** with a preprint or publication
3. **Submit your results** by contacting us with your evaluation metrics and methodology details

For submissions, please contact: [amitayush@utexas.edu](mailto:amitayush@utexas.edu)


## Build Instructions
- Install Lean 4: https://leanprover.github.io/lean4/doc/quickstart.html
- Run `lake build clever` to build the project after changing the directory to `src/lean4`. OR use VS Code with Lean 4 extension to build the project.
- All the formalizations are in the `src/lean4/human_eval` directory.
- Some extra formalizations are provided in `src/lean4/sample_examples/` directory, some of which were used in prompts created for evaluating the benchmark.

## Paper

You can find the paper describing CLEVER at https://arxiv.org/abs/2505.13938.
```bibtex
@inproceedings{thakur2025clever,
  title={CLEVER: A Curated Benchmark for Formally Verified Code Generation},
  author={Amitayush Thakur and Jasper Lee and George Tsoukalas and Meghana Sistla and Matthew Zhao and Stefan Zetzsche and Greg Durrett and Yisong Yue and Swarat Chaudhuri},  
  booktitle={39th Conference on Neural Information Processing Systems (NeurIPS 2025)},
  year={2025}
}
```
