Metadata-Version: 2.4
Name: tempobench
Version: 0.1.3
Summary: Benchmarking framework and datasets for temporal automata tasks
Project-URL: Homepage, https://github.com/nik-hz/tempobench
Project-URL: Issues, https://github.com/nik-hz/tempobench/issues
Author: Nikolaus Holzer
License: MIT
Keywords: automata,benchmark,evaluation,llm,temporal-logic
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Requires-Dist: pydantic>=2
Requires-Dist: requests>=2.31
Requires-Dist: tqdm>=4.66
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: pytest-cov>=5; extra == 'dev'
Requires-Dist: pytest>=8; extra == 'dev'
Requires-Dist: ruff>=0.6; extra == 'dev'
Provides-Extra: hf
Requires-Dist: accelerate>=0.30; extra == 'hf'
Requires-Dist: datasets>=2.20; extra == 'hf'
Requires-Dist: transformers>=4.43; extra == 'hf'
Provides-Extra: openai
Requires-Dist: openai>=1.40; extra == 'openai'
Provides-Extra: synth
Requires-Dist: docker>=7; extra == 'synth'
Provides-Extra: vllm
Requires-Dist: httpx>=0.27; extra == 'vllm'
Description-Content-Type: text/markdown

# Tempo-Bench

Formally grounded **LLM benchmark** for temporal reasoning over automata/traces. Runs locally via a clean CLI or Python API. Keeps the wheel thin (code + tiny samples) and lets you plug in **any model** (OpenAI-compatible, Hugging Face, vLLM, or your own class/function).

---

## Features
- Tasks: **trace acceptance** & **temporal causality** (with per-feature metrics).
- Backends: OpenRouter/OpenAI (OpenAI-compatible), **HF pipelines**, **vLLM**, or **custom** Python adapters.
- Outputs: row-wise JSONL + CSV with accuracy and F1s (AP & timestep).
- Reproducible runs: fixed seeds, manifest-friendly outputs, small packaged sample datasets.

---

## Install

```bash
pip install tempobench
```

Python ≥ 3.10 recommended.

---

## Quickstart (CLI)

```bash
# OpenRouter (OpenAI-compatible) example
export OPENROUTER_API_KEY=YOUR_KEY

tempobench run \
  --dataset_path src/tempobench/data/causal-done.jsonl \
  --task causal \
  --backend openrouter \
  --model-id openai/gpt-4o-mini \
  --gen-args '{"temperature":0.0,"max_tokens":256}' \
  --outdir benchmark_results --console-prints
```

Other backends:
This is not implemented yet and is an open **issue** currently.

```bash
# OpenAI
export OPENAI_API_KEY=YOUR_KEY
tempobench run --dataset_path ... --task causal --backend openai --model-id gpt-4o-mini

# Hugging Face (local model)
tempobench run --dataset_path ... --task causal \
  --backend hf --model-id meta-llama/Meta-Llama-3.1-8B-Instruct \
  --model-args '{"device":0}' --gen-args '{"max_new_tokens":256}'

# vLLM server (OpenAI API compatible)
tempobench run --dataset_path ... --task trace \
  --backend vllm --model-id my-vllm \
  --model-args '{"base_url":"http://127.0.0.1:8000/v1","api_key":"nokey"}'
```

**Outputs** land under `benchmark_results/<task>/` as both `.jsonl` and `.csv`.

---

## Python API
You can use the benchmarker to build custom benchmarking workflows that use tempobench logic. Check out the benchmark.py file out on the project github.


```python
from tempobench import Benchmark

bench = Benchmark(
    dataset_path="src/tempobench/data/causal-done.jsonl",
    task="causal",
    model_id="openai/gpt-4o-mini",
    results_dir="benchmark_results",
    console_prints=True,
)

df = bench.evaluate()
print(df.head())
```

---

## Datasets

Check my huggingface for the tempobench public benchmarking datasets.

If you are interested in access to our datasets for reasoning SFT, reach out to me.

---

## Env vars
You will need to have these env vars set for this to work properly.

- `OPENROUTER_API_KEY` (for `--backend openrouter`)
- `OPENAI_API_KEY` (for `--backend openai`)

---

## Results schema (per row)

`results_*.jsonl` contains:
```json
{
  "model": "openai/gpt-4o-mini",
  "gold": "... (gold JSON) ...",
  "pred": "... (raw text) ...",
  "GT": { "...parsed..." },
  "PRED": { "...parsed..." },
  "correct": true,
  "precision_ap": 1.0,
  "recall_ap": 1.0,
  "F1_ap": 1.0,
  "precision_timestep": 1.0,
  "recall_timestep": 1.0,
  "F1_timestep": 1.0,
  "cost": 0.0023,
  "generation_id": "gen_...",
  "native_prompt_tokens": 123,
  "native_completion_tokens": 45
}
```

## License

MIT (see `LICENSE`).
