Metadata-Version: 2.4
Name: synthbench
Version: 0.1.0
Summary: Synthetic datasets for ML benchmarking with controllable complexity, configurable corruptions, and full provenance.
Project-URL: Homepage, https://github.com/JanTeichertKluge/synth-bench
Project-URL: Documentation, https://JanTeichertKluge.github.io/synth-bench
Project-URL: Repository, https://github.com/JanTeichertKluge/synth-bench.git
Project-URL: Issues, https://github.com/JanTeichertKluge/synth-bench/issues
Project-URL: Changelog, https://github.com/JanTeichertKluge/synth-bench/releases
Author-email: Jan Teichert-Kluge <janteiklu@gmail.com>
License-Expression: MIT
License-File: LICENSE
Keywords: benchmarking,data-generating-process,dataset-generation,machine-learning,reproducibility,synthetic-data
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Typing :: Typed
Requires-Python: >=3.12
Requires-Dist: numpy>=2.0
Requires-Dist: scikit-learn>=1.5
Requires-Dist: scipy>=1.12
Provides-Extra: dev
Requires-Dist: pre-commit>=3.5; extra == 'dev'
Requires-Dist: pytest-cov>=7.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.5; extra == 'dev'
Provides-Extra: docs
Requires-Dist: matplotlib>=3.7; extra == 'docs'
Requires-Dist: mkdocs-jupyter>=0.26; extra == 'docs'
Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
Requires-Dist: mkdocs>=1.6; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=1.0; extra == 'docs'
Requires-Dist: openml>=0.14; extra == 'docs'
Requires-Dist: pandas>=2.0; extra == 'docs'
Provides-Extra: io
Requires-Dist: pyarrow>=14.0; extra == 'io'
Provides-Extra: neural
Requires-Dist: torch>=2.1; extra == 'neural'
Description-Content-Type: text/markdown

<div align="center">
  <img src="icon.png" alt="synthbench" width="420">
</div>

---

synthbench is a small Python library for generating synthetic datasets that are actually useful for benchmarking. You control the signal complexity, add noise or missing data on top, and get back a dataset with full provenance so you know exactly what you generated and why. Every result is reproducible from a single integer seed.

It covers eight DGP families, five corruptors, metadata enrichment (Bayes error, effective rank), Parquet/CSV serialization, and sweep helpers for running ablation grids.

## Installation

```bash
pip install synthbench
```

For Parquet support:

```bash
pip install "synthbench[io]"
```

For `RandomNeuralDGP` (needs PyTorch):

```bash
pip install "synthbench[neural]"
```

## Basic usage

```python
from synthbench import BenchPipeline, LinearDGP, MissingDataCorruptor

pipeline = BenchPipeline(
    LinearDGP(complexity="medium", task_type="classification"),
    corruptors=[MissingDataCorruptor(proportion=0.1, mechanism="mar")],
)
result = pipeline.run(n_samples=500, n_features=10, random_state=42)

print(result.X.shape)                     # (500, 10)
print(result.metadata["bayes_error"])     # empirical difficulty estimate
print(result.metadata["effective_rank"])  # feature space dimensionality
```

## What it does

**Data-generating processes** — Linear, Polynomial, Tree, Friedman (variants 1/2/3), Additive, Sparse, Geometric, and RandomNeural. Each takes a `complexity` parameter and records ground-truth feature importances alongside the data.

**Corruptors** — MeasurementNoise, Outlier, MissingData, Collinearity, and Categorical corruptors for the feature matrix, plus `LabelNoiseCorruptor` for flipping labels or injecting regression noise. They chain together in a canonical order and track how much signal they degrade.

**Metadata** — every result carries `bayes_error`, `effective_rank`, corruptor parameters, and version provenance. Enough to reconstruct the generating pipeline from scratch.

**Sweeps** — `severity_sweep` and `difficulty_sweep` for single-axis ablations, and `experiment_grid` for full factorial runs across sample size, complexity, and severity. Seeds are derived hierarchically so cells are independent but deterministic.

**Named suites** — `BenchSuite("easy-classification").run()` returns a labelled dict of results for a curated collection. Good for quick sanity checks or as a shared benchmark baseline.

**Serialization** — `to_parquet` / `from_parquet` and `to_csv` / `from_csv` round-trip everything including metadata. `BenchPipeline.from_metadata` reconstructs and re-runs the pipeline for bit-identical replay.

## Ablation example

```python
from synthbench import LinearDGP, OutlierCorruptor, experiment_grid

grid = experiment_grid(
    LinearDGP,
    OutlierCorruptor,
    n_samples_list=[200, 500, 1000],
    complexities=["low", "medium", "high"],
    severities=["low", "medium", "high"],
    n_features=10,
    random_state=0,
    task_type="classification",
)

result = grid[(500, "high", "medium")]
print(result.metadata["bayes_error"])
```

## Docs

Full reference at [JanTeichertKluge.github.io/synth-bench](https://JanTeichertKluge.github.io/synth-bench).
