Metadata-Version: 2.4
Name: pybenchtool
Version: 0.0.9
Summary: A comparative benchmarking framework for Python.
Author: Friedrich Schwarz
License-Expression: GPL-3.0-or-later
Project-URL: Homepage, https://github.com/fschwar4/PyBenchTool
Project-URL: Documentation, https://fschwar4.github.io/PyBenchTool/
Keywords: benchmark,performance,profiling,HPC,SLURM
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: System :: Benchmark
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy
Requires-Dist: pandas
Requires-Dist: psutil
Requires-Dist: tqdm
Requires-Dist: scipy
Requires-Dist: threadpoolctl
Requires-Dist: loky
Provides-Extra: plotting
Requires-Dist: seaborn; extra == "plotting"
Requires-Dist: matplotlib; extra == "plotting"
Provides-Extra: stats
Requires-Dist: pingouin; extra == "stats"
Provides-Extra: hardware
Requires-Dist: llvmlite; extra == "hardware"
Provides-Extra: full
Requires-Dist: pybenchtool[plotting]; extra == "full"
Requires-Dist: pybenchtool[stats]; extra == "full"
Requires-Dist: pybenchtool[hardware]; extra == "full"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: sphinx; extra == "dev"
Requires-Dist: pydata-sphinx-theme; extra == "dev"
Requires-Dist: sphinx-autodoc-typehints; extra == "dev"
Dynamic: license-file

# PyBenchTool

**A Comparative Benchmarking Framework for Python**

PyBenchTool is a benchmarking framework for systematic comparison of multiple
function implementations across varying inputs.  It records full per-iteration
runtime distributions together with disk I/O counters, system metadata, and
HPC environment variables, and provides built-in inferential statistics
(Welch ANOVA, Games-Howell post-hoc) for rigorous evaluation of performance
differences.

The framework targets workloads where run-to-run variance carries information
— for instance, I/O-bound compression benchmarks on shared cluster nodes —
rather than micro-benchmarks where sub-microsecond precision is the primary
concern.

## Scope and Positioning

| Capability | timeit / pyperf | perfplot | **PyBenchTool** |
|---|---|---|---|
| Multiple kernels × multiple inputs | Manual loop | Built-in | Built-in |
| Randomised execution order | No | No | Yes (seed is logged) |
| Full per-iteration distributions | pyperf: yes | No (reports minimum) | Yes |
| Disk I/O tracking per iteration | No | No | Yes |
| SLURM / HPC metadata capture | No | No | Yes (40+ fields) |
| Inferential statistics (ANOVA) | No | No | Welch ANOVA + Games-Howell |
| Cold start (skip warmup) | No | No | Yes |
| Subprocess isolation | pyperf: yes | No | Yes (loky/cloudpickle) |
| Three-clock timing (wall/CPU/thread) | pyperf: wall+CPU | No | Yes + derived I/O metrics |
| Warmup calibration / CPU pinning | pyperf: yes | No | No (single-pass warmup) |

**PyBenchTool** is appropriate when comparing multiple implementations across
varying inputs and the analysis requires full runtime distributions, disk I/O
measurements, or hypothesis testing — particularly in HPC environments where
SLURM metadata and environment reproducibility are relevant.

**[pyperf](https://github.com/psf/pyperf)** is the better choice for
low-noise measurement of a single function, where automatic warmup
calibration, outlier detection, CPU pinning, and system tuning are
needed to minimise OS-level variance (e.g. micro-benchmarks).

**[perfplot](https://github.com/nschloe/perfplot)** is sufficient for visual
scaling comparisons when per-run distributions and statistical testing are not
required.

## Features

- **Controlled benchmarking** — separate setup and cleanup phases, excluded from timing, allow data preparation and post-measurement metadata collection
- **Subprocess isolation** — each iteration runs in a fresh subprocess by default (via loky/cloudpickle), providing clean memory, disk I/O counters, and GC state per iteration; in-process mode available for low-overhead micro-benchmarks
- **Three-clock timing** — wall-clock (`perf_counter_ns`), CPU (`process_time_ns`), and thread (`thread_time_ns`) timers with derived metrics (I/O wait, I/O fraction, thread-pool parallelism)
- **Metadata capture** — CPU architecture, RAM, OS, SLURM environment, library versions, disk I/O counters, and GC state (40+ fields per row)
- **Randomised execution** — the full kernel × input × n_runs matrix is shuffled to prevent systematic ordering effects; the seed is logged for reproducibility
- **Statistical analysis** — Welch ANOVA and Games-Howell post-hoc tests account for heterogeneous variances
- **Visualisation** — bar plots and box plots via matplotlib and seaborn
- **Full distributions** — every iteration is stored individually for distribution-level analysis
- **Cold start mode** — skips the warmup run for workloads where the warmup itself is prohibitively expensive
- **HPC integration** — automatic SLURM variable capture; optional OS page-cache clearing for I/O benchmarks

## Known Limitations

- **Single-pass warmup** — one warmup iteration per kernel (or none with `cold_start=True`); no automatic calibration to determine when measurements have stabilised (unlike pyperf)
- **Subprocess startup overhead** — the default `isolation="subprocess"` mode spawns a fresh subprocess per iteration via loky (~200–500 ms overhead). For sub-millisecond kernels, use `isolation="inprocess"`
- **No outlier detection** — outliers from OS scheduling are retained; the statistical tests tolerate moderate outliers but no automatic flagging is performed
- **No CPU pinning** — on SLURM clusters the scheduler controls affinity; no built-in pinning for bare-metal benchmarks

## Installation

```bash
pip install pybenchtool
```

For development:

```bash
pip install -e ".[full]"
```

## Quick Start

```python
from pybenchtool import BenchTool

bt = BenchTool(
    name="Codec Comparison",
    description="blosc2 vs zstd on varying array sizes.",
    verbose=True,
    version_key_libraries=["numpy", "blosc2"],
)

results = bt.bench(
    setup=[prepare_data],
    kernel=[compress_blosc2, compress_zstd],
    cleanup=[measure_ratio],
    input_=[small_array, medium_array, large_array],
    n_runs=10,
    show_progress=True,
)

bt.results2csv(".")
print(bt.summary())
bt.boxplot()
bt.runtime_htest()
```

For a complete walkthrough, see the
[Quick Start guide](https://fschwar4.github.io/PyBenchTool/quickstart.html).

## Documentation

Full documentation is available at
[https://fschwar4.github.io/PyBenchTool/](https://fschwar4.github.io/PyBenchTool/).

To build locally:

```bash
pip install -r docs/requirements.txt
cd docs && make html
```

## Building a Distribution

```bash
python -m pip install build
python -m build
```

Artefacts are placed in the `dist/` directory.

## Repository Structure

```
PyBenchTool/
├── pybenchtool/
│   ├── __init__.py      # Package entry point, version
│   ├── _facade.py       # BenchTool facade class (composes the modules below)
│   ├── _metadata.py     # System metadata collection, get_conda_version()
│   ├── _runner.py       # Benchmark orchestration (bench, _run_iteration)
│   ├── _io.py           # CSV I/O (results2csv, load)
│   ├── _analysis.py     # Summary statistics, hypothesis testing
│   ├── _plotting.py     # Bar plots, box plots, unit conversion
│   └── _utils.py        # Shared helpers
├── tests/               # Pytest test suite
├── notebooks/           # Example Jupyter notebooks
├── docs/                # Sphinx documentation source
├── scripts/
│   └── conda_env_export.sh
├── CHANGELOG.md
├── LICENSE
├── README.md
└── pyproject.toml
```
