Metadata-Version: 2.4
Name: benchaudit
Version: 0.1.0
Summary: BenchAudit -- data hygiene and similarity audits for molecular and DTI benchmarks.
Author: Maximilian G. Schuh, Aleksandra Daniluk, Stephan A. Sieber
License: MIT
Keywords: chemistry,benchmarking,molecular,smiles,dti
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: alembic==1.13.1
Requires-Dist: anyio==4.3.0
Requires-Dist: argon2-cffi==23.1.0
Requires-Dist: argon2-cffi-bindings==21.2.0
Requires-Dist: arrow==1.3.0
Requires-Dist: asttokens==2.4.1
Requires-Dist: async-lru==2.0.4
Requires-Dist: attrs==23.2.0
Requires-Dist: babel==2.14.0
Requires-Dist: beautifulsoup4==4.12.3
Requires-Dist: bleach==6.1.0
Requires-Dist: certifi==2024.2.2
Requires-Dist: cffi==1.16.0
Requires-Dist: charset-normalizer==3.3.2
Requires-Dist: chembl-structure-pipeline>=1.2.2
Requires-Dist: colorlog==6.8.2
Requires-Dist: comm==0.2.2
Requires-Dist: contourpy==1.2.1
Requires-Dist: cycler==0.12.1
Requires-Dist: debugpy==1.8.1
Requires-Dist: decorator==5.1.1
Requires-Dist: defusedxml==0.7.1
Requires-Dist: executing==2.0.1
Requires-Dist: fastjsonschema==2.19.1
Requires-Dist: filelock==3.13.4
Requires-Dist: fonttools==4.51.0
Requires-Dist: fqdn==1.5.1
Requires-Dist: fsspec==2024.3.1
Requires-Dist: greenlet==3.0.3
Requires-Dist: h11==0.14.0
Requires-Dist: httpcore==1.0.5
Requires-Dist: httpx==0.27.0
Requires-Dist: idna==3.7
Requires-Dist: ipykernel==6.29.4
Requires-Dist: ipython==8.23.0
Requires-Dist: ipywidgets==8.1.2
Requires-Dist: isoduration==20.11.0
Requires-Dist: jedi==0.19.1
Requires-Dist: jinja2==3.1.3
Requires-Dist: joblib==1.4.0
Requires-Dist: json5==0.9.25
Requires-Dist: jsonpointer==2.4
Requires-Dist: jsonschema==4.21.1
Requires-Dist: jsonschema-specifications==2023.12.1
Requires-Dist: jupyter==1.0.0
Requires-Dist: jupyter-client==8.6.1
Requires-Dist: jupyter-console==6.6.3
Requires-Dist: jupyter-core==5.7.2
Requires-Dist: jupyter-events==0.10.0
Requires-Dist: jupyter-lsp==2.2.5
Requires-Dist: jupyter-server==2.14.0
Requires-Dist: jupyter-server-terminals==0.5.3
Requires-Dist: jupyterlab==4.1.6
Requires-Dist: jupyterlab-pygments==0.3.0
Requires-Dist: jupyterlab-server==2.27.1
Requires-Dist: jupyterlab-widgets==3.0.10
Requires-Dist: kiwisolver==1.4.5
Requires-Dist: levenshtein>=0.27.1
Requires-Dist: lightgbm==4.3.0
Requires-Dist: mako==1.3.3
Requires-Dist: markupsafe==2.1.5
Requires-Dist: matplotlib==3.8.4
Requires-Dist: matplotlib-inline==0.1.7
Requires-Dist: mistune==3.0.2
Requires-Dist: mpmath==1.3.0
Requires-Dist: msgpack==1.0.8
Requires-Dist: nbclient==0.10.0
Requires-Dist: nbconvert==7.16.3
Requires-Dist: nbformat==5.10.4
Requires-Dist: neovim==0.3.1
Requires-Dist: nest-asyncio==1.6.0
Requires-Dist: networkx==3.3
Requires-Dist: notebook==7.1.3
Requires-Dist: notebook-shim==0.2.4
Requires-Dist: numpy==1.26.4
Requires-Dist: nvidia-cublas-cu12==12.1.3.1
Requires-Dist: nvidia-cuda-cupti-cu12==12.1.105
Requires-Dist: nvidia-cuda-nvrtc-cu12==12.1.105
Requires-Dist: nvidia-cuda-runtime-cu12==12.1.105
Requires-Dist: nvidia-cudnn-cu12==8.9.2.26
Requires-Dist: nvidia-cufft-cu12==11.0.2.54
Requires-Dist: nvidia-curand-cu12==10.3.2.106
Requires-Dist: nvidia-cusolver-cu12==11.4.5.107
Requires-Dist: nvidia-cusparse-cu12==12.1.0.106
Requires-Dist: nvidia-nccl-cu12==2.20.5
Requires-Dist: nvidia-nvjitlink-cu12==12.4.127
Requires-Dist: nvidia-nvtx-cu12==12.1.105
Requires-Dist: optuna==3.6.1
Requires-Dist: overrides==7.7.0
Requires-Dist: packaging==24.0
Requires-Dist: pairwise-sequence-alignment>=1.0.3
Requires-Dist: pandas==2.2.2
Requires-Dist: pandocfilters==1.5.1
Requires-Dist: parso==0.8.4
Requires-Dist: pexpect==4.9.0
Requires-Dist: pillow==10.3.0
Requires-Dist: platformdirs==4.2.1
Requires-Dist: polaris-lib~=0.13
Requires-Dist: prometheus-client==0.20.0
Requires-Dist: prompt-toolkit==3.0.43
Requires-Dist: psutil==5.9.8
Requires-Dist: psycopg2-binary>=2.9.10
Requires-Dist: ptyprocess==0.7.0
Requires-Dist: pure-eval==0.2.2
Requires-Dist: pycparser==2.22
Requires-Dist: pygments==2.17.2
Requires-Dist: pyparsing==3.1.2
Requires-Dist: pytdc>=1.1.15
Requires-Dist: python-dateutil==2.9.0.post0
Requires-Dist: python-json-logger==2.0.7
Requires-Dist: pydantic<3,>=2
Requires-Dist: pytz==2024.1
Requires-Dist: pyyaml==6.0.1
Requires-Dist: pyzmq==26.0.2
Requires-Dist: qtconsole==5.5.1
Requires-Dist: qtpy==2.4.1
Requires-Dist: rdkit==2023.9.5
Requires-Dist: referencing==0.35.0
Requires-Dist: requests==2.31.0
Requires-Dist: rfc3339-validator==0.1.4
Requires-Dist: rfc3986-validator==0.1.1
Requires-Dist: rpds-py==0.18.0
Requires-Dist: scikit-learn==1.4.2
Requires-Dist: scipy==1.13.0
Requires-Dist: seaborn==0.13.2
Requires-Dist: send2trash==1.8.3
Requires-Dist: setuptools==68.2.2
Requires-Dist: six==1.16.0
Requires-Dist: sniffio==1.3.1
Requires-Dist: soupsieve==2.5
Requires-Dist: sqlalchemy<2.0
Requires-Dist: stack-data==0.6.3
Requires-Dist: sympy==1.12
Requires-Dist: terminado==0.18.1
Requires-Dist: threadpoolctl==3.4.0
Requires-Dist: tinycss2==1.3.0
Requires-Dist: torch==2.3.0
Requires-Dist: tornado==6.4
Requires-Dist: tqdm==4.66.2
Requires-Dist: traitlets==5.14.3
Requires-Dist: types-python-dateutil==2.9.0.20240316
Requires-Dist: typing-extensions>=4.12.0
Requires-Dist: tzdata==2024.1
Requires-Dist: umap==0.1.1
Requires-Dist: uri-template==1.3.0
Requires-Dist: urllib3==2.2.1
Requires-Dist: useful-rdkit-utils>=0.3.12
Requires-Dist: wcwidth==0.2.13
Requires-Dist: webcolors==1.13
Requires-Dist: webencodings==0.5.1
Requires-Dist: websocket-client==1.8.0
Requires-Dist: wheel==0.41.2
Requires-Dist: widgetsnbextension==4.0.10
Requires-Dist: xgboost==2.0.3
Dynamic: license-file

# BenchAudit

[![CI](https://github.com/sieber-lab/bench/actions/workflows/ci.yml/badge.svg)](https://github.com/sieber-lab/bench/actions/workflows/ci.yml)
[![Publish to PyPI](https://github.com/sieber-lab/bench/actions/workflows/publish-pypi.yml/badge.svg)](https://github.com/sieber-lab/bench/actions/workflows/publish-pypi.yml)
[![PyPI version](https://img.shields.io/pypi/v/benchaudit.svg)](https://pypi.org/project/benchaudit/)
[![Python versions](https://img.shields.io/pypi/pyversions/benchaudit.svg)](https://pypi.org/project/benchaudit/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

BenchAudit is a lightweight pipeline for auditing molecular property and drug–target interaction benchmarks. It standardizes SMILES strings, checks split hygiene, surfaces label conflicts and activity cliffs, and can run simple baseline models. Outputs are machine‑readable summaries and drill‑down tables you can inspect or feed into other tools.

## Features
- Config‑driven analysis of tabular, TDC, Polaris, and DTI datasets.
- SMILES standardization with optional REOS alerts and configurable fingerprint settings.
- Split hygiene reports: duplicates, cross‑split contamination, and nearest‑neighbor similarity.
- Conflict and activity‑cliff detection for classification and regression tasks.
- DTI extras: sequence normalization, cross‑split pair conflicts, and EMBOSS `stretcher` alignment summaries.
- Optional simple baselines for quick performance sanity checks.

## Installation

### From PyPI
Install the published package:

```bash
pip install benchaudit
```

or with `uv`:

```bash
uv pip install benchaudit
```

### From source with `uv`
BenchAudit uses a standard `pyproject.toml`. The quickest source setup is with [`uv`](https://docs.astral.sh/uv/):

```bash
# 1) Create a virtual environment
uv venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

# 2) Install dependencies declared in pyproject.toml
uv sync
```

If you need optional sequence alignment support, install EMBOSS so `stretcher` is available (e.g., `sudo apt install emboss` on Debian/Ubuntu).

## Automated PyPI publishing
This repo includes `.github/workflows/publish-pypi.yml` for automated releases.

1. In PyPI, configure a Trusted Publisher for this GitHub repository and workflow file (`.github/workflows/publish-pypi.yml`), using environment `pypi`.
2. Bump `project.version` in `pyproject.toml`.
3. Create and push a tag `vX.Y.Z` matching that version (for example `v0.1.1`).
4. GitHub Actions builds with `uv build` and publishes to PyPI automatically when the repository visibility is `public` (publishing is skipped while private).

Detailed release and install documentation: [`docs/publishing_and_installation.md`](docs/publishing_and_installation.md)

## References
- Package on PyPI: <https://pypi.org/project/benchaudit/>
- Publish workflow: [`.github/workflows/publish-pypi.yml`](.github/workflows/publish-pypi.yml)
- CI workflow: [`.github/workflows/ci.yml`](.github/workflows/ci.yml)
- `uv` docs: <https://docs.astral.sh/uv/>
- PyPI Trusted Publishers: <https://docs.pypi.org/trusted-publishers/>

## Usage
The main entry point is `run.py`, which consumes one or more YAML configs and writes results under `runs/` by default. After `uv sync`, you can call it via `uv run python run.py ...` or the installed console scripts:
- `uv run benchaudit ...` (primary)
- `uv run bench ...` (legacy alias)

```bash
# Analyze all configs in a folder
uv run python run.py --configs configs --out-root runs
# or: uv run benchaudit --configs configs --out-root runs

# Analyze a single config and train baselines
uv run python run.py --config configs/example.yml --benchmark
# or: uv run benchaudit --config configs/example.yml --benchmark
```

Outputs per config:
- `summary.json`: split sizes, hygiene counts, similarity and conflict statistics.
- `records.csv`: per-row view with cleaned SMILES, labels, and split tags.
- `conflicts.jsonl`: detailed conflict rows.
- `cliffs.jsonl`: detailed activity cliff rows.
- `sequence_alignments.jsonl`: (DTI only) top alignments between splits.
- `performance.json`: (when `--benchmark`) baseline model metrics and predictions.

## Project layout
- `run.py`: CLI runner that loads configs, builds loaders/analyzers, and writes artifacts.
- `utils/`: loaders, analyzers, baseline helpers, and logging utilities.
- `configs/`: example YAML configurations for supported datasets.
- `data/`, `runs/`: expected data and output locations (not tracked).

## Development
- Code style: keep changes simple, PEP 8-ish. Add short docstrings for public functions.
- Typing: prefer explicit, lightweight type hints when types are clear.
- Tests: run `python -m unittest discover -s tests -p "test_*.py"` (or `pytest tests` if pytest is installed).
- Test data: tiny dummy benchmark datasets live under `tests/data/`.
- Benchmark/analysis docs: run `python scripts/generate_benchmark_analysis_class_docs.py --output docs/benchmark_and_analysis_class_reference.md` to regenerate the class reference; CI enforces freshness via `.github/workflows/benchmark-analysis-docs.yml`.
- Optional extras: Polaris datasets require `polaris-lib`; sequence alignment requires `pairwise-sequence-alignment` and EMBOSS binaries.
