Metadata-Version: 2.4
Name: msdatasets
Version: 0.1.2
Summary: A unified dataset framework for mass spectrometry
Project-URL: Homepage, https://chrisagrams.github.io/msdatasets
Project-URL: Repository, https://github.com/chrisagrams/msdatasets
Author-email: Chris Grams <chrisagrams@gmail.com>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Requires-Dist: httpx-sse>=0.4
Requires-Dist: httpx>=0.27
Requires-Dist: mscompress>=1.0.13
Requires-Dist: mstransfer>=0.2.3
Requires-Dist: pydantic>=2.0
Requires-Dist: rich>=13.0
Provides-Extra: all
Requires-Dist: torch>=2.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: mypy>=1.15; extra == 'dev'
Requires-Dist: pre-commit>=4.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=1.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.9; extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.5; extra == 'docs'
Requires-Dist: mkdocs>=1.6; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.24; extra == 'docs'
Provides-Extra: torch
Requires-Dist: torch>=2.0; extra == 'torch'
Description-Content-Type: text/markdown

# msdatasets

[![CI](https://github.com/chrisagrams/msdatasets/actions/workflows/ci.yml/badge.svg)](https://github.com/chrisagrams/msdatasets/actions/workflows/ci.yml)
[![codecov](https://codecov.io/gh/chrisagrams/msdatasets/graph/badge.svg)](https://codecov.io/gh/chrisagrams/msdatasets)
[![PyPI version](https://badge.fury.io/py/msdatasets.svg)](https://pypi.org/project/msdatasets/)

A unified dataset framework for mass spectrometry.

`msdatasets` is a Python client and CLI for downloading mass spectrometry
datasets from the msdatasets server. Datasets are fetched by server UUID or
by repository accession (PRIDE, MassIVE), cached on disk, and optionally
loaded as a PyTorch `Dataset` for training pipelines.

## Features

- Download by server UUID or by PRIDE / MassIVE accession — the server
  imports and converts remote projects on demand
- Choose the on-disk format per download: `mszx` (raw archive), `msz`
  (inner compressed MS data), or `mzml` (fully decompressed)
- Parallel downloads with a live progress bar
- Filename subsets via `accession[file1.raw,file2.mzML]` syntax
- Server-side extraction is tracked over SSE until files are ready
- Optional PyTorch integration via the `torch` extra

## Installation

```bash
pip install msdatasets              # base install
pip install 'msdatasets[torch]'     # with PyTorch integration
```

## Quick start

### CLI

```bash
# By server UUID
msdatasets download 550e8400-e29b-41d4-a716-446655440000

# From a PRIDE project
msdatasets download pride/PXD075509

# Subset of files, stored as mzML
msdatasets download pride/PXD075509[19HCD_3.mzML] --store-as mzml

# Write directly to a directory instead of the shared cache
msdatasets download massive/MSV000101460 -o ./my-data
```

### Python

```python
from msdatasets import download_dataset, download_repo_dataset

# By UUID
ds = download_dataset("550e8400-e29b-41d4-a716-446655440000")
print(ds.dataset_name, len(ds), "files")
for path in ds:
    ...

# By PRIDE accession (filename subset, stored as mzML)
ds = download_repo_dataset(
    "pride",
    "PXD075509",
    filenames=["19HCD_3.mzML"],
    store_as="mzml",
)
```

### PyTorch

```python
from msdatasets import load_dataset

# Returns an mscompress.datasets.torch.MSCompressDataset.
# Accepts UUIDs and repository specs.
dataset = load_dataset("pride/PXD075509[19HCD_3.mzML]")
```

## Configuration

| Environment variable | Purpose                                      | Default                   |
|----------------------|----------------------------------------------|---------------------------|
| `MS_API_URL`         | Server base URL                              | `https://datasets.lab.gy` |
| `MS_DATASETS_CACHE`  | Explicit cache directory                     | —                         |
| `MS_HOME`            | Alternative cache root (`$MS_HOME/datasets`) | `~/.ms`                   |

Full CLI reference, storage-format details, and Python API are in the
[documentation](https://chrisagrams.github.io/msdatasets/).

## Development

```bash
git clone https://github.com/chrisagrams/msdatasets.git
cd msdatasets
uv sync --extra dev --extra docs
uv run pre-commit install
uv run pytest
```

Pre-commit runs `ruff`, `mypy`, and `pytest` (90% coverage gate). CI runs on
Python 3.10, 3.11, and 3.12.

## License

MIT
