Metadata-Version: 2.4
Name: annbatch
Version: 0.0.1
Summary: A minibatch loader for AnnData stores
Project-URL: Documentation, https://annbatch.readthedocs.io/
Project-URL: Homepage, https://github.com/scverse/annbatch
Project-URL: Source, https://github.com/scverse/annbatch
Author: Ilan Gold, Felix Fischer
Maintainer-email: Ilan Gold <ilan.gold@scverse.org>, Felix Fischer <felix.fischer@lamin.ai>
License: MIT License
        
        Copyright (c) 2025, Ilan Gold
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: <3.14,>=3.12
Requires-Dist: anndata[lazy]
Requires-Dist: dask
Requires-Dist: pandas
Requires-Dist: scipy>1.15
Requires-Dist: session-info2
Requires-Dist: tqdm
Requires-Dist: zarr>=3
Provides-Extra: cupy-cuda12
Requires-Dist: cupy-cuda12x; extra == 'cupy-cuda12'
Provides-Extra: cupy-cuda13
Requires-Dist: cupy-cuda13x; extra == 'cupy-cuda13'
Provides-Extra: dev
Requires-Dist: pre-commit; extra == 'dev'
Requires-Dist: twine>=4.0.2; extra == 'dev'
Provides-Extra: doc
Requires-Dist: docutils!=0.18.*,!=0.19.*,>=0.8; extra == 'doc'
Requires-Dist: ipykernel; extra == 'doc'
Requires-Dist: ipython; extra == 'doc'
Requires-Dist: myst-nb>=1.1; extra == 'doc'
Requires-Dist: pandas; extra == 'doc'
Requires-Dist: scanpydoc[theme,typehints]>=0.15.3; extra == 'doc'
Requires-Dist: sphinx-autodoc-typehints; extra == 'doc'
Requires-Dist: sphinx-book-theme>=1; extra == 'doc'
Requires-Dist: sphinx-copybutton; extra == 'doc'
Requires-Dist: sphinx-issues>=5.0.1; extra == 'doc'
Requires-Dist: sphinx-tabs; extra == 'doc'
Requires-Dist: sphinx>=8.1; extra == 'doc'
Requires-Dist: sphinxcontrib-bibtex>=1; extra == 'doc'
Requires-Dist: sphinxext-opengraph; extra == 'doc'
Provides-Extra: test
Requires-Dist: coverage; extra == 'test'
Requires-Dist: pytest; extra == 'test'
Requires-Dist: zarrs>=0.2.1; extra == 'test'
Provides-Extra: torch
Requires-Dist: torch; extra == 'torch'
Provides-Extra: zarrs
Requires-Dist: zarrs>=0.2.1; extra == 'zarrs'
Description-Content-Type: text/markdown

<!--Links at the top because this document is split for docs home page-->

[uv]: https://github.com/astral-sh/uv

[scverse discourse]: https://discourse.scverse.org/

[issue tracker]: https://github.com/scverse/annbatch/issues

[tests]: https://github.com/scverse/annbatch/actions/workflows/test.yaml

[documentation]: https://annbatch.readthedocs.io

[changelog]: https://annbatch.readthedocs.io/en/latest/changelog.html

[api documentation]: https://annbatch.readthedocs.io/en/latest/api.html

[pypi]: https://pypi.org/project/annbatch

[zarrs-python]: https://zarrs-python.readthedocs.io/

[lamin]: https://lamin.ai/

[scverse]: https://scverse.org/

[in-depth section of our docs]: https://annbatch.readthedocs.io/en/latest/#in-depth

# annbatch

> [!CAUTION]
> This package does not have a stable API.
  However, we do not anticipate the on-disk format to change in an incompatible manner.

[![Tests][badge-tests]][tests]
[![Documentation][badge-docs]][documentation]

[badge-tests]: https://img.shields.io/github/actions/workflow/status/scverse/annbatch/test.yaml?branch=main

[badge-docs]: https://img.shields.io/readthedocs/annbatch

A data loader and io utilities for minibatching on-disk AnnData, co-developed by [lamin][] and [scverse][]

## Getting started

Please refer to the [documentation][],
in particular, the [API documentation][].

## Installation

You need to have Python 3.12 or newer installed on your system.
If you don't have Python installed, we recommend installing [uv][].

To install the latest release of `annbatch` from [PyPI][]:

```bash
pip install annbatch
```

We provide extras in the `pyproject.toml` for `torch`, `cupy-cuda12`, `cupy-cuda13`, and [zarrs-python][].
`cupy` provides accelerated handling of the data via `preload_to_gpu` once it has been read off disk and does not need to be used in conjunction with `torch`.
> [!IMPORTANT]
> [zarrs-python][] gives the necessary performance boost for the sharded data produced by our preprocessing functions to be useful when loading data off a local filesystem.

## Basic usage example

Basic preprocessing:
```python
from annbatch import create_anndata_collection

import zarr
from pathlib import Path

# Using zarrs is necessary for local filesystem perforamnce.
# Ensure you installed it using our `[zarrs]` extra i.e., `pip install annbatch[zarrs]` to get the right version.
zarr.config.set(
    {"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}
)

create_anndata_collection(
    adata_paths=[
        "path/to/your/file1.h5ad",
        "path/to/your/file2.h5ad"
    ],
    output_path="path/to/output/collection", # a directory containing `dataset_{i}.zarr`
    shuffle=True,  # shuffling is needed if you want to use chunked access
)
```

Data loading:

```python
from pathlib import Path

from annbatch import ZarrSparseDataset
import anndata as ad
import zarr

# Using zarrs is necessary for local filesystem perforamnce.
# Ensure you installed it using our `[zarrs]` extra i.e., `pip install annbatch[zarrs]` to get the right version.
zarr.config.set(
    {"codec_pipeline.path": "zarrs.ZarrsCodecPipeline"}
)

ds = ZarrSparseDataset(
    batch_size=4096,
    chunk_size=32,
    preload_nchunks=256,
).add_anndatas(
    [
        ad.AnnData(
            # note that you can open an AnnData file using any type of zarr store
            X=ad.io.sparse_dataset(zarr.open(p)["X"]),
            obs=ad.io.read_elem(zarr.open(p)["obs"]),
        )
        for p in Path("path/to/output/collection").glob("*.zarr")
    ],
    obs_keys="label_column",
)

# Iterate over dataloader (plugin replacement for torch.utils.DataLoader)
for batch in ds:
    ...
```

<!--TODO: proper intersphinx and/or migrate note-->

For usage of our loader inside of `torch`, please see our [this note](https://annbatch.readthedocs.io/en/latest/#user-configurable-sampling-strategy) for more info. At the minimum, be aware that deadlocking will occur on linux unless you pass `multiprocessing_context="spawn"` to the `DataLoader`.

<!--HEADER-->

For a deeper dive into this example, please see the [in-depth section of our docs][]

<!--FOOTER-->
## Release notes

See the [changelog][].

## Contact

For questions and help requests, you can reach out in the [scverse discourse][].
If you found a bug, please use the [issue tracker][].
