Metadata-Version: 2.4
Name: raw2features
Version: 0.1.0
Summary: Read OME-Zarr whole-slide images and emit patch- and slide-level foundation-model embeddings - storage backend and model independently swappable.
Keywords: digital-pathology,whole-slide-image,ome-zarr,ome-ngff,foundation-models,embeddings,feature-extraction
Author: Craig Myles
Author-email: Craig Myles <me@craig.im>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Image Processing
Classifier: Topic :: Scientific/Engineering :: Medical Science Apps.
Requires-Dist: numpy>=2.0
Requires-Dist: pydantic>=2.7
Requires-Dist: pyyaml>=6.0
Requires-Dist: typer>=0.12
Requires-Dist: tqdm>=4.66
Requires-Dist: jsonschema>=4.18
Requires-Dist: raw2features[zarr,image,torch,models,stain] ; extra == 'all'
Requires-Dist: psutil>=5.9 ; extra == 'benchmark'
Requires-Dist: pynvml>=11.5 ; extra == 'benchmark'
Requires-Dist: raw2features[torch,models] ; extra == 'chief'
Requires-Dist: gdown>=5.2 ; extra == 'chief'
Requires-Dist: raw2features[torch,models] ; extra == 'conch'
Requires-Dist: raw2features[torch,models] ; extra == 'gigapath-slide'
Requires-Dist: raw2features[torch] ; extra == 'grandqc'
Requires-Dist: segmentation-models-pytorch>=0.3 ; extra == 'grandqc'
Requires-Dist: timm>=1.0 ; extra == 'grandqc'
Requires-Dist: h5py>=3.0 ; extra == 'h5'
Requires-Dist: transformers<5 ; extra == 'hibou'
Requires-Dist: opencv-python-headless>=4.9 ; extra == 'image'
Requires-Dist: shapely>=2.0 ; extra == 'image'
Requires-Dist: pillow>=11.0 ; extra == 'image'
Requires-Dist: raw2features[torch,models] ; extra == 'kronos'
Requires-Dist: tifffile>=2024 ; extra == 'kronos'
Requires-Dist: imagecodecs ; extra == 'kronos'
Requires-Dist: raw2features[torch,models] ; extra == 'madeleine'
Requires-Dist: timm>=1.0 ; extra == 'models'
Requires-Dist: huggingface-hub>=0.24 ; extra == 'models'
Requires-Dist: einops>=0.7 ; extra == 'models'
Requires-Dist: transformers>=4.40 ; extra == 'models'
Requires-Dist: einops-exts>=0.0.4 ; extra == 'models'
Requires-Dist: raw2features[torch,models] ; extra == 'musk'
Requires-Dist: fairscale ; extra == 'musk'
Requires-Dist: open-clip-torch ; extra == 'musk'
Requires-Dist: raw2features[torch,models] ; extra == 'open-clip'
Requires-Dist: open-clip-torch>=2.20 ; extra == 'open-clip'
Requires-Dist: raw2features[torch,models] ; extra == 'prism'
Requires-Dist: environs ; extra == 'prism'
Requires-Dist: sacremoses ; extra == 'prism'
Requires-Dist: s3fs>=2024.0 ; extra == 's3'
Requires-Dist: raw2features[torch,models] ; extra == 'seal'
Requires-Dist: peft ; extra == 'seal'
Requires-Dist: scanpy ; extra == 'seal'
Requires-Dist: spatialdata>=0.2 ; extra == 'spatialdata'
Requires-Dist: geopandas>=0.14 ; extra == 'spatialdata'
Requires-Dist: scikit-learn>=1.3 ; extra == 'stain'
Requires-Dist: pandas>=2.2 ; extra == 'tables'
Requires-Dist: pyarrow>=17.0 ; extra == 'tables'
Requires-Dist: raw2features[torch,models] ; extra == 'tangle'
Requires-Dist: gdown>=5.2 ; extra == 'tangle'
Requires-Dist: torch>=2.2 ; extra == 'torch'
Requires-Dist: torchvision>=0.17 ; extra == 'torch'
Requires-Dist: zarr>=3.1 ; extra == 'zarr'
Requires-Dist: numcodecs>=0.15 ; extra == 'zarr'
Requires-Dist: ngff-zarr>=0.13 ; extra == 'zarr'
Requires-Dist: ome-zarr>=0.10 ; extra == 'zarr'
Requires-Dist: fsspec>=2024.0 ; extra == 'zarr'
Requires-Python: >=3.11
Project-URL: Homepage, https://github.com/CraigMyles/raw2features
Project-URL: Repository, https://github.com/CraigMyles/raw2features
Project-URL: Issues, https://github.com/CraigMyles/raw2features/issues
Provides-Extra: all
Provides-Extra: benchmark
Provides-Extra: chief
Provides-Extra: conch
Provides-Extra: gigapath-slide
Provides-Extra: grandqc
Provides-Extra: h5
Provides-Extra: hibou
Provides-Extra: image
Provides-Extra: kronos
Provides-Extra: madeleine
Provides-Extra: models
Provides-Extra: musk
Provides-Extra: open-clip
Provides-Extra: prism
Provides-Extra: s3
Provides-Extra: seal
Provides-Extra: spatialdata
Provides-Extra: stain
Provides-Extra: tables
Provides-Extra: tangle
Provides-Extra: torch
Provides-Extra: zarr
Description-Content-Type: text/markdown

# raw2features

<p align="center">
  <img src="assets/raw2features-diagram.svg" alt="raw2features: OME-Zarr whole-slide image in, patch-level and slide-level foundation-model embeddings out" width="100%">
</p>

Read a whole-slide image in **OME-Zarr / OME-NGFF** and emit **patch- and slide-level
foundation-model embeddings** - with storage backend and embedding models
independently swappable.

**Cloud-native and [FAIR](https://www.go-fair.org/fair-principles/):** slides read directly
from cloud storage, and each embedding carries the metadata needed to interpret and reuse it.

By analogy to `bioformats2raw` and `raw2ometiff`, but for features: point it at a raw OME-Zarr WSI,
choose from 30+ feature extractors (UNI/UNI2, Virchow/Virchow2, CONCH,
GigaPath, H-optimus, Phikon, CTransPath, …; full list in [MODELS.md](docs/MODELS.md)),
and get back a compact, self-describing `*.embeddings.zarr` with per-patch
coordinates such that every embedding is relocatable to the slide.

> Status: alpha, under active development. Contributions welcome.

## What is OME-Zarr?

[**Zarr**](https://zarr.dev/) stores large N-dimensional arrays as chunked, compressed
pieces you can read individually - enabling you to stream just the region you need, directly from
cloud storage. [**OME-Zarr**](https://ngff.openmicroscopy.org/) (OME-NGFF) is the
bioimaging convention on top of Zarr: a multi-resolution pyramid plus standard metadata
(pixel size, axes, channels). It's a community driven and widely adopted format for bioimaging.
The [BioImage Archive](https://www.ebi.ac.uk/bioimage-archive/) and
[IDR](https://idr.openmicroscopy.org/) are adopting it at scale: the IDR has
[migrated its infrastructure to OME-Zarr](https://forum.image.sc/t/idr-switchover-test-and-idr-migration/121370)
(image viewing and raw-pixel access are now served directly from OME-Zarr in public object
storage), and the BioImage Archive publishes whole-slide images in the same FAIR format.
raw2features reads OME-Zarr (local or remote) and writes its embeddings back out as a Zarr
store too.

## Why

- **OME-Zarr in, embeddings out.** raw2features focuses on
  cloud-optimised, parallel-friendly NGFF reads → embeddings.
- **Exact MPP.** Patches are extracted at the requested microns/pixel
  (e.g. default 0.5 µm/px @ 224 px) by downsampling from the nearest finer pyramid
  level such that embeddings are comparable across slides and
  datasets.
- **Modular implementation.** `Reader`, `segmenter`, `patcher`, `embedder` and `sink` are
  plugin seams exposed via Python entry-points: add a model or backend by
  shipping a package.
- **FAIR & provenance-first.** Each model's weights are pinned to an **immutable HuggingFace
  revision** (or a sha256-pinned URL), with preprocessing sourced from each
  model's card. Every output records that provenance plus a 1:1
  coords↔features mapping, so an embedding is reproducible and traceable to the exact
  weights that made it.

## Install

```bash
pip install "raw2features[all]"     # full stack: OME-Zarr reads + segmentation + torch + models
pip install "raw2features[zarr]"    # lean: remote/Zarr reads only, no torch
pip install raw2features            # core only (bring your own reader/model extras)
```

Extras are composable - e.g. `raw2features[zarr,torch,models]`. The export bridges
(`spatialdata`, `h5`) stay opt-in; see [MODEL_LICENSES.md](docs/MODEL_LICENSES.md) and
[INTEROP.md](docs/INTEROP.md).

**Gated git-package encoders.** A few encoders (CONCH, KRONOS, MUSK) ship as gated,
non-PyPI git packages, so they install in two steps. The extra pulls the PyPI stack,
then one command installs the package itself:

```bash
pip install "raw2features[conch]"  && pip install git+https://github.com/Mahmoodlab/CONCH.git
pip install "raw2features[kronos]" && pip install git+https://github.com/mahmoodlab/KRONOS.git
pip install "raw2features[musk]"   && pip install git+https://github.com/lilab-stanford/MUSK
```

The same pattern covers the other gated encoders - mostly slide encoders (e.g. `madeleine`,
`gigapath_slide`, `seal`), a few with extra model-specific steps (a pinned fork, `flash-attn`,
or Drive-hosted weights). Each model's exact install is in its [`MODELS.md`](docs/MODELS.md) row
and the matching extra's comment in `pyproject.toml`.

**Development** (from a clone, with [uv](https://docs.astral.sh/uv/)):

```bash
uv sync                      # core
uv sync --extra zarr --extra image --extra torch --extra models   # full stack
```

## Quickstart

With the stack installed (above):

```bash
raw2features sample sample.ome.zarr                          # synthetic slide
raw2features embed  sample.ome.zarr out/ -m resnet50 --device auto
```

`--device auto` picks CUDA → Apple MPS → CPU, so this runs anywhere. Tested on A100, L40S,
GB10, and CPU.

## Notebooks

Runnable tutorials live in [`notebooks/`](notebooks/). Start with the
[**visual walkthrough**](notebooks/02_visual_walkthrough.ipynb) - a real SurGen H&E slide
resolved from the BioImage Archive and taken **cloud-direct** (nothing downloaded) from
thumbnail → tissue segmentation → patch tiles → a ResNet-50 feature map of the slide, all
on CPU with no model-access-token. Its figures are pre-rendered on GitHub.

## Usage

**Full guide: [`docs/usage.md`](docs/usage.md)** - every command, what actually
happens under the hood (exact MPP, decode-once fan-out, output schema), the
rerun-safe / skip-if-complete model, thumbnails, and example SLURM cohort runs.

```bash
raw2features info slide.ome.zarr
raw2features embed slide.ome.zarr out/ \
    --model uni --model resnet50 \
    --mpp 1.0 --patch-size 224 --hf-token "$HF_TOKEN" \
    --emit-thumbnail                                  # optional QC thumbnail + overlay
raw2features list embedders

# Thumbnails can also be made standalone, before/after the embed run. By default
# they render at the segmentation MPP, so --overlay aligns the tissue mask + the
# kept-patch grid with no resampling (--thumbnail-mpp / --max-px to override).
raw2features thumbnail slide.ome.zarr out/ --overlay

# Optional post-hoc exports from the native out/slide.embeddings.zarr store:
# SpatialData for squidpy/napari, or HDF5 for TRIDENT/CLAM/TITAN/STAMP.
# These never re-compute embeddings; install [spatialdata] or [h5] as needed.
raw2features export-spatialdata out/slide.embeddings.zarr   # -> slide.spatialdata.zarr
raw2features export-h5 out/slide.embeddings.zarr --layout trident   # or --layout clam / stamp
```

## Output

```
<slide_id>.embeddings.zarr/
├── .zattrs                  # source, provenance + a grids index
└── grids/<mpp>_<px>/        # one per geometry (usually just one, e.g. mpp0.5_px224)
    ├── .zattrs              # this grid's full header (patching, models, provenance)
    ├── coords/              # (N,2) int32 level-0 (x,y) - 1:1 with every features/<model>
    ├── grid_index/          # (N,2) int32 (row,col)
    ├── mask/                # (rows,cols) uint8 fraction of each cell that is tissue, 0-255 (unless --no-seg)
    └── features/<model>/    # (N, dim) float16

<slide_id>.thumbnail.png            # optional (--emit-thumbnail / thumbnail cmd)
<slide_id>.thumbnail.overlay.png    # optional QC overlay: tissue tint + kept-patch grid

<slide_id>.spatialdata.zarr/        # optional - `export-spatialdata`, see docs/INTEROP.md
<slide_id>.h5                       # optional - `export-h5` (TRIDENT/STAMP), see docs/INTEROP.md
```

**Interop (optional export for supported packages):**
export to scverse **SpatialData** (squidpy / napari-spatialdata) or to pathology-MIL
**HDF5** (TRIDENT/CLAM/TITAN, KatherLab STAMP). These are one-way export bridges so you
can feed existing toolchains; for full FAIR provenance use the default
`.embeddings.zarr`. See [`INTEROP.md`](docs/INTEROP.md).

## Remote / cloud reads (no download)

Any command that takes a slide path also takes a **remote OME-Zarr URL** - the reader
opens `http(s)://`, `s3://`, `gs://`, etc. via `fsspec`/`zarr`, so the **whole pipeline
(segment → tile → embed) runs directly against a cloud store without downloading the
slide**. Needs the `[zarr]` extra (ships `fsspec`); `s3://`/`gs://` need `s3fs`/`gcsfs`.

```bash
# Extract straight from the EBI BioImage Archive - nothing lands on local disk.
raw2features embed \
  https://uk1s3.embassy.ebi.ac.uk/bia-integrator-data/S-BIAD1285/.../image.ome.zarr/0 \
  out/ -m uni --mpp 0.5 --read-block 16        # fewer, larger reads cut round-trips
```

Validated end-to-end against the EBI BioImage Archive. Remote reads are latency-bound, but
**in our read benchmark** (`raw2features benchmark`) a cold embed-once run (the normal case)
was only about 1.6x slower than local: the GPU, segmentation, and write work dominate and
don't depend on where the slide lives, so the raw-read gap (around 16x on warm re-reads)
mostly disappears. On a slow store, `--read-block N` groups patches
into N×N reads to cut round-trips (bit-identical output; try 16 remote, 8 local), and 8
read-workers was the sweet spot either way. For large cohorts, staging slides to local
storage is still faster. See [`docs/usage.md`](docs/usage.md) for the remote-read and
`--read-block` guidance.

## Licence

MIT - see [LICENSE](LICENSE). If you use raw2features, please cite it (see
[CITATION.cff](CITATION.cff)).

raw2features **does not ship model weights** and grants no rights to them. When using a
pretrained encoder please refer to **that model's own licence** (several are
non-commercial, e.g. CC-BY-NC-ND). See
[MODEL_LICENSES.md](docs/MODEL_LICENSES.md).
