Metadata-Version: 2.4
Name: stereosegger
Version: 0.1.1
Summary: Fast and accurate cell segmentation for single-molecule spatial omics (Stereo-seq)
Author-email: Elyas Heidari <elyas.heidari@dkfz-heidelberg.de>
License: MIT
Project-URL: bug_tracker, https://github.com/nrclaudio/stereosegger/issues
Project-URL: source_code, https://github.com/nrclaudio/stereosegger
Project-URL: repository, https://github.com/nrclaudio/stereosegger
Keywords: segmentation,deep learning,pytorch,geometric deep learning,stereo-seq
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch==2.5.1
Requires-Dist: torchvision==0.20.1
Requires-Dist: nvidia-nvjitlink-cu12>=12.4.127
Requires-Dist: distributed==2024.7.1
Requires-Dist: numpy>=1.24.4
Requires-Dist: pandas==2.1.4
Requires-Dist: scipy>=1.7.0
Requires-Dist: matplotlib>=3.4.0
Requires-Dist: seaborn>=0.11.0
Requires-Dist: tqdm>=4.61.0
Requires-Dist: lightning>=1.9.0
Requires-Dist: torchmetrics>=0.5.0
Requires-Dist: zarr<3.0.0,>=2.6.1
Requires-Dist: anndata==0.10.8
Requires-Dist: scanpy==1.10.2
Requires-Dist: squidpy==1.4.1
Requires-Dist: adjustText>=0.8
Requires-Dist: scikit-learn>=0.24.0
Requires-Dist: geopandas<1.0,>=0.9.0
Requires-Dist: shapely>=1.7.0
Requires-Dist: path>=17.0.0
Requires-Dist: pyarrow>=16.1.0
Requires-Dist: dask==2024.7.1
Requires-Dist: dask_geopandas>=0.4.0
Requires-Dist: dask-cuda>=23.10.0
Requires-Dist: torch-geometric>=2.3.0
Requires-Dist: pqdm>=0.2.0
Requires-Dist: rtree>=1.3.0
Requires-Dist: cudf-cu12<24.9.0,>=24.8.0
Requires-Dist: cuml-cu12<24.9.0,>=24.8.0
Requires-Dist: cugraph-cu12<24.9.0,>=24.8.0
Requires-Dist: cuspatial-cu12<24.9.0,>=24.8.0
Requires-Dist: cupy-cuda12x>=12.0.0
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: flake8; extra == "dev"
Requires-Dist: twine>=4.0.2; extra == "dev"
Provides-Extra: tests
Requires-Dist: pytest; extra == "tests"
Requires-Dist: coverage; extra == "tests"
Dynamic: license-file

# StereoSegger: Fast and Accurate Cell Segmentation for Spatial Omics

> **Note:** This project is heavily inspired by the original **Segger** implementation by Elyas Heidari. You can find the original repository at [EliHei2/segger_dev](https://github.com/EliHei2/segger_dev).

## Installation

StereoSegger requires **CUDA 12** (specifically configured for **CUDA 12.4** compatibility) for GPU acceleration.

### Option 1: Automated Setup (Recommended)

We provide a setup script that handles the complex dependency chain (PyTorch 2.5.1, RAPIDS 24.08, CUDA 12.4) automatically. This is the **most reliable method** to ensure GPU acceleration works.

```bash
# Clone the repository
git clone https://github.com/nrclaudio/stereosegger.git
cd stereosegger

# Run the setup script (requires Conda installed)
bash scripts/setup_segger_env.sh

# Activate the environment
conda activate segger_env
```

### Option 2: Pip Install (Advanced)

If you are managing your own CUDA environment, you can install via pip. Note that you **must** include the NVIDIA and PyTorch indices to get the correct GPU-accelerated wheels.

```bash
pip install stereosegger \
  --extra-index-url https://pypi.nvidia.com \
  --extra-index-url https://download.pytorch.org/whl/cu124
```

---

## Inputs & Outputs

### 1. Inputs

StereoSegger primarily operates on **Parquet** files derived from standard spatial formats.

#### A. Raw Input (SAW Output)
- **Format:** `h5ad` (AnnData)
- **Source:** Output from the SAW pipeline (Stereo-seq Analysis Workflow).
- **Requirements:**
    - `.X`: Sparse matrix of gene counts.
    - `.obsm['spatial']`: (x, y) coordinates of the bins.
    - `.var`: Index must contain unique gene names.

#### B. Processed Input (StereoSegger Native)
If you are skipping the conversion step, provide a directory containing:
- **`transcripts.parquet`**: Long-form table of gene-location occurrences (`transcript_id`, `gene_id`, `x`, `y`, `count`, `bx`, `by`).
- **`genes.parquet`**: Mapping of `gene_id` to `gene_name`.
- **`boundaries.parquet`** (Optional): WKB-encoded polygons (e.g., nuclei masks).

---

### 2. Outputs

The pipeline produces three main types of output files, depending on the stage and your configuration.

#### A. Segmentation Results (`.h5ad`) - **Recommended**
The primary output for downstream analysis. Generated when `file_format=anndata`.
- **Expression Matrix (`X`)**: A sparse matrix of shape `(n_cells, n_genes)` containing UMI counts.
- **Cell Metadata (`obs`)**:
    - `transcripts`: Total UMI count per cell.
    - `unique_transcripts`: Number of unique genes detected in the cell.
    - `cell_centroid_x`, `cell_centroid_y`: Spatial center of the segmented cell.
    - `cell_area`: Area of the cell (computed via Convex Hull).
- **Gene Metadata (`var`)**:
    - `total_assigned`: Number of transcripts of this gene assigned to cells.
    - `total_unassigned`: Number of transcripts of this gene that remain unassigned.

#### B. Segmentation Table (`.csv` or `.parquet`)
A long-form record of every transcript's assignment.
- **Columns**:
    - `transcript_id`: The unique ID of the input transcript.
    - `seg_label`: The ID of the cell this transcript was assigned to.
    - `score`: The model's confidence score for the assignment.
    - `bound`: Boolean flag (1 = assigned to a nucleus/seed; 0 = assigned via graph-based connected components).

#### C. Intermediate Tiled Dataset (`.pt`)
Generated by `create_dataset_fast` in your `data_dir`.
- **Content**: Serialized PyTorch Geometric `HeteroData` objects.
- **Use Case**: These are used for training and as the immediate input for the `predict` step. They contain the spatial graph (nodes, edges, features) for 1000x1000 pixel tiles.

---

## Quickstart: Stereo-seq SAW bin1

### 1. Convert Data & Create Dataset

```bash
# 1. Convert H5AD to Parquet
python -m stereosegger.cli.convert_saw_h5ad_to_segger_parquet \
  --h5ad C04895D5_tissue.h5ad \
  --out_dir ./raw_data \
  --bin_pitch 1.0 \
  --min_count 1

# 2. Build Graph Dataset
python -m stereosegger.cli.create_dataset_fast \
  --base_dir ./raw_data \
  --data_dir ./processed_dataset \
  --sample_type saw_bin1 \
  --tx_graph_mode grid_bins \
  --grid_connectivity 8 \
  --within_bin_edges star
```

### 2. Train Model

```bash
python -m stereosegger.cli.train_model \
  --dataset_dir ./processed_dataset \
  --models_dir ./models \
  --sample_tag my_sample \
  --max_epochs 200 \
  --accelerator cuda \
  --devices 1
```

### 3. Run Segmentation (Predict)

```bash
python -m stereosegger.cli.predict \
  --segger_data_dir ./processed_dataset \
  --models_dir ./models \
  --benchmarks_dir ./results \
  --transcripts_file ./raw_data/transcripts.parquet \
  --model_version 0
```

---

## Technical Details

### Stereo-seq SAW bin1 Methodology
StereoSegger implements specific logic to handle SAW bin1 data efficiently:
- **Regular Grid**: SAW bin1 data is already a regular grid. We leverage this by using grid adjacency (neighbors are pixels up/down/left/right) which is `O(1)` compared to `O(N log N)` for distance-based kNN on pseudo-points.
- **Consistency**: Grid adjacency keeps local structure consistent with the chip layout and avoids sensitivity to sparsity or count magnitude.

#### Graph Modes & Definitions
1.  **Pseudo-transcript (Gene-Bin Node)**: A node created from a nonzero (bin, gene) entry. Connects all genes in a bin to a central "hub" gene, and connects hubs across adjacent bins. **Recommended for SAW**.
2.  **Aggregated Bin Node**: A node representing an entire spatial bin, aggregating all transcripts within it. Features are `[log(total_count), log(n_genes)]`.
3.  **Grid Adjacency**: Two bins are neighbors if their integer grid coordinates differ by one step. `grid_connectivity=8` includes diagonals.

### Architecture
StereoSegger employs a Heterogeneous Graph Attention Network (GATv2) to segment transcripts based on their spatial neighborhood and identity.

#### 1. Nodes (The Graph Components)
- **Transcript Nodes (`tx`)**: Represents a specific gene at a spatial location. Gene embeddings are scaled by `(1 + log(count))` to represent signal intensity without exploding graph size.
- **Boundary Nodes (`bd`)**: Represents polygon boundaries (e.g., nuclei). Features like Area are log-transformed for numerical stability.

#### 2. Edges (The Connections)
- **`tx` $\leftrightarrow$ `tx` (Transcript-Transcript)**: Star topology (within bin) + Grid adjacency (across bins).
- **`tx` $\rightarrow$ `bd` (Transcript-Boundary Neighbors)**: Connects transcripts to nearby candidate cells.
- **`tx` $\rightarrow$ `bd` (Supervision)**: Connects a transcript to the _correct_ ground-truth boundary during training.
