Metadata-Version: 2.4
Name: bkeccl
Version: 0.1.0
Summary: GPU connected-component labeling for 2D/3D CuPy and DLPack arrays via the Block-based Komura Equivalence algorithm
Author: Ben Elfner
Author-email: Ben Elfner <belfner@belfner.com>
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Operating System :: POSIX :: Linux
Classifier: Environment :: GPU :: NVIDIA CUDA
Classifier: Topic :: Scientific/Engineering :: Image Recognition
Classifier: Topic :: Scientific/Engineering :: Mathematics
Classifier: Typing :: Typed
Requires-Python: >=3.10
Project-URL: Homepage, https://github.com/belfner/bkeccl
Project-URL: Repository, https://github.com/belfner/bkeccl
Description-Content-Type: text/markdown

# bkeccl

GPU-accelerated connected-component labeling for **2D and 3D** arrays using the
**Block-based Komura Equivalence (BKE)** algorithm
([IEEE TPDS 2019](https://ieeexplore.ieee.org/document/8798895)). bkeccl is a
pure-Python package built on [CuPy](https://cupy.dev/): the CUDA kernels are
compiled on the device at runtime, so there is no compiled extension and no
build step.

- **2D 8-connectivity** (2x2 blocks) and **3D 26-connectivity** (2x2x2 blocks)
- Works with **`cupy.ndarray`** or any **CUDA DLPack tensor** (PyTorch, JAX,
  Triton, ...) imported zero-copy, so it drops into an existing GPU pipeline
  without a host round-trip
- **`uint8` and `bool`** inputs used directly; other dtypes reduced to their
  non-zero mask
- **Dense** (`1..N`) or **raw** (native sparse) label output
- Per-component **boolean mask stacks** in one call
- A **shape-specialized runner** for locked-down hot loops

### When to use bkeccl

bkeccl is optimized for one workflow: full-connectivity binary CCL on a single
GPU array, kept on-device, called repeatedly at a fixed shape. Inside that
workflow it is very fast. It deliberately omits the flexibility that workflow
does not need (4-connectivity, batching, CPU arrays, on-the-fly shapes).

If you need that flexibility, use
[cuCIM](https://docs.rapids.ai/api/cucim/stable/)'s
[`cucim.skimage.measure.label`](https://docs.rapids.ai/api/cucim/stable/api/#cucim.skimage.measure.label):
it is a more general, scikit-image-compatible GPU labeler (selectable
connectivity, n-dimensional, broader dtype/array handling). It is slightly
slower than bkeccl on bkeccl's specialized path but still a fast GPU
implementation, and the better choice when you need the generality.

## Installation

bkeccl does **not** depend on a pinned CuPy. CuPy ships as a
CUDA-version-specific wheel, so you install the one matching your CUDA toolkit
first, then install bkeccl.

### Requirements

- Python >= 3.10
- An NVIDIA GPU and a working CUDA runtime
- A CuPy wheel matching your CUDA major version

### Step 1: install the CuPy wheel for your CUDA

With pip:

```bash
pip install cupy-cuda13x      # CUDA 13.x
# or
pip install cupy-cuda12x      # CUDA 12.x
```

For a uv-managed project:

```bash
uv add cupy-cuda13x      # CUDA 13.x
# or
uv add cupy-cuda12x      # CUDA 12.x
```

For other CUDA versions or the bundled-toolkit `[ctk]` extra, see the
[CuPy install guide](https://docs.cupy.dev/en/stable/install.html). Importing
bkeccl without CuPy raises an `ImportError` that names this step.

### Step 2: install bkeccl

```bash
pip install bkeccl
```

For a uv-managed project:

```bash
uv add bkeccl
```

### Install from source

```bash
git clone https://github.com/belfner/bkeccl
cd bkeccl
pip install .
```

## Usage

| Goal | Function |
|------|----------|
| Label a 2D image | `ccl(img)` |
| Label a 3D volume | `ccl(vol, dimensionality="3D")` |
| Native sparse labels (skip the densify pass) | `ccl(img, output_mode="raw")` |
| One boolean mask per component | `ccl_masks(img)` |
| Masks from a label map you already have | `masks_from_labels(labels, count)` |
| Skip the per-call output allocation | `ccl(img, label_output=buf)` |
| Repeated fixed-shape calls in a hot loop | `make_ccl_2D` / `make_ccl_3D` |

### Core workflow

Labeling an array and reading the result. Most code needs only this section.

#### Label a 2D image

```python
import cupy
from bkeccl import ccl

# Binary image on the GPU. Any non-zero value is foreground; 0 is background.
img = cupy.asarray(
    [[1, 1, 0, 0],
     [0, 0, 0, 1],
     [0, 1, 1, 1]],
    dtype=cupy.uint8,
)

labels, count = ccl(img)
# labels: int32 (H, W), background 0, components renumbered 1..N (dense).
# count:  0-dim int32 device array holding N (read with int(count)).
```

`ccl` returns `(labels, count)` in the default `output_mode="dense"`. `count`
is kept on the device; calling `int(count)` is the one host synchronization.

#### Label a 3D volume

Pass `dimensionality="3D"` (26-connectivity on 2x2x2 blocks):

```python
vol = cupy.asarray(volume, dtype=cupy.uint8)        # (D, H, W)
labels, count = ccl(vol, dimensionality="3D")
```

#### Use a tensor from another framework

Any GPU-resident object exposing the DLPack protocol is imported zero-copy, so
a PyTorch/JAX/Triton CUDA tensor can be passed directly:

```python
import torch
from bkeccl import ccl

t = torch.zeros((512, 512), dtype=torch.uint8, device="cuda")
t[100:200, 100:200] = 1

labels, count = ccl(t)        # zero-copy import, no host round-trip
```

Host-resident tensors are rejected with a `TypeError` rather than silently
staged onto the device.

### Inspecting components

#### Native (raw) labels

`output_mode="raw"` skips the densify pass and returns the labeler's native
sparse, root-derived IDs (positive, not a contiguous `1..N` range), as a single
array:

```python
raw = ccl(img, output_mode="raw")     # int32 (H, W), sparse positive IDs
```

Use raw when you will reduce or threshold the labels yourself and do not need
contiguous IDs; use the default dense mode when you want `1..N`.

#### Component masks

```python
from bkeccl import ccl_masks

masks = ccl_masks(img)                 # bool (N, H, W), one plane per component
```

`ccl_masks` labels and expands in one call, staying on the GPU; the only host
sync is the count read needed to size the stack. For a label map you already
have, expand it directly:

```python
from bkeccl import ccl, masks_from_labels

labels, count = ccl(img)
masks = masks_from_labels(labels, count)   # bool (N, H, W)
```

For thousands of components the dense `(N, *spatial)` stack is large; keep the
label map instead when you do not need per-component planes.

### Performance and reuse

Opt in once the basic workflow works and you are calling it repeatedly.

#### Buffer reuse

Supply a pre-allocated `int32` output buffer to skip the per-call output
allocation. It is keyword-only and the result aliases it:

```python
import cupy
from bkeccl import ccl

buf = cupy.empty((512, 512), dtype=cupy.int32)

results = []
for img in image_batch:
    labels, count = ccl(img, label_output=buf)
    # `labels` IS `buf` (aliased). Retaining it across iterations requires a
    # copy, or every stored result points at the last image.
    results.append(labels.copy())
```

The buffer must be C-contiguous `int32` with the same shape as the input.

#### Shape-specialized runner

For a hot loop with a fixed shape, `make_ccl_2D` / `make_ccl_3D` bake the launch
grid, kernel handles, and all scratch once and return a reusable callable. The
factory is memoized per argument set:

```python
from bkeccl import make_ccl_2D

runner = make_ccl_2D((512, 512), algorithm="bke", output_mode="dense")
for img in image_batch:
    labels, count = runner(img)        # consume or copy before the next call
```

The runner performs **no input validation** and reuses one process-owned
output buffer: the caller must uphold the shape/dtype/device contract, and must
consume or copy the result before the next call. It is **not** thread- or
stream-safe. Use `ccl` unless you can guarantee that contract. Full contract in
[`REFERENCE.md`](https://github.com/belfner/bkeccl/blob/master/REFERENCE.md#shape-specialized-runners).

#### Algorithm variants

```python
labels, count = ccl(img, algorithm="bke")      # default, standard BKE
labels, count = ccl(img, algorithm="bke_ic")   # inline path compression
```

Both produce the same partition. `bke` is the default; benchmark `bke_ic`
against it on your workload if variant choice matters.

## API Reference

Full per-function signatures, parameters, return types, and raised exceptions
are in **[`REFERENCE.md`](https://github.com/belfner/bkeccl/blob/master/REFERENCE.md)**.

## Algorithm

bkeccl implements the **Block-based Komura Equivalence (BKE)** algorithm:

> Stefano Allegretti, Federico Bolelli, Michele Cancilla, Costantino Grana.
> "Optimized Block-Based Algorithms to Label Connected Components on GPUs."
> *IEEE Transactions on Parallel and Distributed Systems*, 31(2):423-438, 2019.
> [IEEE Xplore](https://ieeexplore.ieee.org/document/8798895) -
> [doi:10.1109/TPDS.2019.2934683](https://doi.org/10.1109/TPDS.2019.2934683) -
> ([open-access PDF](https://iris.unimore.it/bitstream/11380/1179616/1/2018_TPDS_Optimized_Block_Based_Algorithms_to_Label_Connected_Components_on_GPUs.pdf))

BKE operates on **2x2 blocks** (2D) or **2x2x2 blocks** (3D) rather than
individual elements, reducing memory accesses and atomic operations. The
pipeline is Init (block connectivity) -> Compress (path compression) ->
Reduction (union for remaining connections) -> Compress -> FinalLabel (block
labels to elements). The `bke_ic` variant updates the parent at each traversal
step (inline compression) for faster convergence. The block structure makes
8-connectivity (2D) and 26-connectivity (3D) the native, fixed connectivity.
The 3D variant follows Section 5.3 ("3D extension") of the same paper.

## Limitations

- **CUDA-only.** Inputs must be GPU-resident; host arrays are rejected with a
  `TypeError`, with no CPU fallback.
- **2D or 3D single array.** No batched API; iterate in Python for batches.
- **Fixed connectivity.** 2D is 8-connected and 3D is 26-connected; lower
  connectivities are not provided (the block structure is intrinsically
  fully-connected).
- **int32 label space.** Index arithmetic is `int32`; arrays with more than
  ~2^31 elements are unsupported.
- **Non-contiguous input is copied.** A contiguous `uint8` temporary is
  allocated internally, so buffer reuse is allocation-free only for contiguous
  input.
- **Concurrency.** The shape-specialized runners and their reused buffers are
  single-context: one per device, per stream, per in-flight op.

## Semantics

- **`dimensionality` must match `ndim`.** A `ValueError` is raised on mismatch.
- **Dense vs. raw labels.** The default `output_mode="dense"` returns a
  deterministic `1..N` renumbering. `output_mode="raw"` IDs are positive but
  root-derived, not contiguous; use dense mode when you need `1..N`.
- **Supplied buffers.** `label_output` must be C-contiguous `int32` with the
  input's shape; the return aliases it.
- **Stream/device.** Launches use the current CUDA stream and device;
  correctness under CUDA graph capture or multi-stream pipelines requires the
  caller to manage stream/event ordering.
- **Determinism.** For a given input and algorithm the output is reproducible
  run-to-run, and the *partition* into components is stable. Dense labels are a
  deterministic `1..N` renumbering of that partition. Raw label *values* are
  union-find-root-derived: sparse, and specific to the device and algorithm, so
  they can differ across devices and between `bke`/`bke_ic`.

## Efficiency Tips

- **Reuse a buffer** with `label_output=` for repeated same-shape calls; the
  return aliases the buffer, so `.copy()` any result you retain.
- **Keep inputs contiguous** (`cupy.ascontiguousarray(x)` once, upstream) so
  buffer reuse actually avoids all allocations.
- **Use a shape-specialized runner** for a fixed-shape hot loop; one runner per
  stream/device.
- **Read `count` once.** `int(count)` (and `ccl_masks` / `masks_from_labels`)
  forces a host sync; avoid it inside tight inner loops where you only need the
  label map.
- **Prefer `output_mode="raw"`** when you do not need contiguous IDs; it skips
  the densify pass.
- **Prefer keeping the label map** over `ccl_masks` when you do not need
  per-component planes; the `(N, *spatial)` stack memory dominates for many
  components.

### Measuring on your workload

Performance depends on array size, component count and size, label-field
contiguity, allocation mode, and GPU architecture. Benchmark with your data and
a warmed device:

```python
import cupy
from bkeccl import make_ccl_2D

img = your_binary_image_cupy
runner = make_ccl_2D(img.shape)
for _ in range(10):                       # warm up (includes kernel compile)
    runner(img)
cupy.cuda.Device().synchronize()
start = cupy.cuda.Event(); end = cupy.cuda.Event()
start.record()
for _ in range(100):
    runner(img)
end.record(); end.synchronize()
print(cupy.cuda.get_elapsed_time(start, end) / 100, "ms/call")
```

Compare against `scipy.ndimage.label` or `cc3d` on your inputs for a meaningful
baseline.

## Development

Development uses [uv](https://docs.astral.sh/uv/). The dev dependency group
pins a CuPy wheel so `uv sync` and the GPU test suite run locally:

```bash
uv sync
uv run ruff check src/
uv run ruff format src/
uv run mypy
uv run pytest -q
```

The test suite requires a usable CUDA device; it aborts with a clear error if
none is visible.

## License

MIT License - see [LICENSE](LICENSE).

## Citation

```bibtex
@article{allegretti2019optimized,
  title={Optimized Block-Based Algorithms to Label Connected Components on GPUs},
  author={Allegretti, Stefano and Bolelli, Federico and Cancilla, Michele and Grana, Costantino},
  journal={IEEE Transactions on Parallel and Distributed Systems},
  volume={31},
  number={2},
  pages={423--438},
  year={2019},
  publisher={IEEE},
  doi={10.1109/TPDS.2019.2934683}
}
```

## Contributing

Contributions are welcome. Please open a Pull Request.
