Metadata-Version: 2.4
Name: bm3d-cupy
Version: 0.1.3
Summary: GPU-accelerated BM3D denoising with CuPy
Author: Kaibo Tang
License-Expression: MIT
Keywords: BM3D,CuPy,denoising,image-processing,GPU
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Image Processing
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: PyWavelets>=1.5
Provides-Extra: cuda12
Requires-Dist: cupy-cuda12x[ctk]>=13.0; extra == "cuda12"
Provides-Extra: cuda13
Requires-Dist: cupy-cuda13x[ctk]>=13.0; extra == "cuda13"
Provides-Extra: benchmark
Requires-Dist: bm3d>=4.0.0; extra == "benchmark"
Requires-Dist: matplotlib>=3.7; extra == "benchmark"
Requires-Dist: scikit-image>=0.22; extra == "benchmark"
Provides-Extra: dev
Requires-Dist: build>=1.2; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Dynamic: license-file

BM3D with CuPy
==============

This repository contains a GPU-accelerated implementation of BM3D using CuPy. The main script is `bm3d_cupy.py`, which provides a function `bm3d_gpu` that can be used to denoise 2D images of shape `(H, W)` or `(H, W, C)` on the GPU. The `bm3d_cupy_benchmark.py` script benchmarks the performance of the GPU implementation against the CPU version.

This implementation is intended to be as close as possible to the reference CPU implementation in the official BM3D implementation, which can be found [here](https://pypi.org/project/bm3d/). Most of the transforms used in this implementation are exactly the same as the official implementation. 

Different transforms than the default can also be used. However, you would need to change the code yourself:

```python
# 2d transforms (along spatial dimensions)
spatial_ht, spatial_ht_inv = _transform_matrices(p, 'bior1.5')  # biorthogonal WT, (p, p)
spatial_wiener, spatial_wiener_inv = _transform_matrices(p, 'dct')  # DCT, (p, p)
# 1d transforms (along group dimension)
group_ht, group_ht_inv = _transform_matrices(ht_group_size, 'haar')  # Haar, (k_ht, k_ht)
group_wiener, group_wiener_inv = _transform_matrices(wiener_group_size, 'haar')  # Haar, (k_w, k_w)
```

The purpose of this implementation is to provide a fast and GPU-accelerated version of BM3D, which is beneficial for **fast prototyping** and, in particular, for **integration as a plug-and-play prior** into an iterative image reconstruction algorithm that is already implemented on the GPU (e.g., SigPy). 


Installation
------------

To install from PyPI, the distribution name is `bm3d-cupy` and the import name is `bm3d_cupy`. To use this package with CUDA 13, use:

```bash
pip install "bm3d-cupy[cuda13]"
```

To use this package with CUDA 12, use:

```bash
pip install "bm3d-cupy[cuda12]"
```


Usage
-----

```python
import cupy as cp
import numpy as np
from skimage.data import cat

from bm3d_cupy import bm3d_gpu


rng = np.random.default_rng(seed=0)


x = cat().astype('float32') / 255.0
sigma = 0.1
y = (x + rng.standard_normal(x.shape).astype('float32') * sigma).astype('float32')

y_gpu = cp.asarray(y)
sigma_gpu = cp.asarray(sigma, dtype=cp.float32)
z_gpu = bm3d_gpu(y_gpu, sigma_gpu)
z = cp.asnumpy(z_gpu)
```


Result
------

On an NVIDIA RTX 6000 Ada, the GPU implementation achieves a speedup of 15.91x compared to the reference CPU implementation for the cat image.

```
BM3D (CPU) time: 3.0155 (0.0226) s
  PSNR: 30.85 dB
  SSIM: 0.8194


BM3D (GPU) time: 0.1881 (0.0004) s
  Speedup: 16.03x
  PSNR: 30.90 dB
  SSIM: 0.8228
```

The argument `chunk_size` determines the number of groups to be processed in parallel on the GPU. A larger `chunk_size` increases GPU memory usage but can improve speed. However, in practice, extremely large `chunk_size` leads to poor speed, possibly due to overhead. The default value of `chunk_size=2048` is set according to the experiment shown above run on the RTX 6000 Ada GPU. You may need to adjust it for different GPUs and images.

Acknowledgement
---------------

If you use this package, reference to the original authors (see below) is recommended.

K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering,” *IEEE Transactions on Image Processing*, vol. 16, no. 8, pp. 2080–2095, Aug. 2007, doi: [10.1109/TIP.2007.901238](https://ieeexplore.ieee.org/document/4271520).
