Metadata-Version: 2.4
Name: SeismicFlowCUDA
Version: 0.0.1
Summary: GPU-accelerated seismic attribute computation via fused CUDA kernels
Author: Anonymous
License: GPL-3.0
Project-URL: Homepage, https://github.com/SeismicFlow/SeismicFlowCUDA
Keywords: seismic,geophysics,attributes,cuda,gpu,cupy
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Physics
Classifier: Topic :: Scientific/Engineering :: GIS
Requires-Python: >=3.12
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.26.4

# SeismicFlowCUDA

**GPU-accelerated seismic attribute computation for NVIDIA CUDA hardware.**

SeismicFlowCUDA implements seismic attributes as fused CUDA kernels compiled at runtime via CuPy's RawModule API — delivering throughput structurally inaccessible to CPU implementations, and establishing CUDA programming as the new standard for high-performance computing in exploration seismology.

---

## Why SeismicFlowCUDA?

SeismicFlowCore demonstrated that compiled, parallelised CPU code outperforms pure-Python implementations by up to 190×. SeismicFlowCUDA goes further: by moving computation to the GPU, it eliminates the memory bandwidth and thread-count ceilings that constrain even the best CPU implementations.

- **Fused CUDA kernels** — gradient computation and coherence accumulation in a single pass; shared-memory tiling eliminates redundant global memory round-trips
- **Shared memory tiling with bank-conflict elimination** — 16×16 thread blocks, coalesced loads on the crossline (fastest) dimension, +1 crossline pad on the shared-memory tile
- **Constant memory** — Gaussian derivative kernel uploaded once per compilation, resident in L1 for every thread
- **Module caching** — kernels compiled once per `(device_id, kernel_size)` pair and reused across every subsequent call in a session
- **Upper-triangle GLCM** — stores only `N*(N+1)/2` entries instead of `N²`, halving shared memory, doubling occupancy, and halving atomic contention
- **Warp-level reductions** — `__shfl_down_sync` replaces shared-memory reduction trees for energy and entropy, eliminating synchronisation overhead
- **NumPy-native API** — input and output are float32 NumPy arrays; GPU transfer is handled internally

---

## Requirements

### Hardware

An NVIDIA GPU with CUDA Compute Capability 3.5 or higher. SeismicFlowCUDA was developed and validated on an NVIDIA GeForce GTX 1080 Ti (Compute Capability 6.1).

### Software

| Package | Tested version |
|---|---|
| Python | 3.12 (64-bit) |
| CuPy | 13.4.1 |
| NumPy | ≥ 1.26.4 |
| CUDA Toolkit | matching your CuPy build |

---

## Installation

### Step 1 — Install CUDA Toolkit

Download from: https://developer.nvidia.com/cuda-downloads

The CUDA Toolkit version must match the CuPy build you install in the next step. Check compatibility at: https://docs.cupy.dev/en/stable/install.html

Verify after install:
```
nvcc --version
```

---

### Step 2 — Install CuPy

Install the CuPy wheel that matches your CUDA Toolkit version:

```bash
# CUDA 12.x
pip install cupy-cuda12x

# CUDA 11.x
pip install cupy-cuda11x
```

Verify:
```python
import cupy
print(cupy.__version__)   # should print e.g. 13.4.1
```

If CuPy imports but CUDA kernels fail to compile, your CUDA Toolkit and CuPy wheel versions are mismatched.

---

### Step 3 — Install SeismicFlowCUDA

```bash
git clone https://github.com/SeismicFlow/SeismicFlowCUDA
cd SeismicFlowCUDA
pip install -e .
```

No compilation step is required. Kernels are compiled at first call via CuPy's RawModule API and cached for the duration of the session.

---

### Step 4 — Verify your install

```python
import numpy as np
import seismicflowcuda as sfc

data = np.random.randn(100, 50, 50).astype(np.float32)

print("coherence:", sfc.coherence(data).shape)
print("glcm:     ", sfc.glcm(data, metric='homogeneity').shape)
print()
print("OK — SeismicFlowCUDA loaded successfully")
```

If both lines print a shape, your install is complete. On first call, CuPy will compile and cache the kernels — this takes a few seconds and does not repeat.

---

## Quick Start

```python
import numpy as np
import seismicflowcuda as sfc

# Seismic volume: (time, inline, crossline), float32
data = np.random.randn(500, 200, 200).astype(np.float32)

coh = sfc.coherence(data)
hom = sfc.glcm(data, metric='homogeneity')
ent = sfc.glcm(data, metric='entropy')
```

All functions accept `(time, inline, crossline)` float32 arrays and return float32 arrays of the same shape.

---

## Attributes

### Discontinuity and Structural Continuity

| Function | Description |
|---|---|
| `sfc.coherence(data)` | GST coherence — fused shared-memory kernel, Cardano eigenvalue decomposition |

### Texture

| Function | Description |
|---|---|
| `sfc.glcm(data, metric='homogeneity')` | GLCM-based texture attribute — upper-triangle shared-memory GLCM, warp-level reduction |

---

## API Reference

### `sfc.coherence`

```python
sfc.coherence(data, window=(3, 3, 3), sigma=1.0)
```

GPU GST coherence. Numerically equivalent to the SeismicFlowCore reference implementation.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `data` | float32 ndarray `(n_time, n_inline, n_crossline)` | — | Input seismic volume |
| `window` | tuple of 3 odd ints | `(3, 3, 3)` | Centred analysis window |
| `sigma` | float | `1.0` | Gaussian derivative sigma |

Returns `float32 ndarray`, same shape as `data`, values in `[0, 1]`.

---

### `sfc.glcm`

```python
sfc.glcm(data, metric='homogeneity', half_levels=16, lateral_radius=4,
         vertical_radius=1, clip_sigmas=1.5, co_offset=1)
```

GLCM-based seismic texture attribute.

| Parameter | Type | Default | Description |
|---|---|---|---|
| `data` | float32 ndarray `(nt, ni, nx)` | — | Input seismic volume |
| `metric` | str | `'homogeneity'` | `'contrast'` \| `'dissimilarity'` \| `'homogeneity'` \| `'energy'` \| `'entropy'` |
| `half_levels` | int in `[1, 16]` | `16` | Quantisation half-range |
| `lateral_radius` | int | `4` | Spatial window half-width |
| `vertical_radius` | int | `1` | Temporal window half-width |
| `clip_sigmas` | float > 0 | `1.5` | RMS clipping scale |
| `co_offset` | int ≥ 1 | `1` | Co-occurrence offset distance |

Returns `float32 ndarray`, same shape as `data`.

---

## CUDA Kernel Design

### GST Coherence — `gst_coherence_fused`

The reference SeismicFlowCore implementation uses four separate kernel launches: three calls to `compute_gradient` (one per spatial axis) followed by `compute_coherence`. Each launch reads the full volume from global memory, producing six global memory round-trips in total.

SeismicFlowCUDA replaces this with a single fused kernel:

- A shared-memory tile of shape `(ts0, ts1, ts2_pad)` is loaded cooperatively by all threads in the block, with reflect-padded boundary handling
- All three directional gradients are computed from the shared tile; no additional global memory reads are required
- The GST tensor is accumulated inline and eigenvalues are solved analytically via Cardano's method
- `__constant__` memory holds the Gaussian derivative kernel, uploaded once per `(device_id, kernel_size)` pair
- The crossline tile dimension is padded by +1 to eliminate shared memory bank conflicts on time-axis taps

Result: one global memory read pass instead of four, with all intermediate gradient values consumed from L1-resident shared memory.

### GLCM — `glcm_direct` and `glcm_ee`

Two kernels serve the five GLCM metrics:

**`glcm_direct`** (contrast, dissimilarity, homogeneity) — one thread per output pixel. Accumulates co-occurrence weighted metrics directly without storing a GLCM matrix. Separate passes over the amplitude and Hilbert transform volumes; results summed at the output.

**`glcm_ee`** (energy, entropy) — one warp per output pixel. The full `N×N` GLCM is symmetric, so only the upper triangle (`N*(N+1)/2` entries) is stored in shared memory per warp slot:

- For `N = 33` levels: full matrix = 1,089 floats × 4 bytes = 4,356 bytes/warp; upper triangle = 561 floats × 4 bytes = 2,244 bytes/warp
- At 4 warps per block: 8,976 bytes/block → 5 blocks per SM → 20 active warps/SM
- Diagonal entries represent `c[i,i]`; off-diagonal entries store `c[i,j] = c[j,i]` once, with the factor-of-two applied at reduction time
- Amplitude and Hilbert transform are processed in two sequential passes that reuse the same GLCM slot, avoiding a doubled shared-memory footprint
- Warp-level reductions over the triangle use `__shfl_down_sync` with `FULL_MASK`

---

## Performance

Benchmarked on an NVIDIA GeForce GTX 1080 Ti against SeismicFlowCore (Intel Core i7-7700K, 8 threads, AVX2) on a 500×500×500 float32 volume. CPU times are wall-clock from SeismicFlowCore; GPU times include host-to-device transfer, kernel execution, and device-to-host transfer.

| Attribute | CPU (ms) | GPU (ms) | Speedup |
|---|---|---|---|
| GST Coherence | 31,225.5 | 681.4 | **45.8×** |
| Contrast | 334,339.9 | 2,361.8 | **141.6×** |
| Dissimilarity | 331,176.2 | 2,377.7 | **139.3×** |
| Homogeneity | 341,142.8 | 2,398.7 | **142.2×** |
| Energy | 306,925.9 | 6,084.5 | **50.4×** |
| Entropy | 495,425.9 | 6,094.4 | **81.3×** |

The GLCM contrast, dissimilarity, and homogeneity attributes achieve the largest speedups (up to 142×) because the fused `glcm_direct` kernel eliminates the intermediate GLCM matrix entirely — a data structure the CPU implementation must write and read back from memory for every output pixel. Energy and entropy (50–81×) involve the upper-triangle shared-memory GLCM and warp-level reductions, which carry slightly more per-pixel work but remain heavily GPU-bound. GST coherence (45.8×) benefits primarily from the fused single-pass kernel and shared-memory tiling, which reduce global memory traffic from six round-trips to one.

These results are reported in full in the accompanying manuscript submitted to *Geophysics* (SEG).

---

## Relationship to SeismicFlowCore

SeismicFlowCUDA and SeismicFlowCore share the same public API contract: both accept `(time, inline, crossline)` float32 NumPy arrays and return float32 NumPy arrays. The GPU coherence implementation is byte-for-byte equivalent to the SeismicFlowCore reference on all tested volumes.

SeismicFlowCore remains the recommended choice for environments without NVIDIA hardware, or when a full 27-attribute pipeline is required. SeismicFlowCUDA is the recommended choice when a GPU is available and maximum throughput is the priority.

---

## Citation

If you use SeismicFlowCUDA in your research, please cite:

> Anonymous. (2026). *SeismicFlowCUDA: Establishing CUDA Programming as the New Standard for High-Performance Computing in Exploration Seismology*. Manuscript submitted to *Geophysics* (Society of Exploration Geophysicists).

If you also use SeismicFlowCore:

> Anonymous. (2026). *SeismicFlowCore: A High-Performance, Industry-Validated Open-Source Python Library for 3D Seismic Attribute Computation*. Manuscript submitted for publication.

---

## License

GNU General Public License v3.0 — see [LICENSE](LICENSE) for details.
