GPU Neural Networks via Vulkan

GPU Acceleration with Vulkan

Grilly runs on any GPU with Vulkan drivers — AMD, NVIDIA, or Intel. No CUDA required.

Initializing Vulkan

The Compute() class initializes a Vulkan device, loads SPIR-V shader bytecode, and sets up compute pipelines. One call gives you access to all GPU operations:

python
import grilly

# Initialize Vulkan backend (selects the best available GPU)
compute = grilly.Compute()
print(f"GPU: {compute.core.device_properties.deviceName}")
Output
GPU: AMD Radeon RX 6750 XT
vs PyTorch PyTorch uses torch.device('cuda') and requires NVIDIA drivers + CUDA toolkit. Grilly uses Vulkan, which ships with standard GPU drivers on all vendors. No separate SDK install needed.

Environment Variables

Control GPU selection and debugging behavior at startup:

VariablePurposeExample
VK_GPU_INDEXSelect a specific GPU by index (0-based)VK_GPU_INDEX=1
GRILLY_DEBUGEnable verbose Vulkan loggingGRILLY_DEBUG=1
ALLOW_CPU_VULKANAllow llvmpipe software renderer (for CI)ALLOW_CPU_VULKAN=1
bash
# Use the second GPU, enable debug logging
VK_GPU_INDEX=1 GRILLY_DEBUG=1 python train.py

Module-Level GPU Mode

By default, each layer transfers data CPU → GPU → CPU for every forward call. Enable gpu_mode to keep intermediate activations in VRAM between layers:

python
import grilly.nn as nn

model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10),
)

# Keep tensors on GPU between layers (avoid PCIe round-trips)
model.gpu_mode(True)

# Alternative: set device explicitly
model.to('vulkan')
Tip GPU VRAM bandwidth is typically 20–30x faster than PCIe. On an AMD RX 6750 XT: VRAM is ~384 GB/s vs PCIe 3.0 x16 at ~14 GB/s. Enabling gpu_mode is the single biggest performance knob.

VulkanTensor: Zero-Copy Pipeline

For maximum control, use VulkanTensor to manage GPU uploads and downloads explicitly:

python
from grilly import VulkanTensor
import numpy as np

# Prepare data on CPU
x_cpu = np.random.randn(1024, 784).astype(np.float32)

# Upload once to GPU
x_gpu = VulkanTensor(x_cpu)
x_gpu.upload()

# Run model (data stays in VRAM)
output_gpu = model(x_gpu)

# Download only at the end
result = output_gpu.numpy()
print(f"Result shape: {result.shape}")

# Clean up GPU memory
x_gpu.release_gpu()
vs PyTorch In PyTorch you write x.to('cuda') and the result lives on GPU until you call .cpu(). Grilly's VulkanTensor is the same concept — the .upload() / .numpy() calls are explicit equivalents of .to('cuda') / .cpu().

Checking Availability

Gracefully handle systems without Vulkan:

python
import grilly

if grilly.VULKAN_AVAILABLE:
    print("Vulkan available - GPU acceleration enabled")
    compute = grilly.Compute()
else:
    print("Vulkan not found - using CPU numpy fallback")
Tip All nn.Module layers work without Vulkan — they fall back to numpy automatically. You can develop on a laptop without a GPU and deploy to a GPU server without changing any code.

Performance: CPU vs GPU

Measure the speedup for a large matrix operation:

python
import grilly
import grilly.functional as F
import numpy as np
import time

x = np.random.randn(512, 1024).astype(np.float32)
w = np.random.randn(2048, 1024).astype(np.float32)
b = np.zeros(2048, dtype=np.float32)

# CPU (numpy)
t0 = time.perf_counter()
for _ in range(100):
    cpu_out = x @ w.T + b
cpu_ms = (time.perf_counter() - t0) * 1000

# GPU (Vulkan)
t0 = time.perf_counter()
for _ in range(100):
    gpu_out = F.linear(x, w, b)
gpu_ms = (time.perf_counter() - t0) * 1000

print(f"CPU: {cpu_ms:.1f} ms  |  GPU: {gpu_ms:.1f} ms  |  Speedup: {cpu_ms/gpu_ms:.1f}x")
Output (example on RX 6750 XT)
CPU: 4821.3 ms  |  GPU: 312.7 ms  |  Speedup: 15.4x