GPU Acceleration with Vulkan
Grilly runs on any GPU with Vulkan drivers — AMD, NVIDIA, or Intel. No CUDA required.
Initializing Vulkan
The Compute() class initializes a Vulkan device, loads SPIR-V shader bytecode, and sets up compute pipelines. One call gives you access to all GPU operations:
python
import grilly
# Initialize Vulkan backend (selects the best available GPU)
compute = grilly.Compute()
print(f"GPU: {compute.core.device_properties.deviceName}")
Output
GPU: AMD Radeon RX 6750 XT
vs PyTorch
PyTorch uses
torch.device('cuda') and requires NVIDIA drivers + CUDA toolkit. Grilly uses Vulkan, which ships with standard GPU drivers on all vendors. No separate SDK install needed.
Environment Variables
Control GPU selection and debugging behavior at startup:
| Variable | Purpose | Example |
|---|---|---|
VK_GPU_INDEX | Select a specific GPU by index (0-based) | VK_GPU_INDEX=1 |
GRILLY_DEBUG | Enable verbose Vulkan logging | GRILLY_DEBUG=1 |
ALLOW_CPU_VULKAN | Allow llvmpipe software renderer (for CI) | ALLOW_CPU_VULKAN=1 |
bash
# Use the second GPU, enable debug logging
VK_GPU_INDEX=1 GRILLY_DEBUG=1 python train.py
Module-Level GPU Mode
By default, each layer transfers data CPU → GPU → CPU for every forward call. Enable gpu_mode to keep intermediate activations in VRAM between layers:
python
import grilly.nn as nn
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 10),
)
# Keep tensors on GPU between layers (avoid PCIe round-trips)
model.gpu_mode(True)
# Alternative: set device explicitly
model.to('vulkan')
Tip
GPU VRAM bandwidth is typically 20–30x faster than PCIe. On an AMD RX 6750 XT: VRAM is ~384 GB/s vs PCIe 3.0 x16 at ~14 GB/s. Enabling
gpu_mode is the single biggest performance knob.
VulkanTensor: Zero-Copy Pipeline
For maximum control, use VulkanTensor to manage GPU uploads and downloads explicitly:
python
from grilly import VulkanTensor
import numpy as np
# Prepare data on CPU
x_cpu = np.random.randn(1024, 784).astype(np.float32)
# Upload once to GPU
x_gpu = VulkanTensor(x_cpu)
x_gpu.upload()
# Run model (data stays in VRAM)
output_gpu = model(x_gpu)
# Download only at the end
result = output_gpu.numpy()
print(f"Result shape: {result.shape}")
# Clean up GPU memory
x_gpu.release_gpu()
vs PyTorch
In PyTorch you write
x.to('cuda') and the result lives on GPU until you call .cpu(). Grilly's VulkanTensor is the same concept — the .upload() / .numpy() calls are explicit equivalents of .to('cuda') / .cpu().
Checking Availability
Gracefully handle systems without Vulkan:
python
import grilly
if grilly.VULKAN_AVAILABLE:
print("Vulkan available - GPU acceleration enabled")
compute = grilly.Compute()
else:
print("Vulkan not found - using CPU numpy fallback")
Tip
All
nn.Module layers work without Vulkan — they fall back to numpy automatically. You can develop on a laptop without a GPU and deploy to a GPU server without changing any code.
Performance: CPU vs GPU
Measure the speedup for a large matrix operation:
python
import grilly
import grilly.functional as F
import numpy as np
import time
x = np.random.randn(512, 1024).astype(np.float32)
w = np.random.randn(2048, 1024).astype(np.float32)
b = np.zeros(2048, dtype=np.float32)
# CPU (numpy)
t0 = time.perf_counter()
for _ in range(100):
cpu_out = x @ w.T + b
cpu_ms = (time.perf_counter() - t0) * 1000
# GPU (Vulkan)
t0 = time.perf_counter()
for _ in range(100):
gpu_out = F.linear(x, w, b)
gpu_ms = (time.perf_counter() - t0) * 1000
print(f"CPU: {cpu_ms:.1f} ms | GPU: {gpu_ms:.1f} ms | Speedup: {cpu_ms/gpu_ms:.1f}x")
Output (example on RX 6750 XT)
CPU: 4821.3 ms | GPU: 312.7 ms | Speedup: 15.4x