GPU Neural Networks via Vulkan

Tensors and Operations

Grilly uses plain numpy arrays as its core data format. No custom tensor class needed for most work.

The Data Format

Unlike PyTorch's torch.Tensor, grilly operates directly on numpy float32 arrays. Every layer accepts and returns numpy arrays. The framework also provides PyTorch-style factory functions for convenience:

python
import grilly
import numpy as np

# 1D tensor (vector)
x = grilly.tensor([1.0, 2.0, 3.0])
print("1D:", x, x.dtype)

# 2D tensor (matrix)
m = grilly.tensor([[1, 2], [3, 4]])
print("2D:", m)

# Factory functions
z = grilly.zeros(2, 3)     # shape (2, 3), all zeros
o = grilly.ones(2, 3)      # shape (2, 3), all ones
r = grilly.randn(2, 3)     # shape (2, 3), random normal
print("Zeros:", z)
print("Ones:", o)
print("Random:", r)
Output
1D: [1. 2. 3.] float32
2D: [[1. 2.]
 [3. 4.]]
Zeros: [[0. 0. 0.]
 [0. 0. 0.]]
Ones: [[1. 1. 1.]
 [1. 1. 1.]]
Random: [[-0.4532  1.2041  0.0893]
 [ 0.7612 -0.3287  1.5140]]
vs PyTorch All of these return np.ndarray with dtype=float32. There is no .cuda() or .to(device) call needed for basic operations — data is always numpy until you opt into GPU mode with VulkanTensor.

Indexing and Slicing

Since grilly tensors are numpy arrays, you get full numpy indexing, slicing, and fancy indexing for free:

python
import numpy as np

tensor = np.array([[1, 2], [3, 4], [5, 6]], dtype=np.float32)

# Single element
element = tensor[1, 0]
print(f"Element [1,0]: {element}")

# Slicing: first two rows
sliced = tensor[:2, :]
print(f"First two rows:\n{sliced}")

# Boolean mask
mask = tensor > 3
print(f"Elements > 3: {tensor[mask]}")

# Fancy indexing
rows = np.array([0, 2])
print(f"Rows 0 and 2:\n{tensor[rows]}")
Output
Element [1,0]: 3.0
First two rows:
[[1. 2.]
 [3. 4.]]
Elements > 3: [4. 5. 6.]
Rows 0 and 2:
[[1. 2.]
 [5. 6.]]

Reshaping

Reshape arrays without copying data. This is essential for adapting tensor dimensions between layers:

python
tensor = np.array([[1, 2], [3, 4], [5, 6]], dtype=np.float32)

# Reshape (3, 2) -> (2, 3)
reshaped = tensor.reshape(2, 3)
print(f"Reshaped (2x3):\n{reshaped}")

# Flatten to 1D
flat = tensor.flatten()
print(f"Flat: {flat}")

# Transpose
t = tensor.T
print(f"Transposed (2x3):\n{t}")

# Add/remove dimensions
expanded = np.expand_dims(tensor, axis=0)  # (1, 3, 2)
squeezed = np.squeeze(expanded)              # (3, 2)
print(f"Expanded shape: {expanded.shape}")
print(f"Squeezed shape: {squeezed.shape}")
Output
Reshaped (2x3):
[[1. 2. 3.]
 [4. 5. 6.]]
Flat: [1. 2. 3. 4. 5. 6.]
Transposed (2x3):
[[1. 3. 5.]
 [2. 4. 6.]]
Expanded shape: (1, 3, 2)
Squeezed shape: (3, 2)

Broadcasting and Matrix Multiplication

Broadcasting lets you combine arrays of different shapes. Matrix multiplication is at the heart of every neural network layer:

python
import numpy as np

a = np.array([[1, 2, 3], [4, 5, 6]], dtype=np.float32)
b = np.array([[10, 20, 30]], dtype=np.float32)

# Broadcasting: (2, 3) + (1, 3) -> (2, 3)
result = a + b
print(f"Broadcast add:\n{result}")

# Matrix multiplication: (2, 3) @ (3, 2) -> (2, 2)
matmul = a @ a.T
print(f"A @ A^T:\n{matmul}")

# Element-wise operations
print(f"a * 2:\n{a * 2}")
print(f"np.sqrt(a):\n{np.sqrt(a)}")
Output
Broadcast add:
[[11. 22. 33.]
 [14. 25. 36.]]
A @ A^T:
[[14. 32.]
 [32. 77.]]
a * 2:
[[ 2.  4.  6.]
 [ 8. 10. 12.]]
np.sqrt(a):
[[1.    1.414 1.732]
 [2.    2.236 2.449]]

VulkanTensor: GPU-Resident Data

VulkanTensor wraps a numpy array with an optional GPU buffer. When modules run in gpu_mode, data stays in VRAM between layers, avoiding PCIe round-trips:

python
from grilly import VulkanTensor
import numpy as np

# Create from numpy
data = np.random.randn(32, 128).astype(np.float32)
vt = VulkanTensor(data)

# Factory methods
vt_zeros = VulkanTensor.zeros((32, 128))
vt_ones  = VulkanTensor.ones((16, 64))
vt_empty = VulkanTensor.empty((8, 256))

# Properties
print(vt.shape)    # (32, 128)
print(vt.dtype)    # float32
print(vt.on_gpu)   # True or False

# Transfer between CPU and GPU
vt.upload()        # force upload to GPU
arr = vt.numpy()   # download to CPU numpy
arr = vt.cpu()      # alias for numpy()
Note VulkanTensor requires Vulkan drivers to be installed. Check grilly.VULKAN_AVAILABLE at runtime. Without Vulkan, grilly still works — it falls back to CPU numpy automatically.

Autograd Variable

For automatic differentiation, grilly provides Variable — a wrapper that tracks operations and computes gradients via backpropagation:

python
from grilly.nn import Variable, no_grad
import numpy as np

# Create differentiable variables
x = Variable(np.array([1.0, 2.0, 3.0]), requires_grad=True)
w = Variable(np.array([0.5, -0.3, 0.8]), requires_grad=True)

# Forward: build computation graph
y = x * w
loss = y.sum()

# Backward: compute gradients
loss.backward()
print("x.grad:", x.grad)  # d(loss)/dx = w
print("w.grad:", w.grad)  # d(loss)/dw = x

# Disable gradient tracking
with no_grad():
    val = x * 2   # no graph, no gradient
Output
x.grad: [ 0.5 -0.3  0.8]
w.grad: [1. 2. 3.]
Tip Variable-based autograd is optional. Most grilly training code uses the nn.Module backward methods with explicit gradient passing, which is faster for standard architectures.