GPU Neural Networks via Vulkan

Advanced Features

Attention mechanisms, efficient fine-tuning, automatic differentiation, and the full functional API.

Multi-Head Attention

Self-attention lets each position attend to all other positions in a sequence. Grilly provides both standard and flash attention:

python
import grilly.nn as nn
import numpy as np

attn = nn.MultiheadAttention(embed_dim=256, num_heads=8)

# Self-attention: Q = K = V
x = np.random.randn(4, 32, 256).astype(np.float32)
# (batch=4, seq_len=32, embed_dim=256)

output, weights = attn(x, x, x)
print(f"Output: {output.shape}")     # (4, 32, 256)
print(f"Weights: {weights.shape}")   # (4, 8, 32, 32) attention maps
Output
Output: (4, 32, 256)
Weights: (4, 8, 32, 32)

FlashAttention2

FlashAttention2 uses tiled GPU computation to reduce memory from O(seq²) to O(seq), enabling much longer sequences:

python
fa2 = nn.FlashAttention2(
    embed_dim=256,
    num_heads=8,
    use_rope=False,  # True to fuse RoPE into attention
)

q = k = v = np.random.randn(4, 32, 256).astype(np.float32)
out = fa2(q, k, v)
print(f"FlashAttention2 output: {out.shape}")
Output
FlashAttention2 output: (4, 32, 256)
Tip Set use_rope=True to fuse rotary position embeddings directly into the attention kernel, saving an extra GPU dispatch. The fused shader is at flash-attention2-rope.spv.

Rotary Position Embeddings (RoPE)

RoPE encodes position information by rotating query and key vectors in complex space. Used by LLaMA, Mistral, and many modern architectures:

python
from grilly.nn import RoPE

rope = RoPE(head_dim=64, max_seq_len=2048, base=10000.0)

q = np.random.randn(4, 32, 8, 64).astype(np.float32)
# (batch, seq_len, num_heads, head_dim)

q_rotated = rope(q, seq_len=32)
print(f"RoPE output: {q_rotated.shape}")
Output
RoPE output: (4, 32, 8, 64)

LoRA: Low-Rank Adaptation

LoRA adds small trainable matrices to frozen base weights, reducing trainable parameters from O(d²) to O(2dr). This enables efficient fine-tuning of large models:

python
from grilly.nn import LoRAConfig, LoRAModel, LoRALinear

# Direct LoRA linear layer
lora_layer = LoRALinear(
    in_features=768,
    out_features=768,
    rank=8,          # low rank (r << d)
    alpha=16.0,      # scaling factor
)
# LoRA: W_eff = W_base + (alpha/r) * B @ A
# Only A (r x d) and B (d x r) are trainable

# Config-based approach for wrapping a full model
config = LoRAConfig(
    rank=8,
    alpha=16.0,
    dropout=0.1,
    target_modules=['q_proj', 'v_proj'],
)

# Wrap an existing model
# lora_model = LoRAModel(base_model, config)
# Only LoRA params are trainable; base weights frozen
vs PyTorch In PyTorch you'd use the peft library for LoRA. Grilly has LoRA built into grilly.nn with the same API pattern, no extra dependency needed.

The Functional API

Stateless functional versions of every operation, matching torch.nn.functional:

python
import grilly.functional as F
import numpy as np

x = np.random.randn(32, 128).astype(np.float32)
w = np.random.randn(64, 128).astype(np.float32)
b = np.zeros(64, dtype=np.float32)

# Linear + activation
out = F.linear(x, w, b)           # x @ w.T + b
out = F.relu(out)                  # max(0, x)
out = F.gelu(out)                  # GELU activation
out = F.silu(out)                  # x * sigmoid(x)

# Normalization
gamma = np.ones(64, dtype=np.float32)
beta  = np.zeros(64, dtype=np.float32)
out = F.layer_norm(out, gamma, beta)

# Dropout (training only)
out = F.dropout(out, p=0.1, training=True)

# Softmax
probs = F.softmax(out)

# Loss
targets = np.random.randint(0, 64, 32)
loss = F.cross_entropy(out, targets)

# Attention
q = k = v = np.random.randn(2, 16, 64).astype(np.float32)
attn_out = F.attention(q, k, v)

Variable Autograd

Grilly includes a full autograd engine with a PyTorch-like Variable that builds a computation graph and computes gradients automatically:

python
from grilly.nn import Variable, no_grad
import numpy as np

# Create tracked variables
a = Variable(np.array([1.0, 2.0, 3.0]), requires_grad=True)
b = Variable(np.array([4.0, 5.0, 6.0]), requires_grad=True)

# Build computation graph
c = a * b + a.sum()
loss = c.sum()

# Backward: compute all gradients
loss.backward()

print("a.grad:", a.grad)  # d(loss)/da
print("b.grad:", b.grad)  # d(loss)/db

# Stop gradient tracking
with no_grad():
    val = a * 2  # no graph, no gradient overhead
Output
a.grad: [7. 8. 9.]
b.grad: [1. 2. 3.]

The autograd system supports arithmetic, reductions, activations, matrix operations, loss functions, and shape operations. Available operations include:

python
from grilly.nn import (
    # Arithmetic
    add, sub, mul, div, neg, pow, matmul,
    # Reductions
    sum, mean, max, min, var, std, norm,
    # Activations
    relu, sigmoid, tanh, gelu, silu, softmax,
    # Shape
    reshape, transpose, squeeze, unsqueeze, flatten,
    concat, stack, clone,
    # Loss
    cross_entropy, mse_loss, l1_loss, bce_loss,
)

Optimizers Reference

OptimizerConstructorNotes
AdamWAdamW(params, lr=1e-3, weight_decay=0.01)Default choice for most tasks
AdamAdam(params, lr=1e-3)Classic adaptive learning rate
SGDSGD(params, lr=0.01, momentum=0.9)Stochastic gradient descent
NLMSNLMS(params, lr=1e-2)Normalized LMS (neuromorphic)
NaturalGradientNaturalGradient(params, lr=1e-3)Fisher-preconditioned updates
AutoHypergradientAdamWAutoHypergradientAdamW(params)Auto LR via OSGM surprise signal
AffectAdamAffectAdam(params, lr=1e-3)Affect-modulated Adam

Learning Rate Schedulers

python
import grilly.optim as optim

optimizer = optim.AdamW(model.parameters(), lr=1e-3)

# Decay LR by 0.1 every 30 epochs
scheduler = optim.StepLR(optimizer, step_size=30, gamma=0.1)

# Cosine annealing
scheduler = optim.CosineAnnealingLR(optimizer, T_max=100)

# Reduce on plateau
scheduler = optim.ReduceLROnPlateau(optimizer, patience=10)

# One-cycle policy (call per batch, not per epoch)
scheduler = optim.OneCycleLR(optimizer, max_lr=1e-2, total_steps=1000)

# In training loop:
# scheduler.step()          # per epoch (StepLR, Cosine)
# scheduler.step(val_loss)  # per epoch (ReduceLROnPlateau)

Transformer Layers

Build transformers from encoder and decoder layer blocks:

python
from grilly.nn import TransformerEncoderLayer, TransformerDecoderLayer

# Single encoder layer: self-attention + FFN
encoder = TransformerEncoderLayer(
    d_model=512,
    nhead=8,
    dim_feedforward=2048,
    dropout=0.1,
)

x = np.random.randn(4, 64, 512).astype(np.float32)
out = encoder(x)
print(f"Encoder output: {out.shape}")  # (4, 64, 512)

HuggingFace Bridge

Load pre-trained weights from HuggingFace models without the PyTorch runtime:

python
from grilly.utils import HuggingFaceBridge

bridge = HuggingFaceBridge()

# Load weights as numpy arrays (no torch dependency)
weights = bridge.load_weights('bert-base-uncased')
# Returns dict: {'encoder.layer.0.attention.self.query.weight': ndarray, ...}
Tip The HuggingFace bridge downloads safetensors or pytorch_model.bin and extracts the weight tensors directly into numpy arrays, bypassing PyTorch entirely. This lets you run inference on grilly without installing torch.