Advanced Features
Attention mechanisms, efficient fine-tuning, automatic differentiation, and the full functional API.
Multi-Head Attention
Self-attention lets each position attend to all other positions in a sequence. Grilly provides both standard and flash attention:
import grilly.nn as nn
import numpy as np
attn = nn.MultiheadAttention(embed_dim=256, num_heads=8)
# Self-attention: Q = K = V
x = np.random.randn(4, 32, 256).astype(np.float32)
# (batch=4, seq_len=32, embed_dim=256)
output, weights = attn(x, x, x)
print(f"Output: {output.shape}") # (4, 32, 256)
print(f"Weights: {weights.shape}") # (4, 8, 32, 32) attention maps
Output: (4, 32, 256) Weights: (4, 8, 32, 32)
FlashAttention2
FlashAttention2 uses tiled GPU computation to reduce memory from O(seq²) to O(seq), enabling much longer sequences:
fa2 = nn.FlashAttention2(
embed_dim=256,
num_heads=8,
use_rope=False, # True to fuse RoPE into attention
)
q = k = v = np.random.randn(4, 32, 256).astype(np.float32)
out = fa2(q, k, v)
print(f"FlashAttention2 output: {out.shape}")
FlashAttention2 output: (4, 32, 256)
use_rope=True to fuse rotary position embeddings directly into the attention kernel, saving an extra GPU dispatch. The fused shader is at flash-attention2-rope.spv.
Rotary Position Embeddings (RoPE)
RoPE encodes position information by rotating query and key vectors in complex space. Used by LLaMA, Mistral, and many modern architectures:
from grilly.nn import RoPE
rope = RoPE(head_dim=64, max_seq_len=2048, base=10000.0)
q = np.random.randn(4, 32, 8, 64).astype(np.float32)
# (batch, seq_len, num_heads, head_dim)
q_rotated = rope(q, seq_len=32)
print(f"RoPE output: {q_rotated.shape}")
RoPE output: (4, 32, 8, 64)
LoRA: Low-Rank Adaptation
LoRA adds small trainable matrices to frozen base weights, reducing trainable parameters from O(d²) to O(2dr). This enables efficient fine-tuning of large models:
from grilly.nn import LoRAConfig, LoRAModel, LoRALinear
# Direct LoRA linear layer
lora_layer = LoRALinear(
in_features=768,
out_features=768,
rank=8, # low rank (r << d)
alpha=16.0, # scaling factor
)
# LoRA: W_eff = W_base + (alpha/r) * B @ A
# Only A (r x d) and B (d x r) are trainable
# Config-based approach for wrapping a full model
config = LoRAConfig(
rank=8,
alpha=16.0,
dropout=0.1,
target_modules=['q_proj', 'v_proj'],
)
# Wrap an existing model
# lora_model = LoRAModel(base_model, config)
# Only LoRA params are trainable; base weights frozen
peft library for LoRA. Grilly has LoRA built into grilly.nn with the same API pattern, no extra dependency needed.
The Functional API
Stateless functional versions of every operation, matching torch.nn.functional:
import grilly.functional as F
import numpy as np
x = np.random.randn(32, 128).astype(np.float32)
w = np.random.randn(64, 128).astype(np.float32)
b = np.zeros(64, dtype=np.float32)
# Linear + activation
out = F.linear(x, w, b) # x @ w.T + b
out = F.relu(out) # max(0, x)
out = F.gelu(out) # GELU activation
out = F.silu(out) # x * sigmoid(x)
# Normalization
gamma = np.ones(64, dtype=np.float32)
beta = np.zeros(64, dtype=np.float32)
out = F.layer_norm(out, gamma, beta)
# Dropout (training only)
out = F.dropout(out, p=0.1, training=True)
# Softmax
probs = F.softmax(out)
# Loss
targets = np.random.randint(0, 64, 32)
loss = F.cross_entropy(out, targets)
# Attention
q = k = v = np.random.randn(2, 16, 64).astype(np.float32)
attn_out = F.attention(q, k, v)
Variable Autograd
Grilly includes a full autograd engine with a PyTorch-like Variable that builds a computation graph and computes gradients automatically:
from grilly.nn import Variable, no_grad
import numpy as np
# Create tracked variables
a = Variable(np.array([1.0, 2.0, 3.0]), requires_grad=True)
b = Variable(np.array([4.0, 5.0, 6.0]), requires_grad=True)
# Build computation graph
c = a * b + a.sum()
loss = c.sum()
# Backward: compute all gradients
loss.backward()
print("a.grad:", a.grad) # d(loss)/da
print("b.grad:", b.grad) # d(loss)/db
# Stop gradient tracking
with no_grad():
val = a * 2 # no graph, no gradient overhead
a.grad: [7. 8. 9.] b.grad: [1. 2. 3.]
The autograd system supports arithmetic, reductions, activations, matrix operations, loss functions, and shape operations. Available operations include:
from grilly.nn import (
# Arithmetic
add, sub, mul, div, neg, pow, matmul,
# Reductions
sum, mean, max, min, var, std, norm,
# Activations
relu, sigmoid, tanh, gelu, silu, softmax,
# Shape
reshape, transpose, squeeze, unsqueeze, flatten,
concat, stack, clone,
# Loss
cross_entropy, mse_loss, l1_loss, bce_loss,
)
Optimizers Reference
| Optimizer | Constructor | Notes |
|---|---|---|
AdamW | AdamW(params, lr=1e-3, weight_decay=0.01) | Default choice for most tasks |
Adam | Adam(params, lr=1e-3) | Classic adaptive learning rate |
SGD | SGD(params, lr=0.01, momentum=0.9) | Stochastic gradient descent |
NLMS | NLMS(params, lr=1e-2) | Normalized LMS (neuromorphic) |
NaturalGradient | NaturalGradient(params, lr=1e-3) | Fisher-preconditioned updates |
AutoHypergradientAdamW | AutoHypergradientAdamW(params) | Auto LR via OSGM surprise signal |
AffectAdam | AffectAdam(params, lr=1e-3) | Affect-modulated Adam |
Learning Rate Schedulers
import grilly.optim as optim
optimizer = optim.AdamW(model.parameters(), lr=1e-3)
# Decay LR by 0.1 every 30 epochs
scheduler = optim.StepLR(optimizer, step_size=30, gamma=0.1)
# Cosine annealing
scheduler = optim.CosineAnnealingLR(optimizer, T_max=100)
# Reduce on plateau
scheduler = optim.ReduceLROnPlateau(optimizer, patience=10)
# One-cycle policy (call per batch, not per epoch)
scheduler = optim.OneCycleLR(optimizer, max_lr=1e-2, total_steps=1000)
# In training loop:
# scheduler.step() # per epoch (StepLR, Cosine)
# scheduler.step(val_loss) # per epoch (ReduceLROnPlateau)
Transformer Layers
Build transformers from encoder and decoder layer blocks:
from grilly.nn import TransformerEncoderLayer, TransformerDecoderLayer
# Single encoder layer: self-attention + FFN
encoder = TransformerEncoderLayer(
d_model=512,
nhead=8,
dim_feedforward=2048,
dropout=0.1,
)
x = np.random.randn(4, 64, 512).astype(np.float32)
out = encoder(x)
print(f"Encoder output: {out.shape}") # (4, 64, 512)
HuggingFace Bridge
Load pre-trained weights from HuggingFace models without the PyTorch runtime:
from grilly.utils import HuggingFaceBridge
bridge = HuggingFaceBridge()
# Load weights as numpy arrays (no torch dependency)
weights = bridge.load_weights('bert-base-uncased')
# Returns dict: {'encoder.layer.0.attention.self.query.weight': ndarray, ...}
safetensors or pytorch_model.bin and extracts the weight tensors directly into numpy arrays, bypassing PyTorch entirely. This lets you run inference on grilly without installing torch.