Datasets and DataLoaders
Wrap your data, split train/val, apply transforms, and iterate batches efficiently.
TensorDataset
The simplest way to wrap numpy arrays into a dataset. Each array must have the same first dimension (number of samples):
python
from grilly.utils.data import TensorDataset, DataLoader
import numpy as np
X = np.random.randn(1000, 64).astype(np.float32)
y = np.random.randint(0, 10, 1000).astype(np.int64)
dataset = TensorDataset(X, y)
print(f"Dataset size: {len(dataset)}")
print(f"First sample: X={dataset[0][0].shape}, y={dataset[0][1]}")
Output
Dataset size: 1000 First sample: X=(64,), y=7
Custom Dataset
For more control, subclass Dataset and implement __len__ and __getitem__:
python
from grilly.utils.data import Dataset
class MyDataset(Dataset):
def __init__(self):
self.data = np.array([
[1.0, 2.0],
[3.0, 4.0],
[5.0, 6.0],
], dtype=np.float32)
self.labels = np.array([0, 1, 0])
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return self.data[idx], self.labels[idx]
dataset = MyDataset()
print(f"Sample 1: {dataset[1]}")
Output
Sample 1: (array([3., 4.], dtype=float32), 1)
DataLoader: Batching and Shuffling
DataLoader iterates through a dataset in batches, optionally shuffling each epoch:
python
loader = DataLoader(dataset, batch_size=2, shuffle=True)
for batch_data, batch_labels in loader:
print("Data:", batch_data)
print("Labels:", batch_labels)
print()
Output
Data: [[5. 6.] [1. 2.]] Labels: [0 0] Data: [[3. 4.]] Labels: [1]
vs PyTorch
The API is identical:
DataLoader(dataset, batch_size, shuffle). The num_workers parameter is accepted for compatibility but currently ignored (all loading is single-process).
Train/Validation Split
Use random_split to divide a dataset into train and validation subsets:
python
from grilly.utils.data import random_split
# Split 1000 samples into 800 train + 200 val
full_dataset = TensorDataset(X, y)
train_set, val_set = random_split(full_dataset, [800, 200])
train_loader = DataLoader(train_set, batch_size=32, shuffle=True)
val_loader = DataLoader(val_set, batch_size=64, shuffle=False)
print(f"Train: {len(train_set)} samples, {len(train_loader)} batches")
print(f"Val: {len(val_set)} samples, {len(val_loader)} batches")
Output
Train: 800 samples, 25 batches Val: 200 samples, 4 batches
Transforms
Chain preprocessing steps with Compose. Transforms are applied per-sample when using ArrayDataset:
python
from grilly.utils.data import (
ArrayDataset, Compose, ToFloat32,
Normalize, Flatten, RandomNoise, RandomFlip,
)
# Define a transform pipeline
transform = Compose([
ToFloat32(scale=1.0/255.0), # uint8 -> float32, normalize to [0,1]
Normalize(mean=0.5, std=0.5), # center to [-1, 1]
RandomNoise(std=0.01), # data augmentation
Flatten(), # flatten to 1D
])
# Wrap data with transforms
images = np.random.randint(0, 256, (500, 28, 28)).astype(np.uint8)
labels = np.random.randint(0, 10, 500)
dataset = ArrayDataset(
data=images,
labels=labels,
transform=transform,
)
sample, label = dataset[0]
print(f"Transformed sample shape: {sample.shape}")
print(f"Value range: [{sample.min():.2f}, {sample.max():.2f}]")
Output
Transformed sample shape: (784,) Value range: [-1.02, 1.01]
Tip
Use
RandomFlip(p=0.5) for image augmentation and OneHot(num_classes) for label encoding. The Lambda(fn) transform lets you apply any custom function.
Complete Data Pipeline
Putting it all together — dataset, transforms, split, and batched training:
python
import grilly.nn as nn
import grilly.optim as optim
from grilly.utils.data import TensorDataset, DataLoader, random_split
import numpy as np
# Create dataset
X = np.random.randn(1000, 64).astype(np.float32)
y = np.random.randint(0, 10, 1000).astype(np.int64)
dataset = TensorDataset(X, y)
# Split and create loaders
train_set, val_set = random_split(dataset, [800, 200])
train_loader = DataLoader(train_set, batch_size=32, shuffle=True)
# Model + optimizer
model = nn.Sequential(
nn.Linear(64, 128), nn.ReLU(),
nn.Linear(128, 10),
)
optimizer = optim.AdamW(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
# Training loop with batches
for epoch in range(5):
model.train()
total_loss = 0.0
for X_batch, y_batch in train_loader:
output = model(X_batch)
loss = loss_fn(output, y_batch)
grad = loss_fn.backward(np.ones_like(loss), output, y_batch)
model.zero_grad()
model.backward(grad)
optimizer.step()
total_loss += float(np.mean(loss))
print(f"Epoch {epoch+1}: avg_loss={total_loss/len(train_loader):.4f}")
Output
Epoch 1: avg_loss=2.3142 Epoch 2: avg_loss=2.2801 Epoch 3: avg_loss=2.2453 Epoch 4: avg_loss=2.2089 Epoch 5: avg_loss=2.1704