Quick Start Guide
This tutorial will walk you through the basic steps of using indexed-parquet-dataset to manage large Parquet files for machine learning.
Step 1: Prepare your data
Suppose you have a folder structure with multiple Parquet files:
Step 2: Initialize the Dataset
The easiest way to get started is by using from_folder. It scans the directory, indexes all files, and creates a dataset object.
from indexed_parquet import IndexedParquetDataset
dataset = IndexedParquetDataset.from_folder("data", pattern="*.parquet", recursive=True)
print(f"Total rows: {len(dataset)}")
print(f"Columns: {dataset.schema}")
Step 3: Accessing Data
You can access rows by index just like a regular Python list. This operation is O(1) and does not depend on the dataset size.
# Single row access
row = dataset[0] # {'id': 1, 'name': 'Item A', ...}
# Slidcing
subset = dataset[10:20] # Returns a list of dictionaries
# Fancy indexing
items = dataset[[1, 5, 100]] # Returns a list of dictionaries
Step 4: Shuffling and Splitting
Working with training and validation sets is easy with built-in methods.
# Shuffle the whole dataset
train_ds = dataset.shuffle(seed=42)
# Split into 80% train and 20% test
train_ds, test_ds = dataset.train_test_split(test_size=0.2, seed=42)
print(f"Training rows: {len(train_ds)}")
print(f"Testing rows: {len(test_ds)}")
Step 5: Integration with PyTorch
The IndexedParquetDataset class inherits from torch.utils.data.Dataset (if torch is installed), so it works out-of-the-box with DataLoader.
from torch.utils.data import DataLoader
loader = DataLoader(
train_ds,
batch_size=32,
shuffle=True,
num_workers=4
)
for batch in loader:
# batch is a dictionary of tensors/lists
images = batch['image']
labels = batch['label']
...
Next Steps
- Learn about handling Schema Evolution when your Parquet files have different structures.
- Dive into the API Reference for detailed class documentation.