Indexed Parquet Dataset
Indexed Parquet Dataset is a high-performance Python library designed for O(1) random access to large-scale Parquet datasets.
It is specifically optimized for Deep Learning workflows, providing a seamless Dataset class for PyTorch while maintaining minimal memory overhead and supporting complex features like Schema Evolution.
Key Features
- ⚡ O(1) Random Access: Instantly jump to any row in a multi-gigabyte dataset without scanning.
- 🔄 Schema Evolution: Handle datasets where files have different schemas, missing columns, or renamed fields.
- 📦 Lazy Loading: Only opens file handles and reads data when requested, with an efficient LRU handle cache.
- 🔥 PyTorch Integration: Native
torch.utils.data.Datasetsupport with batching and shuffling. - 🛠️ Powerful API: Built-in support for filtering, mapping, splitting, and sharding.
Installation
To include PyTorch support:
Quick Example
from indexed_parquet import IndexedParquetDataset
# Scan a directory and index all parquet files
ds = IndexedParquetDataset.from_folder("path/to/data")
# Access any row instantly
row = ds[12345] # Returns a dictionary
print(row)
# Integration with PyTorch
from torch.utils.data import DataLoader
loader = DataLoader(ds, batch_size=32, shuffle=True)
Check out the Quick Start guide to learn more!