Metadata-Version: 2.3
Name: grist
Version: 0.1.0.dev0
Summary: Generative Retrieval Id Semantic Transforms on top of Google Grain.
Author: chicham
Author-email: chicham <hicham.randrianarivo@artefact.com>
Requires-Python: >=3.10, <3.13
Description-Content-Type: text/markdown

# GRIST 🌾

Generative Retrieval ID Semantic Transforms for reproducible data pipelines.

GRIST is a focused Python library for bridging raw research datasets and generative retrieval models. It enriches datasets with Semantic Identifiers, guarantees deterministic preprocessing, and provides helpers for publishing results to public hubs like HuggingFace and Kaggle. It is designed to work smoothly with existing data pipeline tooling, including Grain.

## Why GRIST

In Generative Retrieval (GR) research, reproducibility is everything. GRIST treats a dataset not as a static file, but as a deterministic factory. Every transformation, from text cleaning to model-based ID generation, is designed to be perfectly reproducible.

## Features

- Pipeline-native: Fits into existing data pipeline tooling without new paradigms to learn.
- Semantic ID injection: Built-in MapTransform classes for UUIDs, hashes, or model-generated codes.
- Inference-ready: Wrap any pre-trained model (HuggingFace, JAX, PyTorch) as an ID generator.
- Publishing helpers: Tools to facilitate uploads to HuggingFace or Kaggle.

## Installation

```bash
uv add grist
```

## Quick Start

TODO: Quick start example for the planned public API.

## Concepts

- Semantic Identifiers: Stable, model-aware IDs that augment dataset samples for generative retrieval.
- Deterministic pipelines: Transform semantics guarantee repeatable preprocessing.
- Dataset configs: Optional, reusable configuration files for well-known datasets.

## Why the Name

In milling, grist is the grain separated from its chaff and ready to be ground. This library prepares your "raw grain" (datasets) into a refined format ready for the "mill" of generative retrieval models.
