Metadata-Version: 2.4
Name: macrodata-refiner
Version: 0.2.2
Summary: Refiner by Macrodata Labs, a data processing framework for Machine Learning large scale datasets
Author: Macrodata Labs
License-Expression: Apache-2.0
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: cloudpickle==3.1.2
Requires-Dist: fsspec
Requires-Dist: httpx
Requires-Dist: loguru
Requires-Dist: opentelemetry-exporter-otlp-proto-http
Requires-Dist: opentelemetry-sdk
Requires-Dist: numpy
Requires-Dist: psutil
Requires-Dist: orjson
Requires-Dist: pyarrow
Requires-Dist: msgspec>=0.20.0
Provides-Extra: video
Requires-Dist: av; extra == "video"
Provides-Extra: robotics
Requires-Dist: macrodata-refiner[video]; extra == "robotics"
Requires-Dist: huggingface-hub>=1.4.1; extra == "robotics"
Requires-Dist: hf>=1.7.1; extra == "robotics"
Provides-Extra: text
Requires-Dist: warcio; extra == "text"
Provides-Extra: s3
Requires-Dist: s3fs; extra == "s3"
Provides-Extra: testing
Requires-Dist: macrodata-refiner[robotics]; extra == "testing"
Requires-Dist: macrodata-refiner[text]; extra == "testing"
Requires-Dist: macrodata-refiner[s3]; extra == "testing"
Requires-Dist: pytest>=8.0.0; extra == "testing"
Requires-Dist: pytest-cov>=5.0.0; extra == "testing"
Provides-Extra: all
Requires-Dist: macrodata-refiner[testing]; extra == "all"
Dynamic: license-file

<p align="center">
  <img src="https://macrodata.co/logo.svg" alt="Macrodata" width="180">
</p>

<h1 align="center">Macrodata Refiner</h1>

Refiner is an open-source engine for turning raw, unstructured, and multimodal data into **high-quality datasets** for large model training.

It replaces the brittle scripts and stitched-together data tooling that teams still use for training data work, while offering much better support for multimodal data, robotics workflows, and model-based processing.

It also plugs into the Macrodata platform, which gives you visibility into what is happening to your data while pipelines run: job and shard lifecycle, logs, metrics, manifests, and pipeline behavior. The same code can run locally for development and then scale out through Macrodata's elastic serverless cloud.

## Quickstart

Install:

```bash
pip install macrodata-refiner
```

Create a Macrodata API key:

- https://macrodata.co/settings/api-keys

Log in:

```bash
macrodata login
```

### Cloud example

Launch a robotics pipeline on Macrodata Cloud.

This requires a valid API key.

```python
import refiner as mdr

(
    mdr.read_lerobot("hf://datasets/macrodata/aloha_static_battery_ep005_009")
    .map(
        mdr.robotics.motion_trim(
            threshold=0.001,
            pad_frames=5,
        )
    )
    .write_lerobot("hf://buckets/macrodata/test_bucket/aloha_motion")
    .launch_cloud(
        name="motion_trim",
        num_workers=4,
    )
)
```

Need cloud GPUs? See [Launchers](docs/launchers.md) for the GPU-specific cloud options.

### Local example

Launch a local pipeline:

```python
import refiner as mdr

def add_preview(row):
    return row.update(
        preview=" ".join(row["text"].split()[:20]),
    )

(
    mdr.read_jsonl("input/*.jsonl")
    .filter(mdr.col("lang") == "en")
    .with_columns(
        text=mdr.col("text").str.strip(),
        text_len=mdr.col("text").str.len(),
    )
    .map(add_preview)
    .write_parquet("s3://my-bucket/english-cleanup/")
    .launch_local(
        name="english-cleanup",
        num_workers=2,
    )
)
```

`pip install` gives you:

- the Python package as `refiner`
- the CLI as `macrodata`

## Batteries included

- training-data-first pipeline primitives instead of generic ETL abstractions
- multimodal processing, with robotics support today
- a lot of built-in readers, transforms, sinks, and lifecycle/runtime machinery so you do not have to rebuild the same scaffolding in scripts
- access to any storage backend supported by `fsspec` (S3, GCP, Hugging Face, etc.)
- local execution for development and elastic cloud execution for large runs
- built-in observability through the Macrodata platform, so you can inspect how your data is changing instead of debugging blindly after the fact

## Docs

Getting started:

- [Pipeline basics](docs/pipeline-basics.md)
- [Launchers](docs/launchers.md)
- [CLI](docs/cli.md)

Core concepts:

- [Reading and writing data](docs/reading-and-writing.md)
- [Transforms](docs/transforms.md)
- [Expressions](docs/expressions.md)
- [In-process debugging](docs/in-process-debugging.md)
- [Task pipelines](docs/task-pipelines.md)

Modalities and platform:

- [Robotics](docs/robotics.md)
- [Observability](docs/observability.md)

## Community

- join the Macrodata Discord: https://discord.gg/S8kZtmBR2x
