Metadata-Version: 2.4
Name: auralith-data-pipeline
Version: 0.1.11
Summary: Production-grade data collection and processing pipeline for training LLMs and multimodal AI
Project-URL: Homepage, https://github.com/AuralithAI/Auralith-Data-Pipeline
Project-URL: Documentation, https://github.com/AuralithAI/Auralith-Data-Pipeline#readme
Project-URL: Repository, https://github.com/AuralithAI/Auralith-Data-Pipeline
Project-URL: Issues, https://github.com/AuralithAI/Auralith-Data-Pipeline/issues
Project-URL: Changelog, https://github.com/AuralithAI/Auralith-Data-Pipeline/releases
Author-email: AuralithAI <contact@auralith.ai>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: data-pipeline,data-processing,llm,machine-learning,multimodal,nlp,tokenization
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Requires-Dist: click<9.0,>=8.1.0
Requires-Dist: datasets<4.0,>=2.14.0
Requires-Dist: datasketch<2.0,>=1.6.0
Requires-Dist: ftfy<7.0,>=6.1.0
Requires-Dist: huggingface-hub<2.0,>=0.19.0
Requires-Dist: langdetect<2.0,>=1.0.9
Requires-Dist: numpy<3.0,>=1.24.0
Requires-Dist: pillow<12.0,>=10.0.0
Requires-Dist: pyyaml<7.0,>=6.0
Requires-Dist: requests<3.0,>=2.31.0
Requires-Dist: rich<14.0,>=13.0.0
Requires-Dist: safetensors<1.0,>=0.4.0
Requires-Dist: scipy<2.0,>=1.11.0
Requires-Dist: sentencepiece<1.0,>=0.1.99
Requires-Dist: soundfile<1.0,>=0.12.0
Requires-Dist: tqdm<5.0,>=4.65.0
Requires-Dist: xxhash<4.0,>=3.4.0
Provides-Extra: all
Requires-Dist: astropy<7.0,>=5.3.0; extra == 'all'
Requires-Dist: azure-storage-blob<13.0,>=12.19.0; extra == 'all'
Requires-Dist: black>=23.7.0; extra == 'all'
Requires-Dist: boto3<2.0,>=1.28.0; extra == 'all'
Requires-Dist: decord<1.0,>=0.6.0; extra == 'all'
Requires-Dist: extract-msg<1.0,>=0.48.0; extra == 'all'
Requires-Dist: faiss-cpu<2.0,>=1.7.4; extra == 'all'
Requires-Dist: google-cloud-storage<3.0,>=2.10.0; extra == 'all'
Requires-Dist: h5py<4.0,>=3.9.0; extra == 'all'
Requires-Dist: librosa<1.0,>=0.10.0; extra == 'all'
Requires-Dist: mlflow>=2.9.0; extra == 'all'
Requires-Dist: mypy>=1.5.0; extra == 'all'
Requires-Dist: opencv-python<5.0,>=4.8.0; extra == 'all'
Requires-Dist: openpyxl<4.0,>=3.1.0; extra == 'all'
Requires-Dist: pdfplumber<1.0,>=0.10.0; extra == 'all'
Requires-Dist: pre-commit>=3.4.0; extra == 'all'
Requires-Dist: psutil>=5.9.0; extra == 'all'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'all'
Requires-Dist: pytest-cov>=4.1.0; extra == 'all'
Requires-Dist: pytest>=7.4.0; extra == 'all'
Requires-Dist: python-docx<2.0,>=1.0.0; extra == 'all'
Requires-Dist: python-pptx<1.0,>=0.6.23; extra == 'all'
Requires-Dist: rarfile<5.0,>=4.1; extra == 'all'
Requires-Dist: ray[default]>=2.9.0; extra == 'all'
Requires-Dist: ruff>=0.1.0; extra == 'all'
Requires-Dist: sentence-transformers<4.0,>=2.2.0; extra == 'all'
Requires-Dist: striprtf<1.0,>=0.0.26; extra == 'all'
Requires-Dist: timm<2.0,>=0.9.0; extra == 'all'
Requires-Dist: transformers<5.0,>=4.35.0; extra == 'all'
Requires-Dist: tree-sitter-languages<2.0,>=1.10.0; extra == 'all'
Requires-Dist: tree-sitter<0.22.0,>=0.21.0; extra == 'all'
Requires-Dist: wandb>=0.16.0; extra == 'all'
Requires-Dist: warcio<2.0,>=1.7.4; extra == 'all'
Requires-Dist: zarr<3.0,>=2.16.0; extra == 'all'
Provides-Extra: cloud
Requires-Dist: azure-storage-blob<13.0,>=12.19.0; extra == 'cloud'
Requires-Dist: boto3<2.0,>=1.28.0; extra == 'cloud'
Requires-Dist: google-cloud-storage<3.0,>=2.10.0; extra == 'cloud'
Provides-Extra: code
Requires-Dist: tree-sitter-languages<2.0,>=1.10.0; extra == 'code'
Requires-Dist: tree-sitter<0.22.0,>=0.21.0; extra == 'code'
Provides-Extra: compound
Requires-Dist: astropy<7.0,>=5.3.0; extra == 'compound'
Requires-Dist: extract-msg<1.0,>=0.48.0; extra == 'compound'
Requires-Dist: h5py<4.0,>=3.9.0; extra == 'compound'
Requires-Dist: rarfile<5.0,>=4.1; extra == 'compound'
Requires-Dist: warcio<2.0,>=1.7.4; extra == 'compound'
Requires-Dist: zarr<3.0,>=2.16.0; extra == 'compound'
Provides-Extra: dev
Requires-Dist: black>=23.7.0; extra == 'dev'
Requires-Dist: mypy>=1.5.0; extra == 'dev'
Requires-Dist: openpyxl<4.0,>=3.1.0; extra == 'dev'
Requires-Dist: pdfplumber<1.0,>=0.10.0; extra == 'dev'
Requires-Dist: pre-commit>=3.4.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.21.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: python-docx<2.0,>=1.0.0; extra == 'dev'
Requires-Dist: python-pptx<1.0,>=0.6.23; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Requires-Dist: striprtf<1.0,>=0.0.26; extra == 'dev'
Provides-Extra: distributed
Requires-Dist: psutil>=5.9.0; extra == 'distributed'
Requires-Dist: ray[default]>=2.9.0; extra == 'distributed'
Provides-Extra: multimodal
Requires-Dist: decord<1.0,>=0.6.0; extra == 'multimodal'
Requires-Dist: librosa<1.0,>=0.10.0; extra == 'multimodal'
Requires-Dist: opencv-python<5.0,>=4.8.0; extra == 'multimodal'
Requires-Dist: timm<2.0,>=0.9.0; extra == 'multimodal'
Provides-Extra: pdf
Requires-Dist: openpyxl<4.0,>=3.1.0; extra == 'pdf'
Requires-Dist: pdfplumber<1.0,>=0.10.0; extra == 'pdf'
Requires-Dist: python-docx<2.0,>=1.0.0; extra == 'pdf'
Requires-Dist: python-pptx<1.0,>=0.6.23; extra == 'pdf'
Requires-Dist: striprtf<1.0,>=0.0.26; extra == 'pdf'
Provides-Extra: quality
Requires-Dist: faiss-cpu<2.0,>=1.7.4; extra == 'quality'
Requires-Dist: sentence-transformers<4.0,>=2.2.0; extra == 'quality'
Requires-Dist: transformers<5.0,>=4.35.0; extra == 'quality'
Provides-Extra: tracking
Requires-Dist: mlflow>=2.9.0; extra == 'tracking'
Requires-Dist: wandb>=0.16.0; extra == 'tracking'
Description-Content-Type: text/markdown

# Auralith Data Pipeline

**Production-grade multimodal data processing pipeline for training [RT-DLM](https://github.com/AuralithAI/RT-DLM) and large-scale AI systems.**

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://github.com/AuralithAI/Auralith-Data-Pipeline/blob/main/LICENSE)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![PyPI](https://img.shields.io/pypi/v/auralith-data-pipeline)](https://pypi.org/project/auralith-data-pipeline/)

---

## Overview

Auralith Data Pipeline ingests raw **text, images, audio, video, and code**, applies production-quality curation (perplexity filtering, LLM-as-Judge scoring, FAISS deduplication, PII scrubbing, license detection), tokenizes everything through **BPE + Vector Quantization**, and outputs **SafeTensors shards** ready for distributed model training.

### Pipeline Stages

| Stage | What happens |
|-------|-------------|
| **Ingestion** | Text (HuggingFace, Common Crawl, local), images (`.npy`/JPEG/PNG), audio (`.wav`/`.npy`), video (`.mp4`), code (TheStack) |
| **Quality Curation** | GPT-2 perplexity filter, LLM-as-Judge scoring, FAISS embedding dedup, license detection |
| **Tokenization** | BPE for text, patch + VQ for images, mel + VQ for audio, frame + VQ for video |
| **Sharding** | SafeTensors v2 schema with `input_ids`, `attention_mask`, `modality_mask`, `targets` |
| **Observability** | MLflow / W&B tracking, per-sample lineage, auto data cards |
| **Orchestration** | Argo Workflows, Helm/K8s, Ray, distributed coordinator + workers |

---

## Installation

```bash
# Core text pipeline
pip install auralith-data-pipeline

# With all extras (multimodal, cloud, distributed, dev tools)
pip install "auralith-data-pipeline[all]"

# Pick only what you need
pip install "auralith-data-pipeline[quality]"        # + perplexity filter + FAISS dedup
pip install "auralith-data-pipeline[distributed]"    # + Ray
pip install "auralith-data-pipeline[cloud,pdf]"      # + S3/GCS/Azure + PDF extraction
pip install "auralith-data-pipeline[multimodal]"     # + video/image/audio (PyTorch)
pip install "auralith-data-pipeline[tracking]"       # + MLflow + W&B
```

---

## Quick Start

### CLI

```bash
# List available datasets
auralith-pipeline list-datasets

# Process Wikipedia dataset
auralith-pipeline collect \
  --dataset wikipedia \
  --output ./data/shards \
  --max-samples 100000 \
  --preset production
```

### End-to-End Workflow

```bash
# 1. Train tokenizers (BPE + VQ codebooks)
auralith-pipeline train-tokenizer all \
  --corpus  data/corpus/ \
  --images  data/images/ \
  --audio   data/audio/ \
  --videos  data/videos/ \
  --output  tokenizers/ \
  --vocab-size 32000 \
  --codebook-size 1024

# 2. Process raw data into SafeTensors shards
auralith-pipeline process \
  --input  data/raw/ \
  --output shards/ \
  --tokenizers tokenizers/ \
  --max-seq-len 4096 \
  --shard-size 10000

# 3. Upload to cloud storage or HuggingFace Hub
auralith-pipeline upload --source shards/ --dest s3://my-bucket/training-data/
```

### Python API

```python
from auralith_pipeline import Pipeline, PipelineConfig
from auralith_pipeline.sources import create_source

config = PipelineConfig.from_preset("production")
pipeline = Pipeline(config)
source = create_source("wikipedia", streaming=True, max_samples=1_000_000)
pipeline.add_source(source)

stats = pipeline.run()
print(stats.summary())
```

---

## Key Features

### Data Processing
- Multi-source ingestion (HuggingFace, Common Crawl, local files, video)
- Weighted round-robin interleaving across multiple sources
- MinHash + FAISS embedding deduplication
- Quality filtering (length, language, perplexity, LLM-as-Judge)
- PII removal (multi-jurisdiction, 15+ countries)
- License compliance scanning for code data
- Document extraction (PDF, DOCX, HTML, Markdown)
- SafeTensors sharding with Zstd compression and SHA-256 checksums
- Streaming checkpointing with seeded reproducibility

### Tokenization
- Custom BPE tokenizer with 16 special tokens and byte-level fallback
- Vector quantization for images, audio, and video
- Multimodal token fusion with `encode_with_mask()`
- Configurable vocab size (32k-128k)

### Distributed Processing
- **Embedded mode** — in-process coordinator + workers (no Redis needed)
- **External mode** — multi-machine with Redis state store
- Worker failure detection + automatic task requeue
- Linear scaling up to 64+ workers

### Observability & Compliance
- MLflow / Weights & Biases experiment tracking
- Per-sample lineage (source to shard provenance)
- Auto-generated data cards (HuggingFace-compatible)
- Full audit logging (JSONL) for accept/reject decisions
- Credential and secret sanitization

---

## SafeTensors Schema (v2)

Every output shard is directly compatible with RT-DLM training.

| Tensor | Dtype | Shape | Description |
|--------|-------|-------|-------------|
| `input_ids` | int32 | (batch, seq_len) | All tokens (text + image + audio + video + code) |
| `attention_mask` | uint8 | (batch, seq_len) | 1 = real token, 0 = padding |
| `modality_mask` | uint8 | (batch, seq_len) | 0=text, 1=image, 2=audio, 3=video, 4=code |
| `targets` | int32 | (batch, seq_len) | Right-shifted `input_ids` for causal LM |

### Special Tokens

| ID | Token | Purpose |
|----|-------|---------|
| 0 | `<PAD>` | Padding |
| 1 | `<UNK>` | Unknown |
| 2 | `<BOS>` | Beginning of sequence |
| 3 | `<EOS>` | End of sequence |
| 4-5 | `<IMG>` / `<IMG_END>` | Image region |
| 6-7 | `<AUDIO>` / `<AUDIO_END>` | Audio region |
| 8-9 | `<VIDEO>` / `<VIDEO_END>` | Video region |
| 10 | `<FUSE>` | Cross-modal fusion |
| 11 | `<SEP>` | Separator |
| 12 | `<MASK>` | Masked LM |
| 13-14 | `<CODE>` / `<CODE_END>` | Code block |
| 15 | `<THINK>` | Chain-of-thought |

---

## Available Datasets

| Dataset | Size | Description |
|---------|------|-------------|
| wikipedia | 20 GB | English Wikipedia |
| c4 | 750 GB | Cleaned Common Crawl |
| redpajama | 1.2 TB | LLaMA training data |
| openwebtext | 40 GB | Reddit links |
| bookcorpus | 5 GB | 11k books |
| the_stack | 3 TB | Source code (deduplicated) |

---

## Performance

| Operation | Speed |
|-----------|-------|
| Text preprocessing | 10k samples/sec |
| MinHash deduplication | 5k samples/sec |
| FAISS dedup | 3k samples/sec |
| BPE encoding | < 1 ms/sample |
| SafeTensors writing | 50 MB/s |
| Image tokenization | 50 ms/image |
| Video tokenization | 200 ms/video |

---

## Configuration

```yaml
# configs/production.yaml
pipeline:
  name: production-pipeline
  output_dir: ./data/shards
  deduplicate: true
  quality_filter: true
  remove_pii: true
  seed: 42
  checkpoint_every: 10000

advanced_quality:
  enabled: true
  perplexity_filter: true
  max_perplexity: 1500.0
```

---

## Documentation

For the full documentation, architecture diagrams, distributed processing guide, and contributor guide, visit the [GitHub repository](https://github.com/AuralithAI/Auralith-Data-Pipeline).

- [Architecture](https://github.com/AuralithAI/Auralith-Data-Pipeline/blob/main/docs/ARCHITECTURE.md)
- [Contributing](https://github.com/AuralithAI/Auralith-Data-Pipeline/blob/main/docs/CONTRIBUTING.md)
- [Distributed Processing](https://github.com/AuralithAI/Auralith-Data-Pipeline/blob/main/docs/DISTRIBUTED_PROCESSING.md)
- [Changelog](https://github.com/AuralithAI/Auralith-Data-Pipeline/releases)

---

## License

Apache License 2.0 — see [LICENSE](https://github.com/AuralithAI/Auralith-Data-Pipeline/blob/main/LICENSE).

---

Built by [AuralithAI](https://github.com/AuralithAI) for [RT-DLM](https://github.com/AuralithAI/RT-DLM).
