Metadata-Version: 2.4
Name: xfmr-zem
Version: 0.3.5
Summary: Zem: Unified Data Pipeline Framework (ZenML + NeMo Curator + DataJuicer) for multi-domain processing
Project-URL: Homepage, https://github.com/OAI-Labs/xfmr-zem
Project-URL: Repository, https://github.com/OAI-Labs/xfmr-zem
Project-URL: Issues, https://github.com/OAI-Labs/xfmr-zem/issues
Project-URL: Changelog, https://github.com/OAI-Labs/xfmr-zem/blob/main/CHANGELOG.md
Author-email: Khai Hoang <khaihq@vbiacademy.edu.vn>
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: data-juicer,data-pipeline,mlops,nemo-curator,xfmr-zem,zenml
Requires-Python: <3.13,>=3.10
Requires-Dist: click>=8.0.0
Requires-Dist: fastmcp>=0.1.0
Requires-Dist: loguru>=0.7.0
Requires-Dist: mcp>=0.1.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: pyarrow>=15.0.0
Requires-Dist: pydantic>=2.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.0.0
Requires-Dist: zenml[local,server]>=0.75.0
Provides-Extra: all
Requires-Dist: dask-cuda>=24.0.0; extra == 'all'
Requires-Dist: datasketch>=1.6.0; extra == 'all'
Requires-Dist: faiss-cpu>=1.7.0; extra == 'all'
Requires-Dist: fastapi; extra == 'all'
Requires-Dist: ftfy>=6.3.1; extra == 'all'
Requires-Dist: librosa; extra == 'all'
Requires-Dist: nemo-curator>=0.5.0; extra == 'all'
Requires-Dist: openai-whisper; extra == 'all'
Requires-Dist: opik>=1.10.9; extra == 'all'
Requires-Dist: py-data-juicer>=1.0.0; extra == 'all'
Requires-Dist: python-magic>=0.4.27; extra == 'all'
Requires-Dist: python-multipart; extra == 'all'
Requires-Dist: sentence-transformers>=2.2.0; extra == 'all'
Requires-Dist: soundfile; extra == 'all'
Requires-Dist: underthesea>=6.8.0; extra == 'all'
Requires-Dist: unstructured[all-docs]>=0.16.0; extra == 'all'
Requires-Dist: uvicorn; extra == 'all'
Provides-Extra: argilla
Requires-Dist: argilla>=2.0.0; extra == 'argilla'
Requires-Dist: krippendorff>=0.6.0; extra == 'argilla'
Requires-Dist: scikit-learn>=1.3.0; extra == 'argilla'
Provides-Extra: audio
Requires-Dist: lhotse>=1.24.0; extra == 'audio'
Requires-Dist: sentencepiece>=0.1.99; extra == 'audio'
Provides-Extra: corrector
Requires-Dist: numpy<2.0.0; extra == 'corrector'
Requires-Dist: pillow>=12.1.0; extra == 'corrector'
Requires-Dist: pymupdf>=1.26.7; extra == 'corrector'
Requires-Dist: torch>=2.5.0; extra == 'corrector'
Requires-Dist: torchvision>=0.20.1; extra == 'corrector'
Requires-Dist: transformers>=4.55.2; extra == 'corrector'
Provides-Extra: datajuicer
Requires-Dist: py-data-juicer>=1.0.0; extra == 'datajuicer'
Provides-Extra: deduplication
Requires-Dist: datasketch>=1.6.0; extra == 'deduplication'
Requires-Dist: faiss-cpu>=1.7.0; extra == 'deduplication'
Requires-Dist: sentence-transformers>=2.2.0; extra == 'deduplication'
Requires-Dist: underthesea>=6.8.0; extra == 'deduplication'
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: document
Requires-Dist: python-magic>=0.4.27; extra == 'document'
Requires-Dist: unstructured[all-docs]>=0.16.0; extra == 'document'
Provides-Extra: evaluator
Requires-Dist: litellm>=1.0.0; extra == 'evaluator'
Requires-Dist: opik>=1.10.9; extra == 'evaluator'
Provides-Extra: evaluator-local
Requires-Dist: accelerate>=0.25.0; extra == 'evaluator-local'
Requires-Dist: litellm>=1.0.0; extra == 'evaluator-local'
Requires-Dist: opik>=1.10.9; extra == 'evaluator-local'
Requires-Dist: torch>=2.1.0; extra == 'evaluator-local'
Requires-Dist: transformers>=4.40.0; extra == 'evaluator-local'
Provides-Extra: nemo
Requires-Dist: dask-cuda>=24.0.0; extra == 'nemo'
Requires-Dist: ftfy>=6.3.1; extra == 'nemo'
Requires-Dist: nemo-curator>=0.5.0; extra == 'nemo'
Provides-Extra: nlp
Requires-Dist: dask-cuda>=24.0.0; extra == 'nlp'
Requires-Dist: datasketch>=1.6.0; extra == 'nlp'
Requires-Dist: faiss-cpu>=1.7.0; extra == 'nlp'
Requires-Dist: ftfy>=6.3.1; extra == 'nlp'
Requires-Dist: nemo-curator>=0.5.0; extra == 'nlp'
Requires-Dist: py-data-juicer>=1.0.0; extra == 'nlp'
Requires-Dist: sentence-transformers>=2.2.0; extra == 'nlp'
Requires-Dist: underthesea>=6.8.0; extra == 'nlp'
Provides-Extra: ocr
Requires-Dist: cachetools>=5.0.0; extra == 'ocr'
Requires-Dist: einops; extra == 'ocr'
Requires-Dist: landingai-ade>=1.5.0; extra == 'ocr'
Requires-Dist: numpy<2.0.0; extra == 'ocr'
Requires-Dist: onnxruntime>=1.16.0; extra == 'ocr'
Requires-Dist: opencv-python>=4.8.0; extra == 'ocr'
Requires-Dist: paddleocr>=2.7.0; extra == 'ocr'
Requires-Dist: paddlepaddle>=2.6.0; extra == 'ocr'
Requires-Dist: pdfplumber>=0.11.0; extra == 'ocr'
Requires-Dist: pillow>=10.0.0; extra == 'ocr'
Requires-Dist: pyclipper; extra == 'ocr'
Requires-Dist: pymupdf>=1.23.0; extra == 'ocr'
Requires-Dist: pytesseract>=0.3.10; extra == 'ocr'
Requires-Dist: ruamel-yaml>=0.17.0; extra == 'ocr'
Requires-Dist: shapely; extra == 'ocr'
Requires-Dist: torch>=2.5.1; extra == 'ocr'
Requires-Dist: torchvision>=0.20.1; extra == 'ocr'
Requires-Dist: transformers>=4.40.0; extra == 'ocr'
Provides-Extra: ui
Requires-Dist: fastapi; extra == 'ui'
Requires-Dist: python-multipart; extra == 'ui'
Requires-Dist: uvicorn; extra == 'ui'
Provides-Extra: vn-asr
Requires-Dist: pyvi>=0.1.1; extra == 'vn-asr'
Requires-Dist: transformers>=4.40.0; extra == 'vn-asr'
Provides-Extra: voice
Requires-Dist: accelerate>=0.25.0; extra == 'voice'
Requires-Dist: einops>=0.7.0; extra == 'voice'
Requires-Dist: h5py>=3.10.0; extra == 'voice'
Requires-Dist: librosa>=0.10.0; extra == 'voice'
Requires-Dist: matplotlib>=3.8.0; extra == 'voice'
Requires-Dist: noisereduce>=3.0.0; extra == 'voice'
Requires-Dist: onnxruntime-gpu>=1.16.0; extra == 'voice'
Requires-Dist: openpyxl>=3.1.0; extra == 'voice'
Requires-Dist: pandas>=2.0.0; extra == 'voice'
Requires-Dist: pesq>=0.0.4; extra == 'voice'
Requires-Dist: pyannote-audio>=3.1.0; extra == 'voice'
Requires-Dist: pydantic-settings>=2.0.0; extra == 'voice'
Requires-Dist: pydub>=0.25.0; extra == 'voice'
Requires-Dist: pyloudnorm>=0.1.0; extra == 'voice'
Requires-Dist: pystoi>=0.3.3; extra == 'voice'
Requires-Dist: scipy>=1.11.0; extra == 'voice'
Requires-Dist: soundfile>=0.12.0; extra == 'voice'
Requires-Dist: tabulate>=0.9.0; extra == 'voice'
Requires-Dist: thop>=0.1.1; extra == 'voice'
Requires-Dist: toml>=0.10.0; extra == 'voice'
Requires-Dist: torchaudio>=2.1.0; extra == 'voice'
Requires-Dist: torchinfo>=1.8.0; extra == 'voice'
Description-Content-Type: text/markdown

# 🚀 Zem

[![Version](https://img.shields.io/badge/version-0.2.0-blue.svg)](https://github.com/OAI-Labs/xfmr-zem/releases)
[![License](https://img.shields.io/badge/license-Apache--2.0-green.svg)](LICENSE)
[![ZenML](https://img.shields.io/badge/Orchestration-ZenML-blueviolet)](https://zenml.io)
[![MCP](https://img.shields.io/badge/Interface-MCP-orange)](https://modelcontextprotocol.io)

**Zem** is a high-performance, unified data pipeline framework designed for the modern AI era. It seamlessly bridges **ZenML's** production-grade orchestration with specialized curation powerhouses like **NVIDIA NeMo Curator** and **Alibaba Data-Juicer** using the **Model Context Protocol (MCP)**.

---

## ✨ Key Features

- 🏗️ **Config-Driven Power**: Define complex, production-ready pipelines in single YAML files.
- ⚡ **True Parallel DAGs**: Execute independent processing branches concurrently using a custom `ParallelLocalOrchestrator`.
- 🧠 **Frontier LLM Integration**: Smart data masking, classification, and summarization via **Ollama** or **OpenAI**.
- 📊 **Deep Observability**: Real-time profiling, per-tool performance metrics, and a beautiful integrated dashboard.
- 🔄 **Adaptive Caching**: Fine-grained, step-level cache control to optimize your development cycles.
- 🔌 **Cloud Native**: Native support for S3, GCS, and Parquet with seamless export to **Hugging Face Hub** and **Vector DBs**.

---

## 🏗️ Architecture

```mermaid
graph TD
    YAML["📄 pipeline.yaml"] --> Client["🛠️ Zem CLI / Client"]
    Client --> ZenML["🌀 ZenML Orchestrator"]
    ZenML --> Parallel["⚡ Parallel Local Orchestrator"]
    Parallel --> MCP_Bridge["🔗 MCP Bridge"]
    
    subgraph "Specialized Servers (MCP)"
        MCP_Bridge --> Nemo["🦁 NeMo Curator (GPU)"]
        MCP_Bridge --> DJ["🧃 Data-Juicer"]
        MCP_Bridge --> LLM["🤖 Frontier LLMs"]
        MCP_Bridge --> Prof["📈 Profiler"]
    end
    
    subgraph "Storage & Sinks"
        Nemo --> S3["☁️ Cloud / Parquet"]
        DJ --> HF["🤗 Hugging Face"]
        LLM --> VDB["🌐 Vector DB"]
    end
```

---

## 🚀 Quick Start

### 1. Installation
```bash
git clone https://github.com/OAI-Labs/xfmr-zem.git
cd xfmr-zem
uv sync
```

### 2. Initialize a New Project
```bash
# Bootstrap a standalone project with a sample agent
uv run zem init my_project
cd my_project
```

### 3. Run Your First Pipeline
```bash
uv run zem run pipeline.yaml
```

### 4. Visualize & Inspect
```bash
# Open ZenML Dashboard
uv run zem dashboard

# Preview results with sampling
uv run zem preview <artifact_id> --sample --limit 5
```

---

## 📦 Data Versioning (DVC)

Zem tích hợp sẵn DVC để quản lý phiên bản dữ liệu lớn, sử dụng MinIO (S3-compatible) làm remote storage.

### Cấu hình credentials

```bash
export DVC_MINIO_ENDPOINT=
export DVC_MINIO_BUCKET=
export DVC_MINIO_ACCESS_KEY=
export DVC_MINIO_SECRET_KEY=
```

### Workflow

```bash
# Khởi tạo project với DVC + MinIO
uv run zem init my_project --dvc-remote minio

# Track dataset
cd my_project
uv run zem data add data/dataset.parquet -m "add training data v1"

# Push lên remote / Pull về
uv run zem data push
uv run zem data pull

# Kiểm tra trạng thái & lineage
uv run zem data status
uv run zem data lineage data/dataset.parquet
```

DVC metadata (hash, git commit) được tự động log vào ZenML artifact khi pipeline chạy, đảm bảo truy xuất đúng data version cho mỗi experiment.

---

## 📖 Guided Documentation

| Topic | Description | Link |
|-------|-------------|------|
| **Core Concepts** | Understand the Zem architecture and MCP model. | [AGENTS.md](AGENTS.md) |
| **Pipeline YAML** | How to write and validate your pipeline configs. | [Standard Example](tests/manual/standard_data_pipeline.yaml) |
| **Advanced Parallelism** | Setup true local concurrency. | [Parallel Guide](tests/manual/parallel_test.yaml) |
| **LLM & Sinks** | Connecting to external AI stacks. | [Phase 4 Demo](tests/manual/phase4_test.yaml) |

---

## 🤝 Contributing

We welcome contributions! Whether it's a new MCP server, a performance fix, or a typo in the docs, feel free to open a Pull Request. 

1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request

---

## ⚖️ License

Distributed under the **Apache-2.0 License**. See `LICENSE` for more information.

---

<p align="center">
  Built with ❤️ by the <b>OAI-Labs</b> Team
</p>
