Metadata-Version: 2.4
Name: sdg_hub
Version: 0.1.0a4
Summary: Synthetic Data Generation
Author-email: Red Hat AI Innovation <abhandwa@redhat.com>
License: Apache-2.0
Project-URL: homepage, https://ai-innovation.team/
Project-URL: source, https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub
Project-URL: issues, https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: License :: OSI Approved :: Apache Software License
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX :: Linux
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: Implementation :: CPython
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click<9.0.0,>=8.1.7
Requires-Dist: datasets<4.0.0,>=2.18.0
Requires-Dist: httpx<1.0.0,>=0.25.0
Requires-Dist: jinja2
Requires-Dist: langchain-text-splitters
Requires-Dist: openai<2.0.0,>=1.13.3
Requires-Dist: rich
Requires-Dist: tenacity!=8.4.0,>=8.3.0
Requires-Dist: tqdm<5.0.0,>=4.66.2
Dynamic: license-file

# sdg_hub: Synthetic Data Generation Toolkit for LLMs

![Build](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/actions/workflows/pypi.yaml/badge.svg?branch=main)
![Release](https://img.shields.io/github/v/release/Red-Hat-AI-Innovation-Team/sdg_hub)
![License](https://img.shields.io/github/license/Red-Hat-AI-Innovation-Team/sdg_hub)
[![Tests](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/actions/workflows/test.yml/badge.svg)](https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub/actions/workflows/test.yml)
[![codecov](https://codecov.io/gh/Red-Hat-AI-Innovation-Team/sdg_hub/graph/badge.svg?token=SP75BCXWO2)](https://codecov.io/gh/Red-Hat-AI-Innovation-Team/sdg_hub)

sdg_hub is a modular, scalable, and efficient solution for creating synthetic data generation workflows in a "no-code" manner. At its core, this framework is designed to simplify data creation for LLMs, allowing users to chain computational units and build powerful pipelines for generating data and processing tasks.


## Installation

Latest release from PyPI

```sh
pip install sdg-hub
```

Latest main branch
```sh
pip install git+https://github.com/Red-Hat-AI-Innovation-Team/sdg_hub.git
```

## Core Design Principles

The framework is built around the following principles:

1. **Modular Design**: Highly composable blocks form the building units of the framework, allowing users to build workflows effortlessly.
2. **No-Code Workflow Creation**: Specify workflows using simple YAML configuration files.
3. **Scalability and Performance**: Optimized for handling large-scale workflows with millions of records.

---

## Framework Architecture

![overview](assets/imgs/overview.png)

### Blocks: The Fundamental Unit

At the heart of the framework is the **Block**. Each block is a self-contained computational unit that performs specific tasks, such as:

- Making LLM calls
- Performing data transformations
- Applying filters

Blocks are designed to be:
- **Modular**: Reusable across multiple pipelines.
- **Composable**: Easily chained together to create workflows.

These blocks are implemented in the [src/sdg_hub/blocks](src/sdg_hub/blocks) directory.

### Prompts

Prompts are at the core of how LLMs are instructed within SDG Hub. Each `LLMBlock` is associated with a prompt configuration file written in YAML, allowing users to define the exact behavior of the language model — including system instructions, generation principles, and output formatting.

#### Prompt YAML Structure

A typical prompt YAML file looks like this:

```yaml
system: You are a helpful assistant that can summarize text.
introduction: Give me a short summary of the text.
principles:
  - Do not add any new information.
  - Do not miss any key points from the provided text.
examples:
  - input: Red Hat announced the acquisition of Neural Magic...
    output: Red Hat acquired Neural Magic to enhance its AI optimization capabilities.
generation: Here is the document to summarize: {{document}}
```

#### Key Fields
* `system`: A high-level instruction that sets the persona or behavior of the model.
* `introduction`: Optional introduction to set context for the user.
* `principles`: A list of guiding constraints or rules the model should follow during generation.
* `examples`: Few-shot examples (optional) to guide output format or tone.
* `generation`: The actual template used to generate the model input. This supports variable injection using {{variable_name}}.

### YAML-Based Workflow: The Flow

The YAML configuration file, known as the **Flow**, is central to defining data generation workflows in the SDG Framework. A Flow describes how blocks and pipelines are orchestrated to process and generate data efficiently. By leveraging YAML, users can create highly customizable and modular workflows without writing any code.

#### Key Features of a Flow

1. **Modular Design**:
   - Flows are composed of blocks, which can be chained together into pipelines.
   - Each block performs a specific task, such as generating, filtering, or transforming data.

2. **Reusability**:
   - Blocks and configurations defined in a Flow can be reused across different workflows.
   - YAML makes it easy to tweak or extend workflows without significant changes.

3. **Ease of Configuration**:
   - Users can specify block types, configurations, and data processing details in a simple and intuitive manner.



## Hello World Example

Let’s say you have a document and want to generate a concise summary using an LLM. Here’s how simple that is in sdg\_hub:

```yaml
- block_type: LLMBlock
  block_config:
    block_name: gen_summary
    config_path: prompts/summarization.yaml
    model_id: meta-llama/Llama-3.3-70B-Instruct
    output_cols:
      - summary
  gen_kwargs:
    max_tokens: 512
```

Want to go further? Add another block to extract keywords from the summary:

```yaml
- block_type: LLMBlock
  block_config:
    block_name: gen_keywords
    config_path: prompts/keywords.yaml
    model_id: meta-llama/Llama-3.3-70B-Instruct
    output_cols:
      - keywords
  gen_kwargs:
    max_tokens: 64
```

Just like that, you’ve built a multi-step LLM workflow using nothing but YAML.

## Available Blocks

The SDG Framework provides a rich set of blocks for different data processing needs. Here's a comprehensive overview of the available blocks and when to use them:

### Base Block Class

The framework is built around the abstract `Block` class, which serves as the foundation for all other blocks:

- **Purpose**: Provides core functionality and interface for all blocks
- **Key Features**:
  - Template validation for input data
  - Configuration loading from YAML files
  - Standardized block initialization
  - Common interface for all blocks
- **Core Methods**:
  - `_validate`: Validates input data against templates
  - `_load_config`: Loads configuration from YAML files
  - `generate`: Abstract method for block execution

All blocks inherit from this base class, ensuring consistent behavior and interface across the framework.

### LLM Blocks

1. **LLMBlock**
   - **Purpose**: Generate text using language models
   - **Use Cases**: 
     - Generating questions, responses, or any text content
     - Single-prompt generation with structured outputs
   - **Features**: 
     - Supports batched processing
     - Configurable output parsing
     - Template-based prompt generation

2. **ConditionalLLMBlock**
   - **Purpose**: Generate text based on conditional logic
   - **Use Cases**:
     - Different prompt templates based on input conditions
     - Multi-path text generation workflows
   - **Features**:
     - Multiple config paths for different conditions
     - Dynamic prompt selection

3. **LLMLogProbBlock**
   - **Purpose**: Generate text with log probabilities
   - **Use Cases**:
     - Analyzing model confidence
     - Quality scoring of generations
   - **Features**:
     - Returns top-k log probabilities
     - JSON-formatted output

4. **LLMMessagesBlock**
   - **Purpose**: Chat-based text generation
   - **Use Cases**:
     - Multi-turn conversations
     - Chat-based interactions
   - **Features**:
     - Supports message history
     - Chat completion API

### Filtering and Processing Blocks

1. **FilterByValueBlock**
   - **Purpose**: Filter datasets based on column values
   - **Use Cases**:
     - Removing unwanted samples
     - Data cleaning
     - Quality filtering
   - **Features**:
     - Multiple filter operations
     - Type conversion support
     - Parallel processing

2. **IterBlock**
   - **Purpose**: Iterative processing of data
   - **Use Cases**:
     - Multiple generation attempts
     - Iterative refinement
   - **Features**:
     - Configurable number of iterations
     - Nested block execution



### Utility Blocks

1. **SamplePopulatorBlock**
   - **Purpose**: Populate samples with configuration data
   - **Use Cases**:
     - Adding metadata
     - Configuration injection

2. **SelectorBlock**
   - **Purpose**: Select data based on mapping
   - **Use Cases**:
     - Conditional data selection
     - Data routing

3. **CombineColumnsBlock**
   - **Purpose**: Merge multiple columns
   - **Use Cases**:
     - Text concatenation
     - Feature combination

4. **FlattenColumnsBlock**
   - **Purpose**: Convert wide to long format
   - **Use Cases**:
     - Data reshaping
     - Variable-value pairs

5. **DuplicateColumns**
   - **Purpose**: Create column copies
   - **Use Cases**:
     - Data preservation
     - Multiple processing paths

6. **RenameColumns**
   - **Purpose**: Rename dataset columns
   - **Use Cases**:
     - Standardizing column names
     - Data reorganization

7. **SetToMajorityValue**
   - **Purpose**: Replace values with majority
   - **Use Cases**:
     - Data normalization
     - Outlier handling

---
### Dataflow and Storage

- **Data Representation**: Dataflow between blocks and pipelines is handled using **Hugging Face Datasets**, which are based on Arrow tables. This provides:
  - Native parallelization capabilities (e.g., maps, filters).
  - Support for efficient data transformations.

- **Data Checkpoints**: Intermediate caches of generated data. Checkpoints allow users to:
  - Resume workflows from the last successful state if interrupted.
  - Improve reliability for long-running workflows.


## Examples

For sample use cases and implementation examples, please refer to the [examples](examples) directory. This directory contains various examples demonstrating different workflows and use cases of the SDG Framework.
