Metadata-Version: 2.4
Name: omnigen-usf
Version: 0.0.1.post2
Summary: Enterprise-Grade Synthetic Data Generation
Author-email: Ultrasafe AI <support@us.inc>
Maintainer-email: Ultrasafe AI <support@us.inc>
License: MIT
Project-URL: Homepage, https://us.inc
Project-URL: Documentation, https://github.com/ultrasafe-ai/omnigen
Project-URL: Repository, https://github.com/ultrasafe-ai/omnigen
Keywords: synthetic-data,data-generation,conversational-ai,llm,pipeline,conversation-extension,machine-learning,ai
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai>=1.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: pytz>=2023.3
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: typer>=0.9.0
Requires-Dist: pymongo>=4.0.0
Provides-Extra: hf
Requires-Dist: datasets>=2.14.0; extra == "hf"
Requires-Dist: huggingface_hub>=0.16.0; extra == "hf"
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.0.280; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Dynamic: license-file

# OmniGen 🚀

**Generate synthetic data at scale using an enterprise-ready framework with full customizable configuration, security, and ease of use**

Built by [Ultrasafe AI](https://us.inc) for production environments.

---

## What is OmniGen?

**OmniGen** is an enterprise-grade framework for generating synthetic datasets at scale—from scratch or from base data. Generate **trillions of tokens** and **billions of samples** across multiple modalities:

### 🎯 Data Types Supported
- 💬 **Conversational Data** - Single-turn to multi-turn dialogues
- 🤖 **Agentic Datasets** - Tool use, function calling, multi-step reasoning
- 🎨 **Multimodal Datasets** - Text, images, audio, video combinations
- 🖼️ **Images** - Synthetic image generation and editing
- 🎵 **Audio** - Speech, music, sound effects
- 🎬 **Video** - Synthetic video sequences

### 🎓 Use Cases
- **Fine-Tuning** - Instruction following, task-specific models
- **Supervised Fine-Tuning (SFT)** - High-quality labeled datasets
- **Offline Reinforcement Learning** - Preference datasets with rewards
- **Online Reinforcement Learning** - Ground truth with reward checking scripts
- **Pre-Training** - Large-scale corpus generation
- **Machine Learning** - Training data for any ML task

### 🏗️ Why OmniGen?
- ✅ **Enterprise-Ready** - Built for production at scale
- ✅ **Fully Customizable** - Configure every aspect of generation
- ✅ **Secure** - Complete isolation, no data mixing
- ✅ **Easy** - Simple API, clear examples
- ✅ **Modular** - Independent pipelines for different data types

---

## 🚀 Currently Available Pipeline

### **conversation_extension** - Extend Single-Turn to Multi-Turn Conversations

Turn your base questions into rich multi-turn dialogues. This is just the first pipeline—more coming soon!

---

## Why OmniGen?

✅ **Simple** - One command to generate thousands of conversations  
✅ **Scalable** - Parallel processing for fast generation  
✅ **Flexible** - Mix different AI providers (OpenAI, Anthropic, Ultrasafe AI)  
✅ **Production Ready** - Built for SaaS platforms with multi-tenant support  

---

## Quick Start

### 1. Install

```bash
pip install omnigen
```

### 2. Prepare Base Data

Create a file `base_data.jsonl` with your starting questions:

```jsonl
{"conversations": [{"role": "user", "content": "How do I learn Python?"}]}
{"conversations": [{"role": "user", "content": "What is machine learning?"}]}
{"conversations": [{"role": "user", "content": "Explain neural networks"}]}
```

### 3. Generate Conversations

```python
from omnigen.pipelines.conversation_extension import (
    ConversationExtensionConfigBuilder,
    ConversationExtensionPipeline
)

# Configure the pipeline
config = (ConversationExtensionConfigBuilder()
    # User followup generator
    .add_provider(
        role='user_followup',
        name='ultrasafe',
        api_key='your-api-key',
        model='usf-mini'
    )
    # Assistant response generator
    .add_provider(
        role='assistant_response',
        name='ultrasafe',
        api_key='your-api-key',
        model='usf-mini'
    )
    # Generation settings
    .set_generation(
        num_conversations=100,
        turn_range=(3, 8)  # 3-8 turns per conversation
    )
    # Input data
    .set_data_source(
        source_type='file',
        file_path='base_data.jsonl'
    )
    # Output
    .set_storage(
        type='jsonl',
        output_file='output.jsonl'
    )
    .build()
)

# Run the pipeline
pipeline = ConversationExtensionPipeline(config)
pipeline.run()
```

### 4. Get Results

Your generated conversations will be in `output.jsonl`:

```jsonl
{
  "id": 0,
  "conversations": [
    {"role": "user", "content": "How do I learn Python?"},
    {"role": "assistant", "content": "Great choice! Start with the basics..."},
    {"role": "user", "content": "What resources do you recommend?"},
    {"role": "assistant", "content": "I recommend these resources..."},
    {"role": "user", "content": "How long will it take?"},
    {"role": "assistant", "content": "With consistent practice..."}
  ],
  "num_turns": 3,
  "success": true
}
```

---

## Supported AI Providers

| Provider | Model Examples |
|----------|----------------|
| **Ultrasafe AI** | `usf-mini`, `usf-max` |
| **OpenAI** | `gpt-4-turbo`, `gpt-3.5-turbo` |
| **Anthropic** | `claude-3-5-sonnet`, `claude-3-opus` |
| **OpenRouter** | Various models |

### Mix Different Providers

```python
config = (ConversationExtensionConfigBuilder()
    .add_provider('user_followup', 'openai', api_key, 'gpt-4-turbo')
    .add_provider('assistant_response', 'anthropic', api_key, 'claude-3-5-sonnet')
    # ... rest of config
    .build()
)
```

---

## Advanced Features

### Multi-Tenant SaaS Support

Perfect for platforms serving multiple users concurrently:

```python
# Each user gets isolated workspace
workspace_id = f"user_{user_id}_session_{session_id}"

config = (ConversationExtensionConfigBuilder(workspace_id=workspace_id)
    .add_provider('user_followup', 'ultrasafe', shared_api_key, 'usf-mini')
    .add_provider('assistant_response', 'ultrasafe', shared_api_key, 'usf-mini')
    .set_storage('jsonl', output_file='output.jsonl')  # Auto-isolated
    .build()
)

# Storage automatically goes to: workspaces/{workspace_id}/output.jsonl
```

### Parallel Dataset Generation

```python
from concurrent.futures import ProcessPoolExecutor

def process_dataset(input_file, output_file):
    config = (ConversationExtensionConfigBuilder()
        .add_provider('user_followup', 'ultrasafe', api_key, 'usf-mini')
        .add_provider('assistant_response', 'ultrasafe', api_key, 'usf-mini')
        .set_data_source('file', file_path=input_file)
        .set_storage('jsonl', output_file=output_file)
        .build()
    )
    ConversationExtensionPipeline(config).run()

# Process 3 datasets in parallel
with ProcessPoolExecutor(max_workers=3) as executor:
    executor.submit(process_dataset, 'data1.jsonl', 'out1.jsonl')
    executor.submit(process_dataset, 'data2.jsonl', 'out2.jsonl')
    executor.submit(process_dataset, 'data3.jsonl', 'out3.jsonl')
```

---

## 📖 Complete Configuration Reference

### All Configuration Options Explained

Below is a comprehensive YAML configuration showing **ALL** available options with detailed explanations:

```yaml
# ==============================================================================
# WORKSPACE ISOLATION (Optional)
# ==============================================================================
# Unique ID for multi-tenant environments - auto-isolates all output files
workspace_id: "user_123_session_abc"

# ==============================================================================
# PROVIDERS - AI Model Configuration
# ==============================================================================
# Configure different AI providers for each role
# Each role can use a different provider/model combination

providers:
  # Provider for generating user follow-up questions
  user_followup:
    name: ultrasafe              # Options: ultrasafe, openai, anthropic, openrouter
    api_key: ${API_KEY}          # Use env var: ${VAR_NAME} or direct key
    model: usf-mini              # Model identifier
    temperature: 0.7             # Randomness (0.0-1.0): higher = more creative
    max_tokens: 2048             # Max tokens in response
    timeout: 300                 # Request timeout in seconds
    max_retries: 5               # Number of retry attempts on failure
    retry_delay: 2               # Delay between retries in seconds
  
  # Provider for generating assistant responses
  assistant_response:
    name: ultrasafe              # Can use different provider than user_followup
    api_key: ${API_KEY}
    model: usf-mini
    temperature: 0.7
    max_tokens: 8192             # Larger for detailed responses
    timeout: 300
    max_retries: 5
    retry_delay: 2

# PROVIDER OPTIONS:
# ----------------
# ultrasafe:
#   models: usf-mini, usf-max
#
# openai:
#   models: gpt-4-turbo, gpt-4, gpt-3.5-turbo, gpt-4o, gpt-4o-mini
#
# anthropic:
#   models: claude-3-5-sonnet-20241022, claude-3-opus-20240229,
#           claude-3-sonnet-20240229, claude-3-haiku-20240307
#
# openrouter:
#   models: Any OpenRouter supported model
#   base_url: https://openrouter.ai/api/v1 (optional)

# ==============================================================================
# GENERATION SETTINGS
# ==============================================================================
generation:
  num_conversations: 100           # Total conversations to generate
  
  turn_range:                      # Number of turns per conversation
    min: 3                         # Minimum turns
    max: 8                         # Maximum turns
  
  parallel_workers: 10             # Concurrent workers (balance speed vs rate limits)
  
  # Extension behavior for multi-turn input
  extension_mode: "smart"          # Options: "smart" | "legacy"
  # - smart: Intelligently handle multi-turn conversations
  # - legacy: Always extract first user message only
  
  skip_invalid: true               # Skip invalid patterns (recommended: true)
  
  # Turn calculation method
  turn_calculation: "additional"   # Options: "additional" | "total"
  # - additional: Add NEW turns on top of existing (default)
  # - total: Keep total turns within range (never removes existing)

# ==============================================================================
# DATA SOURCE CONFIGURATION
# ==============================================================================
base_data:
  enabled: true                    # Enable base data loading
  
  # OPTION 1: Local File
  source_type: file                # Use local JSONL/JSON file
  file_path: data/input.jsonl      # Path to file
  format: conversations            # JSON key containing conversation array
  shuffle: false                   # Shuffle data before processing
  
  # OPTION 2: HuggingFace Dataset
  # source_type: huggingface       # Use HuggingFace dataset
  # hf_dataset: username/dataset   # HuggingFace dataset path
  # hf_split: train                # Dataset split: train, test, validation
  # hf_token: ${HF_TOKEN}          # HuggingFace API token (if private)
  # hf_streaming: false            # Stream dataset (for large datasets)
  # format: conversations          # Field name in dataset
  # shuffle: true                  # Shuffle after loading

# ==============================================================================
# STORAGE CONFIGURATION
# ==============================================================================
storage:
  type: jsonl                      # Options: jsonl | mongodb
  
  # JSONL Storage (Default)
  output_file: output.jsonl        # Successful conversations
  partial_file: partial.jsonl      # Partial/incomplete conversations
  failed_file: failed.jsonl        # Failed conversations
  
  # MongoDB Storage (Alternative)
  # type: mongodb
  # mongodb:
  #   connection_string: mongodb://localhost:27017
  #   database: omnigen
  #   collection: conversations
  #   output_collection: output          # Successful
  #   partial_collection: partial        # Partial
  #   failed_collection: failed          # Failed

# ==============================================================================
# DATETIME CONFIGURATION (Optional)
# ==============================================================================
datetime_config:
  enabled: true                    # Enable datetime generation
  mode: random_from_range          # Options: random_from_range | current | fixed
  timezone: UTC                    # Timezone (UTC, America/New_York, Asia/Dubai, etc.)
  format: "%Y-%m-%d %H:%M:%S"      # Python strftime format
  
  # For random_from_range mode
  range:
    start: "2024-01-01 00:00:00"   # Start datetime
    end: "2024-12-31 23:59:59"     # End datetime
  
  # For fixed mode
  # fixed_datetime: "2024-06-15 12:00:00"

# ==============================================================================
# SYSTEM MESSAGES (Optional)
# ==============================================================================
system_messages:
  # Prepend system message to every conversation
  prepend_always:
    enabled: true
    content: "You are a helpful AI assistant. Current time: {current_datetime} ({timezone})."
  
  # Append system message to every conversation
  append_always:
    enabled: false
    content: "Remember to be concise and helpful."
  
  # Add system message only if none exists
  add_if_missing:
    enabled: false
    content: "You are an AI assistant."

# Available variables in system messages:
# - {current_datetime}: Generated datetime
# - {timezone}: Configured timezone
# - {workspace_id}: Current workspace ID

# ==============================================================================
# CUSTOM PROMPTS (Optional)
# ==============================================================================
prompts:
  # Custom prompt for user follow-up generation
  followup_question: |
    ## Your Task
    Generate an intelligent follow-up user question based on conversation history.
    
    ### CONVERSATION HISTORY:
    {history}
    
    ### INSTRUCTIONS:
    - Generate a meaningful follow-up question
    - Be conversational and natural
    - Vary your phrasing and tone
    - Build on the assistant's last response
    
    Return your follow-up question wrapped in XML tags:
    <user>Your follow-up question here</user>
  
  # Custom prompt for assistant response generation
  # assistant_response: |
  #   Your custom assistant response prompt here...

# ==============================================================================
# DEBUG OPTIONS (Optional)
# ==============================================================================
debug:
  log_api_timing: true             # Log API call timings
  log_parallel_status: true        # Log parallel worker status
  verbose: false                   # Verbose logging
```

### Quick Configuration Examples

#### Example 1: Local File Input
```yaml
providers:
  user_followup:
    name: ultrasafe
    api_key: ${ULTRASAFE_API_KEY}
    model: usf-mini
  assistant_response:
    name: ultrasafe
    api_key: ${ULTRASAFE_API_KEY}
    model: usf-mini

generation:
  num_conversations: 100
  turn_range: {min: 3, max: 8}

base_data:
  source_type: file
  file_path: input.jsonl

storage:
  type: jsonl
  output_file: output.jsonl
```

#### Example 2: HuggingFace Dataset Input
```yaml
providers:
  user_followup:
    name: openai
    api_key: ${OPENAI_API_KEY}
    model: gpt-4-turbo
  assistant_response:
    name: anthropic
    api_key: ${ANTHROPIC_API_KEY}
    model: claude-3-5-sonnet-20241022

generation:
  num_conversations: 1000
  turn_range: {min: 5, max: 10}
  parallel_workers: 20

base_data:
  source_type: huggingface
  hf_dataset: username/my-dataset
  hf_split: train
  hf_token: ${HF_TOKEN}
  format: conversations
  shuffle: true

storage:
  type: jsonl
  output_file: output.jsonl
```

#### Example 3: Mixed Providers with MongoDB
```yaml
providers:
  user_followup:
    name: openai
    api_key: ${OPENAI_API_KEY}
    model: gpt-3.5-turbo
    temperature: 0.8
  assistant_response:
    name: anthropic
    api_key: ${ANTHROPIC_API_KEY}
    model: claude-3-5-sonnet-20241022
    temperature: 0.7

generation:
  num_conversations: 500
  turn_range: {min: 3, max: 8}

base_data:
  source_type: file
  file_path: questions.jsonl

storage:
  type: mongodb
  mongodb:
    connection_string: mongodb://localhost:27017
    database: omnigen
    collection: conversations
```

#### Example 4: Programmatic Configuration (Python)
```python
from omnigen.pipelines.conversation_extension import (
    ConversationExtensionConfigBuilder,
    ConversationExtensionPipeline
)

# Build configuration programmatically
config = (ConversationExtensionConfigBuilder()
    # Workspace isolation
    .set_workspace_id("user_123_session_abc")
    
    # Providers
    .add_provider(
        role='user_followup',
        name='ultrasafe',
        api_key='your-api-key',
        model='usf-mini',
        temperature=0.7,
        max_tokens=2048
    )
    .add_provider(
        role='assistant_response',
        name='ultrasafe',
        api_key='your-api-key',
        model='usf-mini',
        temperature=0.7,
        max_tokens=8192
    )
    
    # Generation settings
    .set_generation(
        num_conversations=100,
        turn_range=(3, 8),
        parallel_workers=10,
        extension_mode='smart',
        skip_invalid=True,
        turn_calculation='additional'
    )
    
    # Data source - Local file
    .set_data_source(
        source_type='file',
        file_path='input.jsonl',
        format='conversations',
        shuffle=False
    )
    
    # Data source - HuggingFace (alternative)
    # .set_data_source(
    #     source_type='huggingface',
    #     hf_dataset='username/dataset',
    #     hf_split='train',
    #     hf_token='your-token',
    #     format='conversations',
    #     shuffle=True
    # )
    
    # Storage
    .set_storage(
        type='jsonl',
        output_file='output.jsonl',
        partial_file='partial.jsonl',
        failed_file='failed.jsonl'
    )
    
    # Custom prompts (optional)
    .set_prompts(
        followup_question="Your custom prompt here with {history}"
    )
    
    .build()
)

# Run pipeline
pipeline = ConversationExtensionPipeline(config)
pipeline.run()
```

---

## 📖 Conversation Extension Pipeline - Complete Guide

### Overview

The **Conversation Extension Pipeline** intelligently transforms base conversations into rich multi-turn dialogues. It can handle both single-turn questions and extend existing multi-turn conversations.

### Key Features

- ✅ **Smart Extension** - Continues from existing conversations based on last role
- ✅ **Flexible Input** - Handles single-turn or multi-turn base data
- ✅ **Provider Mix** - Use different AI providers for user and assistant
- ✅ **Multi-Tenant** - Complete workspace isolation
- ✅ **Configurable** - Full control over generation behavior

### Configuration Options

#### Extension Modes

**Smart Mode (Default)**
```yaml
generation:
  extension_mode: "smart"
```

- **Single-turn input** → Generate new conversation from scratch
- **Multi-turn (user last)** → Add 1 assistant response, then continue
- **Multi-turn (assistant last)** → Add user + assistant, then continue
- **Invalid patterns** → Skip row entirely

**Legacy Mode**
```yaml
generation:
  extension_mode: "legacy"
```
- Always extracts first user message only (original behavior)

#### Turn Calculation

**Additional Mode (Default)** - Add NEW turns on top of existing
```yaml
generation:
  turn_calculation: "additional"  # Add 3-8 NEW turns
```

**Total Mode** - Keep total within range (never removes existing)
```yaml
generation:
  turn_calculation: "total"  # Total should be 3-8 turns
```

#### Complete Configuration

```yaml
# Workspace isolation (optional)
workspace_id: "user_123"

# AI Providers
providers:
  user_followup:
    name: "ultrasafe"
    api_key: "${ULTRASAFE_API_KEY}"
    model: "usf-mini"
    temperature: 0.7
    max_tokens: 2048
  
  assistant_response:
    name: "ultrasafe"
    api_key: "${ULTRASAFE_API_KEY}"
    model: "usf-mini"
    temperature: 0.7
    max_tokens: 8192

# Generation Settings
generation:
  num_conversations: 100
  turn_range:
    min: 3
    max: 8
  parallel_workers: 10
  
  # Extension behavior
  extension_mode: "smart"        # "smart" | "legacy"
  skip_invalid: true             # Skip invalid patterns
  turn_calculation: "additional" # "additional" | "total"

# Input Data
base_data:
  enabled: true
  source_type: "file"
  file_path: "base_data.jsonl"
  format: "conversations"
  shuffle: false

# Output Storage
storage:
  type: "jsonl"
  output_file: "output.jsonl"
  partial_file: "partial.jsonl"
  failed_file: "failed.jsonl"

# System Messages (optional)
system_messages:
  add_if_missing:
    enabled: true
    content: "You are a helpful assistant. Current datetime: {current_datetime}"

# DateTime (optional)
datetime_config:
  enabled: true
  timezone: "UTC"
  format: "%Y-%m-%d %H:%M:%S"
  range:
    start_date: "2024-01-01"
    end_date: "2024-12-31"
```

### Input Data Formats

#### Valid Patterns

**Single-turn** ✅
```json
{"conversations": [{"role": "user", "content": "How do I learn Python?"}]}
```

**Multi-turn (user last)** ✅
```json
{
  "conversations": [
    {"role": "user", "content": "How do I learn Python?"},
    {"role": "assistant", "content": "Start with basics..."},
    {"role": "user", "content": "What resources?"}
  ]
}
```

**Multi-turn (assistant last)** ✅
```json
{
  "conversations": [
    {"role": "user", "content": "How do I learn Python?"},
    {"role": "assistant", "content": "Start with basics..."}
  ]
}
```

#### Invalid Patterns (Skipped)

❌ First message not user
```json
{"conversations": [{"role": "assistant", "content": "Hello"}]}
```

❌ Empty conversations
```json
{"conversations": []}
```

### Programmatic Usage

```python
from omnigen.pipelines.conversation_extension import (
    ConversationExtensionConfigBuilder,
    ConversationExtensionPipeline
)

config = (ConversationExtensionConfigBuilder()
    .add_provider('user_followup', 'ultrasafe', 'api-key', 'usf-mini')
    .add_provider('assistant_response', 'ultrasafe', 'api-key', 'usf-mini')
    .set_generation(
        num_conversations=100,
        turn_range=(3, 8),
        parallel_workers=10,
        extension_mode='smart',      # Handle multi-turn intelligently
        skip_invalid=True,            # Skip invalid patterns
        turn_calculation='additional' # Add new turns (default)
    )
    .set_data_source('file', file_path='base_data.jsonl')
    .set_storage('jsonl', output_file='output.jsonl')
    .build()
)

pipeline = ConversationExtensionPipeline(config)
pipeline.run()
```

### Turn Calculation Examples

**Additional Mode (Default)**
```
Existing: 2 turns
Config: turn_range = (3, 8)
Result: Add 3-8 NEW turns → Total: 5-10 turns
```

**Total Mode**
```
Existing: 2 turns
Config: turn_range = (3, 8)
Result: Add 1-6 turns → Total: 3-8 turns

Existing: 10 turns (already > max)
Config: turn_range = (3, 8)
Result: Add 0 turns → Keep 10 turns (never remove)
```

### Best Practices

**Provider Selection**
- Use better models for assistant (claude-3-5-sonnet, gpt-4-turbo)
- Use cheaper models for user followups (usf-mini, gpt-3.5-turbo)

**Turn Range**
- Quick exchanges: `(2, 4)`
- In-depth: `(5, 10)`
- Balanced: `(3, 8)` ✅

**Parallel Workers**
- Conservative: `5` (avoid rate limits)
- Balanced: `10` ✅
- Aggressive: `20` (watch for rate limits)

### Troubleshooting

**Issue: Empty output**
- Check input data format (first message must be user)
- Set `skip_invalid: false` to see errors

**Issue: Rate limits**
- Reduce `parallel_workers`
- Check provider API limits

**Issue: Low quality**
- Increase temperature (0.8-0.9)
- Use better models
- Add custom prompts and system messages

---

## License

MIT License - Ultrasafe AI © 2024

---

## About Ultrasafe AI

Enterprise-grade AI tools with focus on safety and performance.

- 🌐 Website: [us.inc](https://us.inc)
- 📧 Email: support@us.inc

---

<div align="center">

**Made with ❤️ by [Ultrasafe AI](https://us.inc)**

</div>
