Metadata-Version: 2.4
Name: omnigen-usf
Version: 0.0.1.post1
Summary: Enterprise-Grade Synthetic Data Generation
Author-email: Ultrasafe AI <support@us.inc>
Maintainer-email: Ultrasafe AI <support@us.inc>
License: MIT
Project-URL: Homepage, https://us.inc
Project-URL: Documentation, https://github.com/ultrasafe-ai/omnigen
Project-URL: Repository, https://github.com/ultrasafe-ai/omnigen
Keywords: synthetic-data,data-generation,conversational-ai,llm,pipeline,conversation-extension,machine-learning,ai
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: openai>=1.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: tqdm>=4.65.0
Requires-Dist: pytz>=2023.3
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: typer>=0.9.0
Requires-Dist: pymongo>=4.0.0
Provides-Extra: hf
Requires-Dist: datasets>=2.14.0; extra == "hf"
Requires-Dist: huggingface_hub>=0.16.0; extra == "hf"
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: ruff>=0.0.280; extra == "dev"
Requires-Dist: mypy>=1.5.0; extra == "dev"
Dynamic: license-file

# OmniGen 🚀

**Generate synthetic data at scale using an enterprise-ready framework with full customizable configuration, security, and ease of use**

Built by [Ultrasafe AI](https://us.inc) for production environments.

---

## What is OmniGen?

**OmniGen** is an enterprise-grade framework for generating synthetic datasets at scale—from scratch or from base data. Generate **trillions of tokens** and **billions of samples** across multiple modalities:

### 🎯 Data Types Supported
- 💬 **Conversational Data** - Single-turn to multi-turn dialogues
- 🤖 **Agentic Datasets** - Tool use, function calling, multi-step reasoning
- 🎨 **Multimodal Datasets** - Text, images, audio, video combinations
- 🖼️ **Images** - Synthetic image generation and editing
- 🎵 **Audio** - Speech, music, sound effects
- 🎬 **Video** - Synthetic video sequences

### 🎓 Use Cases
- **Fine-Tuning** - Instruction following, task-specific models
- **Supervised Fine-Tuning (SFT)** - High-quality labeled datasets
- **Offline Reinforcement Learning** - Preference datasets with rewards
- **Online Reinforcement Learning** - Ground truth with reward checking scripts
- **Pre-Training** - Large-scale corpus generation
- **Machine Learning** - Training data for any ML task

### 🏗️ Why OmniGen?
- ✅ **Enterprise-Ready** - Built for production at scale
- ✅ **Fully Customizable** - Configure every aspect of generation
- ✅ **Secure** - Complete isolation, no data mixing
- ✅ **Easy** - Simple API, clear examples
- ✅ **Modular** - Independent pipelines for different data types

---

## 🚀 Currently Available Pipeline

### **conversation_extension** - Extend Single-Turn to Multi-Turn Conversations

Turn your base questions into rich multi-turn dialogues. This is just the first pipeline—more coming soon!

---

## Why OmniGen?

✅ **Simple** - One command to generate thousands of conversations  
✅ **Scalable** - Parallel processing for fast generation  
✅ **Flexible** - Mix different AI providers (OpenAI, Anthropic, Ultrasafe AI)  
✅ **Production Ready** - Built for SaaS platforms with multi-tenant support  

---

## Quick Start

### 1. Install

```bash
pip install omnigen
```

### 2. Prepare Base Data

Create a file `base_data.jsonl` with your starting questions:

```jsonl
{"conversations": [{"role": "user", "content": "How do I learn Python?"}]}
{"conversations": [{"role": "user", "content": "What is machine learning?"}]}
{"conversations": [{"role": "user", "content": "Explain neural networks"}]}
```

### 3. Generate Conversations

```python
from omnigen.pipelines.conversation_extension import (
    ConversationExtensionConfigBuilder,
    ConversationExtensionPipeline
)

# Configure the pipeline
config = (ConversationExtensionConfigBuilder()
    # User followup generator
    .add_provider(
        role='user_followup',
        name='ultrasafe',
        api_key='your-api-key',
        model='usf-mini'
    )
    # Assistant response generator
    .add_provider(
        role='assistant_response',
        name='ultrasafe',
        api_key='your-api-key',
        model='usf-mini'
    )
    # Generation settings
    .set_generation(
        num_conversations=100,
        turn_range=(3, 8)  # 3-8 turns per conversation
    )
    # Input data
    .set_data_source(
        source_type='file',
        file_path='base_data.jsonl'
    )
    # Output
    .set_storage(
        type='jsonl',
        output_file='output.jsonl'
    )
    .build()
)

# Run the pipeline
pipeline = ConversationExtensionPipeline(config)
pipeline.run()
```

### 4. Get Results

Your generated conversations will be in `output.jsonl`:

```jsonl
{
  "id": 0,
  "conversations": [
    {"role": "user", "content": "How do I learn Python?"},
    {"role": "assistant", "content": "Great choice! Start with the basics..."},
    {"role": "user", "content": "What resources do you recommend?"},
    {"role": "assistant", "content": "I recommend these resources..."},
    {"role": "user", "content": "How long will it take?"},
    {"role": "assistant", "content": "With consistent practice..."}
  ],
  "num_turns": 3,
  "success": true
}
```

---

## Supported AI Providers

| Provider | Model Examples |
|----------|----------------|
| **Ultrasafe AI** | `usf-mini`, `usf-max` |
| **OpenAI** | `gpt-4-turbo`, `gpt-3.5-turbo` |
| **Anthropic** | `claude-3-5-sonnet`, `claude-3-opus` |
| **OpenRouter** | Various models |

### Mix Different Providers

```python
config = (ConversationExtensionConfigBuilder()
    .add_provider('user_followup', 'openai', api_key, 'gpt-4-turbo')
    .add_provider('assistant_response', 'anthropic', api_key, 'claude-3-5-sonnet')
    # ... rest of config
    .build()
)
```

---

## Advanced Features

### Multi-Tenant SaaS Support

Perfect for platforms serving multiple users concurrently:

```python
# Each user gets isolated workspace
workspace_id = f"user_{user_id}_session_{session_id}"

config = (ConversationExtensionConfigBuilder(workspace_id=workspace_id)
    .add_provider('user_followup', 'ultrasafe', shared_api_key, 'usf-mini')
    .add_provider('assistant_response', 'ultrasafe', shared_api_key, 'usf-mini')
    .set_storage('jsonl', output_file='output.jsonl')  # Auto-isolated
    .build()
)

# Storage automatically goes to: workspaces/{workspace_id}/output.jsonl
```

### Parallel Dataset Generation

```python
from concurrent.futures import ProcessPoolExecutor

def process_dataset(input_file, output_file):
    config = (ConversationExtensionConfigBuilder()
        .add_provider('user_followup', 'ultrasafe', api_key, 'usf-mini')
        .add_provider('assistant_response', 'ultrasafe', api_key, 'usf-mini')
        .set_data_source('file', file_path=input_file)
        .set_storage('jsonl', output_file=output_file)
        .build()
    )
    ConversationExtensionPipeline(config).run()

# Process 3 datasets in parallel
with ProcessPoolExecutor(max_workers=3) as executor:
    executor.submit(process_dataset, 'data1.jsonl', 'out1.jsonl')
    executor.submit(process_dataset, 'data2.jsonl', 'out2.jsonl')
    executor.submit(process_dataset, 'data3.jsonl', 'out3.jsonl')
```

---

## Examples

See [`examples/conversation_extension/`](examples/conversation_extension/) for more examples:
- Simple usage with JSONL files
- Multi-dataset parallel processing  
- Multi-tenant SaaS implementation

---

## Documentation

- [Complete Guide](examples/conversation_extension/README.md)
- [API Reference](https://docs.us.inc/omnigen)

---

## License

MIT License - Ultrasafe AI © 2024

---

## About Ultrasafe AI

Enterprise-grade AI tools with focus on safety and performance.

- 🌐 Website: [us.inc](https://us.inc)
- 📧 Email: support@us.inc

---

<div align="center">

**Made with ❤️ by [Ultrasafe AI](https://us.inc)**

</div>
