Metadata-Version: 2.1
Name: opendatagen
Version: 0.0.12
Summary: Steerable data generation system for LLM fine-tuning
Home-page: https://github.com/thoddnn/open-datagen
Author: Thomas DORDONNE
Author-email: dordonne.thomas@gmail.com
License: UNKNOWN
Platform: UNKNOWN
Description-Content-Type: text/markdown
Requires-Dist: openai (>=1.1.1)
Requires-Dist: python-dotenv (>=0.17.1)
Requires-Dist: numpy (>=1.23.4)
Requires-Dist: trafilatura (>=0.9.1)
Requires-Dist: requests (>=2.29.0)
Requires-Dist: tenacity (>=8.2.2)
Requires-Dist: pydantic (>=2)
Requires-Dist: spacy (>=3)
Requires-Dist: tiktoken (>=0.5)
Requires-Dist: PyPDF2 (>=3)
Requires-Dist: pandas (>=2)

# ⬜️ Open Datagen ⬜️

**Open Datagen**, a steerable data generation system for ML models training.

## Features

- Generate synthetic datasets with a fixed format using:
    - User-defined template with pydantic object
    - Predefined templates

- Data anonymization 

- Quality enhancement with RAG from Internet and local files

- (SOON) Data evaluation & cleaning agent

- (SOON) Open-source model support (+ local inference)

- (SOON) Multimodality

## Installation

```bash
pip install --upgrade opendatagen
```

### Setting up the OpenAI API key

```bash
export OPENAI_API_KEY='your_openai_api_key'
```

### Setting up the SERPLY API key for Google Search API (optional)

```bash
export SERPLY_API_KEY='your_serply_api_key'
```

## Usage

Example: If you want to train a small model to write great python code

```python
from opendatagen.data_generator import DataGenerator
from opendatagen.model import LLM
from opendatagen.template import Template, Variable

variation_model = LLM.load_chat.GPT_35_TURBO_CHAT 
completion_model = LLM.load_instruct.GPT_35_TURBO_INSTRUCT

# Create the custom template using the Pydantic models
user_template = Template(
    description="Custom template for Python exercises",
    prompt="Python exercice statement: {python_exercice_statement}",
    completion="Answer:\n{python_code}",
    prompt_variation_number=1,
    prompt_variables={
        "python_exercice_statement": Variable(
            name="Python exercice statement",
            temperature=1,
            max_tokens=120,
            generation_number=10
        )
    },
    completion_variables={
        "python_code": Variable(
            name="Python code",
            temperature=0,
            max_tokens=256,
            generation_number=1
        )
    }
)

generator = DataGenerator(template=user_template, variation_model=variation_model, completion_model=completion_model)

data = generator.generate_data(output_path="output.csv")

print(data)
```

This code will generate a dataset of 5 medium-level Python exercises/answers formatted as you asked for.

### Predefined Templates:

```python
from opendatagen.data_generator import DataGenerator
from opendatagen.model import LLM
from opendatagen.template import TemplateManager, TemplateName

variation_model = LLM.load_chat.GPT_35_TURBO_CHAT
completion_model = LLM.load_instruct.GPT_35_TURBO_INSTRUCT

manager = TemplateManager()
template = manager.get_template(TemplateName.PRODUCT_REVIEW)

generator = DataGenerator(template=template, variation_model=variation_model, completion_model=completion_model)

data = generator.generate_data(output_path="output.csv")

print(data)
```

You can find the templates in the [template.json](https://github.com/thoddnn/open-datagen/blob/main/opendatagen/files/template.json) file.

## Contribution 

We welcome contributions to Open Datagen! Whether you're looking to fix bugs, add templates, new features, or improve documentation, your help is greatly appreciated.

## Note 

Please note that `opendatagen` is initially powered by OpenAI's models. Be aware of potential biases and use the `start_with` and `note` field to guide outputs.

## Acknowledgements

We would like to express our gratitude to the following open source projects and individuals that have inspired and helped us:

- **Textbook Generation** by [VikParuchuri](https://github.com/VikParuchuri/textbook_quality)

- **Evol-Instruct Paper** ([Read the paper](https://arxiv.org/abs/2306.08568)) by [WizardLM_AI](https://twitter.com/WizardLM_AI)

## Connect 

Reach us on Twitter: [@thoddnn](https://twitter.com/thoddnn).

