Metadata-Version: 2.4
Name: Data_Generation_Agents
Version: 1.0.0
Summary: AI-powered synthetic data generation pipeline with web search, topic extraction, and persistent state management
Author-email: Omar Youssef <omarjooo595@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/Omar-YYoussef/Data_Gen_Agent
Project-URL: Repository, https://github.com/Omar-YYoussef/Data_Gen_Agent
Keywords: synthetic data,ai,data generation,nlp
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pydantic<3.0.0,>=2.7.4
Requires-Dist: google-generativeai>=0.3.0
Requires-Dist: langgraph>=0.2.20
Requires-Dist: langchain>=0.3.0
Requires-Dist: langchain-core>=0.3.0
Requires-Dist: langchain-community>=0.3.0
Requires-Dist: tavily-python>=0.7.0
Requires-Dist: crawl4ai>=0.7.0
Requires-Dist: requests>=2.31.0
Requires-Dist: beautifulsoup4>=4.12.3
Requires-Dist: aiohttp>=3.9.0
Requires-Dist: httpx>=0.25.0
Requires-Dist: Scrapy>=2.11.0
Requires-Dist: scraperapi-sdk>=1.5.3
Requires-Dist: langchain-scraperapi>=0.1.0
Requires-Dist: pandas>=2.2.3
Requires-Dist: numpy<3.0.0,>=1.26.0
Requires-Dist: jsonschema>=4.23.0
Requires-Dist: asyncio-throttle>=1.0.2
Requires-Dist: nest-asyncio>=1.5.0
Requires-Dist: langdetect>=1.0.9
Requires-Dist: structlog>=23.2.0
Requires-Dist: colorlog>=6.8.0
Requires-Dist: cryptography>=41.0.8
Requires-Dist: python-jose>=3.3.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: black>=23.10.0; extra == "dev"
Requires-Dist: isort>=5.12.0; extra == "dev"
Requires-Dist: mypy>=1.7.0; extra == "dev"
Provides-Extra: api
Requires-Dist: fastapi>=0.104.0; extra == "api"
Requires-Dist: uvicorn>=0.24.0; extra == "api"
Requires-Dist: python-multipart>=0.0.6; extra == "api"
Provides-Extra: ui
Requires-Dist: gradio>=4.0.0; extra == "ui"
Requires-Dist: streamlit>=1.28.0; extra == "ui"
Requires-Dist: plotly>=5.17.0; extra == "ui"
Provides-Extra: cli
Requires-Dist: click>=8.1.7; extra == "cli"
Requires-Dist: rich>=13.7.0; extra == "cli"
Requires-Dist: typer>=0.9.0; extra == "cli"
Provides-Extra: monitoring
Requires-Dist: prometheus-client>=0.19.0; extra == "monitoring"
Dynamic: license-file

﻿# Synthetic Data Pipeline

AI-powered synthetic data generation pipeline with web search and topic extraction.

## Installation

Install the package using pip:

    pip install Data_Generation_Agents

## Requirements

- Python >= 3.8
- API Keys for: Gemini, Tavily, ScraperAPI

## Quick Start

**Step 1: Create Configuration File**

Create a `.env` file in your project directory:

    GEMINI_API_KEY=your_gemini_api_key_here
    TAVILY_API_KEY=your_tavily_api_key_here
    SCRAPERAPI_API_KEY=your_scraper_api_key_here
    OUTPUT_DIR=/path/to/your/output

**Note**: The `OUTPUT_DIR` variable is mandatory. The pipeline will not start without it. This directory is where all the generated data and state files will be saved.

**Step 2: Use in Python Code**

Basic usage example:

    from Data_Generation_Agents import generate_synthetic_data
    
    generate_synthetic_data("prompt")

Advanced usage with custom parameters:

    generate_synthetic_data(
        user_query="prompt",
        refined_queries_count=20,
        search_results_per_query=5,
        rows_per_subtopic=5
        gemini_model_name="gemeni-2.0-flash
    )

**Step 3: Use CLI**

Command line usage:

    synthetic-data "prompt"

## Configuration

**Environment Variables**

| Variable | Required | Description |
|----------|----------|-------------|
| GEMINI_API_KEY | Yes | Google Gemini API key |
| TAVILY_API_KEY | Yes | Tavily search API key |
| SCRAPERAPI_API_KEY | Yes | ScraperAPI key |
| OUTPUT_DIR | Yes | Output directory path |


## Pipeline Output Structure

When you run the pipeline, it will create a new directory for each run inside your specified `OUTPUT_DIR`. The directory will be named with a unique workflow ID. Inside this directory, you will find the following files, which are updated in real-time:

- `pipeline_state.json`: The main state file with metadata about the run.
- `refined_queries.json`: The search queries generated by the `QueryRefinerAgent`.
- `search_results.json`: The results from the web search.
- `scraped_content.json`: The content scraped from the web pages.
- `all_chunks.json`: The scraped content, broken down into smaller chunks.
- `all_extracted_topics.json`: The topics extracted from the content chunks.
- `synthetic_data.json`: The final generated synthetic data, with each data point saved as it is generated.

This structure provides a complete and real-time record of the data generation process.

## API Reference

### `generate_synthetic_data(user_query: str, refined_queries_count: Optional[int] = None, search_results_per_query: Optional[int] = None, rows_per_subtopic: Optional[int] = None, gemini_model_name: Optional[str] = None)`

Generate synthetic data based on a natural language prompt. The `user_query` is parsed to automatically determine the number of samples, data type, language, and a detailed description of the data to be generated.

**Categories Feature:**
When you specify categories within your domain (e.g., "cardiovascular and neurology" for medical domain), the pipeline will:
- Focus search queries specifically on those categories
- Generate more targeted and relevant content
- Distribute queries across all specified categories
- Use category-specific terminology and concepts

If no categories are specified, the pipeline will comprehensively cover the entire domain.

**Parameters:**
- `user_query` (str): **Required**. A natural language description of the data you want to generate. This query should implicitly or explicitly contain:
    - **Number of samples**: The total count of data entries to generate (e.g., "100"). (required)
    - **Data type**: The structure or format of the data (e.g., "QA pairs", "product reviews", "customer support conversations"). (required)
    - **Language**: The desired language for the generated data (e.g., "English", "French", "Egyptian_Arabic"). (required)
    - **Description**: A detailed explanation of the data's content and context. (required)
    - **Domain**: The desired domain for the generated data (e.g., "Finance", "Medical", "Law"). (optional)
    - **Categories**: Specific subcategories within the domain to focus on (e.g., "cardiovascular, neurology" for medical domain). (optional)
- `refined_queries_count` (int, optional): Number of refined search queries to generate. Defaults to a value from `.env` or internal settings.
- `search_results_per_query` (int, optional): Number of web search results to consider per refined query. Defaults to a value from `.env` or internal settings.
- `rows_per_subtopic` (int, optional): Number of synthetic data rows to generate per extracted subtopic. Defaults to a value from `.env` or internal settings.
- `gemini_model_name` (str, optional): The name of the Gemini model to use (e.g., "gemini-pro", "gemini-1.5-flash"). Defaults to "gemini-2.5-flash" or a value from `.env`.


**Examples:**

```python
from Data_Generation_Agents import generate_synthetic_data

generate_synthetic_data(
    user_query= "Generate 5000 diverse, contextually rich English-to-Egyptian Arabic translation pairs In Law domain with varying sentence complexities, ensuring authentic colloquial Egyptian Arabic translations while preserving English technical terms, proper nouns, and specialized terminology untranslated. the data the data contains two columns (English, Egyptian Arabic)"
    refined_queries_count=25,
    search_results_per_query=5,
    rows_per_subtopic=5
)
```

```python
from Data_Generation_Agents import generate_synthetic_data

generate_synthetic_data(
    user_query="Generate 2000 finance classification examples in Arabic covering banking, insurance, and investment topics, the data contains two columns (Text, classification_type)",
    refined_queries_count=30,
    search_results_per_query=5,
    rows_per_subtopic=5
    gemini_model_name="gemini-1.5-pro"
)
```

## Development

**Local Installation**

    git clone https://github.com/Omar-YYoussef/Data_Gen_Agent
    cd synthetic-data-pipeline
    pip install -e .

## License

MIT License - see LICENSE file for details.

## Support

- Issues: GitHub Issues
- Email: omarjooo595@gmail.com
