Metadata-Version: 2.4
Name: makeitup
Version: 0.1.0
Summary: LLM-based synthetic dataset generation
Author-email: Tomek Kopczynski <wonsek@gmail.com>
License-Expression: MIT
Project-URL: Homepage, https://github.com/tkopczynski/makeitup
Project-URL: Repository, https://github.com/tkopczynski/makeitup
Project-URL: Issues, https://github.com/tkopczynski/makeitup/issues
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: langchain>=0.1.0
Requires-Dist: langchain-openai>=0.0.5
Requires-Dist: openai>=1.0.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: pyarrow>=14.0.0
Requires-Dist: openpyxl>=3.1.0
Provides-Extra: dev
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: ruff>=0.8.0; extra == "dev"
Requires-Dist: scikit-learn>=1.3.0; extra == "dev"
Dynamic: license-file

# makeitup

Generate synthetic datasets using LLM. Describe your columns in plain English and get realistic data back.

```python
from makeitup import make

df = make(
    columns={
        "name": "Person's full name",
        "age": "Age between 25 and 55",
        "email": "Work email address",
    },
    num_rows=100
)
```

## Quick Start

```bash
# Install
uv venv && source .venv/bin/activate
uv pip install -e .

# Configure
cp .env.example .env
# Add your OpenAI API key to .env
```

## Examples

### Basic Data

```python
from makeitup import make

# Customer data
df = make(
    columns={
        "customer_id": "Unique customer identifier",
        "name": "Customer full name",
        "email": "Email address",
        "signup_date": "Date when customer signed up, 2020-2024",
    },
    num_rows=100
)
```

### ML Dataset with Target Column

```python
df = make(
    columns={
        "tenure_months": "Months as customer, 1-60",
        "monthly_spend": "Monthly spending in USD, 10-500",
        "support_tickets": "Number of support tickets, 0-10",
    },
    target={
        "name": "churned",
        "prompt": "Boolean indicating if customer churned"
    },
    num_rows=500
)
```

### Data Quality Degradation

```python
# Generate dataset with intentional quality issues for testing data pipelines
df = make(
    columns={
        "name": "Person's full name",
        "age": "Age between 20 and 60",
        "salary": "Annual salary in USD, 30000-150000",
    },
    num_rows=100,
    quality_issues=["nulls", "outliers"],  # Options: nulls, outliers, typos, duplicates
)
```

### Save to File

```python
# CSV, JSON, Parquet, or Excel - format detected from extension
df = make(
    columns={"name": "Product name", "price": "Price in USD, 10-1000"},
    num_rows=200,
    output_path="products.csv"
)
```

## Output Formats

| Format | Extension |
|--------|-----------|
| CSV | `.csv` |
| JSON | `.json` |
| Parquet | `.parquet` |
| Excel | `.xlsx` |

## Requirements

- Python >= 3.12
- OpenAI API key

## Documentation

See [DEVELOPER.md](DEVELOPER.md) for technical details, API reference, and development setup.

## License

See LICENSE file for details.
