Metadata-Version: 2.4
Name: mostlyai-mock
Version: 0.2.2
Summary: Synthetic Mock Data
Project-URL: homepage, https://github.com/mostly-ai/mostlyai-mock
Project-URL: repository, https://github.com/mostly-ai/mostlyai-mock
Project-URL: documentation, https://mostly-ai.github.io/mostlyai-mock/
Author-email: MOSTLY AI <dev@mostly.ai>
License-Expression: Apache-2.0
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Financial and Insurance Industry
Classifier: Intended Audience :: Healthcare Industry
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Telecommunications Industry
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Libraries
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: litellm>=1.67.0
Requires-Dist: numpy>=1.26.3
Requires-Dist: pandas>=2.0.0
Requires-Dist: pyarrow>=14.0.0
Requires-Dist: pydantic<3.0.0,>=2.0.0
Requires-Dist: tenacity>=9.1.2
Provides-Extra: litellm-proxy
Requires-Dist: litellm[proxy]>=1.67.0; extra == 'litellm-proxy'
Provides-Extra: mcp
Requires-Dist: fastmcp<3.0.0,>=2.0.0; extra == 'mcp'
Description-Content-Type: text/markdown

# Synthetic Mock Data 🔮

[![Documentation](https://img.shields.io/badge/docs-latest-green)](https://mostly-ai.github.io/mostlyai-mock/) [![stats](https://pepy.tech/badge/mostlyai-mock)](https://pypi.org/project/mostlyai-mock/) ![license](https://img.shields.io/github/license/mostly-ai/mostlyai-mock) ![GitHub Release](https://img.shields.io/github/v/release/mostly-ai/mostlyai-mock)

Use LLMs to generate any Tabular Data towards your needs. Create from scratch, expand existing datasets, or enrich tables with new columns. Your prompts, your rules, your data.

## Key Features

* A light-weight python client for prompting LLMs for mixed-type tabular data.
* Select from a wide range of LLM endpoints and LLM models.
* Supports single-table as well as multi-table scenarios.
* Supports variety of data types: `string`, `integer`, `float`, `category`, `boolean`, `date`, and `datetime`.
* Specify context, distributions and rules via dataset-, table- or column-level prompts.
* Create from scratch or enrich existing datasets with new columns and/or rows.
* Tailor the diversity and realism of your generated data via temperature and top_p.

## Getting Started

1. Install the latest version of the `mostlyai-mock` python package.

```bash
pip install -U mostlyai-mock
```

2. Set the API key of your LLM endpoint (if not done yet)

```python
import os
os.environ["OPENAI_API_KEY"] = "your-api-key"
# os.environ["GEMINI_API_KEY"] = "your-api-key"
# os.environ["GROQ_API_KEY"] = "your-api-key"
```

Note: You will need to obtain your API key directly from the LLM service provider (e.g. for Open AI from [here](https://platform.openai.com/api-keys)). The LLM endpoint will be determined by the chosen `model` when making calls to `mock.sample`.

3. Create your first basic mock table from scratch

```python
from mostlyai import mock

tables = {
    "guests": {
        "prompt": "Guests of an Alpine ski hotel in Austria",
        "columns": {
            "nationality": {"prompt": "2-letter code for the nationality", "dtype": "string"},
            "name": {"prompt": "first name and last name of the guest", "dtype": "string"},
            "gender": {"dtype": "category", "values": ["male", "female"]},
            "age": {"prompt": "age in years; min: 18, max: 80; avg: 25", "dtype": "integer"},
            "date_of_birth": {"prompt": "date of birth", "dtype": "date"},
            "checkin_time": {"prompt": "the check in timestamp of the guest; may 2025", "dtype": "datetime"},
            "is_vip": {"prompt": "is the guest a VIP", "dtype": "boolean"},
            "price_per_night": {"prompt": "price paid per night, in EUR", "dtype": "float"},
            "room_number": {"prompt": "room number", "dtype": "integer", "values": [101, 102, 103, 201, 202, 203, 204]}
        },
    }
}
df = mock.sample(
    tables=tables,   # provide table and column definitions
    sample_size=10,  # generate 10 records
    model="openai/gpt-5-nano",  # select the LLM model (optional)
)
print(df)
#   nationality                 name  gender  age date_of_birth        checkin_time is_vip  price_per_night  room_number
# 0          FR          Jean Dupont    male   29    1994-03-15 2025-01-10 14:30:00  False            150.0          101
# 1          DE         Anna Schmidt  female   34    1989-07-22 2025-01-11 16:45:00   True            200.0          201
# 2          IT          Marco Rossi    male   45    1979-11-05 2025-01-09 10:15:00  False            180.0          102
# 3          AT         Laura Gruber  female   28    1996-02-19 2025-01-12 09:00:00  False            165.0          202
# 4          CH         David Müller    male   37    1987-08-30 2025-01-08 17:20:00   True            210.0          203
# 5          NL  Sophie van den Berg  female   22    2002-04-12 2025-01-10 12:00:00  False            140.0          103
# 6          GB         James Carter    male   31    1992-09-10 2025-01-11 11:30:00  False            155.0          204
# 7          BE        Lotte Peeters  female   26    1998-05-25 2025-01-09 15:45:00  False            160.0          201
# 8          DK        Anders Jensen    male   33    1990-12-03 2025-01-12 08:15:00   True            220.0          202
# 9          ES         Carlos Lopez    male   38    1985-06-14 2025-01-10 18:00:00  False            170.0          203
```

4. Create your first multi-table mock dataset

```python
from mostlyai import mock

tables = {
    "customers": {
        "prompt": "Customers of a hardware store",
        "columns": {
            "customer_id": {"prompt": "the unique id of the customer", "dtype": "string"},
            "name": {"prompt": "first name and last name of the customer", "dtype": "string"},
        },
        "primary_key": "customer_id",
    },
    "warehouses": {
        "prompt": "Warehouses of a hardware store",
        "columns": {
            "warehouse_id": {"prompt": "the unique id of the warehouse", "dtype": "string"},
            "name": {"prompt": "the name of the warehouse", "dtype": "string"},
        },
        "primary_key": "warehouse_id",
    },
    "orders": {
        "prompt": "Orders of a Customer",
        "columns": {
            "customer_id": {"prompt": "the customer id for that order", "dtype": "string"},
            "warehouse_id": {"prompt": "the warehouse id for that order", "dtype": "string"},
            "order_id": {"prompt": "the unique id of the order", "dtype": "string"},
            "text": {"prompt": "order text description", "dtype": "string"},
            "amount": {"prompt": "order amount in USD", "dtype": "float"},
        },
        "primary_key": "order_id",
        "foreign_keys": [
            {
                "column": "customer_id",
                "referenced_table": "customers",
                "prompt": "each customer has anywhere between 2 and 3 orders",
            },
            {
                "column": "warehouse_id",
                "referenced_table": "warehouses",
            },
        ],
    },
    "items": {
        "prompt": "Items in an Order",
        "columns": {
            "item_id": {"prompt": "the unique id of the item", "dtype": "string"},
            "order_id": {"prompt": "the order id for that item", "dtype": "string"},
            "name": {"prompt": "the name of the item", "dtype": "string"},
            "price": {"prompt": "the price of the item in USD", "dtype": "float"},
        },
        "foreign_keys": [
            {
                "column": "order_id",
                "referenced_table": "orders",
                "prompt": "each order has between 1 and 2 items",
            }
        ],
        "primary_key": "item_id",
    },
}
data = mock.sample(
    tables=tables,
    sample_size=2,
    model="openai/gpt-5",
    n_workers=1,
)
print(data["customers"])
#   customer_id             name
# 0   B0-100235  Danielle Rogers
# 1   B0-100236       Edward Kim
print(data["warehouses"])
#   warehouse_id                          name
# 0       B0-001  Downtown Distribution Center
# 1       B0-002     Westside Storage Facility
print(data["orders"])
#   customer_id warehouse_id    order_id                                               text   amount
# 0   B0-100235       B0-002  B0-3010021  Office furniture replenishment - desks, chairs...  1268.35
# 1   B0-100235       B0-001  B0-3010022  Bulk stationery order: printer paper, notebook...    449.9
# 2   B0-100235       B0-001  B0-3010023  Electronics restock: monitors and wireless key...    877.6
# 3   B0-100236       B0-001  B1-3010021  Monthly cleaning supplies: disinfectant, trash...   314.75
# 4   B0-100236       B0-002  B1-3010022  Breakroom essentials restock: coffee, tea, and...   182.45
print(data["items"])
#      item_id    order_id                                   name   price
# 0  B0-200501  B0-3010021                  Ergonomic Office Desk  545.99
# 1  B0-200502  B0-3010021              Mesh Back Executive Chair   399.5
# 2  B1-200503  B0-3010022   Multipack Printer Paper (500 sheets)  129.95
# 3  B1-200504  B0-3010022             Spiral Notebooks - 12 Pack   59.99
# 4  B2-200505  B0-3010023               27" LED Computer Monitor  489.95
# 5  B2-200506  B0-3010023            Wireless Ergonomic Keyboard  387.65
# 6  B3-200507  B1-3010021  Industrial Disinfectant Solution (5L)  148.95
# 7  B3-200508  B1-3010021  Commercial Trash Liners - Case of 100    84.5
# 8  B4-200509  B1-3010022        Premium Ground Coffee (2lb Bag)   74.99
# 9  B4-200510  B1-3010022         Bottled Spring Water (24 Pack)   34.95
```

5. Create your first self-referencing mock table with auto-increment integer primary keys

```python
from mostlyai import mock

tables = {
    "employees": {
        "prompt": "Employees of a company",
        "columns": {
            "employee_id": {"dtype": "integer"},
            "name": {"prompt": "first name and last name of the employee", "dtype": "string"},
            "boss_id": {"dtype": "integer"},
            "role": {"prompt": "the role of the employee", "dtype": "string"},
        },
        "primary_key": "employee_id",
        "foreign_keys": [
            {
                "column": "boss_id",
                "referenced_table": "employees",
                "prompt": "each boss has at most 3 employees",
            },
        ],
    }
}
df = mock.sample(tables=tables, sample_size=10, model="openai/gpt-5", n_workers=1)
print(df)
#   employee_id              name  boss_id                   role
# 0            1      Patricia Lee     <NA>              President
# 1            2  Edward Rodriguez        1       VP of Operations
# 2            3      Maria Cortez        1          VP of Finance
# 3            4     Thomas Nguyen        1       VP of Technology
# 4            5        Rachel Kim        2     Operations Manager
# 5            6     Jeffrey Patel        2      Supply Chain Lead
# 6            7      Olivia Smith        2  Facilities Supervisor
# 7            8      Brian Carter        3     Accounting Manager
# 8            9   Lauren Anderson        3      Financial Analyst
# 9           10   Santiago Romero        3     Payroll Specialist
```

6. Enrich existing data with additional columns

```python
from mostlyai import mock
import pandas as pd

tables = {
    "guests": {
        "prompt": "Guests of an Alpine ski hotel in Austria",
        "columns": {
            "gender": {"dtype": "category", "values": ["male", "female"]},
            "age": {"prompt": "age in years; min: 18, max: 80; avg: 25", "dtype": "integer"},
            "room_number": {"prompt": "room number", "dtype": "integer"},
            "is_vip": {"prompt": "is the guest a VIP", "dtype": "boolean"},
        },
        "primary_key": "guest_id",
    }
}
existing_guests = pd.DataFrame({
    "guest_id": [1, 2, 3],
    "name": ["Anna Schmidt", "Marco Rossi", "Sophie Dupont"],
    "nationality": ["DE", "IT", "FR"],
})
df = mock.sample(
    tables=tables,
    existing_data={"guests": existing_guests},
    model="openai/gpt-5-nano"
)
print(df)
#   guest_id           name nationality  gender  age  room_number is_vip
# 0        1   Anna Schmidt          DE  female   30          102  False
# 1        2    Marco Rossi          IT    male   27          215   True
# 2        3  Sophie Dupont          FR  female   22          108  False
```

## MCP Server

This repo comes with MCP Server. It can be easily consumed by any MCP Client by providing the following configuration:

```json
{
    "mcpServers": {
        "mostlyai-mock-mcp": {
            "command": "uvx",
            "args": ["--from", "mostlyai-mock[mcp]", "mcp-server"],
            "env": {
                "OPENAI_API_KEY": "PROVIDE YOUR KEY",
                "GEMINI_API_KEY": "PROVIDE YOUR KEY",
                "GROQ_API_KEY": "PROVIDE YOUR KEY",
                "ANTHROPIC_API_KEY": "PROVIDE YOUR KEY"
            }
        }
    }
}
```

For example:
- in Claude Desktop, go to "Settings" > "Developer" > "Edit Config" and paste the above into `claude_desktop_config.json`
- in Cursor, go to "Settings" > "Cursor Settings" > "MCP" > "Add new global MCP server" and paste the above into `mcp.json`

Troubleshooting:
1. If the MCP Client fails to detect the MCP Server, provide the absolute path in the `command` field, for example: `/Users/johnsmith/.local/bin/uvx`
2. To debug MCP Server issues, you can use MCP Inspector by running: `npx @modelcontextprotocol/inspector -- uvx --from mostlyai-mock[mcp] mcp-server`
3. In order to develop locally, modify the configuration by replacing `"command": "uv"` (or use the full path to `uv` if needed) and `"args": ["--directory", "/Users/johnsmith/mostlyai-mock", "run", "--extra", "mcp", "mcp-server"]`
