Metadata-Version: 2.4
Name: chaoslake
Version: 0.1.2
Summary: Synthetic relational dataset generator with schema-driven chaos controls.
Author: Gourav Shokeen
License: MIT
Project-URL: Homepage, https://github.com/gouravshokeen/chaoslake
Project-URL: Bug Tracker, https://github.com/gouravshokeen/chaoslake/issues
Keywords: synthetic data,data generation,faker,testing,chaos
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: faker>=28.0.0
Requires-Dist: numpy>=1.26.0
Requires-Dist: polars>=1.0.0
Requires-Dist: pydantic>=2.8.0
Requires-Dist: pyyaml>=6.0.1
Requires-Dist: rich>=13.7.0
Requires-Dist: scipy>=1.13.0
Requires-Dist: typer>=0.12.3
Provides-Extra: dev
Requires-Dist: duckdb>=1.1.0; extra == "dev"
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: sqlalchemy>=2.0.0; extra == "dev"
Dynamic: license-file

# chaoslake

**Generate synthetic relational datasets from a YAML schema in seconds.**

`chaoslake` preserves foreign-key integrity across tables, supports statistical distributions, and can deliberately inject controlled data-quality issues (nulls, duplicates, drift, mixed date formats) — making it ideal for testing data pipelines, ML models, and analytics dashboards.

[![PyPI](https://img.shields.io/pypi/v/chaoslake)](https://pypi.org/project/chaoslake/)
[![Python](https://img.shields.io/pypi/pyversions/chaoslake)](https://pypi.org/project/chaoslake/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

---

## Install

```bash
pip install chaoslake
```

For the optional DuckDB backend and database introspection:

```bash
pip install "chaoslake[dev]"
```

---

## Quick Start

### 1. Try it instantly — no schema needed

```bash
chaoslake quick
```

Generates a `demo_output/` folder with a `users` table and a FK-linked `transactions` table straight away.

### 2. Create a schema interactively

```bash
chaoslake init
```

Asks whether you want to answer a short questionnaire (table name, row count, columns, chaos settings) or drop a ready-made template. Either way it writes `chaoslake.yaml` in the current directory.

### 3. Validate your schema

```bash
chaoslake check chaoslake.yaml
```

### 4. Generate the dataset

```bash
chaoslake generate --schema chaoslake.yaml --output ./output --format csv
```

Options:

| Flag | Default | Description |
|------|---------|-------------|
| `--schema` | `chaoslake.yaml` | Path to YAML schema |
| `--output` | `./output` | Output directory |
| `--format` | `csv` | `csv` or `parquet` |
| `--seed` | _(none)_ | Integer seed for reproducible output |
| `--engine` | `auto` | `polars`, `duckdb`, or `auto` |
| `--verbose` | `false` | Show per-column progress logs |

---

## Schema Reference

```yaml
tables:
  - name: customers          # lowercase identifier
    rows: 1000
    columns:
      - name: id
        type: integer
        primary_key: true    # required — exactly one per table

      - name: full_name
        type: name           # pre-baked numpy name pool (~6M rows/sec)

      - name: email
        type: email
        unique: true

      - name: signup_date
        type: datetime
        range: [2020-01-01, 2024-12-31]

  - name: orders
    rows: 5000
    columns:
      - name: id
        type: integer
        primary_key: true

      - name: customer_id
        type: foreign_key
        references: customers.id   # always table.column

      - name: amount
        type: float
        range: [5, 5000]
        distribution: lognormal    # see distributions below

      - name: order_date
        type: datetime
        range: [2021-01-01, 2024-12-31]

    chaos:
      null_probability: 0.03         # 3 % of non-PK/FK values become null
      duplicate_probability: 0.01    # 1 % extra duplicate rows appended
      drift:                         # concept drift on a numeric column
        field: amount
        distribution: norm
        start_after_row: 4000
      format_inconsistency:          # randomly mix date formats
        fields: [order_date]
        style: mixed_us_dates
```

### Column types

| Type | Description |
|------|-------------|
| `integer` | Random integers. Add `primary_key: true` for sequential IDs. |
| `float` | Uniform floats. Combine with `distribution` for realistic shapes. |
| `string` | Random 12-char alphanumeric strings. |
| `name` | Full person names sampled from a pre-baked pool. |
| `email` | `firstname.lastname@domain.tld` — add `unique: true` for deduplication. |
| `datetime` | Timestamps within an optional `range`. |
| `foreign_key` | Samples from a parent table's primary key. Requires `references`. |

### Distributions

Use `distribution` on `float` columns (or `integer` with `drift`):

| Value | Shape |
|-------|-------|
| `lognormal` / `lognorm` | Right-skewed — good for prices, revenue |
| `normal` / `norm` | Bell curve |
| `uniform` | Flat |
| `expon` | Exponential decay |
| `poisson` | Integer-valued counts |
| Any `scipy.stats` name | Fallback to scipy, e.g. `beta`, `gamma` |

---

## All Commands

```
chaoslake --help
```

| Command | What it does |
|---------|-------------|
| `generate` | Generate tables from a YAML schema |
| `init` | Create `chaoslake.yaml` (interactive or static template) |
| `check` | Validate a schema file without generating data |
| `quick` | One-command demo — no schema file needed |
| `introspect` | Reflect a live database and emit a Chaoslake YAML schema |

### Introspect a database

```bash
chaoslake introspect --db sqlite:///mydb.db
chaoslake introspect --db postgresql://user:pass@localhost/mydb --tables customers,orders
```

Requires `pip install "chaoslake[dev]"` (SQLAlchemy).

---

## Reproducible Output

Pass `--seed` to get bit-identical datasets every time:

```bash
chaoslake generate --schema chaoslake.yaml --output ./output --seed 42
```

---

## Performance

Benchmarked on Apple M-series (1.1M rows, 3 columns):

| Engine | Rows / sec |
|--------|-----------|
| Polars (default) | ~6 000 000 |
| DuckDB | auto-selected above 500k total rows |

Run the benchmark yourself:

```bash
chaoslake bench
```

---

## License

MIT © Gourav Shokeen
