Metadata-Version: 2.4
Name: datacrafter-ai
Version: 0.1.0
Summary: Datacrafter — AI-based, schema-driven synthetic data generator with a plugin architecture.
Author: Mahalakshmi Shanmuga Sundaram
License: MIT
Keywords: synthetic-data,data-generator,fake-data,yaml,csv,json,xml,datacrafter,ai-based,test-data,plugin
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Testing
Classifier: Topic :: Utilities
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: PyYAML>=6.0.1
Requires-Dist: Faker>=20.0.0
Requires-Dist: click>=8.1.7
Requires-Dist: python-dateutil>=2.8.2
Dynamic: license-file

# Datacrafter

**AI-based, schema-driven synthetic data generator with a plugin architecture.**

Design datasets in **YAML**, generate realistic **CSV / JSON / JSONL / XML / Parquet** files, and extend functionality with custom **providers** or **writers** — no core changes required.

---

## ✨ Features

* **Schema-driven** – Define structure, constraints, and output using YAML
* **Deterministic** – Use `seed` for reproducible datasets
* **Rich providers** – uuid, integer, float, boolean, categorical, datetime, person.*, text.*, geo
* **Advanced controls** – `unique`, `null_rate`, `regex`, distributions
* **Templating** – `${first}.${last}@domain.com`
* **Multiple formats** – CSV, JSON, JSONL, XML, Parquet
* **Plugin architecture** – Extend without modifying core
* **CLI + Python API**

---

## 📦 Installation

```bash
pip install datacrafter-ai
```

**Requirements:** Python 3.9+

---

## 🚀 Quickstart (CLI)

### 1. Create a schema (`examples/simple.yaml`)

```yaml
version: 1
seed: 42
rows: 20

fields:
  id:
    type: uuid

  name:
    type: person.name

  age:
    type: integer
    params:
      min: 18
      max: 60

output:
  format: csv
  path: ./output/simple.csv
```

---

### 2. Generate data

```bash
datacrafter generate --schema examples/simple.yaml
```

---

### 3. Output

```
./output/simple.csv
```

---

## 🧠 Quickstart (Python)

```python
from datacrafter.schema_loader import load_schema
from datacrafter.generator import Generator

schema = load_schema("examples/simple.yaml")

gen = Generator(schema)
rows = gen.generate()
gen.write()
```

---

## 🧾 YAML Schema (v1)

| Key      | Type      | Required | Description               |
| -------- | --------- | -------- | ------------------------- |
| version  | int       | Yes      | Schema version (use 1)    |
| seed     | int       | No       | Deterministic output seed |
| rows     | int       | Yes*     | Number of rows            |
| fields   | map       | Yes*     | Column definitions        |
| output   | map       | Yes*     | Output configuration      |
| datasets | list<map> | No       | Multi-dataset support     |

> *Required when `datasets` is not used

---

## 📌 Field Definition

```yaml
<column_name>:
  type: <provider.name>
  params: {}
  unique: false
  null_rate: 0.0
  regex: null

  distribution:
    name: normal
    mean: 35
    std: 10
    min: 18
    max: 75

  categorical:
    values: [IN, US, DE]
    weights: [0.6, 0.3, 0.1]

  template: "${first}.${last}@${domain}"
  depends_on: ["first", "last", "domain"]
  transform: ["lower", "strip"]
```

---

## 📤 Output Configuration

```yaml
output:
  format: csv
  path: ./out/customers.csv
  options:
    delimiter: ","
    header: true
    encoding: "utf-8"
```

---

## 🧩 Built-in Providers

* **IDs** → `uuid`, `id.incremental`
* **Numeric** → `integer`, `float`
* **Boolean** → `boolean`
* **Text** → `text.lorem`, `text.short`, `text.word`, `string.regex`
* **Person** → `person.*`
* **Datetime** → `datetime`
* **Categorical** → `categorical`
* **Geo** → `geo.country`

---

## 🎛️ Constraints & Validation

* `unique` → Enforces uniqueness
* `null_rate` → Probability of null values
* `regex` → Validation
* `distribution` → Statistical control
* `template` → Field composition
* `depends_on` → Dependency ordering

---

## 🖥️ CLI Reference

```bash
# Generate data
datacrafter generate --schema schema.yaml

# Validate schema
datacrafter validate --schema schema.yaml

# List providers & writers
datacrafter list providers
datacrafter list writers

# Create starter schema
datacrafter init --template minimal
```

---

## 🔌 Plugins

Install external plugins:

```bash
pip install datacrafter-healthcare
pip install datacrafter-parquet-writer
```

### Example plugin registration

```toml
[project.entry-points."datacrafter.providers"]
health = "dc_health.providers:register"

[project.entry-points."datacrafter.writers"]
parquet = "dc_parquet.writer:register"
```

---

## 🧪 Example Schemas

### Customers (CSV)

```yaml
version: 1
rows: 5000

fields:
  id: { type: uuid, unique: true }
  first: { type: person.first_name }
  last:  { type: person.last_name }

output:
  format: csv
  path: ./out/customers.csv
```

---

### Events (JSONL)

```yaml
version: 1
rows: 10000

fields:
  event_id: { type: uuid, unique: true }
  user_id:  { type: id.incremental }

output:
  format: jsonl
  path: ./out/events.jsonl
```

---

### Articles (XML)

```yaml
version: 1
rows: 200

fields:
  uid: { type: uuid }

output:
  format: xml
  path: ./out/articles.xml
```

---

## 🛠️ Troubleshooting

* **PyPI name conflict** → Change project name
* **Determinism issues** → Set `seed`
* **Unique errors** → Increase domain size
* **Performance issues** → Use chunking

---

## 📦 Development

```bash
python -m pip install --upgrade build twine
python -m build
twine check dist/*
```

### Publish

```bash
twine upload dist/*
```

---

## 🔒 License

MIT © 2026 Mahalakshmi Shanmuga Sundaram

---
## 🏢 About

Datacrafter is developed and maintained by **DHS Tech Services**.

---

## 🙌 Acknowledgements

Inspired by modern synthetic data generation and schema-driven design.
