Metadata-Version: 2.4
Name: chaos-engine
Version: 0.1.0
Summary: Inject configurable data quality chaos into clean datasets to stress-test DQ frameworks.
License: MIT
Keywords: data-quality,testing,great-expectations,soda,chaos-engineering
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Testing
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: pandas>=2.0
Requires-Dist: numpy>=1.24
Requires-Dist: pyyaml>=6.0
Requires-Dist: pyarrow>=12.0
Requires-Dist: pydantic>=2.0
Requires-Dist: rich>=13.0
Requires-Dist: click>=8.0
Requires-Dist: faker>=20.0
Provides-Extra: ge
Requires-Dist: great-expectations>=0.18; extra == "ge"
Provides-Extra: soda
Requires-Dist: soda-core>=3.0; extra == "soda"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Provides-Extra: all
Requires-Dist: chaos-engine[dev,ge,soda]; extra == "all"

# chaos-engine

> Inject configurable, reproducible data quality chaos into clean DataFrames — then prove your DQ framework catches it.

[![Python](https://img.shields.io/badge/python-3.10%2B-blue)]()
[![Tests](https://img.shields.io/badge/tests-39%20passed-brightgreen)]()
[![License](https://img.shields.io/badge/license-MIT-lightgrey)]()

---

## Why this exists

Every data team needs to stress-test their data quality framework (Great Expectations, Soda, dbt tests) — but you can't use production data in dev, and hand-crafting bad data is tedious and non-reproducible.

**chaos-engine** solves this: give it a clean DataFrame and a YAML config, and it injects precisely the anomalies you want — nulls, duplicates, type mismatches, schema drift, late-arriving rows, statistical outliers — with a deterministic seed so CI is always reproducible.

The `ChaosReport` tells you exactly *what* was injected, so you can assert your DQ suite caught it:

```python
engine = ChaosEngine.from_yaml("chaos_config.yaml")
corrupted_df, report = engine.run(clean_df)

# Now prove Great Expectations detected what we injected
assert "email" in report.null_columns
assert ge_suite.validate(corrupted_df)["statistics"]["unsuccessful_expectations"] > 0
```

---

## GE Detection Matrix

For each injector, Great Expectations **detected** the anomaly:

| Injector     | Anomaly injected               | GE result  |
|--------------|-------------------------------|------------|
| nulls        | 10% of email column nulled    | DETECTED   |
| duplicates   | 5% duplicate customer IDs     | DETECTED   |
| outliers     | revenue values at 10σ         | DETECTED   |
| schema_drop  | `revenue` column removed      | DETECTED   |

Run `pytest tests/test_ge_suite.py -v` to reproduce this matrix.

---

## Installation

```bash
# Core library
pip install chaos-engine

# With Great Expectations support
pip install "chaos-engine[ge]"

# With Soda support
pip install "chaos-engine[soda]"
```

For local development:

```bash
git clone https://github.com/yourhandle/chaos-engine
cd chaos-engine
pip install -e ".[dev,ge]"
pytest
```

---

## Quick start

```python
import pandas as pd
from chaos_engine import ChaosEngine

# 1. Load a clean dataset
df = pd.read_csv("clean_customers.csv")

# 2. Create engine from YAML config
engine = ChaosEngine.from_yaml("examples/chaos_config.yaml")

# 3. Run — input is never mutated
corrupted_df, report = engine.run(df)

# 4. Inspect what changed
print(report.summary())
# ChaosReport (seed=42)
#   Total mutations : 87
#   Injectors used  : duplicates, late_arriving, nulls, outliers, schema_drift, type_mismatch
#   [nulls] Injected 10 nulls into 'email' via random
#   [nulls] Injected 10 nulls into 'phone' via random
#   [duplicates] Injected 6 near duplicate rows
#   ...

# 5. Save output
corrupted_df.to_csv("corrupted_customers.csv", index=False)
report_json = report.to_json()
```

Or programmatically without YAML:

```python
engine = ChaosEngine(seed=42, injectors={
    "nulls":      {"enabled": True, "rate": 0.05, "columns": ["email"]},
    "duplicates": {"enabled": True, "rate": 0.03, "mode": "near"},
    "outliers":   {"enabled": True, "columns": ["revenue"], "sigma": 6},
})
corrupted_df, report = engine.run(df)
```

---

## CLI

```bash
# Run chaos injection from the command line
chaos-engine run examples/chaos_config.yaml clean_customers.csv \
    --output corrupted.parquet --format parquet \
    --report chaos_report.json

# Show which injectors are enabled in a config
chaos-engine inspect examples/chaos_config.yaml
```

---

## Injectors

### `nulls` — random or pattern-based nulls

```yaml
nulls:
  enabled: true
  rate: 0.05          # fraction of rows to null per column
  columns: [email, phone]
  strategy: random    # or: pattern
  # pattern: "^test_"
  # pattern_column: email
```

### `duplicates` — exact and near-duplicate rows

```yaml
duplicates:
  enabled: true
  rate: 0.03
  mode: near          # exact | near (near fuzzes one field slightly)
```

### `type_mismatch` — wrong types in typed columns

```yaml
type_mismatch:
  enabled: true
  rate: 0.04
  columns:
    age: string        # inject words into int column
    revenue: negative  # inject negative numbers
    status: boolean    # inject "yes"/"no" into a categorical
```

Supported targets: `string`, `boolean`, `negative`, `future_date`, `empty_string`.

### `outliers` — statistical anomalies

```yaml
outliers:
  enabled: true
  rate: 0.02
  columns: [revenue, quantity]
  sigma: 6             # standard deviations beyond the mean
  mode: both           # high | low | both
```

### `late_arriving` — shifted timestamps

```yaml
late_arriving:
  enabled: true
  rate: 0.02
  columns: [created_at]
  max_delay_days: 14
  direction: past      # past | future
```

### `schema_drift` — structural changes

```yaml
schema_drift:
  enabled: true
  rename: {customer_id: cust_id}   # break joins
  drop: [internal_flag]             # remove expected columns
  add: {mystery_column: null}       # add unexpected columns
  reorder: false                    # shuffle column order
```

---

## ChaosReport API

```python
corrupted_df, report = engine.run(df)

report.total_mutations         # int — total cells/rows affected
report.injector_names          # ['duplicates', 'nulls', 'outliers', ...]
report.null_columns            # ['email', 'phone']
report.duplicate_row_indices   # [200, 201, 202, ...]
report.schema_changes          # [{'rename_map': {...}}, ...]

report.by_injector("nulls")    # list[InjectionRecord]
report.summary()               # human-readable string
report.to_json()               # JSON string
report.to_dataframe()          # tidy pandas DataFrame
```

---

## Custom injectors

Register your own injector with a decorator:

```python
from chaos_engine import ChaosEngine, BaseInjector, InjectionRecord
import pandas as pd
import numpy as np

@ChaosEngine.register("encoding_chaos")
class EncodingInjector(BaseInjector):
    name = "encoding_chaos"

    def inject(self, df, rng, config):
        col = config.get("column", "name")
        rows = self._sample_rows(df, config.get("rate", 0.02), rng)
        df[col] = df[col].astype(object)
        for i in rows:
            df.at[df.index[i], col] = "Ren\u00e9e M\u00fcller \u4e2d\u6587"  # unicode chaos
        record = InjectionRecord(
            injector=self.name,
            description=f"Injected encoding chaos into '{col}'",
            affected_rows=rows.tolist(),
            affected_columns=[col],
        )
        return df, [record]

# Now use it in a config
engine = ChaosEngine(seed=42, injectors={
    "encoding_chaos": {"enabled": True, "column": "name", "rate": 0.03},
})
```

---

## Running tests

```bash
# All tests
pytest

# Unit tests only (no GE dependency)
pytest tests/test_injectors.py -v

# GE integration tests + detection matrix
pytest tests/test_ge_suite.py -v -s
```

---

## Architecture

```
ChaosEngine.run(df)
    │
    ├── NullInjector       → random / pattern nulls
    ├── DupeInjector       → exact / near duplicates
    ├── TypeInjector       → type mismatches
    ├── OutlierInjector    → statistical anomalies
    ├── LateInjector       → timestamp shifts
    └── SchemaInjector     → rename / drop / add columns (always last)
              │
              ▼
    ChaosReport            → audit trail of every mutation
```

Injectors run in a canonical order (schema drift last, since it renames columns others reference). The seeded `numpy.random.Generator` is threaded through every injector so the full pipeline is deterministic.

---

## Stack

- Python 3.10+, pandas, numpy, PyArrow
- Great Expectations v1.x (integration tests)
- pydantic, PyYAML, click, rich
- pytest + pytest-cov (CI)

---

## License

MIT
