Metadata-Version: 2.4
Name: ghostdq
Version: 0.1.1
Summary: GhostDQ SDK: compute data-quality metrics locally and ship them to GhostDQ.
Project-URL: Homepage, https://ghostdq.io
Project-URL: Repository, https://github.com/jamolpe/ghostdq
Project-URL: Bug Tracker, https://github.com/jamolpe/ghostdq/issues
License: Apache-2.0
License-File: LICENSE
Keywords: data-quality,data-validation,dq,metrics
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Software Development :: Quality Assurance
Requires-Python: >=3.10
Requires-Dist: fastavro<2.0,>=1.9
Requires-Dist: pandas<3.0,>=2.0
Requires-Dist: pyarrow<19.0,>=15.0
Requires-Dist: pyyaml<7.0,>=6.0
Provides-Extra: dev
Requires-Dist: mypy<2.0,>=1.10; extra == 'dev'
Requires-Dist: pandas-stubs<3.0,>=2.0; extra == 'dev'
Requires-Dist: pytest-cov<6.0,>=5.0; extra == 'dev'
Requires-Dist: pytest<9.0,>=8.0; extra == 'dev'
Requires-Dist: ruff<1.0,>=0.5; extra == 'dev'
Requires-Dist: types-pyyaml<7.0,>=6.0; extra == 'dev'
Description-Content-Type: text/markdown

# ghostdq — Python SDK

[![PyPI](https://img.shields.io/pypi/v/ghostdq)](https://pypi.org/project/ghostdq/)
[![License: Apache-2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)

The **GhostDQ SDK** lets you compute data-quality metrics **locally** and ship only the aggregated numbers to the GhostDQ cloud — your raw data never leaves your infrastructure.

---

## Install

```bash
pip install ghostdq
```

Optional extras (Avro support requires `fastavro`, Parquet requires `pyarrow` — both are included in the core install):

```bash
pip install "ghostdq[dev]"   # adds pytest, ruff, mypy, stubs
```

---

## Quick start

```python
from ghostdq import read_file, parse_contract, compute_metrics, GhostDQClient

# 1. Load your data
df = read_file("sales_2024.parquet")   # .csv / .parquet / .avro

# 2. Parse the contract (or fetch it from the API — see below)
contract = parse_contract(open("sales_contract.yaml").read())

# 3. Compute metrics *locally* — no raw data leaves your machine
metrics = compute_metrics(df, contract.rules)
# → {"row_count": 120000, "null_rate:country": 0.02, ...}

# 4. Ship the metrics to GhostDQ
client = GhostDQClient(
    api_key="ghd_your_key",
    ingest_url="https://ingest.ghostdq.io",
)
result = client.create_run(dataset_id="<dataset-uuid>", metrics=metrics)
print(result.run_id, result.status)  # ⇒ <uuid>  pending
```

---

## CLI

```bash
# Validate a file against a local contract
ghostdq run \
  --dataset-id <uuid> \
  --file sales.csv \
  --contract contract.yaml \
  --api-key ghd_xxx \
  --ingest-url https://ingest.ghostdq.io

# Fetch the contract automatically from the API
ghostdq run \
  --dataset-id <uuid> \
  --file sales.parquet \
  --api-key ghd_xxx \
  --ingest-url https://ingest.ghostdq.io
```

Environment variable shortcuts:
```bash
export GHOSTDQ_API_KEY=ghd_xxx
export GHOSTDQ_INGEST_URL=https://ingest.ghostdq.io
ghostdq run --dataset-id <uuid> --file sales.csv
```

---

## Supported file formats

| Format  | Extension   | Engine       |
|---------|-------------|--------------|
| CSV     | `.csv`      | pandas       |
| Parquet | `.parquet`  | pyarrow      |
| Avro    | `.avro`     | fastavro     |

---

## Supported rule types

| Rule             | Metric key(s)                                     |
|------------------|---------------------------------------------------|
| `row_count`      | `row_count`                                       |
| `null_rate`      | `null_rate:{column}`                              |
| `unique`         | `duplicate_count:{column}`                        |
| `value_range`    | `value_min:{column}`, `value_max:{column}`        |
| `allowed_values` | `disallowed_count:{column}`                       |

---

## Local development

Requires Python **3.10+** (3.13 recommended). From the repo root:

```bash
python3.13 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
pytest tests
ruff check src tests
mypy src tests --ignore-missing-imports
```

---

## License & disclaimer

Licensed under [Apache License 2.0](LICENSE).

This software is provided **“as is”**, without warranty of any kind. You are responsible for evaluating whether it fits your use case and for any outcomes from using it. See the [LICENSE](LICENSE) for the full terms, including limitations of liability.
