PACKAGE: packages/sdk  (public PyPI package: ghostdq)
======================================================

PURPOSE
-------
The public, lightweight data-quality SDK. Installed by end-users via `pip install ghostdq`.
It computes metrics locally from a file and ships only the aggregated numbers to the
GhostDQ Ingest API. Raw data never leaves the user's machine.

CRITICAL DEPENDENCY RULE
------------------------
This package MUST NOT import from ghostdq_core. ghostdq_core pulls in SQLAlchemy,
boto3, psycopg, etc. — heavyweight infrastructure that has no place in a public SDK.
If you need a type or helper that exists in ghostdq_core, DUPLICATE it here.
The contract types in `ghostdq.contract` are a deliberate duplicate of the logic
in `ghostdq_core.contract` for this reason.

MODULE LAYOUT
-------------
src/ghostdq/
  __init__.py         — public API surface, re-exports everything a user needs
  py.typed            — mypy marker
  client.py           — backward-compat shim → ghostdq.export
  evaluate.py         — backward-compat shim → ghostdq.evaluation
  io_pandas.py        — backward-compat shim → ghostdq.reading

  contract/
    models.py         — Contract, RuleSpec, SchemaField
    parser.py         — ContractParser, parse_contract, required_columns

  reading/
    pandas.py         — PandasFileReader, read_file/read_csv/read_parquet/read_avro
    types.py          — PathLike

  metrics/
    engine.py         — MetricsEngine (pandas DataFrame)
    arrow.py          — ArrowMetricsEngine (PyArrow table / Parquet)
    streaming.py      — StreamingCsvMetricsEngine (chunked CSV)
    polars_engine.py  — PolarsMetricsEngine (optional: pip install ghostdq[polars])
    duckdb_engine.py  — DuckDBMetricsEngine (optional: pip install ghostdq[duckdb])
    router.py         — compute_metrics_file() backend auto-selection
    plans.py          — ColumnMetricsPlan, build_metric_plan
    accumulators.py   — streaming chunk state

  evaluation/
    evaluator.py      — RuleEvaluator, evaluate_rules, format_evaluation_line
    models.py         — RuleEvaluation

  export/
    client.py         — GhostDQClient, RunResult
    constants.py      — DEFAULT_INGEST_URL
    exceptions.py     — GhostDQAPIError

  cli/
    run.py            — `ghostdq run ...` implementation
    __init__.py       — main() entry point

tests/
  conftest.py              — shared fixtures (simple_df, contract_yaml_minimal/full)
  contract/
    test_parser.py         — parse_contract + metric_keys derivation
  reading/
    conftest.py            — tmp_csv, tmp_parquet
    test_pandas.py         — read_file for CSV + Parquet + Avro
  metrics/
    test_engine.py         — compute_metrics for all rule types
  evaluation/
    test_evaluator.py      — evaluate_rules + formatting
  export/
    test_client.py         — GhostDQClient with mocked urlopen (no network)
  cli/
    conftest.py            — contract_file, data_file
    test_run.py            — `ghostdq run` with mocked GhostDQClient

METRIC KEY CONTRACT (must match ghostdq_core.rules)
----------------------------------------------------
  row_count           → int (total rows)
  null_rate:{col}     → float [0.0, 1.0]  (fraction of null values)
  duplicate_count:{col} → int  (rows whose value appears > 1 time, ALL occurrences)
  value_min:{col}     → float
  value_max:{col}     → float
  disallowed_count:{col} → int (rows whose value is not in the `values` list)

DESIGN DECISIONS
----------------
- CLI uses stdlib `argparse` + `urllib` (no click, no httpx) to keep deps minimal.
- Lazy imports inside _cmd_run keep `ghostdq --help` fast.
- compute_metrics deduplicates: if two rules share a metric key, it's computed once.
- _duplicate_count uses duplicated(keep=False) — counts ALL rows in a duplicate group,
  not just the "extra" ones. This matches the fail condition in ghostdq_core.rules.
- _disallowed_count casts both the column and the allowed list to str before comparing,
  consistent with how YAML parses contract values.
- GhostDQAPIError carries .status_code for programmatic handling by callers.

TESTING
-------
  cd <repo-root>
  pytest packages/sdk/tests -v

All 54 tests are pure unit tests — no network, no DB, no filesystem side effects
beyond pytest's tmp_path. Patch targets: `ghostdq.export.GhostDQClient` for CLI,
`ghostdq.export.client.urlopen` for client unit tests.

VERSIONS
--------
  ghostdq 0.1.0  (M3 — first real release, goes to TestPyPI then PyPI)
  Python ≥ 3.10
  pandas ≥ 2.0, pyarrow ≥ 15, fastavro ≥ 1.9, pyyaml ≥ 6.0

NEXT STEPS (post-M3)
--------------------
- Publish to TestPyPI: `hatch build && twine upload --repository testpypi dist/*`
- Add `ghostdq check` subcommand for offline local-only validation (no API call)
- Spark metrics module (packages/sdk/src/ghostdq/metrics_spark.py)
- SQL metrics module for warehouse use cases
