Metadata-Version: 2.4
Name: conformare
Version: 0.2.7
Summary: Capture the authored dataframe transformation pipeline (Narwhals + PySpark) and profile data at each step to produce lineage diagrams and data-quality docs.
Project-URL: Homepage, https://kaelonlloyd.github.io/conformare-docs/
Project-URL: Documentation, https://kaelonlloyd.github.io/conformare-docs/
Author-email: Kaelon Lloyd <kaelonlloyd@gmail.com>
License: PolyForm-Noncommercial-1.0.0
License-File: LICENSE
Keywords: data-governance,data-lineage,data-quality,governance,great-expectations,lineage,narwhals,pandas,pii,profiling,pyspark
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: Free for non-commercial use
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: executing>=2.0
Requires-Dist: narwhals>=1.0
Provides-Extra: dbt
Requires-Dist: pyyaml>=5; extra == 'dbt'
Provides-Extra: dev
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pandas>=2.0; extra == 'dev'
Requires-Dist: pyarrow>=12.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: pyyaml>=5; extra == 'dev'
Requires-Dist: ruff<0.16,>=0.15; extra == 'dev'
Requires-Dist: scikit-learn>=1.3; extra == 'dev'
Provides-Extra: gx
Requires-Dist: great-expectations>=0.18; extra == 'gx'
Provides-Extra: sklearn
Requires-Dist: scikit-learn>=1.3; extra == 'sklearn'
Provides-Extra: spark
Requires-Dist: pyspark>=3.4; extra == 'spark'
Provides-Extra: test
Requires-Dist: pandas>=2.0; extra == 'test'
Requires-Dist: pyarrow>=12.0; extra == 'test'
Requires-Dist: pytest>=7.0; extra == 'test'
Requires-Dist: pyyaml>=5; extra == 'test'
Provides-Extra: whylogs
Requires-Dist: whylogs>=1.3; extra == 'whylogs'
Description-Content-Type: text/markdown

# Conformare

[![CI](https://github.com/kaelonlloyd/conformare/actions/workflows/ci.yml/badge.svg)](https://github.com/kaelonlloyd/conformare/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/conformare)](https://pypi.org/project/conformare/)
[![Python versions](https://img.shields.io/pypi/pyversions/conformare)](https://pypi.org/project/conformare/)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![License: PolyForm Noncommercial 1.0.0](https://img.shields.io/badge/license-PolyForm%20Noncommercial%201.0.0-blue.svg)](https://polyformproject.org/licenses/noncommercial/1.0.0)

**[Live demo and reports](https://kaelonlloyd.github.io/conformare-docs/)**: explore real interactive reports generated by the examples, and read the [docs](https://kaelonlloyd.github.io/conformare-docs/).

**The governance layer for analytical outputs built on top of governed data.**

Data science and analytics teams build models, features, scores and reports on top of the
structured datasets Data Engineering delivers, in imperative pandas / PySpark / Narwhals code
that no SQL model graph or data catalogue reaches. Conformare governs *that* layer: it captures
the authored transformation as it runs, profiles the data at each step, and records the
**risks, mitigations, owners and business definitions** behind it, then renders the whole thing
as one self-contained, interactive HTML report and rolls every run up into a cross-pipeline
fleet view. It also **inherits the governance Data Engineering already attached** to the inputs
(table comments, dbt `meta`, an OpenMetadata catalogue) so a developer sees the upstream risks
on the data they consume.

Works with **Narwhals**, **PySpark**, and **native pandas**.

## What it does

Conformare does two jobs for a data-processing pipeline:

1. **Governance** : track the **risks, mitigations and owners**, and the **business
   definitions** implemented by each step.
2. **Process & data** : document the **end-to-end process**, flag **PII / sensitive
   data**, and **profile** the data (counts, distributions, null rates, outliers,
   expectations) at each step.

It exists to support the **development, diagnostics, and governance** of the **analytical work
built on top of governed data**: not the data platform or the engineering pipelines that feed
it, but the models, features, scores and reports a data science or analytics team produces from
the datasets it is given.

## Mission

Data Engineering's outputs are governed: modelled, contracted, catalogued, lineage-tracked.
What teams **build on top of them** usually is not. Data science and analytics work, such as
feature engineering, forecasts, customer-lifetime-value, scoring and board reports, is
imperative, often exploratory, frequently never materialised as a governed table, and produced
under deadline. The assumptions and risks behind it live in someone's head or a buried comment,
and the governance, when it exists at all, is written **after the fact** in documents that drift
out of sync with the code.

That gap is widest exactly where it matters most: these analytical outputs **drive decisions**,
such as pricing, risk and strategy, yet have no governance tooling of their own.

Conformare's mission is to be that missing layer. Author risks, mitigations, owners and
definitions next to the analytical logic they describe; **inherit** the governance Data
Engineering already attached to the inputs; generate the documentation *from* the implementation
so it stays current; and **flag when the code drifts** from the governance it claims.

## Two ways to use it

- **Integrated (intrusive)** : add profilers and `describe()` / `risk()` context in your
  code. Unlocks the full feature set: profiling, lineage, sensitivity, expectations and
  governance.
- **Non-intrusive** : get governance and process documentation **without rewriting** your
  pipeline, via **docstring tagging** (risk / mitigation / owner / purpose declared in
  docstrings) or **bootstrapping** (instrument an unmodified script from a separate entry
  point).

See [Choosing an integration style](#choosing-an-integration-style) for the trade-offs.

## Install

```bash
pip install conformare           # core: Narwhals + executing
pip install "conformare[spark]"  # + PySpark
pip install "conformare[gx]"     # + Great Expectations (optional validation profiler)
```

## Quick start

```python
import narwhals as nw, pandas as pd
import conformare as cf

cf.trackNarwhals()
cf.set_profiles({"*": [cf.rowCount, cf.dataSize, cf.histogram(columns="all")]})

with cf.describe("Clean customers", purpose="Keep UK adults only",
                 definition_owner="data-governance",
                 risks=cf.risk("privacy.pii_exposure", "compliance.gdpr",
                               mitigation="Drop email before export", owner="data-governance")):
    customers = nw.from_native(pd.read_csv("customers.csv"))
    adults = customers.filter(nw.col("age") >= 18)

cf.to_html("report.html", title="Customer pipeline")   # open in any browser
```

Existing **PySpark** or **native pandas** code uses the same API: call `cf.trackSpark()` or
`cf.trackPandas()` and run your pipeline **unchanged**:

```python
cf.trackSpark()                               # the only line you add
active   = df.filter(df.status == "active")   # tracked automatically
enriched = active.join(orders, on="id")       # tracked, two parents
```

## Choosing an integration style

|  | Integrated (explicit) | Docstring tagging | Bootstrapping |
|---|---|---|---|
| Change to pipeline code | profilers + `describe`/`risk` inline | docstrings only | none (separate entry point) |
| What you get | everything | risk / mitigation / owner / purpose docs | process tracking + grouping + profiling |
| Best for | new code, deep diagnostics, full report | adding governance with little code change | documenting a script you cannot or will not edit |

- **Explicit integration** gives the richest result: every profiler, the complete lineage
  and column-level detail, data-quality checkpoints, and governance, all in one report.
  Best when you own the code and want both diagnostics and a full governance artifact.
- **Indirect integration** trades feature coverage for **zero or low intrusion**:
  - *Docstrings* keep governance (risks, owners, definitions) literally inside the
    function that implements the concept, so it cannot drift and needs no imports in hot
    paths. With `track_functions()` on, a `Conformare:` docstring block is applied
    automatically; see the [docstring tagging example](https://kaelonlloyd.github.io/conformare-docs/examples/docstring-governance.html).
  - *Bootstrapping* documents and profiles an **unmodified** production script from the
    outside, ideal for legacy or third-party pipelines and for audits.

## Features at a glance

- **Process map & lineage** : a diagram of the pipeline **as authored** (no engine plan
  required), with column-level lineage, a created-column catalog, and each node's
  operation logic shown inline.
- **Per-step profiling** : row/column counts, data size, histograms, null fractions and
  IQR outliers at each step, plus a *distribution follower* to watch a column evolve.
- **Data-quality checkpoints** : drop [Great Expectations](https://greatexpectations.io/)
  in at any step and see exactly where a contract starts failing (with severities).
- **PII / sensitivity** : name-based heuristics flag candidate PII, and the report shows
  whether each sensitive column reaches a written output.
- **Governance** : risks, mitigations, owners, business **definition owners**, Markdown
  context details, a process-wide description, and a governance ranking (owned means
  low-concern). Surfaced as a risk register and a context register.
- **Self-contained HTML report** : one interactive page (diagram, column highlighter,
  KPIs, dark mode); no CDN, no build step.
- **Formal risk checklist** : export the risk register as a sign-off-ready Markdown
  document (`to_risk_checklist`) with blank columns and a sign-off block for a
  governance team to review, comment, and date, keeping an auditable trail.
- **Three backends, one report** : Narwhals (new code), PySpark and native pandas
  (existing code, tracked in place).

## Backends

- **`trackNarwhals()`** : for new, dataframe-agnostic code on
  [Narwhals](https://narwhals-dev.github.io/narwhals/); patches the `nw.from_native`
  chokepoint.
- **`trackSpark()`** : for existing PySpark code, tracked **in place with zero changes**.
- **`trackPandas()`** : for existing **native pandas** code; tracks idiomatic indexing
  like `df[df.col == 1]`, `df[["a","b"]]`, `query`, `merge`, `groupby`.
- **`trackAll()`** : adapters plus automatic function-boundary tracking, for mixed codebases.

## Public API

| Call | Purpose |
|---|---|
| `trackNarwhals()` / `trackSpark()` / `trackPandas()` / `trackAll()` | Start tracking the chosen backend(s). |
| `set_profiles({...})` / `with profile(...)` | Op-to-profilers registry / scoped overlay. |
| `with force_profile(..., cache=)` | Opt-in profiling at a chosen point (optionally cache on Spark). |
| `with describe(...)` / `with risk(...)` | Annotate code with purpose / governance risks (owner, mitigation, definition owner, Markdown details). |
| `describe_process(description, risks=...)` | Process-wide description and risks. |
| `register_risk(...)` | Extend the built-in risk catalog. |
| `mark_sensitive()` / `classify_column()` | Manual / heuristic sensitivity tagging. |
| `@opaque` / `opaque_module(*prefixes)` | Record a function/library call as one node, suppressing its internals (`pyspark.ml` opaque by default). |
| `@track_step` / `track_functions()` | Function-boundary tracking (explicit / automatic, including docstring tagging). |
| `environment()` / `in_notebook()` / `mark_user_packages(*names)` | Detect the runtime (Databricks / Jupyter / IPython / Python); opt your own pip-installed pipeline code into user-code tracking. |
| `bootstrap(run, docs=[doc(...)], ...)` | Instrument an unmodified script from the outside, run it, write a report. |
| `to_mermaid()` / `to_json()` / `to_html()` | Export the lineage (Mermaid / JSON / interactive HTML). |
| `to_risk_checklist(path, process=, reviewers=)` | Export the risk register as a formal, sign-off-ready Markdown checklist for a governance team. |
| `restore()` | Unpatch everything (captured lineage is kept). |

## Profilers

A **profiler** measures something about the data at a step (for example a row count or a
column's distribution) and attaches the result to that node in the report. You choose
which profilers run on which operations, and they execute as the pipeline runs.

Built-in profilers:

- **`rowCount`** : number of rows.
- **`columnCount`** : number of columns.
- **`dataSize`** : approximate in-memory size (with the column count).
- **`histogram(columns=...)`** : per-column distribution (numeric bins, or top values for categorical columns).
- **`nullFraction(columns=...)`** : fraction of nulls per column.
- **`iqrOutliers(columns="all", k=1.5)`** : flags values outside Tukey's IQR fences and summarises the outliers per column.
- **`greatExpectations(*expectations, hard_severities=())`** : runs [Great Expectations](https://greatexpectations.io/) checks as a validation checkpoint at a step, showing which pass or fail (with severities). Accepts native GX objects or portable dicts. Optional dependency (`pip install "conformare[gx]"`); degrades to a status note if absent.
- **`whylogs(columns=...)`** : optional [whylogs](https://github.com/whylabs/whylogs) profile summary (requires whylogs).

On Spark, counts and aggregates are full jobs, so prefer profiling chosen steps with `force_profile`:

```python
cf.set_profiles({})                                  # profile nothing by default
with cf.force_profile(cf.rowCount, cf.histogram("amount"), cache=True, only="last"):
    enriched = adults.join(orders, on="id")          # only this step is profiled (and cached)
```

Under the hood: a profiler is a callable `(frame, backend) -> value`. Conditions
(`contains_columns`, `schema_has`, `min_rows`) compose with `&` `|` `~`. Configuration is
layered, an upfront `set_profiles` registry overridable by scoped `with profile(...)`
overlays. Counts never sample; distribution profilers default to 10,000 rows.

## How it compares

Most data-governance and lineage tools serve the **Data Engineering layer**: the modelled,
materialised, catalogued datasets that are *delivered to* analytics teams. Conformare serves the
layer **downstream of that**: the analytical work data science and analytics teams **build on
top of** those datasets, in imperative dataframe code. It does not re-derive the lineage Data
Engineering already has; it governs the analytical outputs that consume it, and **inherits** the
upstream governance at the handshake.

- **[dbt](https://www.getdbt.com/)** : the canonical Data Engineering-layer tool, with
  model-level lineage, docs, contracts and access governance for the SQL/warehouse
  transformations that *produce* the structured inputs. Conformare governs the imperative
  analytics built *on* those inputs, centred on risk / owner / definition rather than SQL model
  graphs. Vertically adjacent, not competing, and Conformare can read a table's governance
  straight from a dbt model's `meta`.
- **[OpenLineage](https://openlineage.io/) / [Marquez](https://marquezproject.ai/)** : an open
  standard and service for emitting **run / dataset / job** lineage across a platform (Airflow,
  Spark, dbt). It answers "which datasets and jobs feed which" at the orchestration level.
  Conformare documents the authored *analytical steps* inside one process and the **risk and
  ownership** behind them, as a self-contained artifact rather than a metadata service.
- **[DataHub](https://datahub.com/) / [OpenMetadata](https://open-metadata.org/)** : enterprise
  metadata **catalogues** of the Data Engineering estate: ownership, glossaries, column lineage,
  policies and data contracts, served centrally. Conformare is lightweight and **code-proximate**
  and complementary: it can **import** a catalogue's governance for the inputs an analysis reads,
  and **export** its analytical lineage and risk back.
- **[Spline](https://absaoss.github.io/spline/)** : captures Spark **execution-plan** lineage
  (column-level) from logical plans, an engineering concern. Conformare captures the process **as
  authored** (no engine plan, so it also works for pandas / Narwhals) and adds the risk / owner /
  definition governance Spline does not.
- **Python pipeline frameworks ([Dagster](https://dagster.io/), [Hamilton](https://hamilton.dagworks.io/),
  [Kedro](https://kedro.org/))** : impose structure on data pipelines you *build in the framework*,
  with assets / nodes, lineage and metadata. Conformare instruments imperative code you already
  wrote (no framework to adopt) and centres on the **risk / owner narrative and drift**, not
  pipeline orchestration.
- **[Great Expectations](https://greatexpectations.io/) / [Soda](https://www.soda.io/) /
  [whylogs](https://github.com/whylabs/whylogs)** : data validation / profiling. Conformare *uses*
  Great Expectations as an optional checkpoint profiler; its own contribution is placing those
  checks (and profiles) on the **process map**, next to the governance.

**In short:** dbt, the catalogues and the lineage services govern the **data Data Engineering
produces**. Conformare governs the **analytical outputs teams build on top of it**, the models,
features, scores and reports that drive decisions but otherwise have no governance layer of their
own. It sits where their reach ends, and inherits from them at the boundary.

## Examples & tests

Browse the [examples gallery](https://kaelonlloyd.github.io/conformare-docs/examples.html),
where each example shows its code next to the live report it produces. Highlights:

- `example_streaming.py` / `example_streaming_spark.py` : a full pipeline, Narwhals vs PySpark.
- `example_pandas.py` : native-pandas idiomatic tracking (`df[df.col==1]`, `merge`, `groupby`).
- `example_great_expectations.py` / `_spark.py` : validation checkpoints that pinpoint where data breaks its contract.
- `example_docstring_tagging.py` : governance declared purely in docstrings (no decorators).
- `bootstrap/` : instrument a pure, unmodified PySpark script from the outside.

```bash
python -m pytest            # Spark tests skip automatically if no JVM is available
```

## Versioning

Conformare follows [Semantic Versioning](https://semver.org/). While in `0.x`, the API
may still change: breaking changes can land in a **minor** release (`0.1` to `0.2`),
patch releases are bug fixes, and `1.0.0` will mark a commitment to backward
compatibility. The installed version is available as `conformare.__version__`, and
changes are recorded in the [changelog](https://kaelonlloyd.github.io/conformare-docs/changelog.html).

## License

Conformare is licensed under the
[PolyForm Noncommercial License 1.0.0](https://polyformproject.org/licenses/noncommercial/1.0.0):
free to use for any **noncommercial** purpose (personal, research, education, non-profits,
public-sector and similar). It is source-available, not open-source.

**Commercial use requires a separate license.** For commercial licensing, contact
Kaelon Lloyd at kaelonlloyd@gmail.com.

