Metadata-Version: 2.4
Name: conformare
Version: 0.1.8
Summary: Capture the authored dataframe transformation pipeline (Narwhals + PySpark) and profile data at each step to produce lineage diagrams and data-quality docs.
Project-URL: Homepage, https://kaelonlloyd.github.io/conformare-docs/
Project-URL: Documentation, https://kaelonlloyd.github.io/conformare-docs/
Author-email: Kaelon Lloyd <kaelonlloyd@gmail.com>
License: PolyForm-Noncommercial-1.0.0
License-File: LICENSE
Keywords: data-governance,data-lineage,data-quality,governance,great-expectations,lineage,narwhals,pandas,pii,profiling,pyspark
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: Free for non-commercial use
Classifier: License :: Other/Proprietary License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: executing>=2.0
Requires-Dist: narwhals>=1.0
Provides-Extra: dev
Requires-Dist: mypy; extra == 'dev'
Requires-Dist: pandas>=2.0; extra == 'dev'
Requires-Dist: pyarrow>=12.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff<0.16,>=0.15; extra == 'dev'
Requires-Dist: scikit-learn>=1.3; extra == 'dev'
Provides-Extra: gx
Requires-Dist: great-expectations>=0.18; extra == 'gx'
Provides-Extra: sklearn
Requires-Dist: scikit-learn>=1.3; extra == 'sklearn'
Provides-Extra: spark
Requires-Dist: pyspark>=3.4; extra == 'spark'
Provides-Extra: test
Requires-Dist: pandas>=2.0; extra == 'test'
Requires-Dist: pyarrow>=12.0; extra == 'test'
Requires-Dist: pytest>=7.0; extra == 'test'
Provides-Extra: whylogs
Requires-Dist: whylogs>=1.3; extra == 'whylogs'
Description-Content-Type: text/markdown

# Conformare

[![CI](https://github.com/kaelonlloyd/conformare/actions/workflows/ci.yml/badge.svg)](https://github.com/kaelonlloyd/conformare/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/conformare)](https://pypi.org/project/conformare/)
[![Python versions](https://img.shields.io/pypi/pyversions/conformare)](https://pypi.org/project/conformare/)
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
[![License: PolyForm Noncommercial 1.0.0](https://img.shields.io/badge/license-PolyForm%20Noncommercial%201.0.0-blue.svg)](https://polyformproject.org/licenses/noncommercial/1.0.0)

**[Live demo and reports](https://kaelonlloyd.github.io/conformare-docs/)**: explore real interactive reports generated by the examples, and read the [docs](https://kaelonlloyd.github.io/conformare-docs/).

**Tie data-pipeline governance to the code that implements it.** Conformare captures the
*authored* transformation pipeline of your dataframe code, profiles the data at each step,
and records the **risks, mitigations, owners and business definitions** behind it, then
renders the whole thing as one self-contained, interactive HTML report.

Works with **Narwhals**, **PySpark**, and **native pandas**.

## What it does

Conformare does two jobs for a data-processing pipeline:

1. **Governance** : track the **risks, mitigations and owners**, and the **business
   definitions** implemented by each step.
2. **Process & data** : document the **end-to-end process**, flag **PII / sensitive
   data**, and **profile** the data (counts, distributions, null rates, outliers,
   expectations) at each step.

It exists to support the **development, diagnostics, and governance** of pipeline
*implementations*: not the data platform as a whole, but the specific code that does the work.

## Mission

Governance usually lives *away* from the code. Risk reviews raise every assumption and
implementation consideration; each must be owned, approved and tracked, and it is almost
always done **after the fact**, in documents that quietly drift out of sync with the
implementation.

Conformare's mission is to **bring that governance into the code**: author risks,
mitigations, owners and definitions next to the logic they describe, so the documentation
is generated *from* the implementation and stays current with it.

## Two ways to use it

- **Integrated (intrusive)** : add profilers and `describe()` / `risk()` context in your
  code. Unlocks the full feature set: profiling, lineage, sensitivity, expectations and
  governance.
- **Non-intrusive** : get governance and process documentation **without rewriting** your
  pipeline, via **docstring tagging** (risk / mitigation / owner / purpose declared in
  docstrings) or **bootstrapping** (instrument an unmodified script from a separate entry
  point).

See [Choosing an integration style](#choosing-an-integration-style) for the trade-offs.

## Install

```bash
pip install conformare           # core: Narwhals + executing
pip install "conformare[spark]"  # + PySpark
pip install "conformare[gx]"     # + Great Expectations (optional validation profiler)
```

## Quick start

```python
import narwhals as nw, pandas as pd
import conformare as cf

cf.trackNarwhals()
cf.set_profiles({"*": [cf.rowCount, cf.dataSize, cf.histogram(columns="all")]})

with cf.describe("Clean customers", purpose="Keep UK adults only",
                 definition_owner="data-governance",
                 risks=cf.risk("privacy.pii_exposure", "compliance.gdpr",
                               mitigation="Drop email before export", owner="data-governance")):
    customers = nw.from_native(pd.read_csv("customers.csv"))
    adults = customers.filter(nw.col("age") >= 18)

cf.to_html("report.html", title="Customer pipeline")   # open in any browser
```

Existing **PySpark** or **native pandas** code uses the same API: call `cf.trackSpark()` or
`cf.trackPandas()` and run your pipeline **unchanged**:

```python
cf.trackSpark()                               # the only line you add
active   = df.filter(df.status == "active")   # tracked automatically
enriched = active.join(orders, on="id")       # tracked, two parents
```

## Choosing an integration style

|  | Integrated (explicit) | Docstring tagging | Bootstrapping |
|---|---|---|---|
| Change to pipeline code | profilers + `describe`/`risk` inline | docstrings only | none (separate entry point) |
| What you get | everything | risk / mitigation / owner / purpose docs | process tracking + grouping + profiling |
| Best for | new code, deep diagnostics, full report | adding governance with little code change | documenting a script you cannot or will not edit |

- **Explicit integration** gives the richest result: every profiler, the complete lineage
  and column-level detail, data-quality checkpoints, and governance, all in one report.
  Best when you own the code and want both diagnostics and a full governance artifact.
- **Indirect integration** trades feature coverage for **zero or low intrusion**:
  - *Docstrings* keep governance (risks, owners, definitions) literally inside the
    function that implements the concept, so it cannot drift and needs no imports in hot
    paths. With `track_functions()` on, a `Conformare:` docstring block is applied
    automatically; see the [docstring tagging example](https://kaelonlloyd.github.io/conformare-docs/examples/docstring-governance.html).
  - *Bootstrapping* documents and profiles an **unmodified** production script from the
    outside, ideal for legacy or third-party pipelines and for audits.

## Features at a glance

- **Process map & lineage** : a diagram of the pipeline **as authored** (no engine plan
  required), with column-level lineage, a created-column catalog, and each node's
  operation logic shown inline.
- **Per-step profiling** : row/column counts, data size, histograms, null fractions and
  IQR outliers at each step, plus a *distribution follower* to watch a column evolve.
- **Data-quality checkpoints** : drop [Great Expectations](https://greatexpectations.io/)
  in at any step and see exactly where a contract starts failing (with severities).
- **PII / sensitivity** : name-based heuristics flag candidate PII, and the report shows
  whether each sensitive column reaches a written output.
- **Governance** : risks, mitigations, owners, business **definition owners**, Markdown
  context details, a process-wide description, and a governance ranking (owned means
  low-concern). Surfaced as a risk register and a context register.
- **Self-contained HTML report** : one interactive page (diagram, column highlighter,
  KPIs, dark mode); no CDN, no build step.
- **Formal risk checklist** : export the risk register as a sign-off-ready Markdown
  document (`to_risk_checklist`) with blank columns and a sign-off block for a
  governance team to review, comment, and date, keeping an auditable trail.
- **Three backends, one report** : Narwhals (new code), PySpark and native pandas
  (existing code, tracked in place).

## Backends

- **`trackNarwhals()`** : for new, dataframe-agnostic code on
  [Narwhals](https://narwhals-dev.github.io/narwhals/); patches the `nw.from_native`
  chokepoint.
- **`trackSpark()`** : for existing PySpark code, tracked **in place with zero changes**.
- **`trackPandas()`** : for existing **native pandas** code; tracks idiomatic indexing
  like `df[df.col == 1]`, `df[["a","b"]]`, `query`, `merge`, `groupby`.
- **`trackAll()`** : adapters plus automatic function-boundary tracking, for mixed codebases.

## Public API

| Call | Purpose |
|---|---|
| `trackNarwhals()` / `trackSpark()` / `trackPandas()` / `trackAll()` | Start tracking the chosen backend(s). |
| `set_profiles({...})` / `with profile(...)` | Op-to-profilers registry / scoped overlay. |
| `with force_profile(..., cache=)` | Opt-in profiling at a chosen point (optionally cache on Spark). |
| `with describe(...)` / `with risk(...)` | Annotate code with purpose / governance risks (owner, mitigation, definition owner, Markdown details). |
| `describe_process(description, risks=...)` | Process-wide description and risks. |
| `register_risk(...)` | Extend the built-in risk catalog. |
| `mark_sensitive()` / `classify_column()` | Manual / heuristic sensitivity tagging. |
| `@opaque` / `opaque_module(*prefixes)` | Record a function/library call as one node, suppressing its internals (`pyspark.ml` opaque by default). |
| `@track_step` / `track_functions()` | Function-boundary tracking (explicit / automatic, including docstring tagging). |
| `environment()` / `in_notebook()` / `mark_user_packages(*names)` | Detect the runtime (Databricks / Jupyter / IPython / Python); opt your own pip-installed pipeline code into user-code tracking. |
| `bootstrap(run, docs=[doc(...)], ...)` | Instrument an unmodified script from the outside, run it, write a report. |
| `to_mermaid()` / `to_json()` / `to_html()` | Export the lineage (Mermaid / JSON / interactive HTML). |
| `to_risk_checklist(path, process=, reviewers=)` | Export the risk register as a formal, sign-off-ready Markdown checklist for a governance team. |
| `restore()` | Unpatch everything (captured lineage is kept). |

## Profilers

A **profiler** measures something about the data at a step (for example a row count or a
column's distribution) and attaches the result to that node in the report. You choose
which profilers run on which operations, and they execute as the pipeline runs.

Built-in profilers:

- **`rowCount`** : number of rows.
- **`columnCount`** : number of columns.
- **`dataSize`** : approximate in-memory size (with the column count).
- **`histogram(columns=...)`** : per-column distribution (numeric bins, or top values for categorical columns).
- **`nullFraction(columns=...)`** : fraction of nulls per column.
- **`iqrOutliers(columns="all", k=1.5)`** : flags values outside Tukey's IQR fences and summarises the outliers per column.
- **`greatExpectations(*expectations, hard_severities=())`** : runs [Great Expectations](https://greatexpectations.io/) checks as a validation checkpoint at a step, showing which pass or fail (with severities). Accepts native GX objects or portable dicts. Optional dependency (`pip install "conformare[gx]"`); degrades to a status note if absent.
- **`whylogs(columns=...)`** : optional [whylogs](https://github.com/whylabs/whylogs) profile summary (requires whylogs).

On Spark, counts and aggregates are full jobs, so prefer profiling chosen steps with `force_profile`:

```python
cf.set_profiles({})                                  # profile nothing by default
with cf.force_profile(cf.rowCount, cf.histogram("amount"), cache=True, only="last"):
    enriched = adults.join(orders, on="id")          # only this step is profiled (and cached)
```

Under the hood: a profiler is a callable `(frame, backend) -> value`. Conditions
(`contains_columns`, `schema_has`, `min_rows`) compose with `&` `|` `~`. Configuration is
layered, an upfront `set_profiles` registry overridable by scoped `with profile(...)`
overlays. Counts never sample; distribution profilers default to 10,000 rows.

## Related projects

Conformare overlaps with the data-lineage / governance ecosystem but occupies a different
niche: it maps the **inside of a specific process implementation** and binds **governance
documentation to the code**, rather than cataloguing datasets across a platform.

- **[OpenLineage](https://openlineage.io/) / [Marquez](https://marquezproject.ai/)** : an
  open *standard* and service for emitting **run / dataset / job** lineage across a stack
  (Airflow, Spark, dbt). It answers "which datasets and jobs feed which" at the platform
  level. Conformare instead documents the authored *steps inside one pipeline* and the
  governance behind them, as a single self-contained artifact rather than a metadata service.
- **[dbt](https://www.getdbt.com/)** : model-level lineage, docs, contracts and access
  governance for SQL/warehouse transformations. Conformare targets **imperative dataframe
  code** (Narwhals / pandas / Spark) and centres on risk / owner / definition governance
  rather than SQL model graphs.
- **[Spline](https://absaoss.github.io/spline/)** : captures Spark **execution-plan**
  lineage (column-level) from logical plans. Conformare captures the pipeline **as
  authored** (no engine plan, so it also works for pandas/Narwhals) and adds the
  governance layer Spline does not.
- **[DataHub](https://datahub.com/) / [OpenMetadata](https://open-metadata.org/)** :
  enterprise metadata **catalogs**: ownership, glossaries, column lineage, policies and
  data contracts, served centrally. Conformare is lightweight and **code-proximate**:
  governance authored alongside the implementation and rendered per run, not a central
  catalog populated separately.
- **[Great Expectations](https://greatexpectations.io/) / [Soda](https://www.soda.io/) /
  [whylogs](https://github.com/whylabs/whylogs)** : data validation / profiling.
  Conformare *uses* Great Expectations as an optional checkpoint profiler; its own
  contribution is placing those checks (and profiles) on the **process map**, next to the
  governance.

**In short:** tools like OpenLineage and DataHub focus on higher-order data lineage and
central cataloging across a platform; Conformare zooms in on one pipeline's
*implementation* and ties its governance documentation to the code. Complementary, not
competing.

## Examples & tests

Browse the [examples gallery](https://kaelonlloyd.github.io/conformare-docs/examples.html),
where each example shows its code next to the live report it produces. Highlights:

- `example_streaming.py` / `example_streaming_spark.py` : a full pipeline, Narwhals vs PySpark.
- `example_pandas.py` : native-pandas idiomatic tracking (`df[df.col==1]`, `merge`, `groupby`).
- `example_great_expectations.py` / `_spark.py` : validation checkpoints that pinpoint where data breaks its contract.
- `example_docstring_tagging.py` : governance declared purely in docstrings (no decorators).
- `bootstrap/` : instrument a pure, unmodified PySpark script from the outside.

```bash
python -m pytest            # Spark tests skip automatically if no JVM is available
```

## Versioning

Conformare follows [Semantic Versioning](https://semver.org/). While in `0.x`, the API
may still change: breaking changes can land in a **minor** release (`0.1` to `0.2`),
patch releases are bug fixes, and `1.0.0` will mark a commitment to backward
compatibility. The installed version is available as `conformare.__version__`, and
changes are recorded in the [changelog](https://kaelonlloyd.github.io/conformare-docs/changelog.html).

## License

Conformare is licensed under the
[PolyForm Noncommercial License 1.0.0](https://polyformproject.org/licenses/noncommercial/1.0.0):
free to use for any **noncommercial** purpose (personal, research, education, non-profits,
public-sector and similar). It is source-available, not open-source.

**Commercial use requires a separate license.** For commercial licensing, contact
Kaelon Lloyd at kaelonlloyd@gmail.com.

