Metadata-Version: 2.3
Name: consist
Version: 0.1.1
Summary: Provenance tracking, intelligent caching, and data virtualization for scientific simulation workflows.
Keywords: provenance,caching,lineage,reproducibility,duckdb,scientific-computing
Author: Zach Needell
Author-email: Zach Needell <zaneedell@lbl.gov>
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Typing :: Typed
Requires-Dist: duckdb>=1.4.2
Requires-Dist: duckdb-engine>=0.17.0
Requires-Dist: pandas>=2.1.0
Requires-Dist: pyarrow>=22.0.0
Requires-Dist: pydantic>=2.12.4
Requires-Dist: rich>=13.7.1
Requires-Dist: ruff>=0.15.7
Requires-Dist: sqlalchemy>=2.0.44,<2.0.45
Requires-Dist: sqlmodel>=0.0.31
Requires-Dist: tqdm>=4.67.1
Requires-Dist: ty>=0.0.15
Requires-Dist: typer>=0.12.3
Requires-Dist: zensical>=0.0.32
Requires-Dist: pyyaml>=6.0 ; extra == 'activitysim'
Requires-Dist: cftime>=1.6.0 ; extra == 'all'
Requires-Dist: docker>=7.1.0 ; extra == 'all'
Requires-Dist: dlt>=1.21.0 ; extra == 'all'
Requires-Dist: geopandas>=0.14.0 ; extra == 'all'
Requires-Dist: h5netcdf>=1.3.0 ; extra == 'all'
Requires-Dist: h5py>=3.15.1 ; extra == 'all'
Requires-Dist: netcdf4>=1.6.0 ; extra == 'all'
Requires-Dist: openmatrix>=0.3.5.0 ; extra == 'all'
Requires-Dist: pyhocon>=0.3.60 ; extra == 'all'
Requires-Dist: pyyaml>=6.0 ; extra == 'all'
Requires-Dist: tables>=3.10.2 ; extra == 'all'
Requires-Dist: xarray>=2024.9.0 ; extra == 'all'
Requires-Dist: zarr>=2.18,<3 ; extra == 'all'
Requires-Dist: pyhocon>=0.3.60 ; extra == 'beam'
Requires-Dist: polars>=1.35.2 ; extra == 'bench'
Requires-Dist: psutil>6.0.0 ; extra == 'bench'
Requires-Dist: dlt>=1.21.0 ; extra == 'dev'
Requires-Dist: h5netcdf>=1.3.0 ; extra == 'dev'
Requires-Dist: h5py>=3.15.1 ; extra == 'dev'
Requires-Dist: openmatrix>=0.3.5.0 ; extra == 'dev'
Requires-Dist: pyhocon>=0.3.60 ; extra == 'dev'
Requires-Dist: pytest>=9.0.1 ; extra == 'dev'
Requires-Dist: pytest-cov>=7.0.0 ; extra == 'dev'
Requires-Dist: pytest-mock>=3.15.1 ; extra == 'dev'
Requires-Dist: ruff>=0.14.6 ; extra == 'dev'
Requires-Dist: tables>=3.10.2 ; extra == 'dev'
Requires-Dist: xarray>=2024.9.0 ; extra == 'dev'
Requires-Dist: zarr>=2.18,<3 ; extra == 'dev'
Requires-Dist: docker>=7.1.0 ; extra == 'docker'
Requires-Dist: mkdocs-material>=9.7.0 ; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=1.0.0 ; extra == 'docs'
Requires-Dist: ipykernel>=6.29.0 ; extra == 'examples'
Requires-Dist: jupyterlab>=4.0.0 ; extra == 'examples'
Requires-Dist: matplotlib>=3.8.0 ; extra == 'examples'
Requires-Dist: seaborn>=0.13.0 ; extra == 'examples'
Requires-Dist: ipywidgets>=8.1.8 ; extra == 'examples'
Requires-Dist: h5py>=3.15.1 ; extra == 'hdf5'
Requires-Dist: tables>=3.10.2 ; extra == 'hdf5'
Requires-Dist: dlt>=1.21.0 ; extra == 'ingest'
Requires-Dist: cftime>=1.6.0 ; extra == 'netcdf'
Requires-Dist: h5netcdf>=1.3.0 ; extra == 'netcdf'
Requires-Dist: xarray>=2024.9.0 ; extra == 'netcdf'
Requires-Dist: netcdf4>=1.6.0 ; extra == 'netcdf4'
Requires-Dist: xarray>=2024.9.0 ; extra == 'netcdf4'
Requires-Dist: openmatrix>=0.3.5.0 ; extra == 'omx'
Requires-Dist: geopandas>=0.14.0 ; extra == 'spatial'
Requires-Dist: h5netcdf>=1.3.0 ; extra == 'test'
Requires-Dist: openmatrix>=0.3.5.0 ; extra == 'test'
Requires-Dist: pyhocon>=0.3.60 ; extra == 'test'
Requires-Dist: pytest>=9.0.1 ; extra == 'test'
Requires-Dist: pytest-cov>=7.0.0 ; extra == 'test'
Requires-Dist: pytest-mock>=3.15.1 ; extra == 'test'
Requires-Dist: ruff>=0.14.6 ; extra == 'test'
Requires-Dist: xarray>=2024.9.0 ; extra == 'test'
Requires-Dist: zarr>=2.18,<3 ; extra == 'test'
Requires-Dist: pandas-stubs~=2.3.3 ; extra == 'typecheck'
Requires-Dist: ty>=0.0.7 ; extra == 'typecheck'
Requires-Dist: xarray>=2024.9.0 ; extra == 'zarr'
Requires-Dist: zarr>=2.18,<3 ; extra == 'zarr'
Requires-Python: >=3.11
Provides-Extra: activitysim
Provides-Extra: all
Provides-Extra: beam
Provides-Extra: bench
Provides-Extra: dev
Provides-Extra: docker
Provides-Extra: docs
Provides-Extra: examples
Provides-Extra: hdf5
Provides-Extra: ingest
Provides-Extra: netcdf
Provides-Extra: netcdf4
Provides-Extra: omx
Provides-Extra: spatial
Provides-Extra: test
Provides-Extra: typecheck
Provides-Extra: zarr
Description-Content-Type: text/markdown

<p align="center">
  <img src="docs/assets/logo.png" alt="Consist" width="320">
</p>

<p align="center">
  <a href="https://github.com/LBNL-UCB-STI/consist/actions/workflows/ci.yml">
    <img src="https://github.com/LBNL-UCB-STI/consist/actions/workflows/ci.yml/badge.svg" alt="CI">
  </a>
  <img src="https://img.shields.io/badge/python-3.11+-blue.svg" alt="Python 3.11+">
  <a href="LICENSE"><img src="https://img.shields.io/badge/license-BSD--3--Clause-blue.svg" alt="License BSD 3-Clause"></a>
</p>

**Consist** is a caching layer for scientific simulation workflows that makes provenance queryable. It automatically
records what code, configuration, and input data produced each output in your pipeline—eliminating redundant computation
and enabling post-hoc inspection of results via SQL.

### Why Consist?

Multi-run simulation workflows typically accumulate friction:

- **Provenance ambiguity**: "Which configuration produced those results in Figure 3?"
- **Redundant computation**: Re-running a 4-hour pipeline because you changed one unrelated parameter.
- **Scattered outputs**: Finding and comparing results across scenario variants manually.
- **Hidden wiring**: Tools with implicit dependencies (name-based injection, global state) are hard to debug and modify
  when something breaks.

Consist tracks lineage explicitly. Tasks are ordinary Python functions; dependencies flow through concrete values, not
framework magic. Your pipeline remains inspectable and testable.

---

## Installation

```bash
pip install consist
```

Optional extras:

```bash
pip install "consist[ingest]"
```

> [!NOTE]
> Consist is pre-`1.0`. The library is ready for real workflows, but minor
> releases may still include breaking changes while the API continues to settle.

---

## Quick Example

```python
import consist
from pathlib import Path
from consist import ExecutionOptions, Tracker
import pandas as pd

tracker = Tracker(run_dir="./runs", db_path="./provenance.duckdb")


def clean_data(raw: Path, threshold: float = 0.5) -> dict[str, Path]:
    df = pd.read_parquet(raw)
    out = Path("./cleaned.parquet")
    df[df["value"] > threshold].to_parquet(out)
    return {"cleaned": out}


# Executes function and records inputs, config, and output artifact
result = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},  # hashed for cache identity
    config={"threshold": 0.5},  # hashed for cache identity
    outputs=["cleaned"],
    execution_options=ExecutionOptions(input_binding="paths"),
)

# Second call with identical inputs: instant cache hit, no execution
result = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},
    config={"threshold": 0.5},
    outputs=["cleaned"],
    execution_options=ExecutionOptions(input_binding="paths"),
)

# Artifact: the output file with provenance metadata attached
artifact = result.outputs["cleaned"]
print(artifact.path)  # -> PosixPath('./cleaned.parquet')

# Load as a DataFrame when needed
cleaned_df = consist.load_df(artifact)
```

`input_binding="paths"` keeps the file boundary explicit: the function receives
the local `Path` values named in `inputs`, while those same inputs still define
cache identity and lineage.

**Summary**: Consist computes a deterministic fingerprint from your code version, config, and input files. If you change
anything upstream, only affected downstream steps will re-execute.

### Multi-Step Pipeline

Dependencies are explicit: the output of one step becomes the input of the next via a concrete reference, not name
matching or injection.

```python
def analyze_data(cleaned: Path) -> dict[str, Path]:
    df = pd.read_parquet(cleaned)
    out = Path("./analysis.parquet")
    summary = df.groupby("category")["value"].mean()
    summary.to_parquet(out)
    return {"analysis": out}


preprocess = tracker.run(
    fn=clean_data,
    inputs={"raw": Path("raw.parquet")},
    config={"threshold": 0.5},
    outputs=["cleaned"],
    execution_options=ExecutionOptions(input_binding="paths"),
)
analyze = tracker.run(
    fn=analyze_data,
    inputs={"cleaned": consist.ref(preprocess, key="cleaned")},  # explicit artifact reference
    outputs=["analysis"],
    execution_options=ExecutionOptions(input_binding="paths"),
)
```

Use `output_paths` when a function returns `None` but writes files, or when you need explicit destination control.

---

## Key Features

- **Deterministic Caching**: Cache identity is based on an inspectable fingerprint of code, config, and inputs. Only
  affected downstream steps re-execute when any upstream piece changes.
- **Plain Python**: Tasks are ordinary Python functions—callable and testable without the tracker. The tracker is
  additive and does not restructure your code.
- **Complete Lineage**: Every result is tagged with the exact code and config that created it. Trace lineage from any
  output back to its sources.
- **SQL-Native Analysis**: All metadata is indexed in DuckDB. Query across runs, join tables, and compare variants using
  standard SQL.
- **HPC and Container Support**: Run tasks in Docker and Singularity containers, with image digests and mounted
  volumes included in the cache signature. Ideal for long-running jobs on shared compute.
- **Queryable CLI**: Inspect history, trace lineage, and compare results from the command line after a job completes. No
  code required.

---

## Documentation Index

| Section                                                   | Description                                                  |
|:----------------------------------------------------------|:-------------------------------------------------------------|
| **[Getting Started](docs/getting-started/quickstart.md)** | 5-minute guide to your first tracked run.                    |
| **[Usage Guide](docs/usage-guide.md)**                    | Detailed patterns for scenarios and complex workflows.       |
| **[Architecture](docs/architecture.md)**                  | Deep dive into hashing, lineage, and the DuckDB core.        |
| **[CLI Reference](docs/cli-reference.md)**                | Guide to the `consist` command-line tools.                   |
| **[DB Maintenance](docs/db-maintenance.md)**              | Operational runbooks for inspect/doctor/purge/merge/rebuild. |
| **[Example Gallery](docs/examples.md)**                   | Interactive notebooks for Monte Carlo, Demand Modeling, etc. |

---

## Etymology

In railroad terminology, a **consist** (noun, pronounced *CON-sist*) is the specific lineup of locomotives and cars that
make up a train. In this library, a **consist** is the immutable record of exactly which components—code, config, and
inputs—were coupled together to produce a result.
