Metadata-Version: 2.4
Name: lakelogic
Version: 1.31.0
Summary: A declarative, contract-driven medallion pipeline engine for data mesh architectures. Write once. Run on Spark, Polars, or DuckDB.
Author-email: LakeLogic Team <lakelogic@gmail.com>
License: Apache-2.0
License-File: LICENSE
Keywords: data-contracts,data-engineering,data-governance,data-pipeline,data-quality,delta-lake,duckdb,etl,lakehouse,lineage,medallion-architecture,polars,quarantine,schema-validation,spark
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Software Development :: Quality Assurance
Classifier: Typing :: Typed
Requires-Python: >=3.10
Requires-Dist: deltalake>=0.15.0
Requires-Dist: duckdb>=0.9.0
Requires-Dist: httpx<1,>=0.27.0
Requires-Dist: jinja2>=3.1.0
Requires-Dist: loguru>=0.7.0
Requires-Dist: polars>=0.20.0
Requires-Dist: pyarrow>=14.0.0
Requires-Dist: pydantic<3,>=2.0.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: sqlglot>=20.0.0
Requires-Dist: typer>=0.9.0
Provides-Extra: ai
Requires-Dist: anthropic>=0.18.0; extra == 'ai'
Requires-Dist: google-genai>=0.5.0; extra == 'ai'
Requires-Dist: instructor>=1.5.0; extra == 'ai'
Requires-Dist: openai>=1.0.0; extra == 'ai'
Requires-Dist: typing-extensions>=4.12.0; extra == 'ai'
Provides-Extra: all
Requires-Dist: deltalake>=0.15.0; extra == 'all'
Requires-Dist: duckdb>=0.9.0; extra == 'all'
Requires-Dist: lxml>=4.9.0; extra == 'all'
Requires-Dist: openpyxl>=3.1.0; extra == 'all'
Requires-Dist: polars>=0.20.0; extra == 'all'
Requires-Dist: pyarrow>=14.0.0; extra == 'all'
Requires-Dist: sqlglot>=20.0.0; extra == 'all'
Requires-Dist: typer>=0.9.0; extra == 'all'
Provides-Extra: api
Requires-Dist: requests>=2.31.0; extra == 'api'
Provides-Extra: aws-messaging
Requires-Dist: boto3>=1.28.0; extra == 'aws-messaging'
Provides-Extra: azure
Requires-Dist: adlfs>=2023.10.0; extra == 'azure'
Requires-Dist: azure-identity>=1.15.0; extra == 'azure'
Requires-Dist: azure-keyvault-secrets>=4.7.0; extra == 'azure'
Requires-Dist: azure-storage-blob>=12.19.0; extra == 'azure'
Requires-Dist: cryptography>=41.0.0; extra == 'azure'
Requires-Dist: databricks-sdk>=0.18.0; extra == 'azure'
Requires-Dist: fsspec>=2023.10.0; extra == 'azure'
Provides-Extra: azure-messaging
Requires-Dist: azure-eventgrid>=4.17.0; extra == 'azure-messaging'
Requires-Dist: azure-identity>=1.15.0; extra == 'azure-messaging'
Requires-Dist: azure-servicebus>=7.11.0; extra == 'azure-messaging'
Provides-Extra: azuresql
Requires-Dist: azure-identity>=1.15.0; extra == 'azuresql'
Requires-Dist: pyodbc>=5.0.0; extra == 'azuresql'
Provides-Extra: bigquery
Requires-Dist: google-cloud-bigquery>=3.11.0; extra == 'bigquery'
Provides-Extra: bytewax
Requires-Dist: bytewax>=0.19.0; extra == 'bytewax'
Provides-Extra: cli
Requires-Dist: typer>=0.9.0; extra == 'cli'
Provides-Extra: cloud
Requires-Dist: azure-eventgrid>=4.17.0; extra == 'cloud'
Requires-Dist: azure-servicebus>=7.11.0; extra == 'cloud'
Requires-Dist: google-cloud-bigquery>=3.11.0; extra == 'cloud'
Requires-Dist: google-cloud-pubsub>=2.18.0; extra == 'cloud'
Requires-Dist: google-cloud-secret-manager>=2.16.0; extra == 'cloud'
Requires-Dist: google-cloud-storage>=2.10.0; extra == 'cloud'
Requires-Dist: google-genai>=0.5.0; extra == 'cloud'
Requires-Dist: snowflake-connector-python>=3.5.0; extra == 'cloud'
Provides-Extra: databases
Requires-Dist: azure-identity>=1.15.0; extra == 'databases'
Requires-Dist: psycopg2-binary>=2.9.0; extra == 'databases'
Requires-Dist: pymongo>=4.6.0; extra == 'databases'
Requires-Dist: pymysql>=1.1.0; extra == 'databases'
Requires-Dist: pyodbc>=5.0.0; extra == 'databases'
Provides-Extra: delta
Requires-Dist: azure-identity>=1.15.0; extra == 'delta'
Requires-Dist: azure-storage-blob>=12.19.0; extra == 'delta'
Requires-Dist: boto3>=1.28.0; extra == 'delta'
Requires-Dist: databricks-sdk>=0.18.0; extra == 'delta'
Requires-Dist: deltalake>=0.15.0; extra == 'delta'
Requires-Dist: google-cloud-storage>=2.10.0; extra == 'delta'
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: commitizen>=3.0.0; extra == 'dev'
Requires-Dist: git-cliff>=2.0.0; extra == 'dev'
Requires-Dist: hypothesis>=6.100.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: dlt
Requires-Dist: dlt[parquet]>=1.0; extra == 'dlt'
Requires-Dist: pyarrow>=14.0.0; extra == 'dlt'
Provides-Extra: docs
Requires-Dist: mkdocs-jupyter>=0.24.0; extra == 'docs'
Requires-Dist: mkdocs-material>=9.0.0; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.20.0; extra == 'docs'
Provides-Extra: duckdb
Requires-Dist: deltalake>=0.15.0; extra == 'duckdb'
Requires-Dist: duckdb>=0.9.0; extra == 'duckdb'
Requires-Dist: lxml>=4.9.0; extra == 'duckdb'
Requires-Dist: openpyxl>=3.1.0; extra == 'duckdb'
Requires-Dist: pandas>=2.0.0; extra == 'duckdb'
Requires-Dist: pyarrow>=14.0.0; extra == 'duckdb'
Provides-Extra: engines
Requires-Dist: deltalake>=0.15.0; extra == 'engines'
Requires-Dist: duckdb>=0.9.0; extra == 'engines'
Requires-Dist: lxml>=4.9.0; extra == 'engines'
Requires-Dist: openpyxl>=3.1.0; extra == 'engines'
Requires-Dist: polars>=0.20.0; extra == 'engines'
Requires-Dist: pyarrow>=14.0.0; extra == 'engines'
Requires-Dist: sqlglot>=20.0.0; extra == 'engines'
Provides-Extra: enterprise
Requires-Dist: bytewax>=0.19.0; extra == 'enterprise'
Requires-Dist: dataprofiler>=0.9.0; extra == 'enterprise'
Requires-Dist: nbclient>=0.9.0; extra == 'enterprise'
Requires-Dist: nbformat>=5.9.0; extra == 'enterprise'
Requires-Dist: presidio-analyzer>=2.2.0; extra == 'enterprise'
Requires-Dist: presidio-anonymizer>=2.2.0; extra == 'enterprise'
Requires-Dist: pyspark>=3.3.0; extra == 'enterprise'
Provides-Extra: extraction
Requires-Dist: pypandoc-binary>=1.13; extra == 'extraction'
Requires-Dist: spacy>=3.7.0; extra == 'extraction'
Requires-Dist: textblob>=0.18.0; extra == 'extraction'
Requires-Dist: unstructured[csv,doc,docx,md,pptx,rst,tsv,xlsx]>=0.12.0; extra == 'extraction'
Provides-Extra: extraction-ocr
Requires-Dist: pdfplumber>=0.10.0; extra == 'extraction-ocr'
Requires-Dist: pypandoc-binary>=1.13; extra == 'extraction-ocr'
Requires-Dist: rapidocr-onnxruntime>=1.4.0; extra == 'extraction-ocr'
Requires-Dist: spacy>=3.7.0; extra == 'extraction-ocr'
Requires-Dist: textblob>=0.18.0; extra == 'extraction-ocr'
Requires-Dist: unstructured[csv,doc,docx,md,pptx,rst,tsv,xlsx]>=0.12.0; extra == 'extraction-ocr'
Provides-Extra: gcp-messaging
Requires-Dist: google-cloud-pubsub>=2.18.0; extra == 'gcp-messaging'
Provides-Extra: integrations
Requires-Dist: azure-eventgrid>=4.17.0; extra == 'integrations'
Requires-Dist: azure-identity>=1.15.0; extra == 'integrations'
Requires-Dist: azure-servicebus>=7.11.0; extra == 'integrations'
Requires-Dist: boto3>=1.28.0; extra == 'integrations'
Requires-Dist: google-cloud-pubsub>=2.18.0; extra == 'integrations'
Requires-Dist: paramiko>=4.0.1; extra == 'integrations'
Requires-Dist: requests>=2.31.0; extra == 'integrations'
Provides-Extra: kafka
Requires-Dist: kafka-python>=2.0.2; extra == 'kafka'
Provides-Extra: mongodb
Requires-Dist: pymongo>=4.6.0; extra == 'mongodb'
Provides-Extra: mysql
Requires-Dist: pymysql>=1.1.0; extra == 'mysql'
Provides-Extra: nlp
Requires-Dist: spacy>=3.7.0; extra == 'nlp'
Requires-Dist: textblob>=0.18.0; extra == 'nlp'
Provides-Extra: notebook
Requires-Dist: nbclient>=0.9.0; extra == 'notebook'
Requires-Dist: nbformat>=5.9.0; extra == 'notebook'
Provides-Extra: notifications
Requires-Dist: apprise>=1.7.0; extra == 'notifications'
Requires-Dist: azure-identity>=1.15.0; extra == 'notifications'
Requires-Dist: azure-keyvault-secrets>=4.7.0; extra == 'notifications'
Requires-Dist: boto3>=1.28.0; extra == 'notifications'
Requires-Dist: cryptography>=41.0.0; extra == 'notifications'
Requires-Dist: google-cloud-secret-manager>=2.16.0; extra == 'notifications'
Requires-Dist: hvac>=2.0.0; extra == 'notifications'
Requires-Dist: jinja2>=3.1.0; extra == 'notifications'
Provides-Extra: notify
Requires-Dist: apprise>=1.7.0; extra == 'notify'
Requires-Dist: hvac>=2.0.0; extra == 'notify'
Provides-Extra: pathway
Requires-Dist: pathway>=0.15.0; (python_version >= '3.10' and python_version < '3.14') and extra == 'pathway'
Provides-Extra: pii
Requires-Dist: dataprofiler>=0.9.0; extra == 'pii'
Requires-Dist: presidio-analyzer>=2.2.0; extra == 'pii'
Requires-Dist: presidio-anonymizer>=2.2.0; extra == 'pii'
Provides-Extra: polars
Requires-Dist: deltalake>=0.15.0; extra == 'polars'
Requires-Dist: lxml>=4.9.0; extra == 'polars'
Requires-Dist: openpyxl>=3.1.0; extra == 'polars'
Requires-Dist: polars>=0.20.0; extra == 'polars'
Provides-Extra: postgresql
Requires-Dist: azure-identity>=1.15.0; extra == 'postgresql'
Requires-Dist: psycopg2-binary>=2.9.0; extra == 'postgresql'
Provides-Extra: profiling
Requires-Dist: dataprofiler>=0.9.0; extra == 'profiling'
Requires-Dist: presidio-analyzer>=2.2.0; extra == 'profiling'
Requires-Dist: presidio-anonymizer>=2.2.0; extra == 'profiling'
Provides-Extra: review
Requires-Dist: datacontract-cli>=0.10; extra == 'review'
Requires-Dist: ruff<0.14,>=0.6; extra == 'review'
Requires-Dist: sqlfluff<4,>=3; extra == 'review'
Provides-Extra: sftp
Requires-Dist: paramiko>=4.0.1; extra == 'sftp'
Provides-Extra: snowflake
Requires-Dist: snowflake-connector-python>=3.5.0; extra == 'snowflake'
Provides-Extra: spark
Requires-Dist: pyspark>=3.3.0; extra == 'spark'
Provides-Extra: sse
Requires-Dist: sseclient-py>=1.8.0; extra == 'sse'
Provides-Extra: streaming
Requires-Dist: bytewax>=0.19.0; extra == 'streaming'
Requires-Dist: kafka-python>=2.0.2; extra == 'streaming'
Requires-Dist: pathway>=0.15.0; (python_version >= '3.10' and python_version < '3.14') and extra == 'streaming'
Requires-Dist: sseclient-py>=1.8.0; extra == 'streaming'
Requires-Dist: websocket-client>=1.6.0; extra == 'streaming'
Provides-Extra: websocket
Requires-Dist: websocket-client>=1.6.0; extra == 'websocket'
Description-Content-Type: text/markdown

# LakeLogic

**Your Data Estate. Under Contract.**

[![Documentation](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://LakeLogic.github.io/LakeLogic/)
[![PyPI](https://img.shields.io/pypi/v/lakelogic?logo=pypi&logoColor=white)](https://pypi.org/project/lakelogic/)
[![CI](https://github.com/LakeLogic/LakeLogic/actions/workflows/ci-gate.yml/badge.svg)](https://github.com/LakeLogic/LakeLogic/actions/workflows/ci-gate.yml)
[![codecov](https://codecov.io/gh/LakeLogic/LakeLogic/graph/badge.svg)](https://codecov.io/gh/LakeLogic/LakeLogic)
[![Python](https://img.shields.io/badge/python-3.9+-blue?logo=python&logoColor=white)](https://www.python.org)
[![License](https://img.shields.io/badge/license-Apache%202.0-green)](LICENSE)

**Executable + Enforceable Data Contracts. Validate at runtime. Block bad merges in CI/CD.**

> Describe your data products in YAML — LakeLogic materializes them as Delta/Iceberg tables with lineage, quality, and SCD2 built in.
>
> Write once. Run on [Spark](https://spark.apache.org/), [Polars](https://pola.rs/), or [DuckDB](https://duckdb.org/).
> **Data Contracts as Code — the executable layer for data mesh.**

![LakeLogic Architecture](docs/assets/lakelogic_architecture.png)

> **One contract. Executed at runtime. Enforced in CI/CD.** Every row flows through the same gates — across Spark, Polars, or DuckDB — with bad data quarantined and breaking changes blocked before merge.

---

## Data Mesh Alignment

LakeLogic is the missing runtime layer for Data Mesh — where domain ownership and federated governance stop being principles and start being enforced.

| Pillar | How LakeLogic Delivers |
| :--- | :--- |
| **Domain Ownership** | Contracts are owned and defined by domain teams (e.g., CRM, Finance) who know the data best. |
| **Data as a Product** | The contract IS the product interface — a versioned, schema-enforced, SLA-backed guarantee that consuming teams can depend on. |
| **Self-Serve Platform** | A standardized runtime that any team can use to deploy quality gates without infra silos. |
| **Federated Governance** | PII masking rules, SLA thresholds, and schema standards defined once in a central registry — automatically enforced at every domain pipeline. |

---

## Quick Start

```bash
pip install lakelogic
```

**Runtime — execute a contract against your data:**

```python
from lakelogic import DataProcessor

result = DataProcessor("contract.yaml").run_source()
print(f"Valid: {result.good_count}  |  Quarantined: {result.bad_count}  |  Quality: {result.quality_score:.1f}")
```

**CI/CD — block bad contract changes before they merge:**

```bash
# Static validation — no data needed
lakelogic validate \
  --contract contract.yaml \
  --gates breaking_change,pii_classification,lineage_break
```

Drop `lakelogic validate` into your GitHub Actions workflow to enforce schema, PII, and lineage standards on every pull request.

---

## Technical Capabilities

### Data Quality & Trust

- **100% Reconciliation** — Mathematically guaranteed: `source = good + bad`. Every row is accounted for — nothing silently dropped
- **[Pydantic](https://docs.pydantic.dev/)-Powered Validation** — Every contract, system & domain configs are parsed through strict Pydantic models with `Literal` type enforcement — invalid YAML is caught at load time, not at runtime
- **SQL-First Rules** — Define business logic in the language your team already speaks — no SDK, no custom DSL
- **SLO Monitoring & Anomaly Detection** — Native freshness, row count, and statistical anomaly detection with automatic multi-channel alerting when thresholds breach

> **[✏️ Try it out in Google Colab: Data Quality & Trust](https://colab.research.google.com/github/LakeLogic/LakeLogic/blob/main/examples/colab/01_data_quality_trust.ipynb)**

### Compliance & Governance

- **Contract Gates (CI/CD Enforcement)** — Static-analysis gates that block PRs introducing breaking schema changes, unmasked PII, or broken lineage. Run via `lakelogic validate --gates breaking_change,pii_classification,lineage_break` before any data touches your pipeline
- **GDPR & HIPAA Compliance** — Contract-driven `forget_subjects()` with nullify, hash, or redact strategies and immutable audit trail
- **Zero-Retention Architecture** — Built-in `zero_retention_days` enforcement for transient data layers, automatically purging micro-batches after successful downstream processing
- **Automated PII Handling** — Declarative encryption and hashing (`pii: true`, `masking: "encrypt"`) applied at the Bronze layer before data even reaches rest
- **Pipeline Cost Intelligence** — Per-entity compute cost attribution with domain-level budget governance, autoscaling-aware estimation, and Databricks Unity Catalog billing integration

> **[✏️ Try it out in Google Colab: Compliance & Governance](https://colab.research.google.com/github/LakeLogic/LakeLogic/blob/main/examples/colab/02_compliance_governance.ipynb)**

### Engine & Scale

- **Engine Agnostic** — Write once, run on [Spark](https://spark.apache.org/), [Polars](https://pola.rs/), or [DuckDB](https://duckdb.org/) — same contract, zero code changes
- **Multi-Format Materialization** — Natively output validated data to **Apache Iceberg** or **Delta Lake** open-table formats without requiring pipeline rewrites
- **Dimensional Modeling** — Native SCD Type 2 (slowly changing dimensions), merge/upsert (SCD1), append-only fact tables, periodic snapshot overwrites, and partition-aware writes — all declared in YAML, no manual `MERGE INTO` SQL required
- **Incremental-First** — Built-in watermarking, CDC, and file-mtime tracking
- **Parallel Processing** — Concurrent multi-contract execution with data-layer-aware orchestration and topological dependency ordering
- **Backfill & Reprocessing** — Targeted late-arriving data reprocessing with partition-aware filters — no full reload required
- **External Logic** — Plug in custom Python scripts or notebooks for complex Gold-layer transformations while preserving full contract validation and lineage
- **Production Resilience** — Built-in exponential-backoff retries, per-entity timeouts, and circuit-breaker thresholds (`max_consecutive_failures`) — pipelines self-heal transient failures without operator intervention

> **[✏️ Try it out in Google Colab: Engine & Scale](https://colab.research.google.com/github/LakeLogic/LakeLogic/blob/main/examples/colab/03_engine_scale.ipynb)**

### Developer Experience

- **Structured Diagnostics & Observability** — Deep contextual logging out-of-the-box (powered by [`loguru`](https://loguru.readthedocs.io/)) featuring precise timestamps, severity levels, exact function paths, and execution tags to drastically cut troubleshooting time
- **Dry Run Mode** — Validate contracts, resolve dependencies, and preview execution plans without touching any data
- **DDL-Only Mode** — Generate and apply schema DDL (CREATE/ALTER) from contracts without running the pipeline — perfect for CI/CD migrations
- **DAG Dependency Viewer** — Visualize cross-contract lineage and execution order before running — understand your pipeline graph at a glance
- **Data Reset & Reload** — Surgically reset and reload specific entities or data layers (Bronze/Silver/Gold) without impacting the rest of the lakehouse
- **Multi-Channel Alerts** — Powered by [Apprise](https://github.com/caronc/apprise) for Slack, Email (SMTP/SendGrid), Teams, and Webhook notifications with ownership-based auto-routing and full [Jinja2](https://jinja.palletsprojects.com/) templating support for custom formatting

> **[✏️ Try it out in Google Colab: Developer Experience](https://colab.research.google.com/github/LakeLogic/LakeLogic/blob/main/examples/colab/04_developer_experience.ipynb)**

### Data Generation & AI

- **Synthetic Data** — Built-in `DataGenerator` (powered by [Faker](https://faker.readthedocs.io/)) with streaming simulation, time-windowed output, referential integrity, and edge case injection — generate realistic error rows (SQL injection, type confusion, boundary values) for stress testing and quarantine validation
- **Descriptive AI Test Data** — Steer synthetic data generation with natural language prompts (e.g. *"Generate users who are French or Japanese only, enterprise-tier, over 60 years old with SQL injection attempts in email fields"*) — output strictly adheres to the YAML contract schema
- **AI Contract Onboarding** — `lakelogic infer` auto-generates contracts from sample data with LLM-powered enrichment: automatic PII detection, column labelling, and quality rule suggestions
- **Unstructured Processing** — LLM extraction from PDFs, images, audio with same contract validation + lineage
- **Automated Run Logs** — Every pipeline run emits structured JSON with row counts, quality scores, durations, and error details — queryable as a Delta table

> **[✏️ Try it out in Google Colab: Data Generation & AI](https://colab.research.google.com/github/LakeLogic/LakeLogic/blob/main/examples/colab/05_data_generation_ai.ipynb)**

### Integrations

- **[dbt](https://www.getdbt.com/) Adapter** — Import dbt `schema.yml` models and sources as LakeLogic contracts — reuse existing dbt definitions without rewriting
- **[dlt](https://dlthub.com/) (Data Load Tool)** — Native `DltAdapter` supporting 100+ verified sources (Stripe, Shopify, SQL databases, Google Analytics, and more) plus declarative REST API ingestion — all with contract-driven quality gates on arrival
- **Native Streaming Connectors** — Built-in `WebSocketConnector`, `SSEConnector`, `KafkaConnector`, `WebhookConnector` (plus Azure Event Grid, Service Bus, AWS SQS, GCP Pub/Sub) with pre-validation rename transformations for real-time feeds
- **Native Database Ingestion** — High-performance SQL extraction via [Polars/ConnectorX](https://pola.rs/) and [DuckDB](https://duckdb.org/) — PostgreSQL, MySQL, SQL Server, SQLite with automatic dialect detection
- **Incremental CDC** — Watermark-based change data capture with automatic state tracking — only processes rows newer than the last run
- **Batch Processing** — Memory-safe chunked ingestion via `fetch_size` for massive initial loads — handles 100GB+ tables without OOM
- **Column Projection Pushdown** — Automatically constructs `SELECT` queries from `model.fields` — only extracts what the contract declares
- **Cloud Data Sources** — Native `abfss://`, `s3://`, `gs://` URI support with automatic credential resolution via `CloudCredentialResolver` — Azure AD, AWS IAM, GCP ADC, service principals, and Databricks secret scopes

> **[✏️ Try it out in Google Colab: Integrations](https://colab.research.google.com/github/LakeLogic/LakeLogic/blob/main/examples/colab/06_integrations.ipynb)**

---

## What a Contract Looks Like

One YAML file replaces hundreds of lines of ingestion, validation, and materialization code:

```yaml
version: "1.0"
info:
  title: "Silver Customers"
  domain: "CRM"
  system: "Salesforce"

model:
  fields:
    - name: customer_id
      type: integer
      required: true
    - name: email
      type: string
      pii: true
      masking: "hash"
    - name: status
      type: string

transformations:
  - deduplicate: [customer_id]
  - sql: "SELECT *, UPPER(status) AS status_norm FROM source"
    phase: pre

quality:
  row_rules:
    - sql: "email LIKE '%@%.%'"
    - sql: "status IN ('active', 'churned', 'pending')"
  dataset_rules:
    - unique: customer_id

materialization:
  strategy: merge
  merge_keys: [customer_id]
  format: iceberg  # natively supports iceberg, delta, parquet, csv
```

Same contract, **any engine** — swap `engine="polars"` for `"spark"` or `"duckdb"`. Zero code changes.

> **Analogy:** A contract is like a building inspection checklist. The inspector (LakeLogic) checks every room (row) against the blueprint (schema), flags violations (quarantine), and stamps a certificate (lineage) — regardless of whether the building was constructed with bricks (Spark), timber (Polars), or prefab (DuckDB).

### What this buys you

| Without LakeLogic | With LakeLogic |
| :--- | :--- |
| 500+ lines of PySpark/Pandas validation per table | 40 lines of YAML |
| Bad rows silently dropped or crash the pipeline | Bad rows quarantined with error reasons |
| Schema drift discovered in production dashboards | Schema drift caught at ingestion **and blocked in CI/CD** |
| Manual dedup scripts per team | `deduplicate: [key]` — one line |
| PII scattered across notebooks | `pii: true, masking: hash` — automatic |
| Breaking contract changes shipped to prod | `lakelogic validate --gates breaking_change` blocks the PR |
| No audit trail | Every row stamped with run ID, source path, timestamp |

> [!TIP]
> **[View the Complete Contract Reference](docs/contract_template.md)** for every available configuration option.

---

## Architecture

LakeLogic enforces Data Contracts as quality gates across the Medallion Architecture (Bronze → Silver → Gold). Each layer uses its own contract:

| Layer | Role | Guarantee |
| :--- | :--- | :--- |
| **Bronze** | Capture everything raw, no validation | Immutable record of source |
| **Silver** | Full validation, business rules, dedup | Trusted, queryable data |
| **Gold** | Aggregations, KPIs, ML features | Analytics-ready datasets |
| **Quarantine** | Failed rows isolated with error reasons | Nothing silently dropped |

**Key Guarantee:** `source_count = good_count + bad_count` — 100% reconciliation, always.

## Examples

For a complete list of runnable guides and end-to-end notebooks, please visit the **[Examples section of our Documentation](https://lakelogic.github.io/LakeLogic/examples.html)**.

---

## Documentation

For full guides, API references, tutorials, and contract templates, please visit the **[LakeLogic Documentation Site](https://lakelogic.github.io/LakeLogic/)**.

## Contributing

See `CONTRIBUTING.md` to get started, or `docs/installation.md#developer-installation` for environment setup.

---

### License

Apache-2.0
