Metadata-Version: 2.4
Name: substrate-data
Version: 0.1.0
Summary: The data layer for AI agents. Extract, transform, and expose data via MCP.
Author: Ian McLaughlin
License-Expression: MIT
Requires-Python: >=3.11
Requires-Dist: duckdb>=1.0
Requires-Dist: httpx>=0.27
Requires-Dist: mcp[cli]>=1.0
Requires-Dist: psycopg[binary]>=3.1
Requires-Dist: structlog>=24.0
Description-Content-Type: text/markdown

# Substrate

**Canonicalize data from any source — APIs, databases, PDFs, agent outputs, ML models — into structured, joinable, traceable, portable bundles your AI agent can query.**

Extract once, query forever, trace everything.

## The Thesis

Most useful data lives behind APIs, in databases, in PDFs, in agent transcripts, in model outputs — and most of it stays trapped there. Every Claude session that needs real data either hallucinates or scrapes something live. Slow, expensive, no provenance, different results every time.

Substrate is a small kernel that turns *any* of these sources into typed, canonical, joinable rows in a portable DuckDB bundle, with per-row provenance preserved through every transform. Three ops, two extensibility layers, one bundle format. Any agent can attach and query.

## The Four Properties

| Property | What it means |
|---|---|
| **Structured** | Typed rows, declared schemas, asserts that fail loudly |
| **Joinable** | Canonical-key registry types cross-source joins |
| **Traceable** | Per-row provenance (run_id, source, model, cost, sql_hash) follows every transform |
| **Portable** | DuckDB bundle is one file, attaches over HTTP, no infrastructure |

dbt has structured + joinable. Iceberg has structured + portable. Fivetran has structured. Substrate is the only thing that brings all four together.

## Install

```bash
pip install substrate-data
```

The PyPI distribution is `substrate-data`; the Python import and CLI remain
`substrate`. The bare `substrate` package name is owned by an unrelated
archived AI SDK.

Domain libraries are separate packages — install what you need:

```bash
pip install substrate-govdata     # 26 US agency connectors + 21 datasets
pip install substrate-llm         # LLM extraction recipes (OpenRouter / Anthropic)
pip install substrate-pdf         # PDF text + section extraction recipes
```

## Hello World (no external services required)

```python
from substrate import DuckDBStore
from substrate.connectors.csv_file import CSVConnector

csv = CSVConnector({
    "path": "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
})

store = DuckDBStore("./hello.duckdb")
store.sync(csv, "titanic")

store.derive(
    "survival_by_class",
    "SELECT Pclass, AVG(CAST(Survived AS DOUBLE)) AS survival_rate "
    "FROM titanic GROUP BY Pclass ORDER BY Pclass",
    depends_on=["titanic"],
)

for row in store.query("SELECT * FROM survival_by_class"):
    print(row)
```

Then expose it to Claude:

```bash
substrate serve ./hello.duckdb
```

## The Kernel: Three Ops

```
sync       — external source → table          (connectors live here)
transform  — table → table via callable       (recipes live here)
derive     — SQL transform across tables      (joins, canonicalize, project)
```

Plus three orchestrators: `materialize` runs the DAG, `publish` ships a bundle to a Registry (R2, GitHub Releases, local dir), `pull`/`from_url` fetches one and ATTACHes it via DuckDB httpfs.

> **Compatibility.** `extract_text`, `section_extract`, and `llm_extract`
> remain available as convenience methods. They are now wrappers over
> `transform`; the next package split moves their implementation into
> `substrate-pdf` and `substrate-llm`.

## Operational Stores vs Bundle Format

Do not confuse where a pipeline runs with what gets distributed.

| Concern | What it is for | Substrate contract |
|---|---|---|
| **Execution store** | Runs `sync`, `transform`, `derive`, asserts, and `materialize` | `ExecutionStore` |
| **Bundle publisher** | Ships a portable DuckDB artifact to agents, analysts, MCP/API consumers | `BundlePublisher` |

`DuckDBStore` satisfies both contracts today: it can run pipelines locally and
publish the same file as a bundle. Flightdeck-style apps should keep Postgres
as the operational store for concurrency, tenancy, RLS, jobs, and hot reads.
That path is `PostgresStore`: it implements `ExecutionStore` against one
schema in an application Postgres database, with an explicit export bridge that
produces DuckDB bundles for distribution.

`pull()` and `from_url()` stay DuckDB-specific because they are consumer-side
bundle conveniences, not requirements for every operational backend.

```python
from substrate import PostgresStore

store = PostgresStore(settings.DATABASE_URL, schema=f"tenant_{workspace_id}")
result = store.materialize(my_dataset)

# Operational tables live in Postgres; portable consumers still get DuckDB.
bundle = store.export_to_duckdb("./workspace_bundle.duckdb", overwrite=True)
```

## Two Extensibility Layers

### Connectors — for "I have a data source"

Implement three methods on `BaseConnector` and you have a substrate connector. The kernel handles state, manifests, retries with backoff, fan_out across parameter combos, partial-failure resumption, MCP exposure, provenance.

```python
from substrate.connectors.base import BaseConnector

class MyAPIConnector(BaseConnector):
    @classmethod
    def connector_type(cls) -> str:
        return "myapi"

    def list_streams(self):
        return [StreamInfo(name="orders"), StreamInfo(name="customers")]

    def get_stream_schema(self, stream):
        ...

    def extract_streaming(self, stream, stream_filters=None):
        for page in self.paginate(stream, stream_filters):
            yield from page
```

Built-in: `CSVConnector`, `PostgresConnector`, `RESTAPIConnector`. Domain packages add their own — `substrate-govdata` ships 26 US federal agency connectors with canonical join keys (FIPS, NAICS, year).

### Recipes — for "I have unstructured stuff that needs structure"

A recipe is a batch-capable callable that turns source rows into one or more
output rows, with declared output schema and provenance facets. Simple per-row
recipes subclass `RowRecipe`; cost-sensitive recipes can process batches
directly. PDFs, OCR, LLM extraction, vision classification, agent loops, audio
transcription — every modality becomes a recipe on the same `transform` op.

```python
from substrate import ColumnSpec, RowRecipe, TransformContext

class UpperTitle(RowRecipe):
    output_schema = [ColumnSpec("upper_title", "VARCHAR")]
    recipe_name = "examples.upper_title"

    def transform_row(self, row: dict, context: TransformContext) -> dict:
        return {"upper_title": row["title"].upper()}

store.transform(
    "upper_docs",
    source="docs",
    recipe=UpperTitle(),
    pass_through_columns=["id"],
)
```

Transform outputs get common provenance (`__substrate_recipe`,
`__substrate_source_row_id`, `__substrate_input_hash`, etc.). Row or batch
failures can be recorded in `_substrate_transform_errors`; error thresholds
decide whether the transform blocks.

```python
from substrate import DuckDBStore
from substrate_pdf import extract_text
from substrate_llm import llm_extract

store.sync(pdf_connector, "pdfs")              # URLs / paths → table
extract_text(store, "pdfs", "acfr_text")       # PDF rows → text rows
llm_extract(                                   # text rows → structured rows
    store, "acfr_text", "acfr_financials",
    prompt="Extract total_revenue_usd, total_expenses_usd, unfunded_opeb_usd.",
    schema={
        "total_revenue_usd": "DOUBLE",
        "total_expenses_usd": "DOUBLE",
        "unfunded_opeb_usd": "DOUBLE",
    },
)
```

Real example: `govdata/examples/acfr_extraction.py` extracts 56 financial fields from 168 municipal audit PDFs for $1.87 total. Every row traces to its source PDF page, LLM model, prompt hash, and cost.

## Provenance by Default

Every substrate-managed table carries provenance columns, but the exact column
set depends on the operation that emitted the row:

Sync tables record source-call lineage:

```python
__substrate_run_id
__substrate_synced_at
__substrate_source_stream
__substrate_source_filter
__substrate_connector_type
```

Derive tables record SQL lineage:

```python
__substrate_run_id
__substrate_sql_hash
__substrate_depends_on
__substrate_derived_at
__substrate_contributor
```

Recipe outputs add recipe-specific facets (cost, model, prompt_hash,
confidence, bbox, source page, extraction status, etc.) — declared by each
recipe, no hand-maintenance.

```python
store.query("""
  SELECT __substrate_model, __substrate_cost_usd
  FROM acfr_financials
  LIMIT 1
""")
# [{'__substrate_model': 'anthropic/claude-sonnet-4', '__substrate_cost_usd': 0.011}]
```

## Datasets and Bundles

A `Dataset` is a declarative recipe — syncs + derives + asserts + canonical join keys. `materialize(dataset)` runs the whole DAG. `publish(dataset, registry)` ships a versioned bundle to R2 (or GitHub Releases, or a local dir) plus an `index.json` entry. Anyone with the URL can `attach` it from DuckDB without installing substrate:

```sql
INSTALL httpfs; LOAD httpfs;
ATTACH 'https://substrate.example.com/datasets/economy/state_panel/v1/store.duckdb'
       AS state_panel (READ_ONLY);
SELECT * FROM state_panel.state_panel WHERE year = 2023;
```

Or via the substrate library:

```python
from substrate import DuckDBStore
store = DuckDBStore.from_url("https://substrate.example.com/.../store.duckdb")
store.query("""
  SELECT period_label, value
  FROM fred_unemployment_rate
  ORDER BY date DESC
  LIMIT 5
""")
```

## Build Plane vs Serve Plane

Substrate has two separate hardening surfaces:

| Plane | Goal | Hardening that belongs here |
|---|---|---|
| **Build plane** | Create correct bundles | connector contracts, fan-out provenance, min-row checks, canonical-key conformance, join-key column contract checks, failed-assert blocking |
| **Serve plane** | Make bundles safe and easy to use | catalog metadata, `raw` vs `curated`, `live` vs `draft` vs `broken`, MCP attach/query workflow, API schema, honest status/quality signals |

Keeping these separate matters. A dataset can be a good recipe but not yet materialized locally. A bundle can be queryable but still draft. MCP should never hide those states: agents need to know both whether data is available now and whether the recipe has passed its build checks.

### Real-data smoke checks

The repo includes smoke scripts that hit real upstream sources and exercise both
planes:

```bash
# Build plane: materialize raw, heavy fan-out, and curated cross-agency bundles.
uv run python scripts/smoke_real_datasets.py

# Serve plane: start the platform, then call MCP with a real MCP client.
cd platform/backend
SUBSTRATE_PLATFORM_HOME=/tmp/substrate-e2e/platform \
  uv run uvicorn main:app --host 127.0.0.1 --port 8790

# In another terminal:
uv run python scripts/smoke_mcp_client.py \
  --url http://127.0.0.1:8790/mcp/mcp \
  --dataset census_acs1_state

# Operational Postgres path: run in a temp schema, export to DuckDB, query both.
SUBSTRATE_POSTGRES_TEST_DSN=postgresql://... \
  uv run python scripts/smoke_postgres_store.py
```

The default build smoke materializes `census_acs1_state`,
`fred_macro_indicators`, and `state_panel`. The FRED tables use
`period_label`, not `period`, as their canonical time key.

## Cross-Source Joins via Canonical Keys

The killer feature of the connector ecosystem: joins across sources are *typed*. govdata declares `state_fips`, `state_abbrev`, `county_fips`, `cbsa`, `year`, `naics`, `cik`, `ein`, etc. as canonical kinds. Any dataset declaring `JoinKey("state_fips", ...)` gets per-row IN-list conformance asserts for free, and joins to any other dataset claiming the same kind.

```python
JOINS_ON = {
    "bls_qcew_state": [
        JoinKey("state_fips", "state_fips", "VARCHAR"),
        JoinKey("year", "year", "INTEGER"),
    ],
}
```

The same pattern works for private-sector data — match Stripe customers to Salesforce accounts via email, match QBO vendors to HubSpot companies via EIN. Each domain package ships its own canonical-key registry.

## CLI

```
substrate version
substrate serve   <store.duckdb>           # MCP server over stdio
substrate tables  <store.duckdb>           # list tables + row counts
substrate query   <store.duckdb> "<sql>"   # run a SELECT/WITH query
substrate inspect <module:Class> <stream>  # preview a connector stream
substrate doctor  [--package P]            # health-check installed datasets
```

`substrate serve` is what you point Claude Desktop at:

```json
{
  "mcpServers": {
    "substrate": {
      "command": "substrate",
      "args": ["serve", "/absolute/path/to/store.duckdb"]
    }
  }
}
```

## What Substrate Is NOT

- **Not dbt.** dbt is the SQL-transformation graph. Substrate's unique value is `sync` and `transform` — getting data IN from APIs, models, documents. dbt has zero LLM ops, no portable bundles, no canonical-key join registry. Use both: substrate produces tables, dbt transforms them.
- **Not Fivetran/Airbyte.** Those ship to your warehouse. Substrate ships to a portable DuckDB bundle. Different audience: agent builders and analysts who don't run a warehouse.
- **Not HuggingFace datasets.** HF is for ML training. Substrate is for analytical SQL/MCP queries.
- **Not Iceberg/Delta Sharing.** Those are table formats for already-existing data at scale. Substrate is end-to-end, opinionated about how data gets created.
- **Not LangChain document loaders.** Loaders are stateless converters with no provenance. Substrate's `transform` op is the durable, auditable, idempotent equivalent.
- **Not a hosted platform (yet).** Substrate is a library you install. Hosting is an option, not a requirement.

## Architecture

```
                         Connectors                   Recipes
                  (sync — APIs, DBs, PDFs)    (transform — LLM, OCR,
                                               vision, agent loops, ML)
                              │                           │
                              ▼                           ▼
                          ┌────────────────────────────────────┐
                          │   3-op kernel + ExecutionStore     │
                          │   sync → transform → derive        │
                          │   DuckDB today; Postgres path      │
                          │   (per-row provenance throughout)  │
                          └────────────────────────────────────┘
                                          │
                                  export / publish
                                          ▼
                          ┌────────────────────────────────────┐
                          │   DuckDB bundle + Registry         │
                          │   index.json + versioned bundles   │
                          └────────────────────────────────────┘
                                          │
                                  pull / from_url
                                          │
                                          ▼
                              MCP server / DuckDB ATTACH
                              Any agent queries with SQL
```

Three ops. Two layers (connectors, recipes). One bundle format. Provenance everywhere.

## Domain Libraries

| Package | What it adds | Status |
|---|---|---|
| `substrate-govdata` | 26 US agency connectors, 21 published datasets, FIPS/NAICS canonical keys | shipping |
| `substrate-acfr` | Municipal audit PDF extraction pipeline | shipping |
| `substrate-llm` | LLM extraction recipe (OpenRouter, Anthropic, OpenAI) | factoring out of kernel |
| `substrate-pdf` | PDF text + section extraction recipes | factoring out of kernel |
| `substrate-stripe` | Stripe connector + canonical schemas | planned |
| `substrate-salesforce` | Salesforce connector + canonical schemas | planned |
| `substrate-ocr` | Tesseract / Textract OCR recipes | planned |
| `substrate-vision` | CLIP / vision-model classification recipes | planned |
| `substrate-agent` | Tool-using agent loop recipes | planned |

Same kernel, same primitives, different sources and recipes.

## Where This Is Going

See [`research/substrate_thesis_2026_04_28.md`](research/substrate_thesis_2026_04_28.md) for the full thesis and roadmap. The short version:

1. **Phase 0** — taxonomy hygiene (plumb `kind`/`health`, lint tag prefixes)
2. **Phase 1** — UI collapse (drop hardcoded packs, derive groupings from data)
3. **Phase 2** — kernel reshape (`transform` landed; extract LLM/PDF into separate packages next)
4. **Phase 3** — hierarchical dataset namespaces (`economy.state_panel` etc.)
5. **Phase 4** — first private-sector connector + first new modality recipe (proves the package ecosystem)

## License

MIT.
