Metadata-Version: 2.4
Name: semlay
Version: 0.1.0
Summary: File-based semantic layer for AI agents: YAML in git, validated and compiled by Rust
Author-email: Daniel Beach <daniel.beach@rippleshot.com>
License: Apache-2.0
Requires-Python: >=3.9
Description-Content-Type: text/markdown; charset=UTF-8; variant=GFM

# semlay

**A file-based semantic layer for AI agents.** Define tables, columns, metrics,
joins, and business context in simple YAML files. Store them in git, validate them
in CI, compile them to a single context bundle, and deploy it to S3 — no running
service required. Rust core, Python bindings.

```
your-repo/
├── semantic.yml          # project manifest
└── models/
    ├── orders.yml        # one table per file: source, columns, metrics, joins
    └── customers.yml
```

## Why

AI agents writing SQL against raw schemas guess at business logic and get it
wrong. Semantic layers fix that, but the existing ones are SaaS-attached,
dbt-coupled, or API-centric platforms you have to operate. semlay is just files:

- **Physical format is first-class.** Declare whether a dataset is Delta Lake,
  Iceberg, Parquet, CSV, or a Postgres table. SQL generation uses it — DuckDB
  queries against Parquet render as `read_parquet(...)` and run with zero setup.
- **Agent context is first-class.** Metrics carry `context` (pitfalls, caveats),
  `example_queries`, valid time grains, and deprecation notes pointing at
  replacements. The linter warns when context is missing.
- **Structured errors.** A bad query returns machine-readable JSON so agents
  self-correct instead of hallucinating (see below).
- **Git-native deployment.** Validate on PR, compile on merge, push one JSON
  bundle to S3. Agents load the bundle; nothing to host.

## The spec in one file

```yaml
# models/orders.yml
version: 1
source:
  format: parquet            # delta | iceberg | parquet | csv | txt | json
  location: data/orders.parquet
  # ...or a database:        # postgres | mysql | snowflake | bigquery | databricks
  # format: postgres
  # connection: warehouse    # defined in semantic.yml; DSN comes from an env var
  # schema: analytics
  # relation: fct_orders     # physical name, if it differs from table.name

table:
  name: orders
  description: One row per order line item from the order management system.
  grain: order line item     # what one row means — agents need this
  owner: data-eng@example.com
  status: certified          # certified | experimental | deprecated

columns:
  - name: order_id
    type: int64              # Arrow type names: int64, float64, decimal(18,2),
    primary_key: true        # string, bool, date, timestamp, time, binary...
    description: Natural key from the OMS.
  - name: email
    type: string
    classification: pii      # public | internal | restricted | pii
  - name: status
    type: string
    description: One of placed, shipped, returned, cancelled.
    synonyms: [order state]  # how humans actually ask for it
  - name: ordered_at
    type: timestamp
    time_dimension: true     # default time axis for grain queries

metrics:
  - name: gross_revenue
    expr: SUM(amount)
    description: Pre-discount revenue in USD.
    context: |
      Excludes refunds — join refunds for net revenue. Cancelled lines are
      excluded by the built-in filter.
    example_queries:                 # anchor agent question-matching
      - gross revenue by region last 30 days
    filters:
      - status <> 'cancelled'        # always applied, per-metric
    valid_grains: [day, week, month, quarter, year]   # hour is rejected

relationships:
  - to: customers
    type: many_to_one        # one_to_one | one_to_many | many_to_one | many_to_many
    on: customer_id          # or columns: {local: ..., remote: ...}

context: |
  The OMS backfills late order edits up to 48 hours; treat the last two days
  as provisional when reporting.
```

JSON Schemas for editor autocomplete and CI validation live in
[`schemas/`](schemas/).

Four complete example projects live in [`examples/`](examples/) — Parquet
with runnable sample data, Snowflake with connections and deprecation,
Postgres with views and explicit join columns, and a Delta/Iceberg/CSV
lakehouse with two-hop joins. See [`examples/README.md`](examples/README.md).

## CLI walkthrough

All commands run against [`examples/shop`](examples/shop), which ships with
sample Parquet data.

**Validate** (hard errors) and **lint** (weak agent context):

```console
$ semlay validate examples/shop
ok: 2 model(s), 0 errors

$ semlay lint examples/shop
ok: 2 model(s), 0 warning(s)
```

On a broken project, errors name the file and the fix:

```
error: models/bad.yml: source format `Postgres` requires `connection`
error: models/bad.yml: column `c`: unknown column type `varchar` (expected one of int8..int64, ...)
warning: models/bad.yml: metric `m` has no `context`; agents will guess at caveats
```

**Export agent context** — the whole project as compact markdown for a system
prompt, `CLAUDE.md`, or RAG ingestion. This is how agents consume the layer
with zero tooling on their side:

```console
$ semlay context examples/shop -o semantic-context.md
```

The output covers every table (grain, source, deprecations), column tables
with PII flags and synonyms, and each metric's expression, always-applied
filters, valid grains, caveats, and example questions.

**Generate SQL** from a semantic query:

```console
$ semlay sql examples/shop -m gross_revenue,order_count -d region -g month
```

```sql
SELECT
  date_trunc('month', "orders"."ordered_at") AS "ordered_at_month",
  "customers"."region" AS "region",
  SUM("orders"."amount") FILTER (WHERE "orders"."status" <> 'cancelled') AS "gross_revenue",
  COUNT(DISTINCT "orders"."order_id") AS "order_count"
FROM read_parquet('data/orders.parquet') AS "orders"
LEFT JOIN read_parquet('data/customers.parquet') AS "customers"
  ON "orders"."customer_id" = "customers"."customer_id"
GROUP BY 1, 2
ORDER BY 1, 2
```

Note what the layer did: qualified every column, applied `gross_revenue`'s
built-in filter to *its* aggregate only (cancelled orders still count in
`order_count`), planned the join from the relationship graph, and rendered the
Parquet source as a DuckDB reader so the SQL runs as-is. Dialects: `duckdb`,
`postgres`, `databricks`, `snowflake`. Dialect differences are handled per
construct — Snowflake has no `FILTER` clause, so the same metric renders
there as `SUM(CASE WHEN status <> 'cancelled' THEN amount END)`.

Top-N queries sort by any metric or dimension:

```console
$ semlay sql examples/shop -m gross_revenue -d region --sort gross_revenue:desc -l 5
```

**Structured errors** — typos and bad grains come back as JSON agents can act on:

```console
$ semlay sql examples/shop -m gross_revenu
{
  "code": "unknown_metric",
  "name": "gross_revenu",
  "suggestions": ["gross_revenue"]
}

$ semlay sql examples/shop -m gross_revenue -g hour
{
  "code": "incompatible_grain",
  "metric": "gross_revenue",
  "grain": "hour",
  "valid": ["day", "week", "month", "quarter", "year"]
}
```

Other codes: `unknown_dimension` (with suggestions), `ambiguous_dimension`
(with qualified candidates), `no_join_path`, `metrics_span_tables`,
`no_time_dimension`, `unknown_sort_field` (with the sortable fields).

**Fan-out protection.** Relationship cardinality is enforced, not decorative.
If reaching a dimension requires traversing a one-to-many join, the row
multiplication would silently inflate `SUM`s — the classic semantic-layer
correctness bug. semlay refuses with a structured error instead:

```json
{
  "code": "fan_out_join",
  "from": "orders",
  "to": "line_items"
}
```

(The message tells the agent to define the metric on the many side instead.)

**Expression safety.** Metric expressions and filters are parsed as a single
SQL expression and re-rendered from the AST. Statements can't smuggle in —
`status = 'x'; DROP TABLE orders` is rejected as trailing input, not
truncated, not executed.

**Compile** the whole project into one deployable context bundle:

```console
$ semlay compile examples/shop -o bundle.json
```

The bundle is the project with refs resolved plus a join graph — everything an
agent needs in one fetch. See
[`examples/deploy-bundle.yml`](examples/deploy-bundle.yml) for a GitHub Actions
workflow that validates on PR and ships the bundle to S3 on merge.

## MCP server

Serve the project to any MCP client (Claude Code, Claude Desktop, Cursor)
over stdio — no network, no auth, nothing to host:

```console
$ semlay mcp examples/shop
```

Claude Code registration:

```bash
claude mcp add shop-semantics -- semlay mcp /path/to/your/semantic-layer
```

Six tools: `search` (names, synonyms, descriptions, example questions),
`list_tables`, `get_table` (full per-table context), `list_metrics`,
`generate_sql`, `get_context`. Tool errors return the same structured JSON as
the CLI with `isError: true`, so the model reads `unknown_metric` +
suggestions and retries — the self-correction loop is covered by an
integration test that drives the real binary over stdio.

The server reloads YAML from disk on every call: edit a model, the agent sees
it on its next tool call.

## Python

```python
import semlay

semlay.validate("examples/shop")        # [] when clean, else [{file, message}]
semlay.lint("examples/shop")            # warnings, same shape
bundle = semlay.compile_bundle("examples/shop")   # dict

md = semlay.context_markdown("examples/shop")  # agent-ready markdown

sql = semlay.generate_sql(
    "examples/shop",
    {
        "metrics": ["gross_revenue"],
        "dimensions": ["region"],
        "grain": "month",
        "order_by": [{"field": "gross_revenue", "desc": True}],
        "limit": 5,
    },
    dialect="duckdb",
)

# Structured errors raise ValueError carrying JSON:
import json
try:
    semlay.generate_sql("examples/shop", {"metrics": ["gross_revenu"]})
except ValueError as e:
    err = json.loads(str(e))
    assert err["code"] == "unknown_metric"
    assert err["suggestions"] == ["gross_revenue"]
```

Build locally with `maturin develop --manifest-path python/Cargo.toml`.

## How it works

```
crates/semlay-core   spec types (serde), loader, validator, linter, bundle compiler
crates/semlay-sql    semantic query -> resolved plan -> dialect SQL (sqlparser AST)
crates/semlay-mcp    MCP stdio server (hand-rolled JSON-RPC; no async runtime)
crates/semlay-cli    `semlay` binary
python/              PyO3 bindings (abi3, Python >= 3.9)
schemas/             generated JSON Schemas for the YAML files
```

Column types map onto Apache Arrow's type system — one type model across every
physical format. Metric expressions and filters are parsed with sqlparser; bare
column references are qualified against the base table, and per-metric filters
attach as `FILTER (WHERE ...)` on aggregate AST nodes, which DuckDB, Postgres,
and Spark/Databricks all support.

## Tests

The test suite doubles as a tour of the behavior:

- [`crates/semlay-core/tests/validate_lint_tests.rs`](crates/semlay-core/tests/validate_lint_tests.rs)
  — one test per validation rule and lint.
- [`crates/semlay-sql/tests/sql_tests.rs`](crates/semlay-sql/tests/sql_tests.rs)
  — golden SQL, dialect quoting, error codes.
- [`crates/semlay-sql/tests/edge_tests.rs`](crates/semlay-sql/tests/edge_tests.rs)
  — two-hop joins, ambiguous dimensions, ratio metrics filtering both
  aggregates, database sources.
- [`crates/semlay-cli/tests/cli_tests.rs`](crates/semlay-cli/tests/cli_tests.rs)
  — the real binary end to end, including executing generated SQL in DuckDB
  and asserting the numbers.
- [`python/tests/test_semlay.py`](python/tests/test_semlay.py) — bindings.

- [`crates/semlay-mcp/tests/protocol_tests.rs`](crates/semlay-mcp/tests/protocol_tests.rs)
  — MCP handshake, tool listing, the agent self-correction loop.

```bash
cargo test --workspace          # 65 tests
pytest python/tests             # 10 tests (after maturin develop)
```

## Limitations (current, by design of v0)

- **Metrics in one query must live on one table.** Cross-table metric math
  (revenue from `orders` divided by spend from `ads`) is not composed yet;
  query each side and combine downstream.
- **Derived metrics can't reference other metrics.** Ratios work as plain
  expressions (`SUM(a) / COUNT(DISTINCT b)`); `metric_a / metric_b` reuse is
  planned.
- **Query filters are scoped to the base table's columns.** Filter on joined
  tables by qualifying dimensions instead.
- **No derived dimensions** (CASE bucketing, concatenations) — model them as
  columns upstream for now.
- **`week` grain follows each engine's week-start convention**; semlay does
  not normalize it.

## Status

Early. Spec v1, validator, linter, context bundle (with provenance), agent
context export, SQL generation with fan-out protection and top-N, MCP server,
Python bindings. Planned next: schema drift detection against real
Parquet/Delta/Iceberg metadata (Arrow-based), OSI (Open Semantic Interchange)
import/export.

## License

Apache-2.0

