Metadata-Version: 2.4
Name: sql-code-graph
Version: 1.1.3
Summary: SQL code graph analyzer and lineage tracer
Project-URL: Homepage, https://github.com/Warhorze/sql-code-graph
Project-URL: Repository, https://github.com/Warhorze/sql-code-graph
Project-URL: Issues, https://github.com/Warhorze/sql-code-graph/issues
Project-URL: Changelog, https://github.com/Warhorze/sql-code-graph/blob/master/CHANGELOG.md
Author-email: wesley <rademakerwesley@gmail.com>
License: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.12
Requires-Dist: kuzu==0.11.3
Requires-Dist: mcp<2.0,>=1.27.0
Requires-Dist: pathspec>=0.12.1
Requires-Dist: pydantic>=2.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: rich>=13.7.0
Requires-Dist: sqlglot==30.6.0
Requires-Dist: sqlglotc==30.6.0
Requires-Dist: typer>=0.9.0
Requires-Dist: watchdog>=3.0.0
Provides-Extra: dbt
Requires-Dist: dbt-core>=1.7; extra == 'dbt'
Provides-Extra: neo4j
Requires-Dist: neo4j>=5.15.0; extra == 'neo4j'
Provides-Extra: snowflake
Requires-Dist: acryl-datahub<0.15.0,>=0.14.0; extra == 'snowflake'
Description-Content-Type: text/markdown

# sql-code-graph

> **Pre-1.0 — expect breaking changes.** APIs, CLI flags, and graph schema may
> change between releases without a deprecation period. Pin to an exact version
> in production. Re-indexing is always the migration path.

> **Dialect support.** sqlcg is built on [sqlglot](https://github.com/tobymao/sqlglot),
> so in theory it can work for any dialect sqlglot parses. In practice we've only
> tested it against a production **Snowflake** warehouse (~1,400 SQL files) — and
> getting that one dialect working properly was already challenging. Other dialects
> (`bigquery`, `postgres`, `ansi`, `tsql`, `dbt`) may well work, but their lineage
> hasn't been validated, so expect rough edges. We'd gladly collaborate to get
> another dialect integrated — [open an issue](https://github.com/Warhorze/sql-code-graph/issues)
> with a minimal, anonymised corpus we can test against.

SQL lineage and dependency analysis as an MCP server for Claude Code.

Indexes a directory of `.sql` files into a graph database and exposes lineage
queries as MCP tools — so Claude can answer questions like *"what tables does
this view depend on?"* or *"where is `orders.customer_id` derived from?"*
without reading every file.

## Quick start

Choose one:

**Permanent install** (recommended):
```bash
uv tool install sql-code-graph    # Fast, managed, no isolation needed
sqlcg install                     # Register MCP server in Claude Code
```

**One-shot try** (cold cache warning):
```bash
uvx sql-code-graph                # First run is slow (downloads deps)
                                  # Subsequent runs use cache, ~1s startup
```

Restart Claude Code, then inside your project ask:

```
Index my SQL files at ./sql --dialect snowflake
```

That's it. The MCP tools are now available to Claude in every conversation
for that project.

### Workflow (3 steps)

1. **Initialize**: `sqlcg db init`
2. **Index**: `sqlcg index ./sql --dialect snowflake`
3. **Keep fresh**: `sqlcg git install-hooks` (optional)

## Full setup (recommended)

```bash
# 1. Install
pip install sql-code-graph

# 2. Register with Claude Code (~/.claude/settings.json)
sqlcg install

# 3. Restart Claude Code

# 4. Index your SQL repo
# Only git-tracked files are indexed — build artefacts, node_modules,
# and .venv are ignored automatically.
sqlcg db init
sqlcg index ./sql --dialect snowflake   # snowflake is the only tested dialect

# 5. (Optional) Keep the graph fresh on branch switches
cd /your/sql/repo
sqlcg git install-hooks
```

Step 5 installs a `post-checkout` git hook that re-indexes automatically
whenever you switch branches. Without it the graph may be stale after a
`git checkout` until you re-run `sqlcg index` manually.

## Dialect config

To avoid passing `--dialect` every time, create `.sqlcg.toml` in your repo root:

```toml
[sqlcg]
dialect = "snowflake"   # the only tested dialect
```

The git hook and `sqlcg index --dialect auto` both read this file.

## Add to your project CLAUDE.md (recommended)

Adding a short note to your project's `CLAUDE.md` helps Claude know the tools
are available and when to use them:

```markdown
## SQL lineage
This project uses sql-code-graph. MCP tools are available:
- `db_info` — check graph health and parse quality before running lineage queries
- `index_repo` — index or re-index a directory of SQL files
- `find_table_usages` — find all queries that read a table
- `trace_column_lineage` — trace where a column's value comes from
- `get_upstream_dependencies` / `get_downstream_dependencies` — dependency chains
- `search_sql_pattern` — full-text search across all indexed SQL
- `execute_cypher` — raw graph query for advanced analysis
```

The MCP server works without this — Claude can discover the tools on its own —
but the CLAUDE.md snippet ensures they get used proactively.

## Parse quality

After indexing, `sqlcg gain` shows a **parse quality breakdown** that tells you how
much column-level lineage was extracted:

| Quality | Meaning | Tools affected |
|---|---|---|
| `FULL` | Column-level lineage extracted | All tools work |
| `TABLE_ONLY` | Table edges only — no column lineage | `trace_column_lineage`, `get_*_dependencies` return empty |
| `SCRIPTING_FALLBACK` | sqlglot fell back to raw command node | Partial table edges; column lineage unavailable |
| `FAILED` | File failed to parse entirely | File invisible to all queries |

Quality is shown per-file after `sqlcg index` and in `sqlcg gain` Section F.
`list_dialects_and_repos()` warns when scripting fallback exceeds 20% of queries.

**What causes TABLE_ONLY?** Mostly `SELECT *` — sqlglot can't trace column names through
a wildcard without knowing the source table's columns. See [Resolving SELECT *](#resolving-select-)
below for how sqlcg expands wildcards automatically from your DDL and CTAS bodies.

**What causes SCRIPTING_FALLBACK?** Snowflake `$$` procedure bodies or `BEGIN…END` scripting
blocks. sqlglot parses the block as a raw `Command` node and extracts DML via tokenizer
fallback. Table edges are usually correct; column edges are not.

Check `sqlcg db info` for the parsing mode distribution across all indexed queries.

## Resolving SELECT *

A `SELECT *` ETL produces `TABLE_ONLY` parse quality because sqlglot needs the source
table's column list to expand the wildcard into individual columns. sqlcg resolves this
**automatically, with no extra setup** — there is no CSV to export or command to run.

Wildcards are expanded from two sources harvested while indexing:

1. **DDL files** — `CREATE TABLE` / `CREATE VIEW` statements give sqlcg the column list
   for any table they define.
2. **Cross-file CTAS bodies** — a `CREATE TABLE … AS SELECT` (or CTE) in one file is used
   to resolve `SELECT *` against that table in another file.

So the only thing you need to do is **index the DDL alongside your ETLs** — point `sqlcg
index` at a path that contains both (e.g. the repo root, or both `ddl/` and `etl/`). The
more of your `CREATE` statements sqlcg can see, the more wildcards it resolves.

```bash
sqlcg index . --dialect snowflake   # index DDL + ETLs together
```

After indexing, `sqlcg db info` shows non-zero `STAR_EXPANSION lineage edges`, and
`trace_column_lineage` returns results for queries that previously returned empty.

> **Note:** earlier versions accepted an exported `INFORMATION_SCHEMA` CSV (`sqlcg
> load-schema`). That path was **removed** — profiling showed it added zero lineage
> edges over DDL + cross-file CTAS resolution on a real warehouse. DDL is now the
> single source of column truth; no CSV is needed or accepted.

## MCP tools reference

| Tool | Description |
|------|-------------|
| `index_repo(repo_path, dialect)` | Index a directory of SQL files |
| **Lineage & dependencies** | |
| `trace_column_lineage(table_col)` | Trace a column's value upstream to its sources |
| `get_upstream_dependencies(table_col)` | Full upstream dependency chain |
| `get_downstream_dependencies(table_col)` | Full downstream dependency chain |
| `find_table_usages(table_name)` | Find all queries that read a table |
| `find_definition(table_qualified)` | Find where a table/view is defined |
| **Change impact** | |
| `get_change_scope(table_qualified)` | Blast radius of changing a table (impact + risk) |
| `diff_impact(changed_files)` | What a set of changed files affects downstream |
| `get_backfill_order(table_qualified)` | Topological rebuild/backfill order |
| `scope_change(target)` | Synthesised change-scope summary for a target |
| **Search & meta** | |
| `search_sql_pattern(query)` | Full-text search across indexed SQL |
| `list_dialects_and_repos()` | List indexed repos and dialects (catalogue) |
| `db_info()` | Graph health, node counts, parse quality breakdown, warnings, freshness (indexed SHA vs HEAD) |
| `execute_cypher(query)` | Raw Cypher query against the graph |
| `submit_feedback(...)` | Report a false positive/negative to improve metrics |

> **Input format**: lineage/dependency tools expect a **schema-qualified** column
> reference — `schema.table.column` (e.g. `ba.orders.customer_id`), not a bare
> `table.column`. Each returned node carries both `name` (the bare column) and
> `table` (the owning `schema.table`), so results are navigable without a second lookup.

> **Provenance fields**: lineage edges now carry `file`, `line`, and `expression`
> (where the lineage was derived from), a `confidence` of `1.0` for plainly-parsed
> facts (lower for inferred edges, with a `reason`), and a `table_kind`
> (`table` / `cte` / `derived` / `external`) so CTE and derived aliases are
> distinguishable from real tables.

> **LLM agent tip**: call `db_info()` before lineage queries to check that
> `SqlColumn > 0` and `warnings` is empty. If `parse_quality["scripting_block"]`
> is high, column lineage will be limited for those files — use table-level tools
> (`find_table_usages`, `get_*_dependencies`) instead.

## CLI reference

Full option reference: [docs/cli.md](docs/cli.md)

```bash
sqlcg install                          # register MCP server in Claude Code
sqlcg db init                          # initialise graph database
sqlcg index <path> --dialect snowflake # index SQL files (snowflake is the tested dialect)
sqlcg index <path> --dialect auto      # read dialect from .sqlcg.toml
sqlcg index <path> --profile           # index + print per-stage timing and slowest files
sqlcg index <path> --include-working-tree  # also index uncommitted changes (marks graph dirty)
sqlcg reindex <path> --from <sha> --to <sha>  # incremental resync of only changed files
sqlcg analyze unused                   # tables with no query references
sqlcg analyze upstream/downstream      # trace lineage from the CLI
sqlcg find table/column/pattern        # search the graph
sqlcg watch <path>                     # watch for file changes
sqlcg db info                          # graph stats + freshness (indexed SHA vs HEAD)
sqlcg git install-hooks                # install post-checkout + post-merge resync hooks
sqlcg gain                             # show usage metrics
sqlcg report                           # generate FP/error report
sqlcg mcp best-practices               # print the fact/heuristic boundary for the MCP tools
sqlcg mcp start                        # start MCP server manually
sqlcg mcp status                       # server status JSON (via control socket)
sqlcg mcp stop                         # stop the running MCP server gracefully
sqlcg mcp restart                      # stop the server (client must respawn it)
sqlcg version                          # show installed version
```

## Supported dialects

sqlcg is built on [sqlglot](https://github.com/tobymao/sqlglot), so other dialects
*can* be parsed in theory. Only Snowflake has been tested against a real corpus,
though — the table reflects what's actually been exercised, not what sqlglot can do.

| Dialect | Status |
|---------|--------|
| `snowflake` | ✅ Tested against a production DWH (~1,400 files) |
| `bigquery` | ⚠️ Unproven — parses via sqlglot, lineage not validated |
| `postgres` | ⚠️ Unproven — parses via sqlglot, lineage not validated |
| `ansi` | ⚠️ Unproven — parses via sqlglot, lineage not validated |
| `tsql` | ⚠️ Unproven — parses via sqlglot, lineage not validated |
| `dbt` | ⚠️ Unproven — via optional extra, lineage not validated |

Want another dialect properly supported? We'd be glad to collaborate — open an issue
with a minimal, anonymised corpus we can develop and test against. Getting Snowflake
right was real work, so a representative corpus is what makes the difference.

## Development

```bash
git clone https://github.com/Warhorze/sql-code-graph
cd sql-code-graph
uv sync --all-extras
uv run pytest tests/unit
```

## Issues

Bug reports and feature requests: [github.com/Warhorze/sql-code-graph/issues](https://github.com/Warhorze/sql-code-graph/issues)

Questions and discussion: [github.com/Warhorze/sql-code-graph/discussions](https://github.com/Warhorze/sql-code-graph/discussions)

## License

MIT
