Metadata-Version: 2.4
Name: statcast-bigquery
Version: 0.3.1
Summary: Statcast → BigQuery: idempotent ingestion + LLM-friendly docs + Baseball Savant verification
Project-URL: Homepage, https://github.com/blahovec-labs/statcast-bigquery
Project-URL: Issues, https://github.com/blahovec-labs/statcast-bigquery/issues
Project-URL: Changelog, https://github.com/blahovec-labs/statcast-bigquery/blob/main/CHANGELOG.md
Author: Jason Blahovec
License: MIT
License-File: LICENSE
Keywords: baseball,bigquery,data-engineering,mlb,statcast
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering
Requires-Python: >=3.11
Requires-Dist: db-dtypes<2.0,>=1.0
Requires-Dist: google-cloud-bigquery<4.0,>=3.20
Requires-Dist: pandas<3.0,>=2.0
Requires-Dist: pyarrow<19.0,>=15.0
Requires-Dist: pybaseball<3.0,>=2.2.7
Provides-Extra: dev
Requires-Dist: build>=1.2.0; extra == 'dev'
Requires-Dist: pyright>=1.1.380; extra == 'dev'
Requires-Dist: pytest-cov>=5.0; extra == 'dev'
Requires-Dist: pytest>=8.0; extra == 'dev'
Requires-Dist: ruff>=0.6.0; extra == 'dev'
Requires-Dist: sqlglot>=25.0; extra == 'dev'
Description-Content-Type: text/markdown

# statcast-bigquery

Idempotent Statcast → BigQuery ingestion, with first-class documentation for SQL/LLM agents and round-trip validation against Baseball Savant.

## Install

    pip install statcast-bigquery

## Quickstart

    gcloud auth application-default login
    statcast-bigquery sync \
        --start 2024-04-01 --end 2024-10-31 \
        --table myproject.mydataset.statcast_pitches

## Backfill

Backfill historical seasons in resumable chunks:

    statcast-bigquery sync \
        --start 2015-04-01 --end 2026-05-11 \
        --chunk-by year --resume \
        --table myproject.mydataset.statcast_pitches

`--resume` skips chunks already recorded as success in
`<dataset>._statcast_ingest_runs`. Override with `--runs-table` if you
want the run log in a sidecar dataset. Re-running with the same
`--chunk-by` is safe; switching `--chunk-by year` → `month` between
runs will re-process (chunks must match exactly to skip).

## Documentation

    statcast-bigquery docs --format llm > STATCAST_FOR_LLMS.md

## Seed your data dictionary

If you maintain a `data_dictionary` table (one row per column with
business definitions, tags, lineage), you can seed it directly:

    statcast-bigquery docs --format dictionary --apply \
        --dataset mydataset --table myproject.mydataset.statcast_pitches \
        --dictionary-table myproject.shared_ops.data_dictionary

Atomically replaces rows for `(dataset, table)` only; other entries in
the dictionary table are untouched. Required target schema:

    dataset, table, column, dtype, description, business_definition,
    owner, tags ARRAY<STRING>, source_system, upstream_lineage_json,
    created_at TIMESTAMP, updated_at TIMESTAMP

## Verification

    statcast-bigquery verify \
        --source baseball-savant \
        --aggregation player-season \
        --metric all --season 2024 \
        --table myproject.mydataset.statcast_pitches

MIT licensed. This software does not include or distribute MLB data.
