Metadata-Version: 2.4
Name: schema-genie
Version: 1.1.0
Summary: Automatically infer star/snowflake schemas from DataFrames or CSVs and generate production-ready DDL for Snowflake, Redshift, BigQuery, and PostgreSQL — in seconds.
License: MIT
Project-URL: Homepage, https://github.com/muhammadsufiyanbaig/schema-genie
Project-URL: Issues, https://github.com/muhammadsufiyanbaig/schema-genie/issues
Keywords: data-warehouse,schema,ddl,star-schema,snowflake-schema,snowflake,redshift,bigquery,postgres,postgresql,data-engineering,data-science,etl,elt,schema-inference,schema-design,type-detection,cardinality,scd,slowly-changing-dimensions,fact-table,dimension-table,foreign-key,data-modelling,analytics
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Information Technology
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: Database :: Database Engines/Servers
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Software Development :: Code Generators
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Topic :: Utilities
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.3
Requires-Dist: numpy>=1.21
Requires-Dist: sqlglot>=10.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: snowflake
Requires-Dist: snowflake-connector-python>=3.0; extra == "snowflake"
Provides-Extra: redshift
Requires-Dist: psycopg2-binary>=2.9; extra == "redshift"
Provides-Extra: bigquery
Requires-Dist: google-cloud-bigquery>=3.0; extra == "bigquery"
Provides-Extra: diagram
Requires-Dist: graphviz>=0.20; extra == "diagram"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: license-file

# schema-genie

> **Automatically infer optimal star and snowflake schemas from raw CSVs or DataFrames and generate production-ready DDL for Redshift, BigQuery, Snowflake, and PostgreSQL.**

---

## Why schema-genie?

Data engineers and analysts spend a disproportionate amount of time doing work that should be automated. Schema design is the single biggest pre-modelling bottleneck in any data warehouse project — and it is almost entirely mechanical.

schema-genie runs a **6-stage statistical inference pipeline** that turns raw DataFrames or CSVs into deployment-ready DDL in seconds. No BI tools. No data modelling consultants. No manual ERD sessions.

---

## Impact at a Glance

| Metric | Without schema-genie | With schema-genie |
|--------|---------------------|-------------------|
| Schema design time (5-table project) | 4 – 8 hours | < 5 seconds |
| DDL correctness iterations | 3 – 6 review cycles | 1 (auto-generated) |
| SCD Type 2 audit columns per dim | 30 – 60 min manual coding | Automatic |
| FK relationship discovery (10 tables) | Full cross-table review | Value-intersection scoring, instant |
| Warehouse-specific DDL dialects supported | Requires per-dialect expertise | 4 targets, zero config |
| Cost to onboard a net-new dataset | Hours of eng time | One function call |

---

## The Real Cost of Manual Schema Work

### Engineer time

A senior data engineer in the US earns roughly **$130,000 – $160,000/year** (~$65–80/hr fully loaded).

| Task | Manual time | schema-genie time | Hours saved |
|------|-------------|-------------------|-------------|
| Type detection across 20 columns | 20 – 40 min | Instant | ~0.5 hr |
| Cardinality + normalization analysis | 30 – 60 min | Instant | ~0.75 hr |
| FK/relationship mapping (5 tables) | 1 – 3 hrs | Instant | ~2 hrs |
| Fact vs. dimension classification | 30 – 90 min | Instant | ~1 hr |
| Star vs. snowflake recommendation | 30 – 60 min | Instant | ~0.75 hr |
| DDL authoring (4 dialects) | 3 – 6 hrs per dialect | Instant (all 4) | ~15 hrs |
| SCD Type 2 columns (3 dim tables) | 2 – 4 hrs | Automatic | ~3 hrs |
| **Total per project** | **~8 – 25 hrs** | **< 10 seconds** | **~23 hrs** |

At $75/hr, **one project saves ~$1,725 in engineering time** that was previously spent on repetitive, error-prone schema work.

---

## Cloud Warehouse Cost Savings

Poor schema design directly inflates your monthly cloud bill. schema-genie generates warehouse-optimised DDL from the start.

### BigQuery — partition pruning

BigQuery charges **$5 per TB scanned**. schema-genie automatically detects the first date column on your fact table and emits a `PARTITION BY` clause.

```sql
-- Generated automatically for BigQuery
CREATE TABLE IF NOT EXISTS orders (
    orders_key    INT64,
    order_date    DATE,
    revenue       NUMERIC,
    ...
)
PARTITION BY DATE(order_date);
```

On a 1 TB/day table queried 50 times/day:
- **Without partitioning:** 50 × 1 TB = 50 TB scanned = **$250/day** in query costs
- **With partitioning** (typical 80% pruning): 50 × 0.2 TB = 10 TB = **$50/day**
- **Monthly saving: ~$6,000** on that single table alone

### Redshift — DISTKEY + SORTKEY

Redshift performance degrades sharply on large fact tables without distribution and sort keys. schema-genie automatically adds `DISTKEY` and `SORTKEY` on the first ID column of every generated fact table.

```sql
-- Generated automatically for Redshift
CREATE TABLE orders (
    orders_key  INTEGER IDENTITY(1,1) PRIMARY KEY,
    customer_id VARCHAR(64) DISTKEY,
    order_date  TIMESTAMP,
    ...
)
SORTKEY (customer_id);
```

Properly keyed Redshift tables execute join queries **2×–5× faster**, which translates directly to fewer node-hours consumed and lower DC2/RA3 costs.

### Snowflake — column ordering and types

Snowflake charges per **Credit** (~$2–4/credit depending on tier). Queries over wide VARCHAR columns on fact tables consume far more credits than typed FLOAT/NUMBER columns. schema-genie maps every column to the tightest correct SQL type rather than defaulting everything to `VARCHAR(MAX)`.

| Column pattern | Naive DDL | schema-genie DDL |
|---------------|-----------|-----------------|
| `revenue` (float) | `VARCHAR(255)` | `NUMBER(18,2)` |
| `created_at` (datetime) | `VARCHAR(255)` | `TIMESTAMP_NTZ` |
| `status` (low cardinality) | `VARCHAR(255)` | `VARCHAR(128)` |
| `description` (free text) | `VARCHAR(255)` | `VARCHAR(4096)` |

Correct typing eliminates implicit casts at query time, reduces storage footprint, and enables micro-partition pruning.

---

## Data Science Impact

### Faster iteration on analytics models

Data scientists blocked on schema work cannot build models. Every hour a schema is wrong or missing is an hour of blocked ML/analytics work. schema-genie turns schema provisioning from a multi-day ticket into a same-session command.

### Correct semantic types feed better models

The type detector (`type_detector.py`) distinguishes **six semantic types**: `id`, `date`, `currency`, `measure`, `category`, `text`. This prevents classic data leakage bugs:

- Currency columns are never treated as free-form measures
- ID columns are never factored into statistical aggregations
- Low-cardinality strings are automatically flagged as categoricals — the correct dtype for tree models, embeddings, and one-hot encoders

### SCD Type 2 — auditable datasets for time-series modelling

Every dimension table gets four audit columns generated automatically:

```sql
_valid_from   TIMESTAMP_NTZ  DEFAULT CURRENT_TIMESTAMP()
_valid_to     TIMESTAMP_NTZ
_is_current   BOOLEAN        DEFAULT TRUE
_loaded_at    TIMESTAMP_NTZ  DEFAULT CURRENT_TIMESTAMP()
```

Without these, historical point-in-time joins are impossible, which rules out large classes of churn, CLV, and forecasting models. Most teams add these columns only after realising they need them — months into a project.

### Relationship graph as a feature engineering map

The FK relationships detected by the value-intersection algorithm (`relationships.py`) double as a feature engineering roadmap. Every `score >= 0.8` link is a join path that can be exploited in feature stores, dbt models, or graph neural networks.

---

## Pipeline Complexity — What Runs in Seconds

For a schema design task that takes a human engineer hours, schema-genie executes the following automatically:

```
Stage 1 — Type Detection
  Per column: name regex match → dtype check → cardinality ratio
  O(n_cols × n_rows) — negligible on pandas DataFrames

Stage 2 — Cardinality Analysis
  unique_values / n_rows per column
  Threshold: 0.05 (configurable via normalize_threshold)

Stage 3 — Relationship Detection
  For every pair of tables, for every pair of name-stem-matching columns:
    intersection_score = |A ∩ B| / min(|A|, |B|)
  O(n_tables² × n_cols²) — pre-filtered by name stem to stay practical

Stage 4 — Fact Table Selection
  fact_score = (measure_cols / total_cols)
             + (id_col_count × 0.2)
             + (log10(n_rows) / 10)

Stage 5 — Schema Type Recommendation
  Snowflake schema triggered when any dimension has > 40% category columns

Stage 6 — DDL Generation
  Target-specific CREATE TABLE with PKs, FKs, SCD columns, indexes, DISTKEY/SORTKEY/PARTITION BY
```

**Total wall-clock time on a 5-table, 100k-row dataset: typically under 2 seconds.**

---

## Quantified Savings Summary

> Estimates based on industry-standard data engineering hourly rates and major cloud pricing as of 2025/2026. Actual savings vary by team size, dataset scale, and query patterns.

| Scenario | Monthly saving |
|----------|---------------|
| 1 engineer, 2 new datasets/month | ~$3,500 in eng time |
| BigQuery, 1 TB/day partitioned table | ~$6,000 in query costs |
| Redshift, correct DISTKEY on 10B-row fact | 2×–5× query speedup → fewer node-hours |
| Eliminating 4-dialect DDL maintenance | ~15 hrs/project × $75/hr = $1,125 |
| SCD audit columns auto-generated | 2–4 hrs saved per dimension table |

---

## Installation

```bash
pip install schema-genie
```

With optional warehouse connectors:

```bash
pip install schema-genie[snowflake]
pip install schema-genie[redshift]
pip install schema-genie[bigquery]
pip install schema-genie[diagram]
```

---

## Quick Start

```python
import pandas as pd
from schema_genie import SchemaGenie

df = pd.read_csv("sales_data.csv")
genie = SchemaGenie(target="snowflake")
schema = genie.infer(df, table_name="sales")

print(schema.recommended_type)   # "star" or "snowflake"
print(schema.ddl)
schema.export_ddl("schema.sql")
```

### Multi-table inference

```python
genie = SchemaGenie(target="redshift")
schema = genie.infer_multi({
    "orders":    orders_df,
    "customers": customers_df,
    "products":  products_df,
})

print(schema.fact_table.name)         # "orders"
print([t.name for t in schema.dimension_tables])
print(schema.scd_candidates)
schema.export_diagram("er_diagram")   # requires pip install schema-genie[diagram]
```

### Load config from YAML

```yaml
# genie_config.yaml
target: postgres
schema_type: auto
detect_scd: true
normalize_threshold: 0.05
```

```python
genie = SchemaGenie.from_config("genie_config.yaml")
```

---

## How It Works

`schema-genie` runs a 6-stage statistical inference pipeline:

```
Raw DataFrames / CSVs
        │
        ▼
┌────────────────────────────┐
│    Type Detector            │  Maps each column to a semantic type
│  (measure/date/id/          │  using dtype + cardinality + name heuristics
│   category/text/currency)   │
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│    Cardinality Analyzer     │  Measures unique value ratio per column
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Relationship Detector     │  FK-like overlaps via value intersection scoring
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Fact Table Selector       │  Picks table with highest measure density
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Schema Type Recommender   │  Star vs. Snowflake based on dimension depth
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│     DDL Generator           │  CREATE TABLE statements for target warehouse
└────────────────────────────┘
```

---

## API Reference

### `SchemaGenie`

```python
SchemaGenie(
    target: str = "snowflake",          # "snowflake" | "redshift" | "bigquery" | "postgres"
    schema_type: str = "auto",          # "auto" | "star" | "snowflake"
    primary_key_strategy: str = "surrogate",
    detect_scd: bool = True,
    normalize_threshold: float = 0.05
)
```

| Method | Description |
|--------|-------------|
| `genie.infer(df, table_name)` | Infer schema from a single DataFrame |
| `genie.infer_multi(dict_of_dfs)` | Infer schema across multiple related tables |
| `genie.deploy(connection, schema)` | Execute DDL against a live warehouse |
| `SchemaGenie.from_config(path)` | Load configuration from a YAML file |

### `InferredSchema`

```python
schema.recommended_type     # "star" | "snowflake"
schema.fact_table           # TableDefinition
schema.dimension_tables     # list[TableDefinition]
schema.relationships        # list[dict] — detected FK relationships
schema.ddl                  # str — full DDL ready to execute
schema.scd_candidates       # list[str] — SCD Type 2 flagged columns
schema.confidence_score     # float — pipeline confidence [0, 1]
schema.export_ddl(path)     # Save DDL to a .sql file
schema.export_diagram(path) # Export ER diagram (requires graphviz extra)
schema.summary()            # Human-readable summary string
```

---

## Supported Targets

| Target | Surrogate Key | Currency Type | Partitioning / Distribution | Direct Deploy |
|--------|---------------|---------------|-----------------------------|---------------|
| `snowflake` | `AUTOINCREMENT` | `NUMBER(18,2)` | Micro-partition (automatic) | Yes |
| `redshift` | `IDENTITY(1,1)` | `DECIMAL(18,2)` | `DISTKEY` + `SORTKEY` on fact PK | Yes |
| `bigquery` | `INT64` | `NUMERIC` | `PARTITION BY` on first date col | Yes |
| `postgres` | `SERIAL` | `NUMERIC(18,2)` | `CREATE INDEX` on all ID columns | Yes |

All dimension tables automatically receive SCD Type 2 audit columns:
`_valid_from`, `_valid_to`, `_is_current`, `_loaded_at`

---

## Development

```bash
git clone https://github.com/yourusername/schema-genie
cd schema-genie
pip install -e ".[dev]"
pytest tests/ -v
```

---

## License

MIT — See [LICENSE](LICENSE)
