Metadata-Version: 2.4
Name: schema-genie
Version: 1.0.0
Summary: Infer star/snowflake schemas from DataFrames and generate DDL
License: MIT
Project-URL: Homepage, https://github.com/muhammadsufiyanbaig/schema-genie
Project-URL: Issues, https://github.com/muhammadsufiyanbaig/schema-genie/issues
Keywords: data-warehouse,schema,ddl,star-schema,snowflake,redshift,bigquery
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Topic :: Database
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.3
Requires-Dist: numpy>=1.21
Requires-Dist: sqlglot>=10.0
Requires-Dist: pyyaml>=6.0
Provides-Extra: snowflake
Requires-Dist: snowflake-connector-python>=3.0; extra == "snowflake"
Provides-Extra: redshift
Requires-Dist: psycopg2-binary>=2.9; extra == "redshift"
Provides-Extra: bigquery
Requires-Dist: google-cloud-bigquery>=3.0; extra == "bigquery"
Provides-Extra: diagram
Requires-Dist: graphviz>=0.20; extra == "diagram"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0; extra == "dev"
Requires-Dist: build; extra == "dev"
Requires-Dist: twine; extra == "dev"
Dynamic: license-file

# schema-genie

> **Automatically infer optimal star and snowflake schemas from raw CSVs or DataFrames and generate production-ready DDL for Redshift, BigQuery, Snowflake, and PostgreSQL.**

---

## Installation

```bash
pip install schema-genie
```

With optional warehouse connectors:

```bash
pip install schema-genie[snowflake]
pip install schema-genie[redshift]
pip install schema-genie[bigquery]
pip install schema-genie[diagram]
```

---

## Quick Start

```python
import pandas as pd
from schema_genie import SchemaGenie

df = pd.read_csv("sales_data.csv")
genie = SchemaGenie(target="snowflake")
schema = genie.infer(df, table_name="sales")

print(schema.recommended_type)   # "star" or "snowflake"
print(schema.ddl)
schema.export_ddl("schema.sql")
```

### Multi-table inference

```python
genie = SchemaGenie(target="redshift")
schema = genie.infer_multi({
    "orders":    orders_df,
    "customers": customers_df,
    "products":  products_df,
})

print(schema.fact_table.name)         # "orders"
print([t.name for t in schema.dimension_tables])
print(schema.scd_candidates)
schema.export_diagram("er_diagram")   # requires pip install schema-genie[diagram]
```

### Load config from YAML

```yaml
# genie_config.yaml
target: postgres
schema_type: auto
detect_scd: true
normalize_threshold: 0.05
```

```python
genie = SchemaGenie.from_config("genie_config.yaml")
```

---

## How It Works

`schema-genie` runs a 6-stage statistical inference pipeline:

```
Raw DataFrames / CSVs
        │
        ▼
┌────────────────────────────┐
│    Type Detector            │  Maps each column to a semantic type
│  (measure/date/id/          │  using dtype + cardinality + name heuristics
│   category/text/currency)   │
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│    Cardinality Analyzer     │  Measures unique value ratio per column
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Relationship Detector     │  FK-like overlaps via value intersection scoring
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Fact Table Selector       │  Picks table with highest measure density
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│   Schema Type Recommender   │  Star vs. Snowflake based on dimension depth
└──────────────┬─────────────┘
               │
               ▼
┌────────────────────────────┐
│     DDL Generator           │  CREATE TABLE statements for target warehouse
└────────────────────────────┘
```

---

## API Reference

### `SchemaGenie`

```python
SchemaGenie(
    target: str = "snowflake",          # "snowflake" | "redshift" | "bigquery" | "postgres"
    schema_type: str = "auto",          # "auto" | "star" | "snowflake"
    primary_key_strategy: str = "surrogate",
    detect_scd: bool = True,
    normalize_threshold: float = 0.05
)
```

| Method | Description |
|--------|-------------|
| `genie.infer(df, table_name)` | Infer schema from a single DataFrame |
| `genie.infer_multi(dict_of_dfs)` | Infer schema across multiple related tables |
| `genie.deploy(connection, schema)` | Execute DDL against a live warehouse |
| `SchemaGenie.from_config(path)` | Load configuration from a YAML file |

### `InferredSchema`

```python
schema.recommended_type     # "star" | "snowflake"
schema.fact_table           # TableDefinition
schema.dimension_tables     # list[TableDefinition]
schema.relationships        # list[dict] — detected FK relationships
schema.ddl                  # str — full DDL ready to execute
schema.scd_candidates       # list[str] — SCD Type 2 flagged columns
schema.confidence_score     # float — pipeline confidence [0, 1]
schema.export_ddl(path)     # Save DDL to a .sql file
schema.export_diagram(path) # Export ER diagram (requires graphviz extra)
schema.summary()            # Human-readable summary string
```

---

## Supported Targets

| Target | Surrogate Key | Currency Type | Direct Deploy |
|--------|---------------|---------------|---------------|
| `snowflake` | `AUTOINCREMENT` | `NUMBER(18,2)` | Yes |
| `redshift` | `IDENTITY(1,1)` | `DECIMAL(18,2)` | Yes |
| `bigquery` | `INT64` | `NUMERIC` | Yes |
| `postgres` | `SERIAL` | `NUMERIC(18,2)` | Yes |

All dimension tables automatically receive SCD Type 2 audit columns:
`_valid_from`, `_valid_to`, `_is_current`, `_loaded_at`

---

## Development

```bash
git clone https://github.com/yourusername/schema-genie
cd schema-genie
pip install -e ".[dev]"
pytest tests/ -v
```

---

## License

MIT — See [LICENSE](LICENSE)
