Metadata-Version: 2.4
Name: pyde-toolkit
Version: 1.2.0
Summary: A growing toolkit of data-engineering helper functions and CLI commands — starting with schema inference (column standardisation, type inference, schema + DDL generation for Pandas/ANSI SQL or PySpark/Spark SQL).
Author: Your Name
License: MIT
Project-URL: Homepage, https://github.com/your-org/pyde-toolkit
Project-URL: Issues, https://github.com/your-org/pyde-toolkit/issues
Keywords: pandas,pyspark,schema,ddl,data-engineering,delta-lake,databricks,toolkit
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Database
Classifier: Topic :: Software Development :: Libraries
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.3
Requires-Dist: numpy>=1.21
Provides-Extra: excel
Requires-Dist: openpyxl>=3.0; extra == "excel"
Requires-Dist: xlrd>=2.0; extra == "excel"
Requires-Dist: odfpy>=1.4; extra == "excel"
Provides-Extra: memcheck
Requires-Dist: psutil>=5.9; extra == "memcheck"
Provides-Extra: all
Requires-Dist: openpyxl>=3.0; extra == "all"
Requires-Dist: xlrd>=2.0; extra == "all"
Requires-Dist: odfpy>=1.4; extra == "all"
Requires-Dist: psutil>=5.9; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=4.0; extra == "dev"
Dynamic: license-file

# pyde-toolkit

A growing toolkit of data-engineering helper functions and CLI commands.
The first tool is **schema inference**: standardise column names, infer
data types from sample data, and emit ready-to-use schema definitions and
`CREATE TABLE` DDL — either Pandas/ANSI SQL or PySpark/Spark SQL (with
bronze/silver/gold layer support for Databricks / Unity Catalog workflows).

## Install

```bash
pip install pyde-toolkit

# with Excel support (.xlsx, .xls, .xlsm, .xlsb, .ods)
pip install "pyde-toolkit[excel]"

# with the pre-flight memory check for large full-file reads
pip install "pyde-toolkit[memcheck]"

# everything
pip install "pyde-toolkit[all]"
```

Already have it installed and want the latest release?

```bash
pip install --upgrade pyde-toolkit
```

## Quick start — Python

```python
from pyde_toolkit import infer_file          # top-level convenience re-export
# or, namespaced (recommended as the toolkit grows):
from pyde_toolkit.schema_inferencer import infer_file

result = infer_file(
    "sales.csv",
    pyspark=True,
    casing="pascal",
    table_name="sales_fact",
    header_row=0,        # skip junk title rows if needed, e.g. header_row=4
    type_threshold=0.95, # tolerate a few dirty values before falling back to string
)

print(result["schema"])         # PySpark StructType or pandas dtype dict
print(result["create_table"])   # SQL DDL
print(result["rename_code"])    # copy-paste column rename snippet
print(result["report"])         # full formatted text report
```

## Quick start — CLI

```bash
pyde-toolkit schema-infer sales.csv
pyde-toolkit schema-infer sales.csv --pyspark true --case pascal --layer bronze --catalog prod
pyde-toolkit schema-infer sales.xlsx --sheet Sheet2 --layer silver
pyde-toolkit schema-infer messy.csv --header-row 4 --type-threshold 0.80
pyde-toolkit --version
```

Run `pyde-toolkit schema-infer --help` for the full flag reference, or see
the module docstring in `pyde_toolkit/schema_inferencer/core.py`.

## Features

- **Column standardisation** — camel, pascal, snake, screaming, kebab, or
  skip casing, with symbol expansion (`/` → `or`, `%` → `pct`, etc.)
- **Type inference** — bool, int32/int64, float, date, datetime, string,
  with a configurable conformance threshold (`--type-threshold`) to tolerate
  dirty data
- **Header offset** — `--header-row` to skip junk/title rows above the real
  header, for both CSV and Excel
- **Dual output modes** — Pandas dtypes + ANSI SQL, or PySpark StructType +
  Spark SQL
- **Layered outputs** — bronze, parquet_bronze, silver, gold, gold_vw (view),
  or all five at once
- **Table types** — managed Delta, external, or external Delta tables
- **Flexible input** — CSV/TSV (delimiter auto-detected), Excel
  (`.xlsx .xls .xlsm .xlsb .ods`), or a pandas DataFrame directly

## Project structure (for contributors)

```
src/pyde_toolkit/
├── __init__.py            # top-level re-exports + __version__
├── cli.py                 # top-level CLI dispatcher (registers subcommands)
└── schema_inferencer/     # one subpackage per feature
    ├── __init__.py        # public API for this feature
    ├── core.py            # logic only, no argparse
    └── cli.py             # add_arguments(parser) + run(args) for this feature
```

**Adding a new feature later:** create `pyde_toolkit/<your_feature>/` with
the same three-file shape, then register it with one line in
`pyde_toolkit/cli.py`'s `build_parser()`. No other files need to change.

## Releasing a new version

Version lives in one place (`pyproject.toml`); the installed package's
`__version__` is read live from package metadata, so there's nothing else
to keep in sync.

```bash
python scripts/bump_version.py patch   # or minor / major / an exact X.Y.Z
python -m build
twine upload dist/*
```

Anyone with it already installed just runs `pip install --upgrade pyde-toolkit`
— no need to uninstall first.

## License

MIT — see [LICENSE](LICENSE).
