Metadata-Version: 2.4
Name: pyde-toolkit
Version: 1.0.5
Summary: A growing toolkit of data-engineering helper functions and CLI commands — starting with schema inference (column standardisation, type inference, schema + DDL generation for Pandas/ANSI SQL or PySpark/Spark SQL).
Author: Your Name
License: MIT
Project-URL: Homepage, https://github.com/your-org/pyde-toolkit
Project-URL: Issues, https://github.com/your-org/pyde-toolkit/issues
Keywords: pandas,pyspark,schema,ddl,data-engineering,delta-lake,databricks,toolkit
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Database
Classifier: Topic :: Software Development :: Libraries
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.3
Requires-Dist: numpy>=1.21
Provides-Extra: excel
Requires-Dist: openpyxl>=3.0; extra == "excel"
Requires-Dist: xlrd>=2.0; extra == "excel"
Requires-Dist: odfpy>=1.4; extra == "excel"
Provides-Extra: memcheck
Requires-Dist: psutil>=5.9; extra == "memcheck"
Provides-Extra: all
Requires-Dist: openpyxl>=3.0; extra == "all"
Requires-Dist: xlrd>=2.0; extra == "all"
Requires-Dist: odfpy>=1.4; extra == "all"
Requires-Dist: psutil>=5.9; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: build>=1.0; extra == "dev"
Requires-Dist: twine>=4.0; extra == "dev"
Dynamic: license-file

# pyde-toolkit

A growing toolkit of data-engineering helper functions and CLI commands. Each tool lives in its own submodule so the package can keep expanding without things colliding.

**Tools currently included:**

| Submodule | What it does |
|---|---|
| `pyde_toolkit.schema_inferencer` | Infers column names, data types, schema definitions, and `CREATE TABLE`/`CREATE VIEW` DDL from a CSV/TSV/Excel file **or a pandas DataFrame already in memory**. Outputs Pandas/ANSI SQL or PySpark/Spark SQL, with optional Databricks medallion-layer (bronze/silver/gold) support. |

## Installation

```bash
pip install pyde-toolkit
```

Reading Excel files (for `schema_inferencer`) needs the optional extra:

```bash
pip install "pyde-toolkit[excel]"
```

> Not yet on PyPI? See [Building & Publishing](#building--publishing) below to build and install it locally first.

## Quick Start — Schema Inferencer

Pass a DataFrame directly — no file I/O required:

```python
import pandas as pd
from pyde_toolkit.schema_inferencer import infer_file

df = pd.DataFrame({
    "Plant Description": ["Mumbai Plant", "Pune Plant"],
    "ZODI/ZLDI":          ["ZODI", "ZLDI"],
    "Cost %":              [12.5, 8.0],
})

result = infer_file(df, pyspark=True, casing="snake", table_name="plant_master")

print(result["schema"])        # PySpark StructType, ready to paste
print(result["create_table"])  # CREATE TABLE ... USING DELTA
print(result["rename_code"])   # df.withColumnRenamed(...) snippet
```

A top-level convenience import also works for the most common function:

```python
from pyde_toolkit import infer_file
```

Works the same way from a Spark DataFrame in a Databricks notebook:

```python
result = infer_file(spark_df.toPandas(), pyspark=True, casing="snake",
                     table_name="sales_fact", layer="silver", catalog="prod")
```

Or from a file path:

```python
result = infer_file("Sales1.csv", casing="pascal")   # Pandas + ANSI SQL by default
```

## Command line

The package installs a single `pyde-toolkit` command. Each tool is a subcommand:

```bash
pyde-toolkit schema-infer Sales1.csv
pyde-toolkit schema-infer Sales1.csv --pyspark true --case pascal
pyde-toolkit schema-infer Sales1.csv --pyspark true --layer all --catalog prod
pyde-toolkit schema-infer --help
pyde-toolkit --version
```

## Full documentation

- [`docs/schema_inferencer.md`](docs/schema_inferencer.md) — complete reference for the schema inferencer: every flag/parameter, casing rules, type-inference behaviour, sampling, medallion layers, table types, and the full `infer_file()` return value.
- [`docs/RELEASING.md`](docs/RELEASING.md) — step-by-step checklist for making a change, bumping the version, building, publishing, and installing the upgrade.

(As more tools are added, each gets its own `docs/<tool_name>.md`.)

## Adding a new tool to the toolkit

The package is structured so new tools drop in without touching existing ones:

1. Create `src/pyde_toolkit/<your_tool>/` with its own `core.py` (the logic) and `cli.py` exposing two functions: `add_arguments(parser)` to register its flags, and `run(args)` to execute. See `schema_inferencer/cli.py` for the pattern.
2. In `src/pyde_toolkit/cli.py`, register it as a new subcommand — one `subparsers.add_parser(...)` call plus `your_tool_cli.add_arguments(...)`. Dispatch is generic, so nothing else needs to change.
3. Optionally re-export its main function from `src/pyde_toolkit/__init__.py` for a top-level convenience import.
4. Add `docs/<your_tool>.md` and `tests/<your_tool>/`.

## Building, releasing & installing upgrades

Quick version — see [`docs/RELEASING.md`](docs/RELEASING.md) for the full checklist (versioning rules, publishing options, troubleshooting):

```bash
pip install -e ".[dev]"              # 1. dev install
pytest                               # 2. test your changes
#    bump version = "X.Y.Z" in pyproject.toml   # 3. one-line version bump
rm -rf build dist src/*.egg-info
python -m build                      # 4. build dist/*.whl and dist/*.tar.gz
twine upload dist/*                  # 5. publish (PyPI or your private index)
pip install --upgrade pyde-toolkit   # 6. install the new version
```

`pyde_toolkit.__version__` and `pyde-toolkit --version` are read automatically from whatever version is installed — no need to edit any source file besides `pyproject.toml`.

## License

MIT — see [`LICENSE`](LICENSE).
