Metadata-Version: 2.4
Name: spark-fuse
Version: 0.2.0
Summary: Open-source PySpark toolkit with connectors and CLI for Azure Storage, Databricks, Microsoft Fabric Lakehouses, Unity Catalog, and Hive Metastore.
Project-URL: Homepage, https://kevinsames.github.io/spark-fuse/
Project-URL: Documentation, https://kevinsames.github.io/spark-fuse/
Project-URL: Repository, https://github.com/kevinsames/spark-fuse
Project-URL: Issues, https://github.com/kevinsames/spark-fuse/issues
Author: Kevin Sames
License: Copyright 2025 Kevin Sames
        
        Licensed under the Apache License, Version 2.0 (the "License");
        you may not use this file except in compliance with the License.
        You may obtain a copy of the License at
        
            http://www.apache.org/licenses/LICENSE-2.0
        
        Unless required by applicable law or agreed to in writing, software
        distributed under the License is distributed on an "AS IS" BASIS,
        WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
        See the License for the specific language governing permissions and
        limitations under the License.
License-File: LICENSE
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.9
Requires-Dist: adlfs>=2023.4
Requires-Dist: azure-identity>=1.14
Requires-Dist: delta-spark<4,>=3
Requires-Dist: pydantic>=2
Requires-Dist: pyspark<4,>=3.4
Requires-Dist: requests>=2
Requires-Dist: rich>=13
Requires-Dist: typer>=0.9
Provides-Extra: dev
Requires-Dist: build>=1; extra == 'dev'
Requires-Dist: pre-commit>=3; extra == 'dev'
Requires-Dist: pytest-cov>=4; extra == 'dev'
Requires-Dist: pytest>=7; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: twine>=5; extra == 'dev'
Provides-Extra: qdrant
Requires-Dist: qdrant-client>=1.8; extra == 'qdrant'
Description-Content-Type: text/markdown

spark-fuse
================

![CI](https://github.com/kevinsames/spark-fuse/actions/workflows/ci.yml/badge.svg)
![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)

spark-fuse is an open-source toolkit for PySpark — providing utilities, connectors, and tools to fuse your data workflows across Azure Storage (ADLS Gen2), Databricks, Microsoft Fabric Lakehouses (via OneLake/Delta), Unity Catalog, and Hive Metastore.

Features
- Connectors for ADLS Gen2 (`abfss://`), Fabric OneLake (`onelake://` or `abfss://...onelake.dfs.fabric.microsoft.com/...`), and Databricks DBFS (`dbfs:/`).
- Unity Catalog and Hive Metastore helpers to create catalogs/schemas and register external Delta tables.
- SparkSession helpers with sensible defaults and environment detection (Databricks/Fabric/local).
- LLM-powered semantic column normalization that batches API calls and caches responses.
- Typer-powered CLI: list connectors, preview datasets, register tables, submit Databricks jobs.

Installation
- Create a virtual environment (recommended)
  - macOS/Linux:
    - `python3 -m venv .venv`
    - `source .venv/bin/activate`
    - `python -m pip install --upgrade pip`
  - Windows (PowerShell):
    - `python -m venv .venv`
    - `.\\.venv\\Scripts\\Activate.ps1`
    - `python -m pip install --upgrade pip`
- From source (dev): `pip install -e ".[dev]"`
- From PyPI: `pip install "spark-fuse>=0.2.0"`

Quickstart
1) Create a SparkSession with helpful defaults
```python
from spark_fuse.spark import create_session
spark = create_session(app_name="spark-fuse-quickstart")
```

2) Read a Delta table from ADLS or OneLake
```python
from spark_fuse.io.azure_adls import ADLSGen2Connector

df = ADLSGen2Connector().read(spark, "abfss://container@account.dfs.core.windows.net/path/to/delta")
df.show(5)
```

3) Register an external table in Unity Catalog
```python
from spark_fuse.catalogs import unity

unity.create_catalog(spark, "analytics")
unity.create_schema(spark, catalog="analytics", schema="core")
unity.register_external_delta_table(
    spark,
    catalog="analytics",
    schema="core",
    table="events",
    location="abfss://container@account.dfs.core.windows.net/path/to/delta",
)
```

LLM-Powered Column Mapping
```python
from spark_fuse.utils.transformations import map_column_with_llm

standard_values = ["Apple", "Banana", "Cherry"]
mapped_df = map_column_with_llm(
    df,
    column="fruit",
    target_values=standard_values,
    model="o4-mini",
    temperature=None,
)
mapped_df.select("fruit", "fruit_mapped").show()
```

Set `dry_run=True` to inspect how many rows already match without spending LLM tokens. Configure your OpenAI or Azure OpenAI credentials with the usual environment variables before running live mappings. Some provider models only accept their default sampling configuration—pass `temperature=None` to omit the parameter when needed. This helper ships with spark-fuse 0.2.0 and later.

CLI Usage
- `spark-fuse --help`
- `spark-fuse connectors`
- `spark-fuse read --path abfss://container@account.dfs.core.windows.net/path/to/delta --show 5`
- `spark-fuse uc-create --catalog analytics --schema core`
- `spark-fuse uc-register-table --catalog analytics --schema core --table events --path abfss://.../delta`
- `spark-fuse hive-register-external --database analytics_core --table events --path abfss://.../delta`
- `spark-fuse fabric-register --table lakehouse_table --path onelake://workspace/lakehouse/Tables/events`
- `spark-fuse databricks-submit --json job.json`

CI
- GitHub Actions runs ruff and pytest for Python 3.9–3.11.

License
- Apache 2.0
