Metadata-Version: 2.4
Name: sparktimise
Version: 2.0.0
Summary: A Python wrapper that analyses DataFrames and applies optimisation techniques to maximise PySpark session performance.
Author: Keilan Evans
License: MIT
Keywords: pyspark,spark,optimisation,dataframe,RAP
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Classifier: Typing :: Typed
Requires-Python: <3.13,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pyarrow>=10.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: tomli>=2.0; python_version < "3.11"
Provides-Extra: dev
Requires-Dist: pytest>=7.4; extra == "dev"
Requires-Dist: pytest-cov>=4.1; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: mypy>=1.8; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: pre-commit; extra == "dev"
Requires-Dist: types-PyYAML; extra == "dev"
Provides-Extra: pandas
Requires-Dist: pandas>=1.5; extra == "pandas"
Dynamic: license-file

<p align="center">
  <img src="https://github.com/KeilanEvans/sparktimise/blob/main/logo_white.svg" alt="sparktimise logo" width="400">
</p>

# sparktimise

A PySpark optimisation library that inspects DataFrames and applies targeted performance improvements with minimal user code changes.

[![Python 3.10–3.12](https://img.shields.io/badge/python-3.10%20%7C%203.11%20%7C%203.12-blue?logo=python&logoColor=white)](https://github.com/KeilanEvans/sparktimise)
[![Version](https://img.shields.io/badge/version-1.0.1-green)](https://github.com/KeilanEvans/sparktimise)
[![License: MIT](https://img.shields.io/badge/license-MIT-yellow)](https://github.com/KeilanEvans/sparktimise/blob/main/LICENSE)
[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000?logo=ruff&logoColor=white)](https://docs.astral.sh/ruff/)
[![PySpark](https://img.shields.io/badge/PySpark-compatible-E25A1C?logo=apachespark&logoColor=white)](https://spark.apache.org/docs/latest/api/python/)

## What The Package Does

sparktimise provides two ways to optimise PySpark jobs:

1. A pipeline context-manager workflow through ``optimise`` / ``SparkPipelineAutoTuner``.
2. A functional workflow through ``analyse_*`` and ``optimise_*`` functions for explicit control.

| Capability | Description | Primary API |
|---|---|---|
| Partition optimisation | Estimates optimal shuffle partitions and low-cardinality partition candidates | optimise_partitions |
| Skew mitigation | Detects skewed keys and applies salting columns | optimise_skew |
| Cache strategy | Recommends and applies StorageLevel persistence | optimise_cache |
| Spark session tuning | Recommends and applies Spark SQL/session settings | SparkPipelineAutoTuner / optimise_context |
| Broadcast analysis | Profiles table sizes for join strategy advice and optional hints | analyse_broadcast / apply_broadcast_hints |
| Reporting | Summarises pipeline steps and metadata in text or dict form | OptimisationReport |

## Installation

```bash
pip install sparktimise
```

For local development:

```bash
pip install -e .[dev]
pip install pyspark
```

## Quick Start

### Context-manager usage

```python
from pyspark.sql import SparkSession
from sparktimise import SparkPipelineAutoTuner

spark = SparkSession.builder.appName("orders-job").getOrCreate()


def run_pipeline():
    orders = spark.read.parquet("s3a://my-bucket/orders/")
    return orders.groupBy("customer_id").count()


with SparkPipelineAutoTuner(
    spark=spark,
    pipeline_name="orders_pipeline",
    watched_modules=["my_project.orders"],
) as tuner:
    tuner.execute("run_orders", run_pipeline)
```

### Entry-point usage through `optimise`

```python
from pyspark.sql import SparkSession
from sparktimise import optimise

spark = SparkSession.builder.appName("orders-job").getOrCreate()

with optimise(
    spark,
    "orders_pipeline",
    run_type="optimise",
    watched_modules=["my_project.orders"],
) as tuner:
    tuner.execute("run_orders", run_pipeline)
```

### Functional usage

```python
from sparktimise.optimisation import optimise_partitions, optimise_skew

step1 = optimise_partitions(df, target_partition_bytes=134_217_728)
step2 = optimise_skew(step1.df, columns=["customer_id"])

optimised_df = step2.df
print(step1.transformations_applied)
print(step2.transformations_applied)
```

## Configuration And Runtime Controls

The primary runtime configuration surface is the `optimise` context manager.

| Parameter | Type | Default | Effect |
|---|---|---|---|
| run_type | str | "optimise" | `"optimise"`, `"baseline"`, or `"report"` |
| watched_functions | list[str] \| None | None | Exact or wildcard qualified function names to auto-assess |
| watched_modules | list[str] \| None | None | Module prefixes to auto-assess |
| auto_capture | bool | True | Enables context-manager function-return capture |
| include_plan | bool | False | Stores full plan text in assessments |
| run_id | str \| None | None | Groups results under a named folder; defaults to pipeline name |
| results_root | str \| None | None | Root folder to write sparktimise_results under |
| spark | SparkSession | Required | Session used for safe SQL/session tuning |

For full configuration details, including file-backed config loading and Spark recommendation settings, see [docs/configuration.md](docs/configuration.md).

## Architecture And Process Flow

sparktimise follows a hybrid pattern:

1. Functional core: analyser and optimiser functions.
2. Imperative shell: context-manager orchestration.
3. OOP boundaries: adapters, config, and reporting.

Detailed architecture and sequence diagrams are documented in [docs/architecture.md](docs/architecture.md).

## Documentation Map

| Document | Purpose |
|---|---|
| [docs/README.md](docs/README.md) | Documentation index |
| [docs/usage.md](docs/usage.md) | End-to-end usage patterns and examples |
| [docs/configuration.md](docs/configuration.md) | Configuration variables and file formats |
| [docs/architecture.md](docs/architecture.md) | Internal design and process flow |
| [docs/troubleshooting.md](docs/troubleshooting.md) | Common setup/runtime/CI issues and fixes |
| [CHANGELOG.md](CHANGELOG.md) | Versioned release and change history |

## Development Setup

### Requirements

| Dependency | Purpose |
|---|---|
| Python 3.10+ | Runtime and tooling |
| Java (JRE/JDK) | Required for Spark tests |
| PySpark | Runtime dependency for Spark operations |

### Local setup

```bash
python -m pip install -U pip
python -m pip install -e .[dev]
python -m pip install pyspark
```

### Quality checks

```bash
ruff check src/sparktimise/ tests/
ruff format --check src/sparktimise/ tests/
mypy src/sparktimise/
```

### Tests

```bash
# Unit tests (default)
python -m pytest tests/unit/ --tb=short -q

# Integration tests (requires Java + Spark)
python -m pytest tests/integration/ --run-spark --spark-smoke-timeout 60 --tb=short -q
```

### Build package artifacts

```bash
python -m pip install build
python -m build --sdist --wheel
```

Artifacts are created under dist/.

## License

MIT
