Metadata-Version: 2.4
Name: lakebench-k8s
Version: 1.0.0
Summary: Deploy and benchmark lakehouse stacks on Kubernetes
Project-URL: Homepage, https://github.com/PureStorage-OpenConnect/lakebench-k8s
Project-URL: Documentation, https://github.com/PureStorage-OpenConnect/lakebench-k8s/tree/main/docs
Project-URL: Repository, https://github.com/PureStorage-OpenConnect/lakebench-k8s
Author: Andrew Sillifant
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: benchmark,iceberg,kubernetes,lakehouse,spark
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Database
Classifier: Topic :: System :: Systems Administration
Requires-Python: >=3.10
Requires-Dist: boto3>=1.34.0
Requires-Dist: httpx>=0.27.0
Requires-Dist: jinja2>=3.1.0
Requires-Dist: kubernetes>=29.0.0
Requires-Dist: prometheus-client>=0.20.0
Requires-Dist: pydantic-settings>=2.1.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: pyyaml>=6.0
Requires-Dist: rich>=13.7.0
Requires-Dist: typer>=0.12.0
Provides-Extra: build
Requires-Dist: pyinstaller>=6.0.0; extra == 'build'
Provides-Extra: dev
Requires-Dist: moto>=5.0.0; extra == 'dev'
Requires-Dist: mypy>=1.8.0; extra == 'dev'
Requires-Dist: pre-commit>=3.0.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.15.0; extra == 'dev'
Requires-Dist: types-boto3>=1.34.0; extra == 'dev'
Requires-Dist: types-pyyaml>=6.0.0; extra == 'dev'
Description-Content-Type: text/markdown

# Lakebench

CLI tool for deploying and benchmarking lakehouse architectures on Kubernetes.

> **Note:** This package is published as `lakebench-k8s` on PyPI. Install with `pip install lakebench-k8s`. The CLI command is `lakebench`.

Choosing between Hive and Polaris, Iceberg and Delta, or sizing Spark for 100 GB
vs 10 TB shouldn't require weeks of manual setup. Lakebench deploys a complete
lakehouse stack from a single YAML file, generates realistic data at any scale,
runs the pipeline, benchmarks query performance, and tears everything down --so
you can focus on comparing architectures, not plumbing.

## Installation

```bash
pip install lakebench-k8s
```

Or with [pipx](https://pipx.pypa.io/): `pipx install lakebench-k8s`

Pre-built binaries (no Python required) are available on
[GitHub Releases](https://github.com/PureStorage-OpenConnect/lakebench-k8s/releases).

### Prerequisites

- Python 3.10+
- `kubectl` and `helm` on PATH
- Kubernetes cluster (1.26+)
- S3-compatible object storage (FlashBlade, MinIO, AWS S3, etc.)

## Quick Start

```bash
# 1. Generate config (interactive prompts for S3 details)
lakebench init --interactive

# 2. Deploy infrastructure
lakebench deploy lakebench.yaml

# 3. Generate test data
lakebench generate lakebench.yaml --wait

# 4. Run the pipeline + benchmark
lakebench run lakebench.yaml

# 5. View results
lakebench report

# 6. Tear down
lakebench destroy lakebench.yaml
```

## Commands

| Command | Description |
|---------|-------------|
| `lakebench init` | Generate a starter configuration file |
| `lakebench validate <config>` | Validate config and test connectivity |
| `lakebench info <config>` | Show configuration summary |
| `lakebench recommend` | Recommend cluster sizing for a scale factor |
| `lakebench deploy <config>` | Deploy all infrastructure |
| `lakebench generate <config>` | Generate synthetic data to bronze bucket |
| `lakebench run <config>` | Execute the medallion pipeline with metrics |
| `lakebench benchmark <config>` | Run 8-query Trino benchmark |
| `lakebench query <config>` | Execute SQL queries against Trino |
| `lakebench status [config]` | Show deployment status |
| `lakebench logs <component> [config]` | Stream logs from a component |
| `lakebench report` | Generate HTML benchmark report |
| `lakebench destroy <config>` | Tear down all resources |

## How It Works

Lakebench deploys a three-layer stack on Kubernetes:

1. **Platform** -- Kubernetes namespace, S3 secrets, PostgreSQL (metadata store)
2. **Data architecture** -- catalog (Hive or Polaris), table format (Iceberg or Delta),
   query engine (Trino, Spark Thrift, or DuckDB), all wired together via
   [recipes](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/recipes.md)
3. **Observability** -- optional Prometheus + Grafana stack for platform metrics

Once deployed, the pipeline runs three Spark jobs in sequence:

```
Raw Parquet (S3)  -->  Bronze (validate, deduplicate)
                  -->  Silver (normalize, enrich -- Iceberg table)
                  -->  Gold (aggregate -- Iceberg table)
                  -->  Benchmark (8 analytical queries via query engine)
```

The benchmark produces an HTML report with query latencies, throughput scores,
and optional platform metrics (CPU, memory, S3 I/O per pod). See the
[Architecture](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/architecture.md)
doc for the full picture.

## Component Versions

| Component | Version |
|-----------|---------|
| Apache Spark | 3.5.4 |
| Spark Operator | 2.4.0 (Kubeflow) |
| Apache Iceberg | 1.10.1 |
| Delta Lake | 3.0.0 |
| Hive Metastore | 3.1.3 (Stackable 25.7.0) |
| Apache Polaris | 1.3.0-incubating |
| Trino | 479 |
| DuckDB | bundled (Python 3.11) |
| PostgreSQL | 17 |

All versions are configurable. See
[Supported Components](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/supported-components.md)
for the full matrix of components, recipes, and override options.

## Documentation

Full documentation is in the [docs/](https://github.com/PureStorage-OpenConnect/lakebench-k8s/tree/main/docs) directory:

- [Getting Started](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/getting-started.md) -- prerequisites, install, first deployment
- [Configuration](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/configuration.md) -- full YAML reference
- [CLI Reference](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/cli-reference.md) -- all commands and flags
- [Recipes](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/recipes.md) -- supported component combinations
- [Supported Components](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/supported-components.md) -- versions, images, and recipe matrix
- [Deployment](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/deployment.md) -- deploy lifecycle and status checks
- [Running Pipelines](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/running-pipelines.md) -- batch and streaming modes
- [Benchmarking](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/benchmarking.md) -- query suite and scoring
- [Architecture](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/architecture.md) -- system design and component layers
- [Troubleshooting](https://github.com/PureStorage-OpenConnect/lakebench-k8s/blob/main/docs/troubleshooting.md) -- common errors and fixes

## License

Apache 2.0
