Metadata-Version: 2.4
Name: databricks-tpcds
Version: 0.2.0
Summary: Run the TPC-DS benchmark on Databricks (Delta Lake).
Home-page: https://github.com/onehouseinc/onebench
Author: Onehouse
License: Apache-2.0
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.9
Description-Content-Type: text/markdown
Dynamic: author
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: requires-python
Dynamic: summary

## Running TPCDS on Databricks
This document describes how to run TPCDS on Databricks. The TPCDS benchmark is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. The benchmark provides a representative evaluation of performance as a general purpose decision support system. The benchmark is the result of a partnership between the Transaction Processing Performance Council (TPC) and the decision support group (DS) of the Association for Computing Machinery (ACM).

### Pre-requisites
1. Databricks workspace
2. Databricks metastore configured to workspace
3. Databricks cluster (jobs/all purpose etc)

## Install from PyPI
Install the package directly in a Databricks notebook:
```shell
%pip install databricks-tpcds
```

The package provides the `DatabricksTPCDS` library. You drive it from an entrypoint script like
the Delta Lake example below.

## Delta Lake entrypoint example
Fill in the placeholder `catalog_name`, `bucket_name`, `prefix`, and `schema_name` with your own
values, then run it on your Databricks cluster.

```python
from pyspark.sql import SparkSession
from databricks_tpcds.databricks_tpcds import DatabricksTPCDS


def main():
    catalog_name = 'my_catalog'
    bucket_name = 'my-bucket'
    prefix = 'path/to/tpcds-datasets/1TB'
    schema_name = 'my_schema'

    # Initialize Spark session
    spark = SparkSession.builder.appName("TPCDS Query Runner").getOrCreate()

    # Enable/disable cache
    spark.conf.set("spark.databricks.io.cache.enabled", "false")

    databricks_tpcds = DatabricksTPCDS(spark, schema_name=schema_name, catalog_name=catalog_name)

    # Create catalog
    databricks_tpcds.create_catalog()

    # Create schema
    databricks_tpcds.create_schema()

    # Create a single table, provide the table name
    # databricks_tpcds.create_table(bucket_name, prefix, "call_center")

    # Create multiple tables, provide the list of table names
    # databricks_tpcds.create_tables(bucket_name, prefix, ["call_center", "catalog_page"])

    # Create all tables, provide the bucket name and prefix, it'll create all the tables
    databricks_tpcds.create_all_tables(bucket_name, prefix)

    # Run all queries
    for i in range(3):
        time_taken_by_queries = databricks_tpcds.run_all_queries(should_warmup=False)
        print("QUERY_NUMBER,TIME_TAKEN")
        for query_no, time_taken in time_taken_by_queries.items():
            print(f"{query_no},{time_taken}")


if __name__ == "__main__":
    main()
```

## Developing locally
1. Modify the code if necessary in `src/databricks_tpcds/databricks_tpcds.py`
2. Take a look or modify the queries in `src/resources/queries/`
3. Build the package:
```shell
cd tpcds/databricks
python3.10 -m build
```
4. Upload the built `.whl` to your Databricks workspace and install it in a notebook:
```shell
%pip install path/to/databricks_tpcds-0.2.0-py3-none-any.whl --force-reinstall
```
5. Run the benchmark using the Delta Lake entrypoint example above.
