Metadata-Version: 2.4
Name: spark-rapids-user-tools
Version: 26.4.0
Summary: A simple wrapper process around cloud service providers to run tools for the RAPIDS Accelerator for Apache Spark.
Author-email: NVIDIA Corporation <spark-rapids-support@nvidia.com>
License-Expression: Apache-2.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: <3.13,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=2.2.6
Requires-Dist: chevron==0.14.0
Requires-Dist: fastprogress==1.0.3
Requires-Dist: fastcore==1.7.28
Requires-Dist: fire==0.7.1
Requires-Dist: pandas==2.3.0
Requires-Dist: pyYAML>=6.0.2
Requires-Dist: pyaml-env==1.2.2
Requires-Dist: tabulate==0.9.0
Requires-Dist: importlib-resources==6.5.2
Requires-Dist: requests==2.33.0
Requires-Dist: packaging==25.0
Requires-Dist: certifi==2025.6.15
Requires-Dist: urllib3==2.6.3
Requires-Dist: pygments==2.20.0
Requires-Dist: pydantic==2.11.7
Requires-Dist: pylint-pydantic==0.3.5
Requires-Dist: pyarrow==20.0.0
Requires-Dist: azure-storage-blob==12.25.1
Requires-Dist: adlfs==2024.12.0
Requires-Dist: progress==1.6.1
Requires-Dist: xgboost==3.0.2
Requires-Dist: shap==0.48.0
Requires-Dist: scikit-learn==1.7.0
Requires-Dist: psutil==7.0.0
Requires-Dist: zstandard==0.25.0
Requires-Dist: pyspark<4.0.0,>=3.5.7
Requires-Dist: jproperties==2.1.2
Requires-Dist: optuna==4.4.0
Requires-Dist: optuna-integration==4.4.0
Provides-Extra: test
Requires-Dist: tox; extra == "test"
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-cov; extra == "test"
Requires-Dist: cli_test_helpers; extra == "test"
Requires-Dist: behave; extra == "test"
Requires-Dist: flake8; extra == "test"
Requires-Dist: flake8-pydantic; extra == "test"
Requires-Dist: pylint==3.3.7; extra == "test"
Provides-Extra: qualx
Requires-Dist: holoviews; extra == "qualx"
Requires-Dist: matplotlib; extra == "qualx"
Requires-Dist: seaborn; extra == "qualx"
Provides-Extra: dev-env
Requires-Dist: tox; extra == "dev-env"
Requires-Dist: pytest; extra == "dev-env"
Requires-Dist: pytest-cov; extra == "dev-env"
Requires-Dist: cli_test_helpers; extra == "dev-env"
Requires-Dist: behave; extra == "dev-env"
Requires-Dist: flake8; extra == "dev-env"
Requires-Dist: flake8-pydantic; extra == "dev-env"
Requires-Dist: pylint==3.3.7; extra == "dev-env"
Requires-Dist: holoviews; extra == "dev-env"
Requires-Dist: matplotlib; extra == "dev-env"
Requires-Dist: seaborn; extra == "dev-env"
Dynamic: license-file

# spark-rapids-user-tools

User tools to help with the adoption, installation, execution, and tuning of RAPIDS Accelerator for Apache Spark.

The wrapper improves end-user experience within the following dimensions:
1. **Qualification**: Educate the CPU customer on the cost savings and acceleration potential of RAPIDS Accelerator for
   Apache Spark. The output shows a list of apps recommended for RAPIDS Accelerator for Apache Spark with estimated savings
   and speed-up.
2. **Tuning**: Tune RAPIDS Accelerator for Apache Spark configs based on initial job run leveraging Spark event logs. The output
   shows recommended per-app RAPIDS Accelerator for Apache Spark config settings.
3. **Diagnostics**: Run diagnostic functions to validate the Dataproc with RAPIDS Accelerator for Apache Spark environment to
   make sure the cluster is healthy and ready for Spark jobs.
4. **Prediction**: Predict the speedup of running a Spark application with Spark RAPIDS on GPUs.
5. **Train**: Train a model to predict the performance of a Spark job on RAPIDS Accelerator for Apache Spark. The output shows
   the model file that can be used to predict the performance of a Spark job.


## Getting started

Set up a Python environment with a version between 3.10 and 3.12

1. Run the project in a virtual environment. Note, .venv is the directory created to put
   the virtual env in, so modify if you want a different location.
    ```sh
    $ python -m venv .venv
    $ source .venv/bin/activate
    ```
2. Install spark-rapids-user-tools
    - Using released package.

      ```sh
      $ pip install spark-rapids-user-tools
      ```
    - Install from source.

      ```sh
      $ pip install -e .
      ```

      Note:
      - To install dependencies required for running unit tests, use the optional `test` parameter: `pip install -e '.[test]'`
      - To install dependencies required for QualX training, use the optional `qualx` parameter `pip install -e '.[qualx]'`
      - To install all the required dependencies, use the optional `dev-env` parameter: `pip install -e '.[dev-env]'`

    - Using wheel package built from the repo (see the build steps below).

      ```sh
      $ pip install <wheel-file>
      ```

3. Make sure to install CSP SDK if you plan to run the tool wrapper.

## Building from source

Set up a Python environment similar to the steps above.

1. Create a virtual environment. Note, .venv is the directory created to put
   the virtual env in, so modify if you want a different location.
    ```sh
    $ python -m venv .venv
    $ source .venv/bin/activate
    ```

2. Run the provided build script to compile the project.

   ```sh
   $> ./build.sh
   ```

3. **Fat Mode:** Similar to `fat jar` in Java, this mode solves the problem when web access is not
   available to download resources having Url-paths (http/https).  
   The command builds the tools jar file and downloads the necessary dependencies and packages them
   with the source code into a single 'wheel' file.

   ```sh
   $> ./build.sh fat
   ```

## Logging Configuration

The core tools project uses Log4j for logging. Default log level is set to INFO.
You can configure logging settings in the `log4j.properties` file located in the
`src/spark_rapids_pytools/resources/dev/` directory. This is applicable when
you clone the project and build it from source.
To change the logging level, modify the `log4j.rootLogger` property.
Possible levels include `DEBUG`, `INFO`, `WARN`, `ERROR`.

## Usage and supported platforms

Please refer to [spark-rapids-user-tools guide](https://github.com/NVIDIA/spark-rapids-tools/blob/main/user_tools/docs/index.md) for details on how to use the tools
and the platform.

Please refer to [qualx guide](https://github.com/NVIDIA/spark-rapids-tools/blob/main/user_tools/docs/qualx.md) for details on how to use the QualX tool for prediction and training.

## What's new

Please refer to [CHANGELOG.md](https://github.com/NVIDIA/spark-rapids-tools/blob/main/CHANGELOG.md) for our latest changes.
