Metadata-Version: 2.4
Name: databricks-job-runner
Version: 0.4.1
Summary: Reusable CLI for uploading, submitting, validating, fetching logs, and cleaning Databricks job runs
Author: Ryan Knight
Author-email: Ryan Knight <ryan.knight@neo4j.com>
License-Expression: MIT
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Typing :: Typed
Requires-Dist: databricks-sdk
Requires-Dist: pydantic>=2
Requires-Python: >=3.12
Project-URL: Repository, https://github.com/neo4j-partners/databricks-job-runner
Description-Content-Type: text/markdown

# databricks-job-runner

Reusable CLI for uploading, submitting, and cleaning Databricks job runs.

Wraps the [Databricks Python SDK](https://docs.databricks.com/dev-tools/sdk-python.html) into a small library that each project configures with a `Runner` instance. One `Runner` gives you eight CLI subcommands — `upload`, `submit`, `validate`, `logs`, `clean`, `catalog`, `schema`, and `volume` — without writing any Databricks API code in your project.

## Installation

```bash
uv add databricks-job-runner
```

Or with pip:

```bash
pip install databricks-job-runner
```

For local development against a checkout:

```toml
# pyproject.toml
[tool.uv.sources]
databricks-job-runner = { path = "../databricks-job-runner", editable = true }
```

> **Warning — do not list `databricks-job-runner` as a core dependency.**
>
> `databricks-job-runner` is a **local-only CLI tool** — it is not published to PyPI. If you add it to your project's `[project.dependencies]` (core dependencies), any wheel you build from that project will declare it as a requirement. When Databricks serverless (or any remote environment) tries to install your wheel, pip will fail because it cannot resolve `databricks-job-runner`.
>
> Instead, put it in an **optional extras group** so it is only installed locally:
>
> ```toml
> [project.optional-dependencies]
> cli = ["databricks-job-runner"]
> ```
>
> Then install locally with `uv sync --extra cli` (or `pip install -e '.[cli]'`). Your submitted scripts (e.g. `run_my_package.py`) should never import `databricks_job_runner` — they run on Databricks where it is not available.

## Quick start

Create a `cli/` package in your project with two files:

**`cli/__init__.py`**

```python
from databricks_job_runner import Runner

runner = Runner(
    run_name_prefix="my_project",
    wheel_package="my_package",  # optional
)
```

**`cli/__main__.py`**

```python
from cli import runner
runner.main()
```

Then run from **your project's** root (not from the `databricks-job-runner` repo — this is a library, not a standalone CLI):

```bash
uv run python -m cli upload --all          # upload agent_modules/*.py
uv run python -m cli upload test_hello.py  # upload a single file
uv run python -m cli upload --wheel        # build and upload wheel
uv run python -m cli submit test_hello.py  # submit a job and wait
uv run python -m cli submit test_hello.py --no-wait
uv run python -m cli validate              # list remote workspace contents
uv run python -m cli validate test_hello.py  # verify a specific file is uploaded
uv run python -m cli logs                  # stdout/stderr from the most recent run
uv run python -m cli logs 12345            # stdout/stderr from a specific run
uv run python -m cli clean --yes           # clean workspace + runs
uv run python -m cli clean --runs --yes    # clean only runs

# Unity Catalog management
uv run python -m cli catalog list
uv run python -m cli catalog get my_catalog              # show storage location
uv run python -m cli catalog create my_catalog --comment "Analytics"
uv run python -m cli catalog create my_catalog --storage-root "abfss://container@account.dfs.core.windows.net/path"
uv run python -m cli catalog delete my_catalog --force --yes

uv run python -m cli schema list my_catalog
uv run python -m cli schema create my_catalog.my_schema
uv run python -m cli schema delete my_catalog.my_schema --yes

uv run python -m cli volume list my_catalog.my_schema
uv run python -m cli volume create my_catalog.my_schema.my_vol
uv run python -m cli volume create my_catalog.my_schema.ext_vol --volume-type EXTERNAL --storage-location s3://bucket/path
uv run python -m cli volume delete my_catalog.my_schema.my_vol --yes
```

## Configuration

The runner reads a `.env` file from the project root. Core keys (all prefixed with `DATABRICKS_` for consistency):

| Key | Default | Required | Description |
|-----|---------|----------|-------------|
| `DATABRICKS_PROFILE` | — | no | CLI profile in `~/.databrickscfg`. When unset, the SDK's unified auth falls back to env vars (`DATABRICKS_HOST`/`DATABRICKS_TOKEN`), Azure CLI, service principals, etc. |
| `DATABRICKS_COMPUTE_MODE` | `cluster` | no | `cluster` or `serverless`. Selects the compute backend for submitted jobs. |
| `DATABRICKS_CLUSTER_ID` | — | when `DATABRICKS_COMPUTE_MODE=cluster` | All-purpose cluster to run jobs on. Started automatically if terminated. |
| `DATABRICKS_SERVERLESS_ENV_VERSION` | `3` | no | Serverless environment version (e.g. `3` for Python 3.12). |
| `DATABRICKS_WORKSPACE_DIR` | — | yes | Remote workspace path (e.g. `/Users/you/my_project`) |
| `DATABRICKS_VOLUME_PATH` | — | when using `upload --wheel` | UC Volume path for wheel uploads. |

**Precedence:** pre-existing environment variables override `.env` values, matching 12-factor conventions (CI/CD and shell exports can override the file).

Additional non-core keys are captured in `RunnerConfig.extras` and automatically passed to submitted jobs as `KEY=VALUE` parameters. Scripts call `inject_params()` at startup to load them into `os.environ`, then use pydantic `BaseSettings` to read configuration.

### Compute modes

- **Classic cluster** (`DATABRICKS_COMPUTE_MODE=cluster`, the default): jobs submit to an existing all-purpose cluster identified by `DATABRICKS_CLUSTER_ID`. The runner auto-starts the cluster if it is terminated, and attaches wheels via `Library(whl=...)`.
- **Serverless** (`DATABRICKS_COMPUTE_MODE=serverless`): jobs submit to Databricks serverless compute with a job-level environment spec. No cluster ID needed; wheels attach as `Environment.dependencies` entries (UC Volume paths are supported directly).

### Example `.env` (classic cluster)

```
DATABRICKS_PROFILE=my-profile
DATABRICKS_CLUSTER_ID=0123-456789-abcdef
DATABRICKS_WORKSPACE_DIR=/Users/ryan.knight@example.com/my_project
DATABRICKS_VOLUME_PATH=/Volumes/catalog/schema/volume
NEO4J_URI=neo4j+s://abc123.databases.neo4j.io
NEO4J_PASSWORD=secret
```

### Example `.env` (serverless)

```
DATABRICKS_PROFILE=my-profile
DATABRICKS_COMPUTE_MODE=serverless
DATABRICKS_SERVERLESS_ENV_VERSION=3
DATABRICKS_WORKSPACE_DIR=/Users/ryan.knight@example.com/my_project
DATABRICKS_VOLUME_PATH=/Volumes/catalog/schema/volume
```

All `DATABRICKS_*` keys listed above become typed fields on `RunnerConfig`; any other keys (like `NEO4J_URI` above) go into `config.extras`.

## API

### `Runner`

```python
Runner(
    run_name_prefix: str,
    project_dir: Path | str | None = None,
    wheel_package: str | None = None,
)
```

| Parameter | Description |
|-----------|-------------|
| `run_name_prefix` | Prefix for job run names and cleanup filtering |
| `project_dir` | Project root (defaults to `cwd()`). Must contain `.env` and `agent_modules/` |
| `wheel_package` | Package name for wheel builds. Enables `upload --wheel`. Wheels upload to `<DATABRICKS_VOLUME_PATH>/wheels/` |

### `RunnerConfig`

Pydantic model holding parsed `.env` values. Frozen (immutable) after construction.

| Field | Type | Description |
|-------|------|-------------|
| `databricks_profile` | `str \| None` | CLI profile name, or `None` for unified-auth fallback |
| `databricks_compute_mode` | `Literal["cluster", "serverless"]` | Compute backend (`"cluster"` by default) |
| `databricks_cluster_id` | `str \| None` | Cluster ID (required when `databricks_compute_mode == "cluster"`) |
| `databricks_serverless_env_version` | `str` | Serverless environment version (default `"3"`) |
| `databricks_workspace_dir` | `str` | Remote workspace root (required) |
| `databricks_volume_path` | `str \| None` | UC Volume path for wheel uploads |
| `extras` | `dict[str, str]` | All non-core keys from `.env` |

The `env_params()` method returns `extras` (plus `DATABRICKS_VOLUME_PATH`) as `KEY=VALUE` strings suitable for job parameter injection.

### `inject_params`

```python
from databricks_job_runner import inject_params
inject_params()
```

Call at the top of submitted scripts to parse `KEY=VALUE` parameters from `sys.argv` into `os.environ`. This lets scripts use pydantic `BaseSettings` or `os.getenv()` to read configuration that the runner injected from `.env`. Uses `setdefault` so pre-existing env vars take precedence.

> **Note:** `databricks_job_runner` is not available on the Databricks cluster. For standalone scripts (not part of a wheel), inline the equivalent logic instead of importing:
>
> ```python
> import os, sys
> for _arg in sys.argv[1:]:
>     if "=" in _arg and not _arg.startswith("-"):
>         _key, _, _value = _arg.partition("=")
>         os.environ.setdefault(_key, _value)
> ```
>
> For wheel-based scripts, the wheel's entry point can call `inject_params()` only if `databricks_job_runner` is listed as a wheel dependency — but since it is not published to PyPI, the inline approach is preferred.

### `RunnerError`

Raised when a runner operation cannot proceed (missing config, file not found, cluster stopped, job failed). The CLI formats and exits; library callers can catch and handle.

## Project layout

The runner expects this layout in your project:

```
my_project/
  .env
  agent_modules/
    test_hello.py
    run_lab2.py
    ...
  cli/
    __init__.py    # Runner config
    __main__.py    # entry point
```

Scripts in `agent_modules/` are uploaded to `{DATABRICKS_WORKSPACE_DIR}/agent_modules/` on Databricks and submitted as Spark Python tasks.

## Subcommands

### `upload`

- **`upload <file>`** — Upload a single file from `agent_modules/`
- **`upload --all`** — Upload all `*.py` files from `agent_modules/`
- **`upload --wheel`** — Build a wheel with `uv build` and upload to the UC Volume (requires `wheel_package` and `DATABRICKS_VOLUME_PATH`)

### `submit`

- **`submit <script>`** — Submit a script as a one-time Databricks job and wait for completion. Default: `test_hello.py`
- **`submit <script> --no-wait`** — Submit without waiting

On classic mode, if the target cluster is not already `RUNNING`, it is started automatically and the submit waits (up to 20 minutes, the SDK default) for it to reach `RUNNING`. On serverless, no warm-up step is required. When submitting a script named `run_{wheel_package}.py`, the runner automatically attaches the wheel — as a `Library(whl=...)` on classic, or as an `Environment.dependencies` entry on serverless.

### `validate`

- **`validate`** — List the remote workspace directory and its `agent_modules/` subdirectory. On classic, auto-starts the cluster if needed; on serverless, this is a no-op.
- **`validate <file>`** — Also verify that `{DATABRICKS_WORKSPACE_DIR}/agent_modules/<file>` exists; exits non-zero if not.

### `logs`

- **`logs`** — Print stdout/stderr, error, and trace from the most recent run matching `{run_name_prefix}:*`
- **`logs <run_id>`** — Print output for a specific parent run ID

Output is fetched via the Jobs API's `get_run_output`, which returns the **tail 5 MB** of stdout/stderr captured per task (the API caps output size; truncation is signaled in the output). The runner resolves the parent run to its task-level run IDs automatically, so pass the parent `run_id` shown at submit time. Databricks auto-expires runs after 60 days.

### `clean`

- **`clean`** — Delete the remote workspace directory and all matching job runs
- **`clean --workspace`** — Delete only the workspace directory
- **`clean --runs`** — Delete only job runs
- **`clean --yes`** — Skip confirmation prompt

### `catalog`

Manage Unity Catalog catalogs.

- **`catalog list`** — List all catalogs visible to the current user
- **`catalog get <name>`** — Show details for a catalog, including the **storage root** and **storage location** (managed location). Use this to find where a catalog's managed tables are stored.
- **`catalog create <name> [--storage-root URL] [--comment TEXT]`** — Create a new catalog. `--storage-root` sets the managed storage location (equivalent to `MANAGED LOCATION` in SQL). Required on metastores that use per-catalog storage roots instead of a metastore-level default.
- **`catalog delete <name> [--yes] [--force]`** — Delete a catalog. `--force` cascades to all schemas, tables, and volumes inside it. Prompts for confirmation unless `--yes` is given.

#### Finding the managed location

To find the managed storage location for an existing catalog:

```bash
uv run python -m cli catalog get my_catalog
```

This prints the **storage root** (the URL set at creation time) and the **storage location** (the full resolved path where managed tables are stored). Example output:

```
Catalog: my_catalog
  Owner:            ryan.knight@example.com
  Storage root:     abfss://container@account.dfs.core.windows.net/path
  Storage location: abfss://container@account.dfs.core.windows.net/path/__unitystorage/catalogs/abc123
  Comment:          Analytics catalog
  Created:          2025-01-15 10:30:00
```

### `schema`

Manage Unity Catalog schemas. Schema names use dotted notation: `catalog.schema`.

- **`schema list <catalog>`** — List schemas in a catalog
- **`schema get <catalog.schema>`** — Show details for a schema
- **`schema create <catalog.schema> [--comment TEXT]`** — Create a new schema
- **`schema delete <catalog.schema> [--yes]`** — Delete a schema. Prompts for confirmation unless `--yes` is given.

### `volume`

Manage Unity Catalog volumes. Volume names use dotted notation: `catalog.schema.volume`.

- **`volume list <catalog.schema>`** — List volumes in a schema
- **`volume get <catalog.schema.volume>`** — Show details (type, owner, storage location) for a volume
- **`volume create <catalog.schema.volume> [--volume-type MANAGED|EXTERNAL] [--storage-location URL] [--comment TEXT]`** — Create a volume. Defaults to `MANAGED`. `EXTERNAL` volumes require `--storage-location`.
- **`volume delete <catalog.schema.volume> [--yes]`** — Delete a volume. Prompts for confirmation unless `--yes` is given.

## Releasing

Releases are published to PyPI automatically when you push a Git tag. The version in the tag becomes the package version.

```bash
git tag v0.3.0
git push origin v0.3.0
```

The GitHub Actions workflow strips the `v` prefix, patches `pyproject.toml` with the version, builds the wheel and sdist, and publishes to PyPI via trusted publishing.

## Requirements

- Python 3.12+
- Databricks authentication: either a [Databricks CLI profile](https://docs.databricks.com/dev-tools/cli/index.html), or env vars (`DATABRICKS_HOST`/`DATABRICKS_TOKEN`), or any other [unified-auth](https://docs.databricks.com/dev-tools/auth/) method
- Either a Databricks all-purpose cluster (auto-started if terminated) or serverless compute enabled for the workspace
- [uv](https://docs.astral.sh/uv/) (for wheel building only)

## Architecture

`databricks-job-runner` is layered into a thin CLI, an orchestrator, and a set of single-purpose action modules. `Runner` is the only class consuming projects need to touch.

```
cli.py          argparse + dispatch (flags -> Runner method calls)
  |
runner.py       Runner: holds config, owns the WorkspaceClient,
  |             exposes one method per subcommand
  |
  |-- config.py     RunnerConfig (frozen pydantic) + .env parser
  |-- compute.py    ClassicCluster / Serverless strategies (Protocol)
  |-- inject.py     inject_params() for submitted scripts
  |-- upload.py     workspace file + wheel upload
  |-- submit.py     compute-agnostic job submission
  |-- validate.py   workspace listing + file-existence checks
  |-- logs.py       per-task stdout/stderr retrieval
  |-- catalog.py    Unity Catalog catalog/schema/volume management
  |-- clean.py      workspace + run cleanup
  |-- errors.py     RunnerError
```

### Layers

- **CLI (`cli.py`)** owns all argparse setup and translates the parsed namespace into method calls on `Runner`. Formats `RunnerError` into friendly exit messages. No argparse knowledge lives outside this file.
- **Orchestration (`runner.py`)** exposes the `Runner` class. `RunnerConfig` and the `WorkspaceClient` are built lazily on first access, so importing a project's `cli/__init__.py` doesn't touch Databricks. Each public method coordinates a single subcommand end-to-end.
- **Action modules** (`upload.py`, `submit.py`, `validate.py`, `logs.py`, `clean.py`, `catalog.py`) are plain functions wrapping SDK calls. None know about argparse or `Runner`, keeping each unit composable and independently testable.
- **Compute strategies (`compute.py`)** implement the `Compute` protocol. A strategy knows how to (1) validate that its backend is ready, (2) decorate a `SubmitTask` with backend-specific fields, and (3) produce the top-level `environments[]` list for `jobs.submit`. `submit_job` is compute-agnostic — swapping backends is a strategy change, not a conditional branch.

### Design choices

- **Strategy pattern for compute.** `Compute` is a `typing.Protocol`, so adding a new backend is a new frozen dataclass that matches the shape — no changes to `submit_job`, `Runner`, or the CLI. `ClassicCluster` and `Serverless` are both frozen dataclasses for value-equality and immutability.
- **Single validation point.** Required-key enforcement lives entirely in `RunnerConfig.from_env_file`, branching on `DATABRICKS_COMPUTE_MODE` (only `DATABRICKS_CLUSTER_ID` is required when mode is `cluster`). Downstream code trusts the config is valid.
- **Automatic parameter injection.** All non-core `.env` keys are passed to submitted jobs as `KEY=VALUE` parameters via `RunnerConfig.env_params()`. Scripts call `inject_params()` at startup to load them into `os.environ`, then use pydantic `BaseSettings` to read configuration. No callback or per-project wiring needed.
- **Wheel convention.** A submitted script named exactly `run_{wheel_package}.py` auto-attaches the latest wheel from `dist/` — as `Library(whl=...)` on classic, or an `Environment.dependencies` entry on serverless. Ties `upload --wheel` and `submit run_xxx.py` together without adding a CLI flag.
- **12-factor `.env`.** Pre-existing env vars override `.env` values, so CI/CD exports and shell overrides trump the file — matching standard `.env` semantics.
