Metadata-Version: 2.4
Name: metaflow-skypilot
Version: 0.0.1
Summary: Skypilot extension for Metaflow
Author: Outerbounds
Author-email: help@outerbounds.co
License: Apache Software License
Classifier: Development Status :: 4 - Beta
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: skypilot>=0.11.1
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# SkyPilot extension for Metaflow

This extension adds support for executing steps in Metaflow flows on any cloud provider via [SkyPilot](https://skypilot.readthedocs.io/).

## Installation

```bash
pip install metaflow-skypilot
```

SkyPilot also requires cloud credentials to be configured. Follow the [SkyPilot setup guide](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html) for your cloud provider.

## Basic Usage

Add the `@skypilot` decorator to any step you want to run on the cloud:

```python
from metaflow import FlowSpec, step, skypilot

class MyFlow(FlowSpec):

    @step
    def start(self):
        self.next(self.train)

    @skypilot(cpus='2+', memory='8+')
    @step
    def train(self):
        # This step runs on cloud via SkyPilot
        print("Training on the cloud!")
        self.next(self.end)

    @step
    def end(self):
        print("Done!")

if __name__ == '__main__':
    MyFlow()
```

## Resource Configuration

The `@skypilot` decorator accepts all `sky.Resources` parameters directly, an example with a few of them is given below:

```python
@skypilot(
    infra='aws',
    cpus='4+',
    memory='16+',
    accelerators='A100:1',
)
@step
def gpu_step(self):
    ...
```

## Reusing a Named Cluster

By default, each run provisions a fresh ephemeral cluster that is torn down after the job finishes. To reuse a persistent cluster across runs, pass a `cluster_name`:

```python
@skypilot(
    cpus='2+',
    cluster_name='my-persistent-cluster',
)
@step
def my_step(self):
    ...
```

- **Without `cluster_name`**: a new cluster is provisioned, runs the job, and is terminated after 10 idle minutes.
- **With `cluster_name`**: the cluster is reused across runs (auto-started if stopped). It stops automatically after 10 idle minutes but is not terminated — it will be restarted on the next run.

Each task always runs in an isolated working directory (`~/metaflow/assets/<job_name>/`) regardless of cluster type, so there are no filesystem clashes when reusing a cluster.

## Using with `@pypi`

Use `@pypi` to install Python dependencies on the remote VM:

```python
@skypilot(cpus='2+')
@pypi(python='3.9', packages={'numpy': '1.24.0', 'pandas': '2.0.0'})
@step
def my_step(self):
    import numpy as np
    ...
```

## Supplying Credentials

Cloud credentials for accessing the Metaflow datastore (e.g. S3) can be supplied in three ways:

- **Instance IAM role / cloud identity**: if the provisioned resource has access to the datastore via its cloud identity, no extra configuration is needed.
- **Environment variables** via the `@environment` decorator:

```python
@environment(vars={
    "AWS_ACCESS_KEY_ID": "XXXX",
    "AWS_SECRET_ACCESS_KEY": "YYYY"
})
@skypilot(cpus='2+')
@step
def my_step(self):
    ...
```

- **Secrets manager** via the `@secrets` decorator.

If you are on the [Outerbounds](https://outerbounds.com/) platform, authentication is handled automatically.

## Things to Note

- `@skypilot` cannot be combined with `@kubernetes`, `@batch`, or `@slurm` on the same step.
- `@parallel` is not supported with `@skypilot`.
- The minimum step timeout is 60 seconds.

### Fin.
