Metadata-Version: 2.4
Name: metaflow-observability
Version: 0.1.0
Summary: Automatic observability for Metaflow — step duration, CPU, memory, disk, and GPU metrics via OpenTelemetry
License: Apache-2.0
License-File: LICENSE
Requires-Python: >=3.9
Requires-Dist: metaflow>=2.9
Requires-Dist: opentelemetry-exporter-prometheus>=0.45b0
Requires-Dist: opentelemetry-sdk>=1.24
Requires-Dist: psutil>=5.9
Provides-Extra: dev
Requires-Dist: mypy>=1.10; extra == 'dev'
Requires-Dist: opentelemetry-sdk>=1.24; extra == 'dev'
Requires-Dist: pytest-cov; extra == 'dev'
Requires-Dist: pytest>=7; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Requires-Dist: types-psutil; extra == 'dev'
Provides-Extra: gpu
Requires-Dist: pynvml>=11.0.0; extra == 'gpu'
Description-Content-Type: text/markdown

# metaflow-observability

[![CI](https://github.com/npow/metaflow-observability/actions/workflows/ci.yml/badge.svg)](https://github.com/npow/metaflow-observability/actions/workflows/ci.yml)
[![PyPI](https://img.shields.io/pypi/v/metaflow-observability)](https://pypi.org/project/metaflow-observability/)
[![License: Apache-2.0](https://img.shields.io/badge/License-Apache--2.0-blue.svg)](LICENSE)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)
[![Docs](https://img.shields.io/badge/docs-mintlify-18a34a?style=flat-square)](https://mintlify.com/npow/metaflow-observability)

Get production metrics for every Metaflow step — without changing your flow code.

## The problem

When a Metaflow pipeline slows down or crashes in production, you have no time-series data to tell you whether it was CPU saturation, a memory spike, a disk bottleneck, or a GPU stall. You're left digging through logs after the fact. Metaflow's built-in tooling gives you per-run artifacts and cards, but nothing you can alert on or trend over time.

## Quick start

```bash
pip install metaflow-observability
```

```python
from metaflow import FlowSpec, step
from metaflow.decorators import observability

class MyFlow(FlowSpec):

    @observability
    @step
    def train(self):
        ...  # your code — metrics collected automatically
        self.next(self.end)

    @step
    def end(self):
        pass

if __name__ == "__main__":
    MyFlow()
```

Metrics are exported via OpenTelemetry. Point them at Prometheus + Grafana with:

```bash
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
python flow.py run
```

## Install

```bash
# Core (CPU, memory, disk, duration)
pip install metaflow-observability

# With GPU support (NVIDIA only, requires CUDA drivers)
pip install "metaflow-observability[gpu]"
```

## Usage

**Zero-config with Prometheus**

Add `@observability` to any step. By default, metrics are scraped via a Prometheus endpoint on port 8000.

```python
@observability
@step
def preprocess(self):
    ...
```

**Custom OTel backend**

Use any OpenTelemetry-compatible backend (Grafana Cloud, Datadog, Honeycomb, etc.) via standard OTel environment variables:

```bash
export OTEL_EXPORTER_OTLP_ENDPOINT=https://otlp.example.com
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <token>"
```

**GPU metrics**

Install the GPU extra and run on a CUDA-enabled machine — GPU utilization and memory are collected automatically per device, tagged with `gpu_index`.

```bash
pip install "metaflow-observability[gpu]"
```

## How it works

`@observability` wraps `task_pre_step` / `task_post_step` / `task_exception` hooks in Metaflow's decorator API. Before your step code runs, it starts background threads that sample CPU%, RSS memory, disk I/O throughput, and (optionally) GPU utilization at 1-second intervals. When the step finishes, samples are aggregated and exported as OpenTelemetry instruments:

| Metric | Instrument | Tags |
|---|---|---|
| `step.duration` | Histogram (seconds) | `step`, `flow`, `run_id`, `retry` |
| `step.cpu.pct` | Gauge (avg / max / p95) | same |
| `step.memory.mb` | Gauge (avg / max RSS) | same |
| `step.disk.read_bytes` | Counter | same |
| `step.disk.write_bytes` | Counter | same |
| `step.disk.read_throughput` | Gauge (MB/s) | same |
| `step.disk.write_throughput` | Gauge (MB/s) | same |
| `step.gpu.utilization` | Gauge | + `gpu_index` |
| `step.gpu.memory.used_mb` | Gauge | + `gpu_index` |
| `step.retries` | Gauge | same |
| `step.failures` | Counter | same |

## Configuration

All configuration is via standard OpenTelemetry environment variables. No extension-specific config needed.

| Variable | Purpose |
|---|---|
| `OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP endpoint for traces and metrics |
| `OTEL_EXPORTER_OTLP_HEADERS` | Auth headers (e.g., `Authorization=Bearer ...`) |
| `OTEL_SERVICE_NAME` | Service name tag on all metrics |

If neither variable is set, metrics are printed to stdout via the OTel console exporter (useful for local debugging).

## Development

```bash
git clone https://github.com/npow/metaflow-observability
cd metaflow-observability
pip install -e ".[dev]"

# Run tests
pytest

# Lint + format
ruff check src tests
ruff format src tests

# Type check
mypy
```

CI runs the full suite across Python 3.9, 3.10, 3.11, and 3.12 on every push.

## License

[Apache-2.0](LICENSE)
