Metadata-Version: 2.4
Name: rundial
Version: 1.0.0rc1
Summary: Rundial Python SDK (non-blocking ingest with bounded spool and ergonomic run API)
Author: Rundial
License-Expression: Apache-2.0
Project-URL: Documentation, https://github.com/rundial-dev/rundial/tree/main/docs
Project-URL: Issues, https://github.com/rundial-dev/rundial/issues
Project-URL: Source, https://github.com/rundial-dev/rundial
Keywords: machine-learning,experiments,observability,metrics,sdk
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: System :: Monitoring
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: integrations
Requires-Dist: keras<4,>=3; extra == "integrations"
Requires-Dist: lightning<3,>=2.2; extra == "integrations"
Requires-Dist: transformers<5,>=4.40; extra == "integrations"
Provides-Extra: test-integrations
Requires-Dist: keras<4,>=3; extra == "test-integrations"
Requires-Dist: lightning<3,>=2.2; extra == "test-integrations"
Requires-Dist: transformers<5,>=4.40; extra == "test-integrations"
Dynamic: license-file

# rundial

```bash
pip install rundial
```

Phase 4 introduces a non-blocking metrics transport with:

- bounded in-memory queue on the training thread
- background flush worker
- bounded disk spool (default enabled)
- gzip compression in worker transport (threshold-based)
- retry with exponential backoff + jitter
- diagnostics counters for dropped/accepted/retried points

## CLI quickstart

Installing `rundial` also installs the `rundial` CLI:

```bash
rundial init --endpoint http://127.0.0.1:8787
rundial auth whoami
rundial target ls
rundial doctor
```

Operational commands:

```bash
rundial workspace ls
rundial project ls --workspace default-workspace

rundial run start --workspace default-workspace --project default-project --name baseline-001 --kind training
rundial run list --workspace default-workspace --project default-project --status running
rundial run status run_...
rundial run finish run_... --state completed

rundial metrics tail run_... --workspace default-workspace --project default-project --keys train/loss
rundial metrics export run_... \
  --workspace default-workspace \
  --project default-project \
  --keys train/loss \
  --format csv \
  --out metrics.csv
rundial logs tail run_... --workspace default-workspace --project default-project --min-level info
rundial logs export run_... \
  --workspace default-workspace \
  --project default-project \
  --format json \
  --out logs.json
```

All commands accept global `--json` output. Exit codes are stable: `0` success, `1` command
or transport error, and `2` authentication/authorization failure.

CLI operational smoke, with the open-core stack running:

```bash
RUNDIAL_API_KEY=rdk_... python user_tests/cli_operational_parity_smoke.py \
  --workspace default-workspace \
  --project default-project
```

Start a run with workspace/project strings only:

```bash
rundial run start \
  --endpoint http://127.0.0.1:8787 \
  --workspace default-workspace \
  --project default-project \
  --name baseline-001 \
  --kind training
```

Config precedence:

1. CLI flags
2. env vars (`RUNDIAL_API_KEY`, `RUNDIAL_ENDPOINT`, `RUNDIAL_WORKSPACE`, `RUNDIAL_PROJECT`)
3. `~/.config/rundial/config.toml`

## Quick start (recommended)

```python
import rundial as rd

with rd.init(
    workspace="team-alpha",
    project="mnist-demo",
    name="baseline-001",
    kind="training",
    endpoint="http://127.0.0.1:8787",
    api_key="rdk_...",
    mode="online",
) as run:
    run.log({"train/loss": 0.42, "train/acc": 0.91}, step=1)
    run.log_metric("eval/loss", value=0.31, time_ms=1_760_000_000_000)
    run.log_text("starting eval loop", level="info")
    run.checkpoint("checkpoints/model.pt", step=1)

# `run.finish()` / `run.close()` finalize the run as `completed`.
# Use `run.fail(...)` or `run.abort(...)` for explicit terminal outcomes.
```

This slug-first mode resolves `workspace/project` to the canonical internal run target before run start.
Use `kind="agent"` or `kind="eval"` for agent and evaluation runs; the default is
`kind="training"`.

## Logs and console capture

`run.log_text(message, level="info")` shares the same bounded, non-blocking queue as metric
logging. Messages are capped at 8 KiB, truncated lines are flagged, and queue drops are visible
through `run.diagnostics()`.

```python
import rundial as rd

with rd.init(
    workspace="team-alpha",
    project="mnist-demo",
    name="logs-demo",
    endpoint="http://127.0.0.1:8787",
    api_key="rdk_...",
    capture_console=True,
) as run:
    print("stdout is mirrored into Rundial logs")
    run.log_text("manual warning", level="warn")
```

`capture_console=True` tees stdout as `info` and stderr as `error`. The caller still writes to
the original stream, and Rundial drops-and-counts when the bounded queue is full instead of
blocking the training process.

## Lightweight traces

Trace spans use the same non-blocking ingest worker and disk spool as metrics and logs.
Attributes and events are normalized in the worker; large prompt, completion, or tool-output
values above 16 KiB are uploaded through the artifact pipe and replaced on the span with a
small evidence reference.

```python
with rd.init(
    workspace="team-alpha",
    project="mnist-demo",
    name="agent-demo",
    kind="agent",
    endpoint="http://127.0.0.1:8787",
    api_key="rdk_...",
) as run:
    with run.trace("planner.step", attrs={"phase": "plan"}) as span:
        span.event("prompt.ready", {"tokens": 128})
        span.set_attrs({"model": "example-model"})

    run.tool_call("search", input={"q": "Ada Lovelace"}, output="large tool output...")
```

## Artifacts and checkpoints

`run.log_artifact(path_or_dir, name="checkpoint")` enqueues artifact work and returns before
hashing or uploading files. A dedicated background uploader handles manifest hashing,
pre-signed upload URLs, multipart uploads for large files, and finalization without sharing
the metrics/log worker.

```python
with rd.init(workspace="team-alpha", project="mnist-demo", api_key="rdk_...") as run:
    run.log_artifact("outputs/eval-report", name="eval-report")
    run.checkpoint("checkpoints/model.pt", step=100, keep_last=5)
    rd.checkpoint("checkpoints/model.pt", step=101)
```

Artifact upload jobs are journaled in the SDK spool directory and retried by the next client
process if an upload is interrupted. `run.checkpoint(...)` and the current-run convenience
`rd.checkpoint(...)` use artifact type `checkpoint`, alias `latest`, and a server-enforced
keep-last retention policy. The API default is to keep the latest 5 finalized checkpoints
per run/name when a client omits the hint; pass `keep_last=K` to tune it for a checkpoint
call.

To consume an artifact from another run, record lineage and download through a blocking
handle:

```python
with rd.init(workspace="team-alpha", project="mnist-demo", api_key="rdk_...") as run:
    artifact = run.use_artifact("checkpoint:latest")
    artifact.download("inputs/checkpoint")
```

Lineage UI is still in progress for the v1 artifact milestone.

## Media

`run.log(...)` accepts image and table helper values for common visual inspection workflows.
Media bytes ride the artifact uploader, while Rundial stores only a bounded manifest row for
querying and display.

```python
with rd.init(workspace="team-alpha", project="mnist-demo", api_key="rdk_...") as run:
    run.log({"samples": rd.Image("outputs/sample-grid.png", caption="validation samples")}, step=10)
    run.log(
        {
            "predictions": rd.Table(
                columns=["id", "label", "score"],
                rows=[["img-1", "cat", 0.91], ["img-2", "dog", 0.87]],
            )
        },
        step=10,
    )
```

`rd.Image(...)` accepts filesystem paths, PIL-like objects with `save(...)`, and uint8
numpy-like arrays shaped `(height, width)`, `(height, width, 1)`, `(height, width, 3)`, or
`(height, width, 4)`. Array and PIL-like serialization happens in the artifact worker, not
inside `run.log(...)`. File-backed media jobs are replayable through the artifact journal;
generated media is best-effort until the worker materializes the generated file.

## Framework Integrations

Install optional framework adapters only when you need them:

```bash
pip install "rundial[integrations]"
```

| Framework | Import | What it maps |
| --- | --- | --- |
| PyTorch Lightning | `from rundial.integrations import RundialLogger` | hyperparams to run config, metrics to `run.log(...)`, checkpoints to artifacts |
| Hugging Face Transformers | `from rundial.integrations import RundialCallback` | Trainer args/model config to run config, logs/eval metrics to `run.log(...)`, saved checkpoints to artifacts |
| Keras | `from rundial.integrations import RundialKerasCallback` | fit/optimizer params to run config, epoch/batch metrics to `run.log(...)`, checkpoint paths to artifacts |

The base `rundial` install has no hard framework dependencies. Adapter imports remain safe
without Lightning, Transformers, or Keras installed; installing the extra provides the native
callback base classes for framework type checks.

## W&B Compatibility

For common W&B-style training scripts, swap only the import line:

```python
import rundial.compat.wandb as wandb
```

The shim supports `wandb.init`, `wandb.log`, `wandb.config`, `wandb.finish`, `run.summary`,
`wandb.Image`, `wandb.Table`, `wandb.watch`, `wandb.define_metric`, and `wandb.login`.
Unsupported symbols raise `NotImplementedError` with a pointer to the compatibility table in
`docs/wandb-compat.md`.

## Resume existing runs

Use `run_id` with an explicit `resume` mode when restarting a crashed or interrupted job:

```python
import rundial as rd

with rd.init(
    workspace="team-alpha",
    project="mnist-demo",
    run_id="run_abc123",
    resume="allow",
    endpoint="http://127.0.0.1:8787",
    api_key="rdk_...",
) as run:
    run.log({"train/loss": 0.38}, step=50)
```

Resume modes:

- `resume="never"` (default): create `run_id` only if it does not already exist.
- `resume="allow"`: attach to a running run or create it if missing; terminal runs are not reopened.
- `resume="must"`: require an existing run; terminal runs are explicitly reopened as `running`.

Duplicate steps are resolved at query time. Rundial keeps raw metric rows append-only, but series
queries show the latest accepted value per `(runId, metricKey, step)` using ingest time, with a
stable row-id tie breaker. This keeps training-loop ingest fast while resumed curves remain
monotonic by step.

## Discovery helpers

```python
import rundial as rd

client = rd.Client(
    endpoint="http://127.0.0.1:8787",
    api_key="rdk_...",
    spool_enabled=False,
    start_worker_on_init=False,
)
print(client.whoami())
print(client.list_workspaces())
print(client.list_projects("default-workspace"))
client.close(timeout_seconds=0.1, drain=False)
```

If the server does not expose `/api/v1/runs/resolve-target`, slug-first run start fails with an actionable stale-build error. Rebuild/restart API and retry.

## Runtime notes

- `run.log()` / `run.log_metric()` are non-blocking and never perform network or disk I/O.
- `run.log_text()` and opt-in console capture use the same non-blocking queue and expose
  `log_lines_truncated`, `dropped_log_lines_queue_full`, and
  `dropped_log_lines_invalid` diagnostics.
- system metrics are sampled by a background thread by default and logged as ordinary
  `system/*` metrics; pass `system_metrics=False` to `rd.init(...)` to opt out, or
  `system_metrics_interval_seconds=...` to tune the cadence (minimum 2 seconds).
- `run.finish()` / `run.close()` flush and finalize the run; use `client.close(...)` when you only want to release the client transport.
- NaN and infinite metric values are dropped without raising, counted in
  `run.diagnostics().non_finite_dropped`, and warn once per metric key.
- disk spool is enabled by default at `.rundial_spool` and is bounded by size/age.
- if disk spool writes fail, fallback memory buffering stays bounded and drops oldest points.
- `close()` returns within the requested timeout plus a bounded transport wait; when it
  cannot send all pending points before the deadline, un-sent points are handed to the disk
  spool and re-sent by the next process.
- `run.diagnostics().pending_spooled_batches` reports durable batches waiting for delivery.
- worker transport can gzip large payloads (`gzip_enabled`, `gzip_min_bytes`).
- use `run.diagnostics()` to inspect queue pressure, retries, and drop counters.
- modes:
  - `online` (default): upload in background with retries/spool fallback
  - `offline`: buffer to spool only (no upload attempts)
  - `disabled`: safe no-op logging for tests and dry-runs
- distributed policy:
  - `distributed="rank0"` (default): only rank 0 emits logs
  - `distributed="all"`: all ranks emit logs (use with caution for cardinality/volume)
- rank detection uses common env vars (`RANK`, `LOCAL_RANK`, `SLURM_PROCID`, etc.);
  override explicitly with `distributed_rank=<int>`.

## Backward-compatible low-level API

```python
from rundial_sdk import RundialClient
```

`RundialClient` remains supported for advanced/manual lifecycle control.

## Benchmark guardrail

Run the Phase 4 benchmark/guardrail script:

```bash
bun run test:phase4:sdk:benchmark
```

The command validates hot-path latency and bounded spool behavior under sustained retryable failures.
