Metadata-Version: 2.4
Name: resontech
Version: 0.1.2
Summary: SDK for submitting federated learning jobs to the ResonTech platform
Author: ResonTech
License-Expression: Apache-2.0
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: httpx>=0.27
Requires-Dist: boto3<1.41,>=1.34
Provides-Extra: dev
Requires-Dist: pytest>=8; extra == "dev"
Requires-Dist: fastapi>=0.111; extra == "dev"
Requires-Dist: uvicorn[standard]>=0.30; extra == "dev"
Requires-Dist: httpx>=0.27; extra == "dev"
Requires-Dist: moto[s3]>=5; extra == "dev"
Dynamic: license-file

# ResonTech SDK

Python SDK for submitting federated learning jobs to the ResonTech platform.

The SDK is **task-agnostic** — it does not assume a specific task (image
classification, NLP, regression, segmentation, RL, …), dataset format, loss,
or optimizer. You write your model and a one-round training function; the SDK
ships them to every worker and plumbs global weights in/out of the federation.

The SDK's job ends at submission. Progress, logs, output files, and the final
model are tracked on the web dashboard — the ``Job`` returned from ``rt_submit``
carries a ``dashboard_url`` you can open to follow along.

---

## Installation

```bash
pip install resontech
# or, from source:
pip install -e /path/to/sdk
```

**Requirements:** Python 3.11+, `httpx`, `boto3`. PyTorch is required on the
worker side (NVFlare workflow runs on PyTorch state dicts).

---

## Setup — once per user

1. Log in to the ResonTech web UI.
2. Go to **Profile → Storage** and click **Provision Bucket**.
3. Copy the ``accessKeyId`` and ``secretAccessKey`` it shows you — the secret
   is only displayed once. These are what the SDK uses to upload directly to
   your bucket.

No SSH keys, no SFTP — everything goes over the S3 API.

---

## The user contract

Define your model class **and** an `fl_train` function in the same module
(or notebook cell). The SDK extracts the file/cell verbatim and ships it as
`model_def.py`. Each round, the worker:

1. Loads the latest global weights into your model.
2. Calls `fl_train(model, env, out_dir, logger)`.
3. Sends the returned weights back for aggregation.

```python
class Model(nn.Module):                       # any name, any signature
    def __init__(self, **kwargs):             # gets ModelConfig.model_args
        super().__init__()
        ...
    def forward(self, x): ...

def fl_train(model, env, out_dir, logger=None) -> dict:
    """
    env contains: CURRENT_ROUND, SITE_NAME, JOB_ID, DATA_ROOT,
                  EPOCHS, BATCH_SIZE, LR, plus every key from TrainingConfig.extra.
    Returns: {"weights": dict[str, Tensor], "samples": int, ...}
    """
    # You own the dataset, loss, optimizer, training loop.
    return {"weights": model.state_dict(), "samples": N}

def fl_validate(model, env, out_dir, logger=None) -> dict:    # optional
    ...
```

---

## Quick start (image classification)

```python
import json, os
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Dataset
from torchvision import models, transforms

from resontech import (
    ResonTech, ResonTechConfig,
    TrainingConfig, FederationConfig, ModelConfig,
)


class MyResNet(nn.Module):
    def __init__(self, num_classes: int = 10):
        super().__init__()
        self.backbone = models.resnet18(weights=None)
        self.backbone.fc = nn.Linear(self.backbone.fc.in_features, num_classes)

    def forward(self, x):
        return self.backbone(x)


def fl_train(model, env, out_dir, logger=None):
    img_size = int(env.get("IMG_SIZE", 224))
    num_classes = int(env.get("NUM_CLASSES", 10))
    manifest = os.path.join(env["DATA_ROOT"], env.get("MANIFEST_FILE", "manifest.ndjson"))

    class _Dataset(Dataset):
        def __init__(self, path):
            from PIL import Image, ImageFile
            ImageFile.LOAD_TRUNCATED_IMAGES = True
            self.records = [json.loads(l) for l in open(path) if l.strip()]
            self.tx = transforms.Compose([
                transforms.Resize(max(img_size, 256)),
                transforms.CenterCrop(img_size),
                transforms.ToTensor(),
                transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
            ])

        def __len__(self): return len(self.records)

        def __getitem__(self, i):
            from PIL import Image
            rec = self.records[i]
            with Image.open(rec["uri"]) as img:
                x = self.tx(img.convert("RGB"))
            y = torch.zeros(num_classes)
            for j in rec.get("y", []):
                y[int(j)] = 1.0
            return x, y

    ds = _Dataset(manifest)
    loader = DataLoader(ds, batch_size=env["BATCH_SIZE"], shuffle=True)
    optimizer = torch.optim.Adam(model.parameters(), lr=env["LR"])
    loss_fn = nn.BCEWithLogitsLoss()
    device = next(model.parameters()).device

    model.train()
    for _ in range(int(env["EPOCHS"])):
        for x, y in loader:
            x, y = x.to(device), y.to(device)
            optimizer.zero_grad(set_to_none=True)
            loss_fn(model(x), y).backward()
            optimizer.step()

    return {
        "weights": {k: v.detach().cpu() for k, v in model.state_dict().items()},
        "samples": len(ds),
    }


config = ResonTechConfig(
    base_url="https://api.reson.tech",
    email="you@example.com",
    password="your-password",
    s3_access_key_id="AKIA...",
    s3_secret_access_key="...",
)

sdk = ResonTech(config)
sdk.login()

job = sdk.rt_submit(
    model=MyResNet,
    name="my-fl-job",
    shards_dir="./shards",
    requirements_txt="./requirements.txt",
    training=TrainingConfig(
        local_epochs=2,
        batch_size=32,
        learning_rate=1e-3,
        extra={"IMG_SIZE": 224, "NUM_CLASSES": 10, "MANIFEST_FILE": "manifest.ndjson"},
    ),
    federation=FederationConfig(num_rounds=5, min_clients=1),
    model_config=ModelConfig(model_args={"num_classes": 10}),
)

print(f"Submitted: {job.id}")
print(f"Track at:  {job.dashboard_url}")
```

For NLP, tabular, regression, segmentation, RL, etc., **replace `fl_train`**
with whatever your task needs. The SDK does not change.

---

## Core concepts

### Model requirements

The user's model class can have any signature. The constructor receives
**only** the keys you put in `ModelConfig.model_args` — the SDK injects
nothing of its own.

### Dataset shards

Each worker receives one file from `shards_dir` (one shard per worker).
Mounted under `env["DATA_ROOT"]` on the worker. The SDK does not enforce a
file format — your `fl_train` decides how to read whatever you uploaded
(NDJSON manifests, parquet, raw images, HDF5, anything).

### What the SDK uploads for you

`sdk.rt_submit(...)` writes directly to `s3://<your-bucket>/jobs/<name>/`:

```
jobs/<name>/
├── scripts/
│   ├── model_def.py                  ← your file/cell, verbatim
│   ├── custom_client_executor.py     ← bundled GenericClientExecutor (or your own)
│   └── custom_persistor.py
├── configs/
│   ├── config_fed_client.json
│   ├── config_fed_server.json
│   └── meta.json
├── requirements/
│   └── requirements.txt
├── shards/
│   ├── shard_0.zip
│   └── …
└── model/                            ← optional warm-start weights
    └── checkpoint.pt
```

No local temp directories are created. Shards stream from your disk directly
via S3 multipart (50 MB parts, 5 concurrent).

---

## API reference

### `ResonTechConfig`

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `base_url` | str | — | REST API base URL (HTTPS required for non-loopback hosts) |
| `email` | str | — | Account email |
| `password` | str | — | Account password |
| `s3_access_key_id` | str | — | Garage S3 access key (from the UI) |
| `s3_secret_access_key` | str | — | Garage S3 secret (from the UI) |
| `s3_bucket_alias` | str | `""` | Bucket alias — auto-resolved on login if empty |
| `s3_endpoint` | str | `https://s3.reson.tech` | S3-compatible endpoint URL |
| `s3_region` | str | `"garage"` | AWS region label |
| `dashboard_url` | str | `https://app.reson.tech` | Web UI base used in the tracking message |

### `TrainingConfig`

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `local_epochs` | int | `1` | Local epochs per round → `env["EPOCHS"]` |
| `batch_size` | int | `32` | Mini-batch size hint → `env["BATCH_SIZE"]` |
| `learning_rate` | float | `0.001` | Optimiser LR hint → `env["LR"]` |
| `extra` | dict | `{}` | Free-form, task-specific knobs merged into `env` |

Anything task-specific — image size, num_classes, sequence length, optimizer
name, mixup alpha, etc. — goes into `extra` and reaches your `fl_train`
function via the `env` dict, with no SDK interpretation.

### `ModelConfig`

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `model_class` | str | auto | Importable path to the model class (auto-filled by `rt_submit`) |
| `model_args` | dict | `{}` | Kwargs forwarded to the model constructor on every site |
| `adapter_module` | str | `"model_def"` | Module the executor imports |
| `train_fn` | str | `"fl_train"` | Name of the user training function |
| `validate_fn` | str | `"fl_validate"` | Name of the optional user validation function |

### `FederationConfig`

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `num_rounds` | int | `5` | Total training rounds |
| `min_clients` | int | `1` | Clients required per round |
| `wait_time_after_min_received` | int | `10` | Seconds to wait after min clients respond |
| `heart_beat_timeout` | int | `600` | Client heartbeat timeout (seconds) |
| `job_name` | str | `"rt_job"` | Name stored in job metadata |
| `aggregator_path` / `aggregator_args` | str / dict | `InTimeAccumulateWeightedAggregator` w/ `expected_data_kind="WEIGHTS"` | Server-side aggregator |
| `shareable_generator_path` / `shareable_generator_args` | str / dict | `FullModelShareableGenerator` | Wraps weights into NVFlare Shareables |
| `workflow_path` / `workflow_extra_args` | str / dict | `ScatterAndGather` | Server workflow — swap in FedProx, FedOpt, cyclic, swarm, etc. |

### `ResonTech`

#### `sdk.login() → User`

Authenticates and sets the JWT token. Auto-resolves `s3_bucket_alias` via
`GET /api/users/storage/bucket` if not set. Raises `StorageError` if the
user has not provisioned a bucket yet.

#### `sdk.rt_submit(...) → Job`

Render, upload, and submit a federated learning job in one call.

```python
job = sdk.rt_submit(
    model=MyResNet,                      # nn.Module subclass (required)
    name="my-fl-job",                    # display name (required)
    shards_dir="./shards",               # local directory of shard files (required)
    requirements_txt="./req.txt",        # optional
    training=TrainingConfig(...),
    federation=FederationConfig(...),
    model_config=ModelConfig(...),       # optional (auto-filled from class name)
    executor=None,                       # custom NVFlare Executor class (optional)
    persistor=None,                      # custom NVFlare ModelPersistor class (optional)
    model_checkpoint="./warmstart.pt",   # optional warm-start file
    estimated_memory_mb=None,            # VRAM hint in MB (optional)
    worker_ids=None,                     # explicit list (optional)
    auto_select_workers=False,           # let server pick (optional)
    only_mine_workers=False,             # restrict to your own nodes
)
```

#### `sdk.storage`

Thin boto3 wrapper pointed at the user's bucket. Exposes `put_bytes`,
`upload_file`, `head`, `list`, `presign_get`. Useful for double-checking
what landed in the bucket or sharing a download URL.

#### `sdk.dashboard.worker_stats() → WorkerStats`

Returns the current cluster availability (online workers, free VRAM).

#### `sdk.jobs`

`JobsResource` — low-level job submission and retrieval (`submit`, `get`,
`list`). Prefer `rt_submit` unless you have already uploaded a workspace.

### `Job`

Thin handle returned by `rt_submit` and `jobs.submit`.

| Attribute | Description |
|-----------|-------------|
| `job.id` | Unique job ID |
| `job.name` | Job name |
| `job.state` | Initial `JobState` returned by the API |
| `job.dashboard_url` | URL to the job detail page on the web UI |

`JobState` values: `PENDING`, `RUNNING`, `FINISHED`, `FAILED`.

---

## Advanced: custom executor / persistor

Pass your own NVFlare subclasses to override the bundled defaults. The SDK
extracts their source the same way it does for the model:

```python
from nvflare.apis.executor import Executor

class MyExecutor(Executor): ...   # full control over the per-round flow

job = sdk.rt_submit(
    model=MyResNet,
    name="custom-fl-job",
    shards_dir="./shards",
    executor=MyExecutor,
)
```

Note: a custom executor *replaces* `GenericClientExecutor` entirely — it does
not have to obey the `fl_train(model, env, out_dir, logger)` contract. You
own the whole NVFlare task lifecycle.

## Advanced: swap the workflow / aggregator

`FederationConfig` exposes the NVFlare component paths. Replace the defaults
with FedProx, FedOpt, cyclic, hierarchical, or any other workflow:

```python
federation = FederationConfig(
    num_rounds=10,
    workflow_path="nvflare.app_common.workflows.cyclic.Cyclic",
    aggregator_path="nvflare.app_opt.pt.fedopt.PTFedOptModelShareableGenerator",
    aggregator_args={"optimizer_args": {"lr": 0.01}},
)
```

## Advanced: inspect generated configs

`RTJobBuilder` renders configs and scripts as strings — no disk writes:

```python
from resontech.rt.builder import RTJobBuilder

builder = RTJobBuilder(training_config=training, federation_config=federation)
configs = builder.render_configs()              # {"client": "...", "server": "...", "meta": "..."}
scripts = builder.render_scripts(MyResNet)      # {"model_def.py": "...", ...}

print(configs["client"])
```

---

## Troubleshooting

**`StorageError: No S3 bucket configured`**
You haven't provisioned a bucket yet. Go to the web UI's Profile → Storage
page and click "Provision Bucket", then copy the secret key into
`ResonTechConfig`.

**`StorageError: ... AccessDenied`**
The credentials in `s3_access_key_id` / `s3_secret_access_key` do not
match the bucket on your account. Rotate them from Profile → Storage →
Rotate Key and try again.

**`ValidationError: shards_dir is not a directory` or empty shards list**
Each worker needs its own shard file. Drop them into `shards_dir` as
`shard_0.zip`, `shard_1.zip`, … (one per worker).

**`name 'nn' is not defined` (server-side error)**
Put all imports (`import torch`, `import torch.nn as nn`, …) in the
**same notebook cell** as your model class and `fl_train`. The SDK extracts
the whole cell verbatim — anything outside that cell does not travel.

**`User module 'model_def' does not define class 'X'`** or
**`... does not define 'fl_train'`**
Your model class and `fl_train` function must live in the file/cell that
the SDK extracts. If you renamed either, adjust `ModelConfig.model_class`
and `ModelConfig.train_fn` accordingly.

---

## Runnable examples

Examples live outside the published package, in the project repository:

- `submit_job.py` — minimal script demonstrating the `fl_train` contract.
- `01_quickstart.ipynb` — walk-through notebook mirroring this README.
- `02_custom_hyperparameters.ipynb` — using `TrainingConfig.extra`.
- `03_custom_executor_persistor.ipynb` — replacing the bundled NVFlare classes.
- `04_warmstart_and_inspection.ipynb` — pre-trained checkpoint + bucket inspection.

Set the following environment variables before running any of them:
`RT_BASE_URL`, `RT_EMAIL`, `RT_PASSWORD`,
`RT_S3_ACCESS_KEY_ID`, `RT_S3_SECRET_ACCESS_KEY`,
and optionally `RT_S3_ENDPOINT`.
