Metadata-Version: 2.4
Name: krunic
Version: 0.1.10
Summary: Hyperparameter search for image classifiers using Ray Tune + SkyPilot
Requires-Python: >=3.12
Requires-Dist: argparse>=1.4.0
Requires-Dist: gdown>=5.2.1
Requires-Dist: h5py>=3.16.0
Requires-Dist: matplotlib>=3.10.8
Requires-Dist: medmnist>=3.0.2
Requires-Dist: numpy>=2.4.3
Requires-Dist: optuna>=4.7.0
Requires-Dist: pillow>=12.1.1
Requires-Dist: ray[train,tune]>=2.54.0
Requires-Dist: scikit-learn>=1.8.0
Requires-Dist: skypilot[aws,lambda]==0.12.0
Requires-Dist: timm>=1.0.25
Requires-Dist: torch>=2.0.0
Requires-Dist: webdataset>=1.0.2
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == 'dev'
Description-Content-Type: text/markdown

# krunic

[![tests](https://github.com/ljbuturovic/krunic/actions/workflows/test.yml/badge.svg)](https://github.com/ljbuturovic/krunic/actions/workflows/test.yml)

Automated hyperparameter search for image classifiers - from dataset to tuned model with one command. Distributed across GPUs and across hosts, locally and on the cloud (AWS). 

krunic uses off-the-shelf models and packages, so you won't get SOTA performance. That said, it can get surprisingly close, with almost zero effort. Useful as a baseline, or experimentation with architectures, GPUs etc.

Built on [Ray Tune](https://docs.ray.io/en/latest/tune/index.html), [Optuna](https://optuna.org/), [timm](https://github.com/huggingface/pytorch-image-models), and [SkyPilot](https://skypilot.readthedocs.io/). NOTE: Requires Python ≥ 3.12. Ubuntu 22.04 users need to install it via deadsnakes or upgrade their OS.

## Install (Mac and Linux)

```bash
$ pipx install krunic
```

This installs three commands: `tunic` (local training), `krunic` (cloud launcher), and `tunic-plotter` (results visualizer). The command takes a couple of minutes. 

## Install (Windows)

Note: 1. This is untested because I don't have Windows access. But I'm guessing it should work. 2. The SkyPilot support for Windows is poor. Thus, only the tunic (local classifier tuning) should work

```bash
winget install Python.Python.3   # if Python not already installed
py -m pip install --user pipx
py -m pipx ensurepath            # restart terminal after this
pipx install krunic
```

## Quick start

**Local:**
```bash
$ tunic --data /path/to/dataset --model resnet50 --n_trials 30 --epochs 30 --output results.json
```

**Cloud (AWS):**

This requires, obviously, an AWS account. The image data must be copied to S3 prior to the run, for example like this:

```bash
$ aws s3 sync ~/image_data/tin s3://image.data/tin
```

```bash
$ krunic \
  --cluster skya \
  --s3-path my-dataset \
  --model resnet50 \
  --accelerator T4:4 \
  --num-nodes 4 \
  --n-trials 48 \
  --n-epochs 50 \
  --prefix kaws
```

SkyPilot creates the cluster, Ray distributes the load across the GPUs. In my experiments, it achieves near-perfect utilization:

![Description](https://raw.githubusercontent.com/ljbuturovic/krunic/main/pcam14_GPU_load.png)


Upon completion, get the best model hyperparameters:

```bash
$ aws s3 cp s3://image.data/ray-results/tin6/kaws_results.json .
```
**Plot metric per trial:**
```bash
$ tunic-plotter kaws_results.json
```
![Description](https://raw.githubusercontent.com/ljbuturovic/krunic/main/tin5_results_val_acc.png)

Remember to take down the cluster after downloading the results. 

```bash
$ yes | sky down skya
```


**Train final model from tuning results (locally):**
```bash
$ tunic --final kaws_results.json --data /path/to/dataset --epochs 50 --amp
```


## Results on common benchmarks

| Dataset | Model | Metric | Validation | Test | SOTA |
|---|---|---|---|---|---|
| PCam | ResNet18 | AUROC | 0.96 | 0.96 | 0.96 |
| TinyImageNet | ViT-Small | Accuracy | 0.87  | | 0.91 |
| ChestMNIST | ResNet18 | AUROC | 0.75 | 0.75 | 0.77 |
| TissueMNIST | ResNet18 | AUROC | 0.92 | 0.94 | 0.93 |

All runs use generic off-the-shelf models with no domain-specific modifications.

## Search space

| Parameter | Range |
|---|---|
| Optimizer | AdamW, SGD |
| Learning rate | 1e-5 – 1e-1 (log) |
| Weight decay | 1e-6 – 1e-1 (log) |
| Label smoothing | 0 – 0.3 |
| Dropout rate | 0 – 0.5 |
| RandAugment magnitude | 1 – 15 |
| RandAugment num ops | 1 – 4 |
| Mixup alpha | 0 – 0.5 |
| CutMix alpha | 0 – 1.0 |

Override any part with a YAML file via `--search-space`.

## tunic - local hyperparameter search

```
tunic --data PATH --model MODEL [options]
```

| Flag | Default | Description |
|---|---|---|
| `--data` | required | Dataset root (ImageFolder or WebDataset) |
| `--model` | required | Any timm model name |
| `--n_trials` | 80 | Number of Optuna trials |
| `--epochs` | 30 | Training epochs per trial (also used for `--final`) |
| `--tune-metric` | `val_auroc` | Metric for trial selection and pruning |
| `--training_fraction` | 1.0 | Fraction of training data (val always uses 1.0) |
| `--batch-size` | 32 | Batch size per trial |
| `--amp` | | Enable automatic mixed precision |
| `--resume` | | Warm-start from a previous experiment directory |
| `--final` | | Skip tuning; train final model from results JSON |
| `--combine` | | Train final model on train+val combined |
| `--final-model` | `tunic_final.pt` | Output path for final model weights |
| `--final-stats` | | Output path for final model stats (JSON) |
| `--device` | `auto` | `auto`, `cuda`, `mps`, or `cpu` |
| `--smoke-test` | | Quick end-to-end test with synthetic data |

## krunic - cloud launcher

krunic generates a SkyPilot YAML and launches the job. The dataset is S3-mounted (or copied); results are uploaded to S3 when the job completes.

### Prerequisites

**1. AWS credentials**

```bash
aws configure
```

Prompts for your Access Key ID, Secret Access Key, and region (e.g. `us-east-1`). Your IAM user needs EC2 and S3 permissions. SkyPilot uses these credentials directly - no separate SkyPilot account or configuration needed.

**2. Verify SkyPilot sees AWS**

```bash
sky check
```

Should show `AWS: enabled`.

**3. Dataset in S3**

```bash
aws s3 sync ~/image_data/my-dataset s3://my-bucket/my-dataset
```

### Monitor and tear down

krunic launches the cluster and streams logs. Once the job completes, download results and tear down:

```bash
sky status                          # check cluster state
sky logs my-cluster 1               # stream logs (job ID increments with each run)
aws s3 cp s3://my-bucket/ray-results/prefix/prefix_results.json .
yes | sky down my-cluster           # terminate cluster
```

`--workdir` defaults to the installed package directory (contains `tunic.py` and `requirements.txt`). Override it only if you are developing from a local source checkout and want to test unpublished changes.

```
krunic --cluster NAME --workdir DIR --s3-path PATH --model MODEL [options]
```

| Flag | Default | Description |
|---|---|---|
| `--cluster` | required | SkyPilot cluster name |
| `--workdir` | package dir | Local directory synced to the cluster. Used for development |
| `--s3-path` | required | Dataset path within the S3 bucket |
| `--model` | required | Any timm model name |
| `--accelerator` | `T4:4` | GPU spec (e.g. `T4:4`, `A10G:1`, `A100:8`) |
| `--num-nodes` | 1 | Number of cluster nodes |
| `--n-trials` | 30 | Number of Optuna trials |
| `--n-epochs` | 30 | Training epochs per trial |
| `--batch-size` | 32 | Batch size per trial |
| `--training-fraction` | 1.0 | Fraction of training data per trial |
| `--tune-metric` | `val_auroc` | Metric for trial selection and pruning |
| `--bucket` | `image.data` | S3 bucket name |
| `--prefix` | `tunic` | Prefix for output files and S3 paths |
| `--spot` | — | Use spot instances (with retry-until-up) |
| `--copy` | — | Copy data from S3 to local disk instead of mounting |
| `--idle-minutes` | 60 | Auto-stop cluster after N idle minutes |
| `--no-autostop` | — | Disable auto-stop |

Results are uploaded to `s3://<bucket>/ray-results/<prefix>/<prefix>_results.json`.

## tunic-plotter - visualize results

```bash
tunic-plotter results.json                  # plots val_auroc and val_acc
tunic-plotter results.json --metric val_acc # single metric
tunic-plotter results.json --trial_sort     # keep original trial order, show running best
```

Saves PNG files alongside the results JSON.

## Dataset format

tunic auto-detects the dataset format:

- **ImageFolder** - standard `split/class/image.ext` layout
- **WebDataset** - sharded TAR files; detected when `wds/dataset_info.json` exists

## Scaling

Concurrent trials = total GPUs: `--num-nodes 4 --accelerator T4:4` --> 16 concurrent trials.

Optuna's TPE needs ~20 trials before it outperforms random search. 32–64 trials is a practical range for most problems.

## Output format

```json
{
  "model": "resnet18",
  "best_val_auroc": 0.963,
  "best_val_acc": 0.891,
  "best_params": {
    "optimizer": "AdamW",
    "lr": 0.0028,
    "weight_decay": 3.6e-06,
    "label_smoothing": 0.058,
    "drop_rate": 0.183
  },
  "n_trials": 48,
  "completed_trials": 48,
  "epochs": 50,
  "all_trials": [...]
}
```
