Metadata-Version: 2.4
Name: coolingcube
Version: 0.1.1
Summary: GPU cluster tail latency optimizer for DDP training and inference
Home-page: https://coolingcube.cc
Author: Cooling Cube
Author-email: CoolingCubeInfo@proton.me
Keywords: gpu ddp training inference tail-latency optimization multi-gpu
Classifier: Programming Language :: Python :: 3
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: System :: Distributed Computing
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: requests>=2.20
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Cooling Cube

GPU cluster tail latency optimizer for DDP training and multi-GPU inference.

Reduces per-step tail latency by identifying pressure workers and computing start-time offsets. Works on 8–64 GPU clusters. Zero negative gains guaranteed.

## Install

```bash
pip install coolingcube
```

## CLI usage

```bash
# From a timing log file
coolingcube --logs timing.json

# Inline worker times (microseconds)
coolingcube --workers '{"0": 12000, "1": 11800, "2": 15500, "3": 12500}'

# JSON output
coolingcube --logs timing.json --json
```

## Python usage

```python
from coolingcube import optimize

result = optimize({
    "0": 12000,
    "1": 11800,
    "2": 15500,
    "3": 12500,
})

print(f"Gain: {result['gain_pct']:.3f}% ({result['gain_us']:.1f} µs)")
print(f"Schedule: {result['best_schedule']}")
```

## Collecting logs from PyTorch DDP

```python
import time
import torch
import torch.distributed as dist

timing_logs = {}

for step in range(num_steps):
    t0 = time.perf_counter()
    loss = model(batch)
    loss.backward()
    optimizer.step()
    dist.barrier()
    elapsed_us = (time.perf_counter() - t0) * 1e6

    rank = dist.get_rank()
    timing_logs[str(rank)] = elapsed_us  # average over steps in practice

# After training loop, on rank 0:
if dist.get_rank() == 0:
    from coolingcube import optimize
    result = optimize(timing_logs)
    print(f"Gain: {result['gain_pct']:.3f}%")
```

## Log file formats

Cooling Cube accepts several formats automatically:

```json
{"0": 12345, "1": 11800, "2": 13000}
```

```json
[{"rank": 0, "total_iter_time": 0.186}, {"rank": 1, "total_iter_time": 0.212}]
```

```json
{"workers": [{"rank": 0, "step_time": 0.172}, ...]}
```

## How it works

Standard DDP holds all workers at the barrier until the slowest finishes. The bottleneck is usually not the straggler itself but the pressure workers pushing it — workers with slightly elevated times that create synchronization pressure.

Cooling Cube identifies those pressure workers and computes per-worker start-time offsets to reduce tail latency. The algorithm uses a Ridge surrogate model and converges in 40–80 oracle calls regardless of cluster size.

Typical gains: 0.2–0.9% step-time reduction on heterogeneous or PCIe-bound clusters.

## Free

Free for open source and research use. No account required.

https://coolingcube.cc · CoolingCubeInfo@proton.me
