Metadata-Version: 2.4
Name: gpu-choochoo
Version: 0.1.0
Summary: Gymnasium environments, workloads, and scheduling baselines for realistic GPU cluster research.
Author: Soren Madsen
Project-URL: Homepage, https://github.com/sorenmadsen/gpu-choochoo
Project-URL: Repository, https://github.com/sorenmadsen/gpu-choochoo
Project-URL: Issues, https://github.com/sorenmadsen/gpu-choochoo/issues
Keywords: reinforcement-learning,gymnasium,gpu,scheduling,simulation
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Visualization
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: gymnasium>=0.29
Requires-Dist: numpy>=1.22
Provides-Extra: dev
Requires-Dist: pytest>=7.4; extra == "dev"
Requires-Dist: stable-baselines3>=2.0; extra == "dev"
Dynamic: license-file

# GPU ChooChoo

Keep your GPUs chugging along!

## Overview

GPU ChooChoo is a library providing various Gymnasium environments for training Deep RL agents on GPU scheduling with statistically realistic workloads that generalize across different cluster configurations.

## Key Features

### 1. Realistic Workload Generation
Based on actual ML/AI cluster characteristics:

- **Poisson arrivals with time-varying rate λ(t)**
  - Business hours effect (higher load 9am-5pm)
  - Non-homogeneous Poisson process
  - Burst arrivals (researchers submitting job batches)

- **Power-law job sizes** (P(k GPUs) ∝ k^(-2.5))
  - Most jobs small (1-2 GPUs): ~75%
  - Few large jobs (8+ GPUs): ~5%
  - Realistic for ML workloads

- **Log-normal durations** with size correlation
  - Heavy-tailed distribution
  - Larger jobs run longer (correlation)
  - Range: 1 minute to 48 hours

- **Correlated characteristics**
  - Larger jobs prefer newer GPUs (H100 > A100)
  - VRAM scales with job size
  - GPU type preferences realistic
- **GPU tier awareness**
  - Built-in catalog describing V100/A100/H100-style tiers
  - Jobs expose preferred/acceptable GPU type lists and per-GPU VRAM minima
  - Scheduler enforces compatibility automatically

### 2. Multi-Scenario Training

Train on diverse scenarios to learn general policies:

- **Curriculum learning**: Easy → Medium → Hard progression
- **Difficulty levels**: Based on load factor and cluster size
- **Held-out test scenarios**: Evaluate generalization
- **Automatic infeasible job filtering**: No hanging on impossible jobs

### 3. Safety Features

- **Adaptive step limits**: Automatically scales with scenario size (num_jobs × 3)
  - Easy scenarios (17 jobs): ~50-100 steps
  - Medium scenarios (95 jobs): ~300 steps
  - Hard scenarios (760 jobs): ~2300 steps
  - Ensures 24-hour traces complete successfully
- **Infeasible job removal**: Jobs requiring more GPUs than any node are filtered
- **Proper termination**: All episodes complete without hanging

## Quick Start

```python
from gym_env.multi_scenario_wrapper import create_diverse_training_env

# Create environment with 30 diverse scenarios
env = create_diverse_training_env(
    num_scenarios=30,
    difficulty_distribution='curriculum',  # or 'balanced', 'easy-heavy', 'hard-heavy'
    seed=42,
    max_steps=200  # Truncate after 200 steps
)

# Standard Gymnasium loop
obs, info = env.reset()
done = False

while not done:
    action = your_policy(obs)  # Your RL agent
    obs, reward, terminated, truncated, info = env.step(action)
    done = terminated or truncated

# Get scenario info
scenario_info = env.get_scenario_info()
print(f"Difficulty: {scenario_info['difficulty']}")
print(f"Load Factor: {scenario_info['load_factor']:.2f}")
```

## Running Tests

### Test Workload Generation
```bash
python -m pytest gpu_choochoo/tests/test_gpu_tier_preferences.py
```

Validates:
- Poisson arrival process (homogeneous & time-varying)
- Power-law job size distribution
- Log-normal duration with size correlation
- GPU type selection logic, tier preferences, and per-GPU VRAM constraints enforced by the scheduler
- Load factor computation

### Test Multi-Scenario Wrapper
```bash
python test_multi_scenario_quick.py
```

Tests:
- Multi-scenario environment creation
- Episode execution without hanging
- Performance across difficulty levels
- Proper truncation with max_steps

## Files

### Core Implementation
- `gym_env/gpu_scheduler_env.py` - Base Gymnasium environment
- `gym_env/realistic_workload_generator.py` - Statistical workload generation
- `gym_env/multi_scenario_wrapper.py` - Multi-scenario training wrapper

### Tests
- `test_realistic_workloads.py` - Validate workload statistics
- `test_multi_scenario_quick.py` - Quick integration test
- `test_simple_scenario.py` - Debug single scenario

### Original Files
- `test_baseline.py` - Test EASY Backfilling baseline
- `test_all_baselines.py` - Compare FCFS, SJF, EASY Backfilling
- `test_gym_env.py` - Basic environment tests
- `example_rl_training.py` - Training loop template

## Workload Statistics

### Job Size Distribution (Power-Law, α=2.5)
```
1 GPU:    74.4% ####################################
2 GPUs:   14.0% #######
3 GPUs:    5.2% ##
4 GPUs:    2.1% #
5+ GPUs:   4.3% ##
```

### Duration by Job Size (Log-Normal)
```
Size  | Mean Duration | Median Duration
------|---------------|----------------
1 GPU |   0.77 hours  |   0.28 hours
2 GPU |   1.21 hours  |   0.41 hours
4 GPU |   1.26 hours  |   0.41 hours
8 GPU |   1.55 hours  |   0.58 hours
16 GPU|   1.91 hours  |   0.73 hours
```

## Difficulty Levels

### Easy
- 2 nodes, 4-8 GPUs per node
- ~17 jobs over 4 hours
- Load factor: 0.5-2.0
- No burst arrivals

### Medium
- 4 nodes, 4-8 GPUs per node
- ~95 jobs over 8 hours
- Load factor: 0.7-1.0
- 2 burst events

### Hard
- 8 nodes, 4-16 GPUs per node
- ~760 jobs over 24 hours
- Load factor: 0.8-1.2
- 5 burst events

## Baseline Performance

Running `benchmark_baselines.py` gives the following example output:
```
======================================================================
BASELINE SCHEDULER BENCHMARK ON REALISTIC SCENARIOS
======================================================================

Generating test scenarios...
Created 30 scenarios:
  Easy: 10
  Medium: 10
  Hard: 10

======================================================================
Benchmarking: EASY Backfilling
======================================================================
  Scenario 5: easy   - Util= 18.3%, Jobs=15/15
  Scenario 10: easy   - Util= 15.2%, Jobs=19/19
  Scenario 15: medium - Util= 52.7%, Jobs=99/99
  Scenario 20: medium - Util= 26.9%, Jobs=85/85
  Scenario 25: hard   - Util= 31.9%, Jobs=731/731
  Scenario 30: hard   - Util= 27.7%, Jobs=736/736

EASY Backfilling Results:
======================================================================

EASY:
  Scenarios: 10
  Utilization: 16.08% ± 6.28%
  Range: [8.3%, 28.2%]
  Median: 13.58%
  Avg Wait Time: 0.4 minutes

MEDIUM:
  Scenarios: 10
  Utilization: 29.55% ± 10.73%
  Range: [16.6%, 52.7%]
  Median: 25.65%
  Avg Wait Time: 10.7 minutes

HARD:
  Scenarios: 10
  Utilization: 32.21% ± 6.81%
  Range: [21.8%, 43.4%]
  Median: 31.28%
  Avg Wait Time: 29.9 minutes

ALL:
  Scenarios: 30
  Utilization: 25.95% ± 10.81%
  Range: [8.3%, 52.7%]
  Median: 25.63%
  Avg Wait Time: 13.7 minutes

======================================================================
Benchmarking: Shortest Job First
======================================================================
  Scenario 5: easy   - Util= 18.3%, Jobs=15/15
  Scenario 10: easy   - Util= 15.2%, Jobs=19/19
  Scenario 15: medium - Util= 40.0%, Jobs=99/99
  Scenario 20: medium - Util= 23.6%, Jobs=85/85
  Scenario 25: hard   - Util= 31.9%, Jobs=731/731
  Scenario 30: hard   - Util= 27.7%, Jobs=736/736

Shortest Job First Results:
======================================================================

EASY:
  Scenarios: 10
  Utilization: 16.08% ± 6.28%
  Range: [8.3%, 28.2%]
  Median: 13.58%
  Avg Wait Time: 0.4 minutes

MEDIUM:
  Scenarios: 10
  Utilization: 27.66% ± 8.25%
  Range: [16.6%, 40.0%]
  Median: 23.99%
  Avg Wait Time: 10.5 minutes

HARD:
  Scenarios: 10
  Utilization: 28.32% ± 5.52%
  Range: [21.0%, 38.1%]
  Median: 26.60%
  Avg Wait Time: 18.4 minutes

ALL:
  Scenarios: 30
  Utilization: 24.02% ± 8.81%
  Range: [8.3%, 40.0%]
  Median: 23.72%
  Avg Wait Time: 9.8 minutes

======================================================================
Benchmarking: Pure FCFS
======================================================================
  Scenario 5: easy   - Util= 18.3%, Jobs=15/15
  Scenario 10: easy   - Util= 15.2%, Jobs=19/19
  Scenario 15: medium - Util= 22.4%, Jobs=99/99
  Scenario 20: medium - Util= 25.3%, Jobs=85/85
  Scenario 25: hard   - Util= 27.3%, Jobs=731/731
  Scenario 30: hard   - Util=  3.1%, Jobs=736/736

Pure FCFS Results:
======================================================================

EASY:
  Scenarios: 10
  Utilization: 16.08% ± 6.28%
  Range: [8.3%, 28.2%]
  Median: 13.58%
  Avg Wait Time: 0.4 minutes

MEDIUM:
  Scenarios: 10
  Utilization: 25.06% ± 6.37%
  Range: [16.6%, 35.3%]
  Median: 22.91%
  Avg Wait Time: 79.8 minutes

HARD:
  Scenarios: 10
  Utilization: 23.55% ± 9.23%
  Range: [3.1%, 34.2%]
  Median: 24.95%
  Avg Wait Time: 1954.4 minutes

ALL:
  Scenarios: 30
  Utilization: 21.56% ± 8.39%
  Range: [3.1%, 35.3%]
  Median: 20.84%
  Avg Wait Time: 678.2 minutes

======================================================================
BASELINE COMPARISON SUMMARY
======================================================================

Scheduler                 | Overall      | Easy         | Medium       | Hard        
------------------------------------------------------------------------------------------
EASY Backfilling          | 25.95% ± 10.81 | 16.08% ± 6.28 | 29.55% ± 10.73 | 32.21% ± 6.81
Shortest Job First        | 24.02% ± 8.81 | 16.08% ± 6.28 | 27.66% ± 8.25 | 28.32% ± 5.52
Pure FCFS                 | 21.56% ± 8.39 | 16.08% ± 6.28 | 25.06% ± 6.37 | 23.55% ± 9.23

======================================================================
TARGET FOR RL AGENT: Beat 25.95% average utilization
======================================================================
```

## Packaging and PyPI Release

The repository already contains a standard `pyproject.toml` that points `setuptools` at the `gpu_choochoo` package living one directory below the repo root. To ship a new release:

1. **Update metadata**  
   - Bump the version string in both `pyproject.toml` (`[project].version`) and `gpu_choochoo/__init__.py` (`__version__`).  
   - Fill in/adjust the author, license, and classifiers so they match the release you want to publish.

2. **Build the distribution artifacts**  
   ```bash
   python -m pip install --upgrade build twine
   rm -rf dist/ build/
   python -m build
   ```
   The `dist/` folder will contain both a source tarball and a wheel that package the `gpu_choochoo` module tree.

3. **Smoke-test the build locally**  
   ```bash
   python -m venv .venv-test
   source .venv-test/bin/activate
   python -m pip install dist/gpu_choochoo-*.whl
  ```

4. **Upload to TestPyPI (recommended) and then PyPI**  
   ```bash
   # TestPyPI
   python -m twine upload --repository testpypi dist/*

   # Production PyPI
   python -m twine upload dist/*
   ```
   Twine will prompt for your PyPI (or TestPyPI) credentials or can read them from `~/.pypirc`.

5. **Consume the published package**  
   ```bash
   python -m pip install gpu-choochoo
   ```
   Users can then import everything from `gpu_choochoo`, including `GPUSchedulerEnv`, `MultiScenarioWrapper`, and `RealisticWorkloadGenerator`. Package discovery works even though the code lives in the nested `gpu_choochoo/` directory because `setuptools` is configured to include `gpu_choochoo*`.

## Troubleshooting

### Episodes not terminating?
- Increase `max_steps` parameter (default: 1000)
- Check for infeasible jobs in logs (verbose=True)

### Low utilization?
- Check load factor of scenarios (should be 0.7-1.2)
- Verify policy is scheduling jobs (not all no-ops)
- Compare against baselines (see test_all_baselines.py)

### Jobs filtered as infeasible?
- Workload generator ensures max_gpus ≤ largest node
- Check cluster config has sufficient capacity
- Set verbose=True to see which jobs are filtered

## Next Steps

1. **Train your RL agent** on diverse scenarios
2. **Compare to baselines** (target: >45% utilization)
3. **Test generalization** on held-out scenarios
4. **Analyze learned policies** - what strategies emerge?
5. **Scale up** - add more scenarios, larger clusters

## References

- Workload characteristics based on Google cluster traces, Azure ML traces
- Power-law distributions: Reiss et al., "Google cluster-usage traces" (2011)
- EASY Backfilling: Lifka, "The ANL/IBM SP scheduling system" (1995)
