Metadata-Version: 2.4
Name: PyGPUkit
Version: 0.2.0
Summary: A lightweight GPU runtime for Python with Rust-powered scheduler, NVRTC JIT compilation, and NumPy-like API
Keywords: gpu,cuda,nvrtc,jit,numpy,array
Author: m96-chan
License-Expression: MIT
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Microsoft :: Windows
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Project-URL: Homepage, https://github.com/m96-chan/PyGPUkit
Project-URL: Repository, https://github.com/m96-chan/PyGPUkit
Project-URL: Issues, https://github.com/m96-chan/PyGPUkit/issues
Requires-Python: >=3.10
Requires-Dist: numpy>=1.20.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: mypy>=1.0.0; extra == "dev"
Requires-Dist: psutil>=5.0.0; extra == "dev"
Requires-Dist: pre-commit>=3.0.0; extra == "dev"
Description-Content-Type: text/markdown


# PyGPUkit — Lightweight GPU Runtime for Python
*A minimal, modular GPU runtime with Rust-powered scheduler, NVRTC JIT compilation, and a clean NumPy-like API.*

[![PyPI version](https://badge.fury.io/py/PyGPUkit.svg)](https://badge.fury.io/py/PyGPUkit)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

---

## Overview
**PyGPUkit** is a lightweight GPU runtime for Python that provides:
- **Rust-powered scheduler** with admission control, QoS, and resource partitioning
- NVRTC-based JIT kernel compilation
- A NumPy-like `GPUArray` type
- Kubernetes-inspired GPU scheduling (bandwidth + memory guarantees)
- Extensible operator set (add/mul/matmul, custom kernels)
- Minimal dependencies and embeddable runtime

PyGPUkit aims to be the "micro-runtime for GPU computing": small, fast, and ideal for research, inference tooling, DSP, and real-time systems.

---

## v0.2 Features (NEW)

### Core Infrastructure (Rust)
| Feature | Description |
|---------|-------------|
| **Memory Pool** | LRU eviction, size-class free lists |
| **Scheduler** | Priority queue, memory reservation |
| **Transfer Engine** | Separate H2D/D2H streams, priority |
| **Kernel Dispatch** | Per-stream limits, lifecycle tracking |

### Advanced Features (Rust)
| Feature | Description |
|---------|-------------|
| **Admission Control** | Deterministic admission, quota enforcement |
| **QoS Policy** | Guaranteed/Burstable/BestEffort tiers |
| **Kernel Pacing** | Bandwidth-based throttling per stream |
| **Micro-Slicing** | Kernel splitting, round-robin fairness |
| **Pinned Memory** | Page-locked host memory with pooling |
| **Kernel Cache** | PTX caching, LRU eviction, TTL |
| **GPU Partitioning** | Resource isolation, multi-tenant support |
| **Tiled Matmul** | Shared memory + double buffering |

### Performance (RTX 3090 Ti)
| Matrix Size | Performance | vs NumPy |
|-------------|-------------|----------|
| 512x512 | 1262 GFLOPS | 11.6x |
| 1024x1024 | 1350 GFLOPS | 2.2x |
| 2048x2048 | 4417 GFLOPS | 6.1x |
| 4096x4096 | **6555 GFLOPS** | 7.9x |

---

## Features
- **Lightweight** — no PyTorch/CuPy overhead
- **Modular** — runtime / memory / scheduler / JIT / ops
- **Rust Backend** — memory pool, scheduler, dispatch in Rust
- **GPUArray** with NumPy interop
- **NVRTC JIT** for CUDA kernels
- **Advanced Scheduler** with memory & bandwidth guarantees
- **106 Rust tests** for core components

---

## Installation

```bash
pip install pygpukit
```

From source:

```bash
git clone https://github.com/m96-chan/PyGPUkit
cd PyGPUkit
pip install -e .
```

Requirements:
- Python 3.10+
- CUDA 11+
- NVRTC available
- NVIDIA GPU

**Supported GPUs:**
- RTX 30XX series (Ampere) and above
- Performance tuning is optimized for GPUs with large L2 cache (6MB+)
- Older GPUs (RTX 20XX, GTX 10XX, etc.) are NOT tuned and may have suboptimal performance

---

## Project Goals
1. Provide the smallest usable GPU runtime for Python
2. Expose GPU scheduling (bandwidth, memory, partitioning)
3. Make writing custom GPU kernels easy
4. Serve as a building block for inference engines, DSP systems, and real-time workloads

---

## Usage Examples

### Allocate Arrays
```python
import pygpukit as gp

x = gp.zeros((1024, 1024), dtype="float32")
y = gp.ones((1024, 1024), dtype="float32")
```

### Basic Operations
```python
z = gp.add(x, y)
w = gp.matmul(x, y)
```

### CPU <-> GPU Transfer
```python
arr = z.to_numpy()
garr = gp.from_numpy(arr)
```

### Custom NVRTC Kernel
```cuda
extern "C" __global__
void scale(float* x, float factor, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < n) x[idx] *= factor;
}
```

```python
kernel = gp.jit(src, func="scale")
kernel(x, factor=0.5, n=x.size)
```

### Rust Scheduler (v0.2)
```python
import _pygpukit_rust as rust

# Memory Pool with LRU eviction
pool = rust.MemoryPool(quota=100 * 1024 * 1024, enable_eviction=True)
block = pool.allocate(4096)

# QoS-aware task scheduling
evaluator = rust.QosPolicyEvaluator(total_memory=8*1024**3, total_bandwidth=1.0)
task = rust.QosTaskMeta.guaranteed("task-1", "Critical Task", 256*1024*1024)
result = evaluator.evaluate(task)

# GPU Partitioning
manager = rust.PartitionManager(rust.PartitionConfig(total_memory=8*1024**3))
manager.create_partition("inference", "Inference",
    rust.PartitionLimits().memory(4*1024**3).compute(0.5))
```

---

# Scheduler — Kubernetes-Inspired GPU Orchestration

PyGPUkit includes an experimental scheduler that treats a single GPU as a **multi-tenant compute node**, similar to how Kubernetes orchestrates CPU workloads. The goal is to provide **resource isolation, guarantees, and fair sharing** across multiple GPU tasks.

### **Core Capabilities**

---

## **1. GPU Memory Reservation**
Tasks may request a guaranteed block of GPU memory.

- Hard guarantees -> task is rejected if memory cannot be allocated
- Soft guarantees -> best-effort allocation
- Overcommit strategies (evict to host when pressure is high)
- Reclaim policies (LRU GPUArray eviction)

**Example:**
```python
task = scheduler.submit(
    fn,
    memory="512MB",
)
```

---

## **2. GPU Bandwidth Guarantees / Throttling**
Tasks may request a specific percentage of GPU compute bandwidth.

Bandwidth control is implemented via:
- Stream priority
- Kernel pacing (launch intervals)
- Micro-slicing large kernels
- Cooperative time-quantized scheduling
- Persistent dispatcher kernels (planned)

**Example:**
```python
task = scheduler.submit(
    fn,
    bandwidth=0.20,   # 20% GPU compute share
)
```

---

## **3. Logical GPU Partitioning**
PyGPUkit implements **software-defined GPU slicing**, similar in spirit to Kubernetes device plugin resource partitioning.

Slices may define:
- Memory quota
- Bandwidth share
- Stream priority band
- Isolation level

Useful for:
- Multi-tenant inference servers
- Real-time audio/DSP workloads
- Background/foreground GPU task separation

---

## **4. Scheduling Policies**
The scheduler supports multiple policies:

- **Guaranteed** — exclusive reservation, strict QoS
- **Burstable** — partial guarantees, opportunistic bandwidth
- **BestEffort** — uses leftover GPU cycles
- **Priority scheduling**
- **Deadline scheduling** (planned)
- **Weighted fair sharing**

**Example:**
```python
task = scheduler.submit(
    fn,
    policy="guaranteed",
    memory="1GB",
    bandwidth=0.10,
)
```

---

## **5. Admission Control**
Before executing a task, the scheduler performs:

- Resource validation
- Quota check
- QoS matching
- Scheduling feasibility

Results in:
- **admitted**
- **queued**
- **rejected**

---

## **6. Monitoring & Introspection**
PyGPUkit exposes live metrics:

- Memory usage per task
- SM occupancy and GPU utilization
- Throttling / pacing logs
- Queue position / execution state
- Reclaim/eviction count

**Example:**
```python
stats = scheduler.stats(task_id)
```

---

## **7. Soft Isolation Model**
While not OS-level isolation, each GPU task is provided:

- Dedicated stream groups
- Guaranteed memory pools
- Kernel pacing to enforce bandwidth
- Optional sandboxed GPUArray region

This provides practical multi-tenant safety without MIG/MPS.

---

# Project Structure
```
PyGPUkit/
  src/pygpukit/    # Python API (NumPy-compatible)
  native/          # C++ backend (CUDA Driver/Runtime/NVRTC)
  rust/            # Rust backend (memory pool, scheduler, dispatch)
    pygpukit-core/   # Pure Rust core logic
    pygpukit-python/ # PyO3 bindings
  examples/        # Demo scripts
  tests/           # Test suite
```

---

## Roadmap

### **v0.1 (Released)**
- [x] GPUArray
- [x] NVRTC JIT
- [x] add/mul/matmul ops
- [x] Basic stream manager
- [x] Packaging + wheels

### **v0.2 (Released)**
- [x] Rust Memory Pool (LRU, size-class)
- [x] Rust Scheduler (priority, memory reservation)
- [x] Rust Transfer Engine (async H2D/D2H)
- [x] Rust Kernel Dispatch Controller
- [x] Admission Control
- [x] QoS Policy Framework (Guaranteed/Burstable/BestEffort)
- [x] Kernel Pacing Engine
- [x] Micro-Slicing Framework
- [x] Pinned Memory Support
- [x] Kernel Cache (PTX caching)
- [x] GPU Partitioning
- [x] Tiled Matmul (shared memory)
- [x] 106 Rust tests

### **v0.3 (Planned)**
- [ ] Triton optional backend
- [ ] Advanced ops (softmax, layernorm)
- [ ] Inference-oriented plugin system
- [ ] MPS/MIG integration

---

## Contributing
Contributions and discussions are welcome!
Please open Issues for feature requests, bugs, or design proposals.

---

## License
MIT License

---

## Acknowledgements
Inspired by:
- CUDA Runtime
- NVRTC
- PyCUDA
- CuPy
- Triton

PyGPUkit aims to fill the gap for a tiny, embeddable GPU runtime for Python.
