Metadata-Version: 2.4
Name: pydotcompute
Version: 0.2.0
Summary: Python port of DotCompute's Ring Kernel System - GPU-native actor model with persistent kernels and message passing
Project-URL: Homepage, https://github.com/mivertowski/PyDotCompute
Project-URL: Documentation, https://mivertowski.github.io/PyDotCompute/
Project-URL: Repository, https://github.com/mivertowski/PyDotCompute
Author: Michael Ivertowski
License-Expression: Apache-2.0
License-File: LICENSE
Keywords: actor-model,cuda,gpu,hpc,message-passing,parallel-computing,ring-kernel
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: System :: Distributed Computing
Classifier: Typing :: Typed
Requires-Python: >=3.11
Requires-Dist: msgpack>=1.0.0
Requires-Dist: numpy>=1.26.0
Provides-Extra: cuda
Requires-Dist: cupy-cuda12x>=13.0.0; extra == 'cuda'
Requires-Dist: numba>=0.59.0; extra == 'cuda'
Requires-Dist: pynvml>=11.5.0; extra == 'cuda'
Provides-Extra: cython
Requires-Dist: cython>=3.0.0; extra == 'cython'
Requires-Dist: setuptools>=68.0.0; extra == 'cython'
Provides-Extra: dev
Requires-Dist: hypothesis>=6.0.0; extra == 'dev'
Requires-Dist: mypy>=1.8.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=8.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Requires-Dist: uvloop>=0.19.0; (sys_platform != 'win32') and extra == 'dev'
Provides-Extra: docs
Requires-Dist: mkdocs-material>=9.0.0; extra == 'docs'
Requires-Dist: mkdocs>=1.5.0; extra == 'docs'
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == 'docs'
Provides-Extra: fast
Requires-Dist: uvloop>=0.19.0; (sys_platform != 'win32') and extra == 'fast'
Provides-Extra: metal
Requires-Dist: mlx>=0.4.0; (sys_platform == 'darwin') and extra == 'metal'
Description-Content-Type: text/markdown

# PyDotCompute

[![PyPI version](https://img.shields.io/pypi/v/pydotcompute.svg)](https://pypi.org/project/pydotcompute/)
[![Python versions](https://img.shields.io/pypi/pyversions/pydotcompute.svg)](https://pypi.org/project/pydotcompute/)
[![License](https://img.shields.io/pypi/l/pydotcompute.svg)](https://github.com/mivertowski/PyDotCompute/blob/main/LICENSE)
[![CI](https://github.com/mivertowski/PyDotCompute/actions/workflows/ci.yml/badge.svg)](https://github.com/mivertowski/PyDotCompute/actions/workflows/ci.yml)
[![Documentation](https://img.shields.io/badge/docs-mkdocs-blue.svg)](https://mivertowski.github.io/PyDotCompute/)

A Python port of DotCompute's Ring Kernel System - a GPU-native actor model with persistent kernels and message passing.

## Overview

PyDotCompute brings GPU-native actor model capabilities to Python, enabling developers to create persistent GPU kernels that communicate through message queues. This approach is ideal for:

- Real-time GPU compute pipelines
- Streaming data processing on GPU
- Actor-based GPU programming
- High-throughput message-driven architectures

## Performance Highlights

| Metric | Value |
|--------|-------|
| Message latency (p50) | **21μs** |
| Message latency (p99) | **131μs** |
| GPU graph processing | **1.7M edges/sec** |
| Actor throughput | **76K msg/sec** |
| Cython queue ops | **0.33μs** |

*Benchmarked with uvloop on Linux. See [Benchmarks](#benchmarks) for details.*

## Features

- **Ring Kernel System**: Persistent GPU kernels with infinite loops and message queues
- **High Performance**: uvloop auto-installation for 21μs message latency
- **Message Passing**: Type-safe message serialization with msgpack
- **Unified Memory**: Transparent host-device memory management with lazy synchronization
- **Lifecycle Management**: Two-phase launch (launch -> activate) with graceful shutdown
- **Telemetry**: Real-time GPU monitoring and kernel performance metrics
- **Backend Support**: CPU simulation, CUDA via Numba/CuPy, and Metal via MLX (macOS)
- **Performance Tiers**: From uvloop (default) to Cython extensions

## Installation

```bash
# Basic installation (CPU only)
pip install pydotcompute

# With CUDA support (NVIDIA GPUs)
pip install pydotcompute[cuda]

# With Metal support (macOS/Apple Silicon)
pip install pydotcompute[metal]

# With performance optimizations (uvloop - Linux/macOS)
pip install pydotcompute[fast]

# With Cython extensions (maximum performance)
pip install pydotcompute[cython]
python setup_cython.py build_ext --inplace

# Development installation
pip install -e ".[dev]"
```

## Quick Start

```python
import asyncio
from pydotcompute import RingKernelRuntime, ring_kernel, message

# Define message types
@message
class ComputeRequest:
    values: list[float]

@message
class ComputeResponse:
    result: float

# Define a ring kernel actor
@ring_kernel(
    kernel_id="compute",
    input_type=ComputeRequest,
    output_type=ComputeResponse,
)
async def compute_actor(ctx):
    while not ctx.should_terminate:
        msg = await ctx.receive()
        result = sum(msg.values)
        await ctx.send(ComputeResponse(result=result))

# Use the runtime (automatically uses uvloop for best performance)
async def main():
    async with RingKernelRuntime() as runtime:
        await runtime.launch("compute")
        await runtime.activate("compute")

        await runtime.send("compute", ComputeRequest(values=[1.0, 2.0, 3.0]))
        response = await runtime.receive("compute")

        print(f"Result: {response.result}")  # 6.0

asyncio.run(main())
```

## Performance Tiers

PyDotCompute offers three performance tiers to match your use case:

| Tier | Implementation | Latency (p50) | Use Case |
|------|---------------|---------------|----------|
| **1 (Default)** | uvloop + FastMessageQueue | **21μs** | Async Python code |
| 2 | ThreadedRingKernel | ~100μs | Blocking I/O, C extensions |
| 3 | CythonRingKernel | **0.33μs** queue ops | Multi-process IPC |

### Tier 1: Async (Default)

Automatically enabled when you import `pydotcompute`. Uses uvloop on Linux/macOS.

```python
async with RingKernelRuntime() as runtime:
    # uvloop is auto-installed for 21μs latency
    await runtime.launch("my_kernel")
    await runtime.activate("my_kernel")
```

### Tier 2: Threaded

For blocking operations or GIL-releasing C extensions:

```python
from pydotcompute.ring_kernels import ThreadedRingKernel, ThreadedKernelContext

def blocking_kernel(ctx: ThreadedKernelContext):
    while not ctx.should_terminate:
        msg = ctx.receive(timeout=0.1)
        if msg:
            ctx.send(process(msg))

with ThreadedRingKernel("worker", blocking_kernel) as kernel:
    kernel.send(request)
    response = kernel.receive()
```

### Tier 3: Cython (Maximum Performance)

For multi-process scenarios or Cython extensions:

```python
from pydotcompute.ring_kernels import CythonRingKernel, is_cython_kernel_available

if is_cython_kernel_available():
    # 0.33μs queue operations
    with CythonRingKernel("fast_worker", my_kernel) as kernel:
        kernel.send(request)
```

## Benchmarks

### Message Latency

```
GPU Actors (1000 samples):
  p50:  63μs
  p95:  103μs
  p99:  131μs
  mean: 70μs
```

### Graph Processing (PageRank)

| Graph Size | CPU Sparse | GPU Batch | Speedup |
|------------|------------|-----------|---------|
| 1K nodes   | 6.8ms      | 64ms      | CPU wins |
| 5K nodes (dense) | 256ms | 200ms | **GPU 1.28x** |
| 1M nodes   | 39.6s      | 4.25s     | **GPU 9.3x** |

**Crossover**: GPU wins at 50K+ nodes

### Streaming Throughput

| Scenario | GPU Actors | Advantage |
|----------|-----------|-----------|
| Persistent state | Yes | No repeated GPU transfers |
| Transfer overhead | 0% | vs 16-28% for batch |
| Best for | Long-running pipelines | Context preservation |

## Architecture

```
PyDotCompute Ring Kernel System
├── Ring Kernels          │ Performance Tiers      │ GPU Backends
│   • RingKernelRuntime   │ • uvloop (21μs)        │ CUDA:
│   • FastMessageQueue    │ • ThreadedRingKernel   │ • Numba JIT, CuPy arrays
│   • @ring_kernel        │ • CythonRingKernel     │ • Zero-copy DMA, PTX caching
│   • @message            │ • FastSPSCQueue        │ Metal (macOS):
│                         │                        │ • MLX, Unified memory
├─────────────────────────┴────────────────────────┴─────────────────
│ Memory: UnifiedBuffer (.host, .device, .metal), MemoryPool, Accelerator
```

## Core Components

### UnifiedBuffer

Transparent host-device memory management:

```python
from pydotcompute import UnifiedBuffer
import numpy as np

buffer = UnifiedBuffer((1000,), dtype=np.float32)
buffer.allocate()

# Write on host
buffer.host[:] = np.random.randn(1000)
buffer.mark_host_dirty()

# Access on CUDA device (auto-syncs)
await buffer.ensure_on_device()
device_data = buffer.device

# Access on Metal/macOS (auto-syncs)
metal_data = buffer.metal  # MLX array
```

### Ring Kernels

Persistent actors with message queues:

```python
@ring_kernel(kernel_id="processor", queue_size=4096)
async def processor(ctx):
    while not ctx.should_terminate:
        msg = await ctx.receive(timeout=0.1)
        # Process message
        await ctx.send(response)
```

### Lifecycle Management

```python
async with RingKernelRuntime() as runtime:
    # Phase 1: Launch (allocate resources)
    await runtime.launch("my_kernel")

    # Phase 2: Activate (start processing)
    await runtime.activate("my_kernel")

    # Use the kernel...

    # Deactivate (pause) or Terminate (cleanup)
    await runtime.deactivate("my_kernel")
    await runtime.reactivate("my_kernel")
```

### Metal Backend (macOS)

GPU acceleration on Apple Silicon using MLX:

```python
from pydotcompute.backends.metal import MetalBackend, get_vector_add_kernel
import numpy as np

# Initialize backend
backend = MetalBackend()

if backend.is_available:
    # Copy data to Metal GPU
    a = backend.copy_to_device(np.array([1, 2, 3], dtype=np.float32))
    b = backend.copy_to_device(np.array([4, 5, 6], dtype=np.float32))

    # Use pre-built kernels
    add_kernel = get_vector_add_kernel()
    result = add_kernel(a, b)  # [5, 7, 9]

    # Copy back to host
    result_np = backend.copy_to_host(result)

    # Or compile custom kernels
    compiled = backend.compile_kernel(lambda x: x * 2 + 1)
    output = compiled(np.array([1, 2, 3], dtype=np.float32))
```

## Project Structure

```
pydotcompute/
├── core/
│   ├── accelerator.py      # GPU device abstraction
│   ├── unified_buffer.py   # Host-device memory
│   ├── memory_pool.py      # Memory pooling
│   └── orchestrator.py     # Compute coordination
├── ring_kernels/
│   ├── runtime.py          # Main runtime (uvloop)
│   ├── message.py          # Message serialization
│   ├── queue.py            # Async message queues
│   ├── fast_queue.py       # O(1) priority queue
│   ├── lifecycle.py        # Kernel lifecycle
│   ├── telemetry.py        # Performance monitoring
│   ├── _loop.py            # uvloop auto-install
│   ├── sync_queue.py       # Threading queues
│   ├── threaded_kernel.py  # Tier 2 kernel
│   ├── cython_kernel.py    # Tier 3 kernel
│   └── _cython/            # Cython extensions
│       └── fast_spsc.pyx   # 0.33μs queue
├── backends/
│   ├── cpu.py              # CPU simulation
│   ├── cuda.py             # CUDA via Numba/CuPy
│   └── metal.py            # Metal via MLX (macOS)
├── compilation/
│   ├── compiler.py         # Kernel compilation
│   └── cache.py            # PTX caching
└── decorators/
    ├── kernel.py           # @kernel decorator
    ├── ring_kernel.py      # @ring_kernel decorator
    └── validators.py       # Runtime validation
```

## Testing

```bash
# Run all tests (398 passing)
pytest

# Run with coverage
pytest --cov=pydotcompute

# Run only unit tests
pytest tests/unit/

# Skip CUDA tests (if no GPU)
pytest -m "not cuda"

# Skip Metal tests (if not on macOS)
pytest -m "not metal"

# Run benchmarks
python benchmarks/extended_benchmark.py
python benchmarks/pagerank_benchmark.py
python benchmarks/realtime_anomaly_benchmark.py
python benchmarks/metal_benchmark.py  # macOS only
```

## Requirements

- Python >= 3.11
- numpy >= 1.26.0
- msgpack >= 1.0.0

### Optional Dependencies

| Package | Purpose |
|---------|---------|
| uvloop | 20-40% faster event loop (Linux/macOS) |
| cupy-cuda12x | CUDA array operations |
| numba | GPU kernel JIT compilation |
| pynvml | GPU monitoring |
| mlx | Metal GPU acceleration (macOS only) |
| cython | Maximum performance queues |

## Disabling uvloop

If you need to disable uvloop auto-installation:

```bash
PYDOTCOMPUTE_NO_UVLOOP=1 python my_script.py
```

## Contributing

Contributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines and `docs/IMPLEMENTATION_PLAN.md` for the project roadmap.

## License

Apache License 2.0 - see [LICENSE](LICENSE) file for details.

## Related

- [DotCompute](https://github.com/mivertowski/DotCompute) - Original .NET implementation
- [Numba CUDA](https://numba.readthedocs.io/en/stable/cuda/) - Python CUDA JIT
- [CuPy](https://cupy.dev/) - NumPy-compatible GPU arrays
- [MLX](https://ml-explore.github.io/mlx/) - Apple's ML framework for Apple Silicon
- [uvloop](https://github.com/MagicStack/uvloop) - Fast asyncio event loop
