Metadata-Version: 2.4
Name: tileops
Version: 0.0.1.dev2
Summary: TileOPs kernels for efficient inference
Author: Tile-AI
License-Expression: MIT
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.1.0
Requires-Dist: numpy>=1.23.5
Requires-Dist: einops>=0.6.0
Requires-Dist: Cython
Requires-Dist: tqdm>=4.62.3
Requires-Dist: typing_extensions>=4.10.0
Requires-Dist: cloudpickle
Requires-Dist: ml_dtypes
Requires-Dist: psutil
Requires-Dist: tilelang>=0.1.6.post1
Provides-Extra: dev
Requires-Dist: yapf==0.40.2; extra == "dev"
Requires-Dist: toml==0.10.2; extra == "dev"
Requires-Dist: tomli==2.0.1; extra == "dev"
Requires-Dist: ruff==0.6.5; extra == "dev"
Requires-Dist: codespell==2.3.0; extra == "dev"
Requires-Dist: clang-format==15.0.7; extra == "dev"
Dynamic: license-file

# TileOPs (TOP)

**TileOPs (TOP)** is a high-performance machine learning operator collections built on top of [TileLang](https://github.com/tile-ai/tilelang). It offers efficient, modular, and composable implementations optimized for AI workloads.

Note: TileOPs is still under rapid development.

---

![DeepSeek-V3.2-Exp DeepSeek Sparse Attention (DSA) performance on H800 SXM](https://raw.githubusercontent.com/tile-ai/TileOPs/main/docs/figures/sparse_mla_perf.png)
*DeepSeek-V3.2-Exp DeepSeek Sparse Attention (DSA) performance on H800 SXM*

## 📦 Installation

### Requirements

- Python 3.8+
- PyTorch >= 2.1
- GLIBCXX_3.4.32
- [TileLang](https://github.com/tile-ai/tilelang)

### Method 1: Install with Pip

```bash
pip install tileops
```

### Method 2: Install from source (editable mode for development)

```bash
git clone https://github.com/tile-ai/TileOPs
cd TileOPs
pip install -e '.[dev]' -v # remove -e option if you don't want to install in editable mode, -v for verbose output
```

## 🚀 Quick Usage

### Sparse MLA

```python
import torch
from top import SparseMLAKernel

batch_size = 1
seq_len = 1024
seq_len_kv = 2048
q_start_index_s = 1024
n_heads = 128
head_dim = 512
tail_dim = 64
topk = 2048
kv_stride = 1
kv_group = 1
sm_scale = None

sparse_mla = SparseMLAKernel(
    batch=batch_size,
    seq_len=seq_len,
    seq_len_kv=seq_len_kv,
    q_start_index_s=q_start_index_s,
    heads=n_heads,
    dim=head_dim,
    tail_dim=tail_dim,
    topk=topk,
    kv_stride=kv_stride,
    kv_group=kv_group,
    sm_scale=sm_scale,
    is_casual=True,
    dtype=torch.bfloat16,
    device='cuda',
)

# Evaluate the Sparse MLA kernel performance
sparse_mla.check()
latency = sparse_mla.profile()
print(f"Latency: {latency:.4f} ms")
print(f'fwd tflops = ',
        (batch_size * seq_len * (head_dim + tail_dim + head_dim) * topk * 2 * n_heads) / (latency * 1e-3) / 1e12)
```

### MLA

```python
import torch
import top
from top import MLAKernel

device = "cuda"
dtype = torch.float16

batch = 128
heads = 64
kv_heads = 1
kv_ctx = 8192
dim = 512
pe_dim = 64

# Query input: [batch, heads, dim]
q = torch.randn(batch, heads, dim, device=device, dtype=dtype)

# Query positional encoding: [batch, heads, pe_dim]
q_pe = torch.randn(batch, heads, pe_dim, device=device, dtype=dtype)

# KV cache input: [batch, kv_ctx, kv_heads, dim]
kv = torch.randn(batch, kv_ctx, kv_heads, dim, device=device, dtype=dtype)

# KV positional encoding: [batch, kv_ctx, kv_heads, pe_dim]
k_pe = torch.randn(batch, kv_ctx, kv_heads, pe_dim, device=device, dtype=dtype)

# Use MLA kernel
block_N = 64
block_H = 64
num_split = 1

mla = MLAKernel(batch, heads, kv_heads, kv_ctx, dim, pe_dim, block_N, block_H, num_split)

out = mla(q, q_pe, kv, k_pe)
```

## Acknowledgments
