Metadata-Version: 2.4
Name: xoron-kernel
Version: 1.0.0
Summary: High-performance kernels for Xoron multimodal model with runtime dispatch, JIT compilation, and multi-GPU support
Home-page: https://github.com/nigfuapp-web/xoron-kernel
Author: XTransformers Team
Author-email: XTransformers Team <xoron-kernel@example.com>
Maintainer-email: XTransformers Team <xoron-kernel@example.com>
License: Apache-2.0
Project-URL: Homepage, https://github.com/nigfuapp-web/xoron-kernel
Project-URL: Documentation, https://github.com/nigfuapp-web/xoron-kernel#readme
Project-URL: Repository, https://github.com/nigfuapp-web/xoron-kernel.git
Project-URL: Issues, https://github.com/nigfuapp-web/xoron-kernel/issues
Keywords: transformers,moe,mixture-of-experts,multimodal,llm,cuda,triton,avx512,amx,gguf,flash-attention,ring-attention,mla,xoron
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: C++
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.0.0
Requires-Dist: numpy>=1.20.0
Requires-Dist: safetensors>=0.3.0
Provides-Extra: triton
Requires-Dist: triton>=2.0.0; extra == "triton"
Provides-Extra: sglang
Requires-Dist: sglang>=0.1.0; extra == "sglang"
Provides-Extra: flash
Requires-Dist: flash-attn>=2.0.0; extra == "flash"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-benchmark>=4.0.0; extra == "dev"
Requires-Dist: black>=22.0.0; extra == "dev"
Requires-Dist: isort>=5.0.0; extra == "dev"
Requires-Dist: mypy>=0.900; extra == "dev"
Requires-Dist: ruff>=0.0.270; extra == "dev"
Provides-Extra: all
Requires-Dist: triton>=2.0.0; extra == "all"
Requires-Dist: sglang>=0.1.0; extra == "all"
Requires-Dist: flash-attn>=2.0.0; extra == "all"
Dynamic: author
Dynamic: home-page
Dynamic: license-file
Dynamic: requires-python

# XTransformers

🚀 **High-Performance Kernels for Xoron Multimodal Model**

XTransformers is a custom kernel library designed specifically for the [Xoron](https://github.com/nigfuapp-web/Xoron-Dev) multimodal model, providing state-of-the-art optimizations for running large language models on consumer-grade hardware.

## ✨ Features

### Hardware Support
- **Multi-GPU**: NVIDIA CUDA, AMD ROCm, Intel oneAPI
- **Apple Silicon**: Metal Performance Shaders
- **CPU**: Intel (AVX2/AVX512/AMX), AMD (AVX2/AVX512), ARM (NEON/SVE)
- **Cross-Platform**: Triton JIT kernels for portability

### Runtime Optimization
- **Runtime Dispatch**: Automatically selects optimal kernel variant at startup
- **JIT Compilation**: Compiles kernels optimized for your specific hardware
- **NUMA Awareness**: Efficient memory placement on multi-socket systems

### Model Optimizations
- **MoE Expert Offloading**: Cold experts offloaded to CPU, hot experts on GPU
- **GGUF Support**: On-the-fly dequantization (Q2_K to Q8_K, FP8)
- **MLA (Multi-Head Latent Attention)**: 4-8x KV cache compression
- **Ring Attention**: Efficient 128K+ context processing
- **Flash Attention**: Memory-efficient attention with O(N) memory

### Multimodal
- **Vision**: SigLIP encoder with TiTok 1D tokenization
- **Video**: 3D-RoPE temporal encoding, VidTok compression
- **Audio**: Conformer encoder/decoder, raw waveform processing
- **Generation**: MoE-DiT image/video generation with Flow Matching

## 📦 Installation

### From PyPI (Recommended)

```bash
pip install xtransformers
```

### From Source (Development)

```bash
git clone https://github.com/nigfuapp-web/xtransformers.git
cd xtransformers
pip install -e .
```

### With GPU Support

```bash
# NVIDIA CUDA
XT_USE_CUDA=1 pip install xtransformers

# AMD ROCm
XT_USE_ROCM=1 pip install xtransformers

# With Triton (cross-platform GPU)
pip install xtransformers[triton]
```

### Build Options

Control the build with environment variables:

```bash
# Force specific CPU variant
XT_CPU_VARIANT=AVX512_BF16 pip install xtransformers

# Enable/disable features
XT_ENABLE_AMX=ON pip install xtransformers
XT_ENABLE_GGUF=ON pip install xtransformers
XT_ENABLE_JIT=ON pip install xtransformers

# CUDA architectures (for multi-GPU support)
XT_CUDA_ARCHS="70;75;80;86;89;90" pip install xtransformers

# Parallel build
XT_PARALLEL=16 pip install xtransformers
```

## 🚀 Quick Start

```python
import xtransformers

# Initialize with hardware detection
xtransformers.init()

# Print detected hardware
xtransformers.print_hardware_info()

# Get optimal kernel variant
variant = xtransformers.get_best_variant()
print(f"Using kernel variant: {variant}")
```

### MoE Inference

```python
import xtransformers
import torch

# Configure MoE kernel
config = xtransformers.MoEKernelConfig()
config.num_experts = 8
config.num_experts_per_tok = 2
config.hidden_size = 4096
config.intermediate_size = 11008
config.enable_expert_offload = True
config.quant_bits = 4  # Q4 quantization

# Create kernel with runtime dispatch
moe_kernel = xtransformers.create_moe_kernel(config)

# Load GGUF weights
moe_kernel.load_weights("/path/to/model/experts")

# Forward pass
expert_ids = torch.randint(0, 8, (batch_size, seq_len, 2))
routing_weights = torch.softmax(torch.randn(batch_size, seq_len, 2), dim=-1)
input_hidden = torch.randn(batch_size, seq_len, 4096, dtype=torch.bfloat16)
output = torch.zeros_like(input_hidden, dtype=torch.float32)

moe_kernel.forward(batch_size, seq_len, expert_ids, routing_weights, input_hidden, output)
```

### Triton Kernels (Cross-Platform GPU)

```python
from xtransformers.triton_kernels import TritonKernels

# Flash Attention
output = TritonKernels.flash_attention(Q, K, V, causal=True)

# MoE Routing
expert_ids, routing_weights = TritonKernels.moe_routing(router_logits, top_k=2)

# GGUF Dequantization
fp32_weights = TritonKernels.dequantize_q4(quantized, scales, num_elements)
```

## 🏗️ Architecture

```
xtransformers/
├── cmake/                     # CMake modules for CPU detection
├── xtransformers_kernel/
│   ├── cpu_backend/           # NUMA-aware worker pool, SIMD kernels
│   ├── cuda/                  # CUDA kernels (MoE, attention)
│   ├── operators/
│   │   ├── moe/              # MoE with expert offloading
│   │   ├── attention/        # Flash/Ring/MLA attention
│   │   └── multimodal/       # Vision, video, audio projectors
│   ├── python/               # Python bindings and Triton kernels
│   └── ext_bindings.cpp      # pybind11 bindings
├── third_party/              # pybind11, llama.cpp headers
├── examples/                 # Usage examples
├── tests/                    # Unit tests
├── setup.py                  # Build script with JIT detection
└── pyproject.toml           # Package configuration
```

## 📊 Performance

### MoE Expert Processing (DeepSeek-V3 style)

| Hardware | Variant | Throughput (tokens/sec) |
|----------|---------|------------------------|
| Intel Sapphire Rapids | AMX | 2,500 |
| Intel Ice Lake | AVX512-BF16 | 1,800 |
| AMD EPYC Genoa | AVX512-VNNI | 1,600 |
| Apple M2 Ultra | NEON | 1,200 |

### Memory Efficiency

| Feature | Memory Reduction |
|---------|-----------------|
| MLA KV Compression | 4-8x |
| Expert Offloading | ~60% GPU memory |
| Q4 Quantization | 4x |
| Ring Attention | O(N) vs O(N²) |

## 🔧 Configuration

### CPU Variant Selection

XTransformers automatically selects the best kernel variant at install time and runtime:

1. **AMX** (Intel Sapphire Rapids+): Highest performance for matrix operations
2. **AVX512-BF16**: Optimal for bfloat16 operations
3. **AVX512-VNNI**: Accelerated INT8/INT16 operations
4. **AVX512**: Base AVX-512 operations
5. **AVX2**: Fallback for older x86 CPUs
6. **SVE**: ARM Scalable Vector Extensions
7. **NEON**: ARM SIMD (Apple Silicon, etc.)

### Expert Offloading

Configure which experts stay on GPU vs CPU:

```python
config.enable_expert_offload = True
config.gpu_expert_ids = [0, 1, 2, 3]  # Hot experts on GPU
config.cpu_expert_ids = [4, 5, 6, 7]  # Cold experts on CPU
```

### GGUF Quantization

Supported formats:
- `Q2_K`, `Q3_K`, `Q4_K`, `Q5_K`, `Q6_K`: K-quant formats
- `Q4_0`, `Q8_0`: Basic formats
- `IQ4_XS`: imatrix quantization
- `FP8`: 8-bit floating point

## 🔗 Integration

### With sglang

```python
# Coming soon: Native sglang backend
from xtransformers.sglang import XTransformersBackend
```

### With Xoron Model

```python
from xoron import XoronMultimodalModel
import xtransformers

# XTransformers is automatically used when available
model = XoronMultimodalModel.from_pretrained("path/to/model")
```

## 📝 License

Apache 2.0 License

## 🤝 Contributing

Contributions are welcome! Please see our [Contributing Guide](CONTRIBUTING.md).

## 🙏 Acknowledgments

- [KTransformers](https://github.com/kvcache-ai/ktransformers) for inspiration
- [llama.cpp](https://github.com/ggerganov/llama.cpp) for GGUF format
- [Triton](https://github.com/openai/triton) for cross-platform GPU kernels
- [Flash Attention](https://github.com/Dao-AILab/flash-attention) for attention algorithms
