Metadata-Version: 2.4
Name: dfastllm
Version: 0.0.2
Summary: High-performance inference engine for Diffusion Language Models - 3x faster with advanced optimizations
Author-email: DFastLLM Team <dfastllm.project@gmail.com>
Maintainer-email: DFastLLM Team <dfastllm.project@gmail.com>
License: Apache-2.0
Project-URL: Homepage, https://dfastllm-project.github.io/dfastllm-website/
Project-URL: Documentation, https://dfastllm-project.github.io/dfastllm-website/
Project-URL: Repository, https://github.com/dfastllm-project/dfastllm
Project-URL: Issues, https://github.com/dfastllm-project/dfastllm/issues
Project-URL: Changelog, https://github.com/dfastllm-project/dfastllm/blob/main/CHANGELOG.md
Keywords: llm,diffusion,serving,inference,llada,dream,mdlm,dfastllm,kubernetes,kserve,production,openai-api
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: Intended Audience :: System Administrators
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Internet :: WWW/HTTP :: HTTP Servers
Classifier: Typing :: Typed
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.1.0
Requires-Dist: transformers>=4.36.0
Requires-Dist: accelerate>=0.25.0
Requires-Dist: safetensors>=0.4.0
Requires-Dist: fastapi>=0.104.0
Requires-Dist: uvicorn[standard]>=0.24.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: prometheus-client>=0.19.0
Requires-Dist: httpx>=0.25.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: einops>=0.7.0
Requires-Dist: starlette>=0.27.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"
Requires-Dist: pytest-cov>=4.1.0; extra == "dev"
Requires-Dist: pytest-timeout>=2.2.0; extra == "dev"
Requires-Dist: ruff>=0.1.0; extra == "dev"
Requires-Dist: black>=23.0.0; extra == "dev"
Requires-Dist: mypy>=1.7.0; extra == "dev"
Requires-Dist: pre-commit>=3.5.0; extra == "dev"
Requires-Dist: types-requests>=2.31.0; extra == "dev"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.5.0; extra == "docs"
Requires-Dist: mkdocs-material>=9.5.0; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.24.0; extra == "docs"
Provides-Extra: monitoring
Requires-Dist: opentelemetry-api>=1.21.0; extra == "monitoring"
Requires-Dist: opentelemetry-sdk>=1.21.0; extra == "monitoring"
Requires-Dist: opentelemetry-instrumentation-fastapi>=0.42b0; extra == "monitoring"
Provides-Extra: quantization
Requires-Dist: bitsandbytes>=0.41.0; extra == "quantization"
Requires-Dist: auto-gptq>=0.5.0; extra == "quantization"
Provides-Extra: flash
Requires-Dist: flash-attn>=2.3.0; extra == "flash"
Requires-Dist: triton>=2.1.0; extra == "flash"
Provides-Extra: all
Requires-Dist: dfastllm[dev,docs,flash,monitoring,quantization]; extra == "all"
Dynamic: license-file

<p align="center">
  <img src="https://img.shields.io/badge/DFast-LLM-blue?style=for-the-badge&labelColor=4F46E5&color=06B6D4" alt="DFastLLM" height="40">
</p>

<h1 align="center">DFastLLM</h1>

<p align="center">
  <strong>🚀 High-Performance Inference Engine for Diffusion Language Models</strong>
</p>

<p align="center">
  <a href="https://pypi.org/project/dfastllm/"><img src="https://img.shields.io/pypi/v/dfastllm?style=flat-square&color=blue" alt="PyPI"></a>
  <a href="https://github.com/dfastllm-project/dfastllm/actions"><img src="https://img.shields.io/github/actions/workflow/status/dfastllm-project/dfastllm/ci.yaml?style=flat-square" alt="CI"></a>
  <a href="https://github.com/dfastllm-project/dfastllm/blob/main/LICENSE"><img src="https://img.shields.io/badge/license-Apache%202.0-blue?style=flat-square" alt="License"></a>
  <img src="https://img.shields.io/badge/python-3.10+-blue?style=flat-square" alt="Python">
  <a href="https://dfastllm-project.github.io/dfastllm-website/"><img src="https://img.shields.io/badge/docs-website-blue?style=flat-square" alt="Docs"></a>
</p>

<p align="center">
  <a href="#-quick-start">Quick Start</a> •
  <a href="#-features">Features</a> •
  <a href="#-performance">Performance</a> •
  <a href="#-documentation">Documentation</a> •
  <a href="#-contributing">Contributing</a>
</p>

---

## 🎯 What is DFastLLM?

**DFastLLM** is a production-ready inference engine optimized for **Diffusion Language Models** (LLaDA, Dream, MDLM). Unlike autoregressive models that generate tokens sequentially, diffusion LLMs generate multiple tokens in parallel through iterative denoising — enabling massive throughput gains.

```
Traditional LLM:     Token → Token → Token → Token (sequential)
Diffusion LLM:       [████████] → [████████] → Done! (parallel)
```

---

## ⚡ Quick Start

### Installation

```bash
pip install dfastllm
```

### Generate Text

```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dfastllm.engine.diffusion import DiffusionEngine

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "GSAI-ML/LLaDA-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("GSAI-ML/LLaDA-8B-Instruct", trust_remote_code=True)

# Create engine and generate
engine = DiffusionEngine(model, tokenizer)
output = engine.generate("What is artificial intelligence?", max_tokens=64)
print(output)
```

### With Quantization (2x Memory Savings)

```bash
pip install dfastllm[quantization]
```

```python
from dfastllm import load_quantized_model

# Load 8B model in ~8GB instead of ~16GB
model = load_quantized_model("GSAI-ML/LLaDA-8B-Instruct", "int4")
```

### Batch Processing (4x Throughput)

```python
prompts = ["What is AI?", "What is ML?", "What is DL?", "What is NLP?"]
outputs = engine.generate(prompts, max_tokens=64)
# Throughput: 1,000+ tok/s
```

---

## 🔥 Features

| Feature | Description | Status |
|---------|-------------|--------|
| **Diffusion Generation** | Parallel token unmasking | ✅ |
| **Batch Processing** | Process multiple requests | ✅ |
| **INT4/INT8 Quantization** | 2-4x memory reduction | ✅ |
| **torch.compile** | JIT compilation for 2x speedup | ✅ |
| **FlashAttention** | Memory-efficient attention | ✅ |
| **Multi-GPU** | Tensor parallelism | ✅ |
| **OpenAI API** | Drop-in compatible server | ✅ |
| **Streaming** | Real-time token streaming | ✅ |
| **CUDA Graphs** | Zero-overhead inference | ✅ |
| **Kubernetes** | Production deployment | ✅ |

---

## 📊 Performance

Benchmarked on **NVIDIA L40S** (46GB) with **LLaDA-8B**:

| Batch Size | Throughput | Latency | Speedup |
|------------|------------|---------|---------|
| 1 | 265 tok/s | 241 ms | 1.0x |
| 2 | 484 tok/s | 132 ms | 1.8x |
| 4 | 786 tok/s | 81 ms | 3.0x |
| **8** | **1,056 tok/s** | 61 ms | **4.0x** |

### Memory Usage

| Configuration | Memory | Notes |
|---------------|--------|-------|
| BF16 | 16.8 GB | Default |
| INT8 | ~10 GB | 1.7x reduction |
| INT4 | ~6 GB | 2.8x reduction |

---

## 🐳 Docker

```bash
# GPU image
docker run --gpus all -p 8000:8000 ghcr.io/dfastllm-project/dfastllm:gpu

# CPU image
docker run -p 8000:8000 ghcr.io/dfastllm-project/dfastllm:latest
```

---

## ☸️ Kubernetes

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dfastllm
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: dfastllm
        image: ghcr.io/dfastllm-project/dfastllm:gpu
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8000
```

---

## 🌐 OpenAI-Compatible API

Start the server:

```bash
dfastllm-serve --model GSAI-ML/LLaDA-8B-Instruct --port 8000
```

Use with OpenAI client:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="GSAI-ML/LLaDA-8B-Instruct",
    messages=[{"role": "user", "content": "What is AI?"}],
    stream=True,
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")
```

---

## 🛠️ Supported Models

| Model | Parameters | Status |
|-------|------------|--------|
| **LLaDA-8B-Instruct** | 8B | ✅ Full Support |
| **LLaDA-8B-Base** | 8B | ✅ Full Support |
| **Dream** | 7B | ⚠️ Experimental |
| **MDLM** | Various | ⚠️ Experimental |

---

## 📚 Documentation

- [📖 Documentation Website](https://dfastllm-project.github.io/dfastllm-website/)
- [🚀 Getting Started](docs/getting-started/quickstart.md)
- [⚙️ Configuration](docs/getting-started/configuration.md)
- [📊 Benchmarks](docs/benchmarks/performance.md)
- [☸️ Kubernetes Deployment](docs/deployment/kubernetes.md)

---

## 🤝 Contributing

We welcome contributions! Here's how to get started:

```bash
# Clone the repo
git clone https://github.com/dfastllm-project/dfastllm.git
cd dfastllm

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/

# Run linting
ruff check dfastllm/
black --check dfastllm/
```

See [CONTRIBUTING.md](CONTRIBUTING.md) for detailed guidelines.

---

## 📄 License

Apache 2.0 - See [LICENSE](LICENSE) for details.

---

## 🙏 Acknowledgments

- [LLaDA](https://github.com/GSAI-ML/LLaDA) - The primary diffusion LLM we support
- [HuggingFace Transformers](https://github.com/huggingface/transformers) - Model loading infrastructure
- [PyTorch](https://pytorch.org/) - Deep learning framework

---

<p align="center">
  Made with ❤️ by the <a href="https://github.com/dfastllm-project">DFastLLM Team</a>
</p>
