Metadata-Version: 2.4
Name: openllava
Version: 3.0.0
Summary: Open-Source Multimodal Vision Injection Framework for Any Language Model
Author: OpceanAI Research Team
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/OpceanAI/openllava
Project-URL: Documentation, https://github.com/OpceanAI/openllava
Project-URL: Repository, https://github.com/OpceanAI/openllava
Project-URL: Issues, https://github.com/OpceanAI/openllava/issues
Project-URL: Changelog, https://github.com/OpceanAI/openllava/releases
Keywords: multimodal,vision-language,llava,llm,deep-learning,cuda,quantization
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: C
Classifier: Programming Language :: C++
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.3.0
Requires-Dist: transformers>=4.45.0
Requires-Dist: accelerate>=0.33.0
Requires-Dist: peft>=0.12.0
Requires-Dist: bitsandbytes>=0.43.0
Requires-Dist: pillow>=10.0
Requires-Dist: datasets>=2.16.0
Requires-Dist: huggingface-hub>=0.24.0
Requires-Dist: tqdm>=4.66.0
Requires-Dist: einops>=0.8.0
Requires-Dist: safetensors>=0.4.3
Requires-Dist: numpy>=1.24
Requires-Dist: sentencepiece>=0.2.0
Requires-Dist: trl>=0.12.0
Requires-Dist: wandb>=0.17.0
Requires-Dist: tensorboard>=2.16.0
Provides-Extra: cli
Requires-Dist: click>=8.0; extra == "cli"
Requires-Dist: typer>=0.12; extra == "cli"
Requires-Dist: rich>=13.0; extra == "cli"
Provides-Extra: serve
Requires-Dist: fastapi>=0.110; extra == "serve"
Requires-Dist: uvicorn[standard]>=0.29; extra == "serve"
Provides-Extra: rl
Requires-Dist: trl>=0.12.0; extra == "rl"
Requires-Dist: vllm>=0.5.0; extra == "rl"
Provides-Extra: cuda
Requires-Dist: nvidia-cuda-runtime-cu12; extra == "cuda"
Requires-Dist: nvidia-cuda-nvcc-cu12; extra == "cuda"
Requires-Dist: nvidia-cublas-cu12; extra == "cuda"
Provides-Extra: tpu
Requires-Dist: jax>=0.4.20; extra == "tpu"
Requires-Dist: jaxlib>=0.4.20; extra == "tpu"
Requires-Dist: flax>=0.8.0; extra == "tpu"
Requires-Dist: torch_xla>=2.3.0; extra == "tpu"
Provides-Extra: eval
Requires-Dist: lmms-eval>=0.2.0; extra == "eval"
Provides-Extra: dev
Requires-Dist: pytest>=8.0; extra == "dev"
Requires-Dist: black>=24.0; extra == "dev"
Requires-Dist: ruff>=0.4; extra == "dev"
Requires-Dist: pre-commit>=3.6; extra == "dev"
Provides-Extra: ui
Requires-Dist: gradio>=4.0; extra == "ui"
Provides-Extra: all
Requires-Dist: openllava[cli,cuda,dev,eval,rl,serve,tpu,ui]; extra == "all"
Dynamic: license-file
Dynamic: requires-python

<p align="center">
  <img src="https://img.shields.io/badge/OpenLLaVA-v3.0.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="OpenLLaVA v3.0.0">
  <img src="https://img.shields.io/badge/License-Apache--2.0-7C4DFF?style=for-the-badge&labelColor=0A0A0A" alt="License">
  <img src="https://img.shields.io/badge/Python-3.10+-3776AB?style=for-the-badge&labelColor=0A0A0A&logo=python&logoColor=3776AB" alt="Python">
  <img src="https://img.shields.io/badge/PyTorch-2.3+-EE4C2C?style=for-the-badge&labelColor=0A0A0A&logo=pytorch&logoColor=EE4C2C" alt="PyTorch">
</p>

<p align="center">
  <img src="https://img.shields.io/badge/CUDA-8.0%2B-76B900?style=for-the-badge&labelColor=0A0A0A&logo=nvidia&logoColor=76B900" alt="CUDA">
  <img src="https://img.shields.io/badge/ROCm-AMD-ED2B23?style=for-the-badge&labelColor=0A0A0A" alt="ROCm">
  <img src="https://img.shields.io/badge/TPU-Google-4285F4?style=for-the-badge&labelColor=0A0A0A" alt="TPU">
  <img src="https://img.shields.io/badge/MLX-Apple-555555?style=for-the-badge&labelColor=0A0A0A&logo=apple&logoColor=white" alt="MLX">
  <img src="https://img.shields.io/badge/XPU-Intel-0071C5?style=for-the-badge&labelColor=0A0A0A&logo=intel&logoColor=0071C5" alt="XPU">
</p>

---

<p align="center">
  <strong>OpenLLaVA</strong> is an open-source multimodal vision injection framework for adding vision capabilities to any language model. Architecture-agnostic, multi-backend, and production-ready — from research to deployment.
</p>

<p align="center">
  <a href="#quickstart">Quickstart</a> ·
  <a href="#architecture">Architecture</a> ·
  <a href="#core-concepts">Core Concepts</a> ·
  <a href="#training-pipeline">Training</a> ·
  <a href="#optimizations">Optimizations</a> ·
  <a href="#cli-reference">CLI</a> ·
  <a href="#distributed-training">Distributed</a>
</p>

---

## Table of Contents

- [Overview](#overview)
- [Key Features](#key-features)
- [Quickstart](#quickstart)
- [Architecture](#architecture)
- [Core Concepts](#core-concepts)
- [Training Pipeline](#training-pipeline)
- [Optimizations](#optimizations)
- [CLI Reference](#cli-reference)
- [API Reference](#api-reference)
- [Distributed Training](#distributed-training)
- [RL Alignment](#rl-alignment)
- [Export and Deployment](#export-and-deployment)
- [Evaluation](#evaluation)
- [Configuration](#configuration)
- [Backends](#backends)
- [Performance](#performance)
- [Project Structure](#project-structure)
- [License](#license)

---

## Overview

OpenLLaVA is a comprehensive framework for injecting vision capabilities into any HuggingFace language model. It provides a complete pipeline — from model construction through training, inference, serving, export, and evaluation — all accessible through a unified Python API and CLI.

The framework supports any LLM architecture (Llama, Mistral, Qwen, Gemma, Phi, etc.) and any HuggingFace-compatible vision encoder. It automatically detects model dimensions, constructs the appropriate projector, patches the tokenizer with visual tokens, and configures the training and inference pipelines

> [!NOTE]
> OpenLLaVA is backend-agnostic. The same code runs on CUDA, ROCm, Apple MLX, Intel XPU, Google TPU, and CPU — with automatic hardware detection and optimal configuration selection.

### Design Principles

| Principle | Description |
|:----------|:------------|
| **Architecture Agnostic** | Works with any HuggingFace LLM and vision encoder |
| **Multi-Backend** | CUDA, ROCm, TPU, MLX, XPU, CPU — auto-detected |
| **Production Ready** | Continuous batching, PagedAttention, speculative decoding |
| **Optimization Suite** | 40+ built-in optimizations for training and inference |
| **Full Pipeline** | Train, serve, export, evaluate — all in one framework |

---

## Key Features

### Model Construction

- **Vision Injection**: Add vision capabilities to any language model in 3 lines of code
- **AnyRes Processing**: Dynamic high-resolution image support with patch grouping
- **YakiProjector**: MLP-based vision-to-LLM alignment with configurable depth and width
- **Token Extending**: Automatic tokenizer patching with `<image>` special tokens
- **Architecture Detection**: Auto-detects LLM hidden dimensions, attention heads, and vocabulary size

### Training Pipeline

- **3-Phase Training**: Pretraining alignment, visual instruction tuning, RLHF/DPO alignment
- **LoRA Variants**: LoRA, LoRA+, LoRAGA, LoRAFA, DoRA, QLoRA, Split LoRA
- **BitNet Training**: Ternary weight training (b1.58) with absmean quantization
- **MoE + LoRA Fusion**: Mixture-of-Experts with LoRA adapters per expert
- **Curriculum Learning**: Progressive difficulty scheduling
- **Padding-Free Training**: Variable-length sequences without padding tokens
- **Sequence Packing**: Pack multiple sequences into single training examples
- **FP8 Training**: Native FP8 training on H100 GPUs (Hopper architecture)

### Inference and Serving

- **Continuous Batching**: Dynamic batching with no maximum batch size
- **PagedAttention**: Block-level KV cache management for 4x memory efficiency
- **Speculative Decoding**: Eagle, Medusa, NGram draft models for 2-3x throughput
- **KV Cache Optimizations**: Quantization, eviction (H2O, SnapKV, FastGen, WG), compression (PackKV, SWAN)
- **Sparse Attention**: Dynamic sparse attention pattern selection
- **Chunked Prefill**: Split long prompts into manageable chunks
- **OpenAI-Compatible API**: FastAPI server with `/v1/chat/completions`, streaming, and vision support

### Optimization Suite

- **40+ Built-in Optimizations**: From FP8 training to KV cache compression
- **torch.compile**: Full-graph compilation with custom backends
- **torchao Integration**: Quantization-aware training, weight-only quantization, sparsity
- **GPTQ / AWQ**: Post-training weight quantization
- **FP4 / NVFP4**: 4-bit floating point quantization for H100
- **GaLore**: Gradient Low-Rank Projection for memory-efficient full finetuning
- **EMA**: Exponential Moving Average for training stability
- **Selective Checkpointing**: Memory-efficient activation checkpointing

### Distributed Training

- **FSDP2**: Fully Sharded Data Parallel with mixed precision
- **DeepSpeed ZeRO**: ZeRO stages 0-3 with CPU/NVMe offload
- **Tensor Parallelism**: Megatron-style tensor parallel (1D, 2D, 3D)
- **Pipeline Parallelism**: GPipe / 1F1B pipeline scheduling
- **Expert Parallelism**: Distributed MoE training
- **Ring Attention**: Sequence parallelism for long-context training
- **Heterogeneous Training**: GPU + CPU + TPU mixed-device training
- **ZeRO++**: Hierarchical ZeRO with communication compression

### Multi-Backend Support

| Backend | Hardware | Status |
|:--------|:---------|:-------|
| CUDA | NVIDIA GPUs (Ampere, Ada, Hopper) | Production |
| ROCm | AMD GPUs | Production |
| CPU FP32 | Any x86/x64 CPU | Production |
| TPU (XLA/SPMD) | Google TPU v3-v5 | Beta |
| MLX | Apple Silicon (M1-M4) | Beta |
| XPU | Intel GPUs (Arc, Data Center) | Beta |

---

## Quickstart

### Installation

```bash
# Core installation (CUDA auto-detected)
pip install openllava

# With CLI tools
pip install openllava[cli]

# With serving capabilities
pip install openllava[serve]

# Full installation
pip install openllava[all]
```

> [!IMPORTANT]
> OpenLLaVA requires PyTorch 2.3 or later. Install PyTorch separately if your environment requires a specific CUDA version.

### Build from Source

```bash
git clone https://github.com/OpceanAI/openllava.git
cd openllava

# Install with CUDA extensions
pip install -e .[all]

# Install without CUDA (CPU-only)
OPENLLAVA_NO_CUDA=1 pip install -e .[all]
```

### Inject Vision Into Any LLM

```python
from openllava import OpenLLaVA, Backend

model = OpenLLaVA(
    llm="meta-llama/Llama-3-8B",
    vision_encoder="google/siglip2-so400m-patch14-384",
    backend=Backend.AUTO,
)

# View the patched model architecture
print(model)
```

### Train with LoRA

```python
# Apply LoRA adapters
model.lora(r=64, alpha=128, dropout=0.05)

# Phase 1: Vision-language alignment
model.train(phase1=dict(
    dataset="liuhaotian/LLaVA-Pretrain",
    samples=100_000,
    learning_rate=1e-3,
    batch_size=128,
))

# Phase 2: Visual instruction tuning
model.train(phase2=dict(
    dataset="liuhaotian/LLaVA-Instruct-150K",
    learning_rate=2e-4,
    batch_size=32,
))

# Push to HuggingFace Hub
model.push("my-org/my-model")
```

### Run Inference

```python
from openllava import OpenLLaVA

model = OpenLLaVA.from_pretrained("openllava/yaki-8b")

response = model.generate(
    images=["chart.png"],
    prompt="Describe the key trends in this chart.",
    max_new_tokens=512,
    temperature=0.7,
)

print(response)
```

### Serve as OpenAI-Compatible API

```bash
openllava serve openllava/yaki-8b --port 8000
```

```python
from openai import OpenAI

client = OpenAI(
    api_key="openllava",
    base_url="http://localhost:8000/v1",
)

response = client.chat.completions.create(
    model="yaki-8b",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is shown in this image?"},
                {"type": "image_url", "image_url": {"url": "https://example.com/photo.jpg"}},
            ],
        }
    ],
    max_tokens=512,
)

print(response.choices[0].message.content)
```

---

## Architecture



### How Vision Injection Works

1. **Vision Encoding**: The input image is processed by a vision encoder (e.g., SigLIP2) producing a grid of patch embeddings.

2. **Patch Grouping**: Adjacent patches are grouped (default 3x3) to reduce the visual token count and capture local spatial structure. Each group produces a single 10368-dimensional vector for SigLIP2.

3. **Projection**: The YakiProjector MLP maps grouped vision features into the LLM's hidden dimension (e.g., 4096 for Llama-3-8B) through a 2-layer GELU-activated MLP.

4. **Token Patching**: The tokenizer is extended with a `<image>` special token. During processing, this token is replaced by the projected vision embeddings, which are inserted before or interleaved with text embeddings.

5. **Generation**: The LLM attends to both visual and textual tokens, enabling multimodal understanding and generation.

> [!TIP]
> The patch grouping size is configurable. Larger groups reduce sequence length at the cost of spatial resolution. The default 3x3 grouping with SigLIP2 produces 81 visual tokens per image.

---

## Core Concepts

### OpenLLaVA Class

The central entry point. It orchestrates model loading, patching, training, and inference.

```python
from openllava import OpenLLaVA

model = OpenLLaVA(
    llm="OpceanAI/Yuuki-RxG",                           # HF model ID or local path
    vision_encoder="google/siglip2-so400m-patch14-384",  # HF vision encoder
    architecture="llava",                                # Architecture variant
    backend=Backend.AUTO,                                # Backend selection
    torch_dtype=torch.bfloat16,                          # Compute dtype
    device_map="auto",                                   # Device mapping strategy
)
```

### YakiProjector

The MLP projector that aligns vision features to the LLM's embedding space.

```python
from openllava import YakiProjector

projector = YakiProjector(
    vision_hidden_size=1152,      # SigLIP2 hidden dimension
    llm_hidden_size=4096,         # Llama-3-8B hidden dimension
    patch_group=3,                # 3x3 patch grouping
    projector_depth=2,            # MLP depth
    activation="gelu",            # Activation function
    dropout=0.0,                  # Dropout rate
)
```

### FastVisionModel API

An Unsloth-style API for quick model loading and PEFT configuration.

```python
from openllava.api import FastVisionModel

model, tokenizer = FastVisionModel.from_pretrained(
    "openllava/yaki-8b",
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=torch.bfloat16,
)

model = FastVisionModel.get_peft_model(
    model,
    r=16,
    alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

FastVisionModel.for_training(model)
```

### Backend Abstraction

Hardware backends are auto-detected on import. Explicit selection is also supported.

```python
from openllava import Backend, BackendManager

# Auto-detection (default)
manager = BackendManager()

# Explicit selection
manager = BackendManager(Backend.CUDA)

# Available backends
for backend in Backend:
    print(backend.value)
    # auto, cuda, cpu_fp32, tpu, xpu, rocm, mlx, heterogeneous
```

---

## Training Pipeline

OpenLLaVA employs a 3-phase training pipeline designed for optimal vision-language alignment.

### Phase 1: Vision-Language Alignment

Aligns the vision encoder and projector with the LLM's embedding space using image-caption pairs.

| Parameter | Recommended Value | Description |
|:----------|:------------------|:------------|
| Dataset | `liuhaotian/LLaVA-Pretrain` | 100K image-caption pairs |
| Learning Rate | 1e-3 | High LR for projector convergence |
| Batch Size | 128 | Large batches recommended |
| Optimizer | AdamW | Standard optimizer |
| Scheduler | Cosine | Cosine decay with warmup |
| Epochs | 1 | Single pass sufficient |

### Phase 2: Visual Instruction Tuning

Fine-tunes the entire model (or LoRA adapters) on visual instruction-following data.

| Parameter | Recommended Value | Description |
|:----------|:------------------|:------------|
| Dataset | `liuhaotian/LLaVA-Instruct-150K` | 150K visual instructions |
| Learning Rate | 2e-4 | Lower LR for instruction tuning |
| Batch Size | 32 | Moderate batch size |
| Optimizer | AdamW | Standard optimizer |
| Scheduler | Cosine | Cosine decay with warmup |
| Epochs | 3-5 | Multiple epochs beneficial |

### Phase 3: RL Alignment (Optional)

Aligns the model with human preferences using RLHF, DPO, GRPO, or ORPO.

```python
from openllava.api import OpenLLaVATrainer
from openllava.api import TrainingConfig

config = TrainingConfig(
    phase1_dataset="liuhaotian/LLaVA-Pretrain",
    phase2_dataset="liuhaotian/LLaVA-Instruct-150K",
    output_dir="./yaki-checkpoints",
    lora_r=64,
    lora_alpha=128,
    lora_dropout=0.05,
    learning_rate_phase1=1e-3,
    learning_rate_phase2=2e-4,
    batch_size_phase1=128,
    batch_size_phase2=32,
    num_epochs_phase2=3,
    save_steps=500,
    logging_steps=10,
    report_to="wandb",
)

trainer = OpenLLaVATrainer(config)
trainer.train()

# Or train step-by-step
trainer.train_phase1()
trainer.train_phase2()
```

### Training Modes

| Mode | Description | Memory Usage | Speed |
|:-----|:------------|:-------------|:------|
| `lora` | Low-Rank Adaptation | Low | Fast |
| `qlora` | 4-bit LoRA | Very Low | Moderate |
| `lora_plus` | LoRA with different LR for A/B matrices | Low | Fast |
| `dora` | Weight-Decomposed Low-Rank Adaptation | Low | Fast |
| `lora_ga` | LoRA with Gradient Approximation | Low | Moderate |
| `lora_fa` | LoRA with Feature Alignment | Low | Fast |
| `full_finetune` | Full parameter fine-tuning | High | Slow |
| `bitnet` | Ternary weight training (b1.58) | Very Low | Fast |
| `moe_lora` | Mixture-of-Experts with LoRA | Moderate | Moderate |

---

## Optimizations

OpenLLaVA ships with 40+ built-in optimizations covering training, inference, memory, and quantization.

> [!NOTE]
> All optimizations are opt-in and configurable. The framework applies sensible defaults based on hardware detection.

### Training Optimizations

| Optimization | Description | Hardware |
|:-------------|:------------|:---------|
| **FP8 Training** | Native FP8 forward/backward pass | H100 (Hopper) |
| **Padding-Free** | Variable-length sequences without padding | All |
| **Sequence Packing** | Pack multiple sequences per example | All |
| **Selective Checkpointing** | Activation checkpointing with heuristics | All |
| **CPU Offloading** | Async CPU offload for optimizer states | All |
| **GPU Memory Pooling** | Pre-allocated memory pool for tensors | CUDA |
| **torch.compile** | Full-graph compilation | All |
| **EMA** | Exponential Moving Average | All |
| **Curriculum Learning** | Progressive difficulty scheduling | All |

### Quantization

| Technique | Bits | Type | Use Case |
|:----------|:-----|:-----|:---------|
| **GPTQ** | 2-4 | Post-training | Inference speedup |
| **AWQ** | 4 | Post-training | Inference speedup |
| **FP8** | 8 | Training/Inference | H100 training |
| **FP4 (NVFP4)** | 4 | Inference | H100 inference |
| **QAT** | 2-8 | Training | Quantization-aware training |
| **torchao** | 2-8 | Post-training | Weight-only quantization |
| **NF4 (bitsandbytes)** | 4 | Training | QLoRA |
| **BitNet b1.58** | 1.58 | Training | Ternary weights |

### KV Cache Optimizations

| Optimization | Method | Memory Savings |
|:-------------|:-------|:---------------|
| **KV Quantization** | FP8/INT8 KV cache | 50% |
| **H2O Eviction** | Heavy Hitter Oracle policy | 20-50% |
| **SnapKV** | Snapshot-based eviction | 20-50% |
| **FastGen** | Generation-aware eviction | 20-40% |
| **WG Eviction** | Window-Guided eviction | 30-50% |
| **PackKV** | Cache compression | 50-75% |
| **SWAN** | Sliding Window Attention with cache | 40-60% |
| **Chunked Prefill** | Split long prompts into chunks | Variable |

### Speculative Decoding

| Method | Description | Speedup |
|:-------|:------------|:--------|
| **Eagle Draft** | Eagle-style draft model | 2-3x |
| **Medusa Heads** | Multi-head speculative decoding | 2-3x |
| **NGram Draft** | N-gram based draft model | 1.5-2x |
| **Tree Verification** | Parallel verification of draft tokens | 2-3x |

### Other Optimizations

| Optimization | Description |
|:-------------|:------------|
| **torchao Sparsity** | Weight sparsification for inference |
| **MXFP8 MoE** | MXFP8 format for MoE layers |
| **VQ Codebook EMA** | YADIS VQ codebook EMA updates |
| **Fused Cross-Attention** | YADIS fused cross-attention |
| **Adaptive MoE Routing** | YADIS dynamic expert routing |
| **Split LoRA** | Split LoRA across devices |
| **GaLore** | Gradient Low-Rank Projection |
| **Mixed-Precision Quantization** | MicroMix per-layer precision |
| **Async I/O** | nvJPEG + async data loading |

```python
from openllava.optimizations import (
    compile_model,
    enable_fp8_training,
    gptq_quantize,
    EMAModel,
)

# Compile the model
model.model = compile_model(model.model, mode="max-autotune")

# Enable FP8 training (H100 only)
enable_fp8_training(model.model)

# GPTQ quantization
gptq_quantize(model.model, bits=4, dataset="c4")

# Enable EMA tracking
ema = EMAModel(model.model, decay=0.999)
```

---

## CLI Reference

The `openllava` CLI provides five main commands.

```bash
openllava --help
```

### train

Train a vision-language model.

```bash
openllava train \
  --llm meta-llama/Llama-3-8B \
  --vision-encoder google/siglip2-so400m-patch14-384 \
  --phase1-dataset liuhaotian/LLaVA-Pretrain \
  --phase2-dataset liuhaotian/LLaVA-Instruct-150K \
  --output-dir ./checkpoints \
  --lora-r 64 \
  --lora-alpha 128 \
  --batch-size 128 \
  --learning-rate 1e-3 \
  --num-epochs 3 \
  --report-to wandb
```

Training modes:

```bash
# QLoRA (4-bit quantized)
openllava train --mode qlora --load-in-4bit

# BitNet (ternary weights)
openllava train --mode bitnet

# Full fine-tuning
openllava train --mode full_finetune

# MoE + LoRA
openllava train --mode moe_lora --num-experts 8
```

### serve

Launch an OpenAI-compatible inference server.

```bash
openllava serve openllava/yaki-8b --port 8000

# With advanced features
openllava serve openllava/yaki-8b \
  --port 8000 \
  --batch-size 64 \
  --max-seq-len 4096 \
  --paged-attention \
  --continuous-batching \
  --speculative-decoding \
  --kv-cache-dtype fp8
```

> [!TIP]
> The inference server supports all OpenAI SDK features: streaming, vision inputs, function calling, and structured JSON output.

### export

Export a model to various formats.

```bash
# HuggingFace SafeTensors
openllava export openllava/yaki-8b --format safetensors --output ./export

# GGUF (for llama.cpp)
openllava export openllava/yaki-8b --format gguf --quant q4_k_m

# ONNX
openllava export openllava/yaki-8b --format onnx --output ./export

# vLLM
openllava export openllava/yaki-8b --format vllm

# MLX (Apple Silicon)
openllava export openllava/yaki-8b --format mlx
```

### benchmark

Benchmark model performance.

```bash
openllava benchmark openllava/yaki-8b

# Specific benchmarks
openllava benchmark openllava/yaki-8b \
  --throughput \
  --latency \
  --memory \
  --batch-sizes 1,8,32,64
```

### info

Display system and framework information.

```bash
openllava info
```

---

## API Reference

### OpenLLaVA

```python
class OpenLLaVA:
    def __init__(
        self,
        llm: str,
        vision_encoder: str = "google/siglip2-so400m-patch14-384",
        architecture: str = "llava",
        backend: Backend = Backend.AUTO,
        torch_dtype: Optional[torch.dtype] = None,
        device_map: str = "auto",
        attn_implementation: str = "flash_attention_2",
        trust_remote_code: bool = False,
    )

    # Training
    def lora(self, r: int = 64, alpha: int = 128, dropout: float = 0.05,
             target_modules: Optional[List[str]] = None) -> "OpenLLaVA":
    def dora(self, r: int = 64, alpha: int = 128, ...) -> "OpenLLaVA":
    def qlora(self, r: int = 64, ..., load_in_4bit: bool = True) -> "OpenLLaVA":
    def lora_plus(self, r: int = 64, lr_ratio: float = 16.0) -> "OpenLLaVA":
    def bitnet(self) -> "OpenLLaVA":
    def train(self, phase1: Optional[dict] = None,
              phase2: Optional[dict] = None, **kwargs):

    # RL Alignment
    def dpo(self, dataset: str, ..., beta: float = 0.1):
    def grpo(self, dataset: str, ..., group_size: int = 8):
    def orpo(self, dataset: str, ...):

    # Inference
    def generate(self, images: Union[str, List[str]], prompt: str,
                 **generate_kwargs) -> str:
    def chat(self, messages: List[dict], images: Optional[List[str]] = None,
             **generate_kwargs) -> str:

    # I/O
    def save(self, path: str, merge_lora: bool = False):
    def push(self, repo_id: str, merge_lora: bool = False):
    @classmethod
    def from_pretrained(cls, repo_id: str, **kwargs) -> "OpenLLaVA":
```

### FastVisionModel

```python
class FastVisionModel:
    @classmethod
    def from_pretrained(
        cls,
        model_id: str,
        max_seq_length: int = 2048,
        load_in_4bit: bool = False,
        load_in_8bit: bool = False,
        dtype: Optional[torch.dtype] = None,
        device_map: str = "auto",
        attn_implementation: str = "flash_attention_2",
    ) -> Tuple[nn.Module, AutoTokenizer]:

    @classmethod
    def get_peft_model(
        cls,
        model: nn.Module,
        r: int = 16,
        alpha: int = 32,
        target_modules: Optional[List[str]] = None,
        modules_to_save: Optional[List[str]] = None,
    ) -> nn.Module:

    @classmethod
    def for_training(cls, model: nn.Module):
    @classmethod
    def for_inference(cls, model: nn.Module):
```

### OpenLLaVATrainer

```python
class OpenLLaVATrainer:
    def __init__(self, config: TrainingConfig):
    def train(self):
    def train_phase1(self):
    def train_phase2(self):
    def train_rl(self, method: str = "dpo", **kwargs):
    def save(self, path: str):
    def push(self, repo_id: str):
    def evaluate(self, benchmarks: List[str] = ["scienceqa", "mmbench"]):
```

### InferenceEngine

```python
class OpenLLaVAInferenceEngine:
    def __init__(self, model_id: str, **kwargs):
    def generate(self, prompt: str, images: Optional[List[str]] = None,
                 max_tokens: int = 512, temperature: float = 0.7,
                 stream: bool = False) -> Union[str, Generator]:
    def chat(self, messages: List[dict], **kwargs) -> str:
    def get_stats(self) -> dict:
```

### Server

```python
from openllava.serve import OpenLLaVAServer

server = OpenLLaVAServer(
    model_id="openllava/yaki-8b",
    host="0.0.0.0",
    port=8000,
    api_key="sk-openllava",           # Optional auth
    rate_limit=100,                    # Requests per minute
    continuous_batching=True,
    paged_attention=True,
)

server.run()
```

---

## Distributed Training

OpenLLaVA supports a comprehensive distributed training stack spanning multiple parallelism strategies.

> [!WARNING]
> Distributed training requires a cluster with high-speed interconnects (NVLink, InfiniBand, or RoCE). The framework auto-detects topology and recommends optimal strategies.

### Parallelism Strategy Comparison

| Strategy | Description | When to Use |
|:---------|:------------|:------------|
| **FSDP2** | Fully Sharded Data Parallel | Single-node multi-GPU |
| **DeepSpeed ZeRO-1** | Optimizer state partitioning | Large models, moderate speedup |
| **DeepSpeed ZeRO-2** | Optimizer + gradient partitioning | Large models, good speedup |
| **DeepSpeed ZeRO-3** | Full parameter partitioning | Very large models (>13B) |
| **Tensor Parallel (1D)** | Split tensors across GPUs | >13B, high-bandwidth interconnect |
| **Tensor Parallel (2D/3D)** | 2D/3D tensor sharding | Very large models, multi-node |
| **Pipeline Parallel** | Layer-level partitioning | Multi-node, deep models |
| **Expert Parallel** | Distribute MoE experts | MoE models |
| **Ring Attention** | Sequence parallelism | Long context (>32K) |
| **Heterogeneous** | GPU+CPU+TPU mixed | Resource-constrained environments |

### FSDP2

```python
from openllava import OpenLLaVA
from openllava.distributed import FSDPConfig

config = FSDPConfig(
    sharding_strategy="hybrid",
    cpu_offload=False,
    mixed_precision="bf16",
    activation_checkpointing=True,
    limit_all_gathers=True,
)

model = OpenLLaVA(
    llm="meta-llama/Llama-3-8B",
    vision_encoder="google/siglip2-so400m-patch14-384",
)

model.train(
    phase2=dict(dataset="liuhaotian/LLaVA-Instruct-150K"),
    distributed="fsdp",
    fsdp_config=config,
)
```

### DeepSpeed ZeRO

```python
from openllava.distributed import DeepSpeedConfig

config = DeepSpeedConfig(
    zero_stage=3,
    offload_optimizer="cpu",
    offload_params="nvme",
    gradient_accumulation_steps=4,
    gradient_clipping=1.0,
    communication_dtype="bf16",
)
```

### Auto-Parallelism

```python
from openllava.distributed import auto_parallel
from openllava.utils import HardwareDetector

detector = HardwareDetector()
topology = detector.detect_topology()

strategy = auto_parallel(
    model_size=8_000_000_000,    # 8B parameters
    hardware=topology,
    memory_budget_gb=80,
    target_throughput=1000,       # tokens per second
)

print(f"Recommended strategy: {strategy.name}")
print(f"World size: {strategy.world_size}")
print(f"Strategy config: {strategy.config}")
```

---

## RL Alignment

OpenLLaVA supports four RL alignment methods for post-training preference optimization.

| Method | Description | Use Case |
|:-------|:------------|:---------|
| **DPO** | Direct Preference Optimization | Binary preference pairs |
| **GRPO** | Group Relative Policy Optimization | Multi-response ranking |
| **ORPO** | Odds Ratio Preference Optimization | Preference optimization without reference model |
| **PPO** | Proximal Policy Optimization | Full RLHF pipeline with reward model |

```python
# DPO
model.dpo(
    dataset="your-dpo-dataset",
    beta=0.1,
    learning_rate=5e-6,
    batch_size=16,
)

# GRPO
model.grpo(
    dataset="your-grpo-dataset",
    group_size=8,
    learning_rate=1e-6,
)

# ORPO
model.orpo(
    dataset="your-orpo-dataset",
    lambda_weight=0.5,
    learning_rate=1e-6,
)
```

### Reward Functions

```python
from openllava.rl.rewards import (
    ExactMatchReward,
    F1Reward,
    FormatReward,
    SafetyReward,
    CompositeReward,
)

reward_fn = CompositeReward([
    ExactMatchReward(target="expected_answer"),
    FormatReward(pattern=r"```.*```"),
    SafetyReward(),
])
```

---

## Export and Deployment

### Model Export Formats

| Format | Use Case | Tool |
|:-------|:---------|:-----|
| **SafeTensors** | HuggingFace Hub, PyTorch | `openllava export` |
| **GGUF** | llama.cpp, Ollama, local CPU inference | `openllava export --format gguf` |
| **ONNX** | ONNX Runtime, cross-platform inference | `openllava export --format onnx` |
| **vLLM** | High-throughput production serving | `openllava export --format vllm` |
| **MLX** | Apple Silicon inference | `openllava export --format mlx` |

```python
from openllava.export import export_to_gguf, export_to_onnx, push_to_hub

# Export to GGUF
export_to_gguf(model, output_path="./model.gguf", quant="q4_k_m")

# Export to ONNX
export_to_onnx(model, output_path="./model.onnx")

# Push to HuggingFace Hub
model.push("openllava/yaki-8b", private=False)

# Or via CLI
push_to_hub(
    repo_id="openllava/yaki-8b",
    local_path="./checkpoints",
    commit_message="Release Yaki-8B v1",
)
```

### LoRA Merge

```python
from openllava.export import merge_lora_weights

# Merge LoRA weights into base model
model = merge_lora_weights(model)
model.save("./merged-model")
model.push("my-org/my-model-merged")
```

---

## Evaluation

OpenLLaVA integrates with standard multimodal benchmarks.

```python
from openllava.eval import EvalRunner

runner = EvalRunner(
    model=model,
    benchmarks=["scienceqa", "mmbench", "textvqa"],
    batch_size=16,
)

results = runner.run()
print(results)

# Results per benchmark
{
    "scienceqa": {"accuracy": 0.912, "samples": 4241},
    "mmbench": {"accuracy": 0.763, "samples": 2975},
    "textvqa": {"accuracy": 0.684, "samples": 5000},
}
```

```bash
openllava eval \
  --model openllava/yaki-8b \
  --benchmarks scienceqa,mmbench,textvqa \
  --batch-size 16
```

---

## Configuration

### Training Configuration

```python
from openllava.api import TrainingConfig

config = TrainingConfig(
    # Phase 1
    phase1_dataset="liuhaotian/LLaVA-Pretrain",
    phase1_learning_rate=1e-3,
    phase1_batch_size=128,
    phase1_max_samples=100_000,

    # Phase 2
    phase2_dataset="liuhaotian/LLaVA-Instruct-150K",
    phase2_learning_rate=2e-4,
    phase2_batch_size=32,
    phase2_num_epochs=3,

    # LoRA
    lora_r=64,
    lora_alpha=128,
    lora_dropout=0.05,
    lora_target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],

    # Optimization
    optim="adamw_torch",
    warmup_ratio=0.03,
    weight_decay=0.0,
    gradient_accumulation_steps=1,
    max_grad_norm=1.0,

    # Precision
    torch_dtype="bfloat16",
    load_in_4bit=False,

    # Checkpointing
    output_dir="./checkpoints",
    save_steps=500,
    save_total_limit=5,
    logging_steps=10,
    report_to="wandb",

    # Distributed
    distributed_strategy="fsdp",
    fsdp_sharding_strategy="hybrid",
    deepspeed_zero_stage=3,
)
```

### Environment Variables

| Variable | Default | Description |
|:---------|:--------|:------------|
| `CUDA_VISIBLE_DEVICES` | — | GPU device IDs |
| `OPENLLAVA_BACKEND` | auto | Force backend selection |
| `OPENLLAVA_CACHE_DIR` | ~/.cache/openllava | Cache directory |
| `OPENLLAVA_NO_CUDA` | false | Disable CUDA detection |
| `HF_TOKEN` | — | HuggingFace Hub token |
| `WANDB_API_KEY` | — | Weights & Biases key |
| `PJRT_DEVICE` | — | TPU device type |

---

## Backends

OpenLLaVA supports six hardware backends with automatic device detection and operation routing.

### CUDA (NVIDIA)

```python
from openllava import Backend

model = OpenLLaVA(llm="...", backend=Backend.CUDA)
```

Optimized for NVIDIA Ampere (A100/A30), Ada Lovelace (RTX 4090), and Hopper (H100) architectures. Uses FlashAttention-2, FP8 training on H100, and CUDA graphs for reduced kernel launch overhead.

> [!IMPORTANT]
> CUDA 11.8 or later is required. Ampere or newer architecture recommended. FlashAttention-2 is auto-enabled when supported.

### ROCm (AMD)

```python
model = OpenLLaVA(llm="...", backend=Backend.ROCM)
```

Supports AMD MI250, MI300X, and RX 7000 series GPUs. Uses ROCm-aware Triton kernels and the Composable Kernel library for optimized matmul and attention.

### CPU FP32

```python
model = OpenLLaVA(llm="...", backend=Backend.CPU_FP32)
```

Falls back to FP32 computation with SIMD-optimized kernels (AVX-512, AVX2, NEON). Suitable for CPU-only inference and development environments.

### TPU (Google)

```python
model = OpenLLaVA(llm="...", backend=Backend.TPU)
```

Requires `torch_xla` and `jax`. Supports TPU v3-v5 with SPMD (Single Program Multiple Data) for model parallelism.

### MLX (Apple Silicon)

```python
model = OpenLLaVA(llm="...", backend=Backend.MLX)
```

Requires `mlx` and `mlx-lm`. Optimized for Apple M1-M4 series with unified memory architecture.

### XPU (Intel)

```python
model = OpenLLaVA(llm="...", backend=Backend.XPU)
```

Supports Intel Arc A-series and Data Center GPU Max Series via `intel-extension-for-pytorch`.

### Heterogeneous

```python
model = OpenLLaVA(llm="...", backend=Backend.HETEROGENEOUS)
```

Distributes model layers across multiple device types (e.g., GPU + CPU + TPU) for resource-constrained environments.

---

## Performance

### Training Throughput (tokens/second, BF16)

| Model | GPU | LoRA | Full FT |
|:------|:----|:-----|:--------|
| LLaVA-7B (Llama-2) | 1x A100-80GB | 2,850 | 1,240 |
| LLaVA-13B (Vicuna) | 1x A100-80GB | 1,620 | 680 |
| LLaVA-7B | 8x A100-80GB (FSDP) | 21,400 | 9,600 |
| LLaVA-13B | 8x A100-80GB (FSDP) | 12,800 | 5,400 |

### Inference Latency (first token, ms)

| Model | GPU | FlashAttn | PagedAttn | Speculative |
|:------|:----|:----------|:----------|:------------|
| Yaki-7B | A100-80GB | 45 | 38 | 22 |
| Yaki-7B | RTX 4090 | 38 | 32 | 18 |
| Yaki-13B | A100-80GB | 72 | 61 | 35 |
| Yaki-13B | 2x A100 (TP) | 40 | 34 | 20 |

### Memory Usage (GB, Yaki-7B with LoRA)

| Configuration | Peak Memory | Notes |
|:--------------|:------------|:------|
| FP32 Full FT | 56.2 | Not recommended |
| BF16 Full FT | 28.8 | Recommended |
| BF16 LoRA (r=64) | 18.4 | Default |
| FP16 QLoRA (4-bit) | 10.2 | Memory-constrained |
| BitNet b1.58 | 6.8 | Maximum efficiency |

---

## Project Structure

```
openllava/
├── openllava/                    # Main Python package
│   ├── core/                     # Core model, backend, patcher
│   ├── api/                      # High-level FastModel + Trainer API
│   ├── cli/                      # Click-based CLI (train, serve, export, benchmark)
│   ├── data/                     # Dataset loading, templates, collators, streaming
│   ├── training/                 # LoRA variants, BitNet, DoRA, checkpointing
│   ├── rl/                       # RL alignment (DPO, GRPO, ORPO, PPO)
│   ├── inference/                # Inference engine, continuous batching, PagedAttention
│   ├── serve/                    # FastAPI OpenAI-compatible server
│   ├── optimizations/            # 40+ optimizations (FP8, KV cache, quantization, etc.)
│   ├── experts/                  # Mixture-of-Experts layers and training
│   ├── distributed/              # FSDP, DeepSpeed, TP, PP, EP, ring attention
│   ├── backends/                 # CUDA, ROCm, MLX, TPU, XPU, CPU, ONNX, GGUF
│   ├── kernels/                  # Triton kernels + CUDA graphs
│   │   ├── triton/               # Fused attention, RoPE, SwiGLU, RMSNorm, etc.
│   │   └── cuda_graphs/          # CUDA graph capture
│   ├── export/                   # GGUF, ONNX, SafeTensors, vLLM, MLX export
│   ├── eval/                     # ScienceQA, MMBench, TextVQA benchmarks
│   └── utils/                    # Hardware detection, profiling, model cards
├── csrc/                         # C++/CUDA/CPU native extensions
│   ├── gpu/                      # CUDA kernels (projector, cross-attention, VQ)
│   ├── cpu/                      # CPU fallbacks (offload, quantization, GGUF)
│   └── tpu/                      # TPU XLA backend
├── setup.py                      # Python packaging + CMake extension build
├── pyproject.toml                # Project configuration
├── CMakeLists.txt                # C++/CUDA build system
└── LICENSE                       # Apache 2.0
```

---

## License

OpenLLaVA is licensed under the [Apache License 2.0](LICENSE).

```
Copyright (c) 2024-2026 OpceanAI

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

---

<p align="center">
  <strong>OpenLLaVA</strong> — Vision injection for every language model.
</p>

<p align="center">
  Built by <a href="https://github.com/OpceanAI">OpceanAI Research Team</a>
</p>
