Metadata-Version: 2.4
Name: granite-switch
Version: 0.0.6
Summary: Granite Switch: Composable model building
License-Expression: Apache-2.0
Project-URL: Repository, https://github.com/generative-computing/granite-switch
Project-URL: Documentation, https://github.com/generative-computing/granite-switch/tree/main/tutorials
Requires-Python: <3.14,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.10.0
Requires-Dist: transformers>=5.5.1
Provides-Extra: hf
Requires-Dist: accelerate>=0.20.0; extra == "hf"
Provides-Extra: vllm
Requires-Dist: vllm<0.20.0,>=0.19.1; extra == "vllm"
Provides-Extra: vllm20
Requires-Dist: vllm<0.21.0,>=0.20.0; extra == "vllm20"
Provides-Extra: compose
Requires-Dist: huggingface_hub; extra == "compose"
Requires-Dist: pyyaml; extra == "compose"
Requires-Dist: tqdm; extra == "compose"
Requires-Dist: safetensors; extra == "compose"
Provides-Extra: build
Requires-Dist: huggingface_hub; extra == "build"
Requires-Dist: pyyaml; extra == "build"
Requires-Dist: tqdm; extra == "build"
Requires-Dist: safetensors; extra == "build"
Provides-Extra: tutorials
Requires-Dist: granite-switch[compose,hf,vllm]; extra == "tutorials"
Requires-Dist: chromadb>=0.4.0; extra == "tutorials"
Requires-Dist: httpx>=0.24.0; extra == "tutorials"
Requires-Dist: requests>=2.31.0; extra == "tutorials"
Requires-Dist: rich>=13.0.0; extra == "tutorials"
Requires-Dist: mellea>=0.1.0; extra == "tutorials"
Requires-Dist: ipython>=8.10.0; extra == "tutorials"
Requires-Dist: python-dotenv>=1.0.0; extra == "tutorials"
Provides-Extra: dev
Requires-Dist: pytest; extra == "dev"
Requires-Dist: granite-switch[compose,hf,vllm]; extra == "dev"
Provides-Extra: dev-vllm20
Requires-Dist: pytest; extra == "dev-vllm20"
Requires-Dist: granite-switch[compose,hf,vllm20]; extra == "dev-vllm20"
Provides-Extra: test
Requires-Dist: pytest; extra == "test"
Requires-Dist: bitsandbytes; extra == "test"
Requires-Dist: optimum-quanto; extra == "test"
Dynamic: license-file

# Granite Switch — Build AI models like you build software

[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/)

| [**Browse Adapters**](https://huggingface.co/collections/ibm-granite/granite-libraries) | [Models on HF](https://huggingface.co/ibm-granite/granite-switch-4.1-8b-preview) | [Tutorials](tutorials/README.md) |

Most AI models are monolithic — all capabilities baked into one set of weights. Granite Switch lets you compose a model from independent, task-specific components: pick the capabilities you need, compose a single checkpoint in minutes, then swap or upgrade individual components as your needs change.

Browse available libraries in the [Granite Libraries collection](https://huggingface.co/collections/ibm-granite/granite-libraries) on Hugging Face.

## Key Features

- **Composable** — Combine independently developed adapters into one checkpoint, whether IBM's or yours. Swap, upgrade, or customize without retraining.
- **Fast** — Built on IBM's Activated LoRA technology for efficient KV cache reuse, low latency, and [high inference throughput](https://github.com/lastras/granite-switch/tree/alora-vs-lora-race/tutorials/alora_vs_lora_race).
- **Accurate** — Task-specific adapters can match and even surpass the accuracy of significantly larger generalist models, while requiring only a fraction of the serving cost. See the [adapter catalog](https://generative-computing.github.io/granite-switch/adapter_catalog.html#hallucination-detection) for benchmark comparisons across all 12 adapters.
- **Inference-ready** — Support for Hugging Face and vLLM.

## Quick Start

### Install

```bash
python -m venv venv && source venv/bin/activate

# Granite-Switch installation is based on your usecase:
pip install "granite-switch[compose]"   # Compose modular models
pip install "granite-switch[hf]"        # HuggingFace inference
pip install "granite-switch[vllm]"      # vLLM production inference (0.19.x)
pip install "granite-switch[vllm20]"    # vLLM 0.20+ (requires CUDA 13+)
pip install "granite-switch[dev]"       # Everything (uses vLLM 0.19.x by default)
pip install "granite-switch[dev-vllm20]" # Dev environment with vLLM 0.20+
```

Requires Python 3.9+ and PyTorch 2.0+.

> **vLLM version note:** This project currently defaults to vLLM 0.19.1 due to vLLM 0.20's
> dependency on CUDA 13.0+ (via PyTorch 2.11), which is incompatible with many existing
> environments running CUDA 12.x drivers. Use `.[vllm20]` if your environment supports CUDA 13+.

### Compose a Model

Compose a base Granite model with adapter libraries into a single deployable checkpoint:

```bash
python -m granite_switch.composer.compose_granite_switch \
  --base-model ibm-granite/granite-4.1-3b \
  --adapters ibm-granite/granitelib-core-r1.0 ibm-granite/granitelib-rag-r1.0  ibm-granite/granitelib-guardian-r1.0 \
  --output ./my-model
```

Use the **[Adapter Composer](https://generative-computing.github.io/granite-switch/adapter_catalog.html)** to browse available adapters, compare benchmarks, and generate a ready-to-run compose command.

This downloads the base model, embeds compatible LoRA adapters (with a preference towards activated LoRA), adds control tokens and a chat template, and produces a model directory that works with both HuggingFace and vLLM.

For convenience, you can find already composed Granite Switch models for the Granite 4.1 model family here:

- [ibm-granite/granite-switch-4.1-3b-preview](https://huggingface.co/ibm-granite/granite-switch-4.1-3b-preview)
- [ibm-granite/granite-switch-4.1-8b-preview](https://huggingface.co/ibm-granite/granite-switch-4.1-8b-preview)
- [ibm-granite/granite-switch-4.1-30b-preview](https://huggingface.co/ibm-granite/granite-switch-4.1-30b-preview)


### Run Inference

**vLLM + Mellea (recommended):**

```bash
pip install mellea
# Example with the 3B model 
python -m vllm.entrypoints.openai.api_server --model ibm-granite/granite-switch-4.1-3b-preview --port 8000
```

```python
from mellea.backends.openai import OpenAIBackend
from mellea.stdlib.components.intrinsic import rag
from mellea.stdlib.context import ChatContext

backend = OpenAIBackend(
    model_id="ibm-granite/granite-switch-4.1-3b-preview",
    base_url="http://localhost:8000/v1",
    api_key="unused",
)
backend.register_embedded_adapter_model("ibm-granite/granite-switch-4.1-3b-preview")

query = "I want to ask you something. what is...mmmm the the main city(capital you call it,right?) of France?"
ctx = ChatContext()

rewritten = rag.rewrite_question(query, ctx, backend)
print(f"original:  {query}")
print(f"rewritten: {rewritten}")
# => "What is the capital of France?"
```

**HuggingFace:**

```python
import granite_switch.hf  # Register HF backend

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-switch-4.1-3b-preview", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("ibm-granite/granite-switch-4.1-3b-preview")

messages = [{"role": "user", "content": "What is the capital of France?"}]
documents = [{"doc_id": "1", "text": "Paris is the capital of France."}]

prompt = tokenizer.apply_chat_template(
    messages,
    documents=documents,
    adapter_name="answerability",  # activates the answerability adapter
    add_generation_prompt=True,
    tokenize=False,
)
outputs = model.generate(**tokenizer(prompt, return_tensors="pt").to(model.device))
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# => "answerable"
```

## How It Works

Granite Switch uses a **switch layer**—a small attention-based mechanism that reads control tokens from the input and determines which adapter's LoRA weights to apply at each position.

**What makes composition work:**

- **KV cache normalization** — each adapter sees only the base model's KV cache, never another adapter's internal state
- **No joint training required** — adapters are developed, tested, and published independently
- **Standard inference** — The entire model loads in vLLM with zero code changes

## Documentation

For detailed tutorials and many working examples, see the [Tutorials](tutorials/README.md) section.

## IBM ❤️ Open Source AI

Granite Switch was started by IBM Research.

## License

Granite Switch has an Apache-2.0 license, as found in the [LICENSE](LICENSE) file.
