Metadata-Version: 2.4
Name: qwerky-vllm-models
Version: 0.2.6
Summary: vLLM plugin for Qwerky AI MambaInLlama hybrid models
Author-email: Qwerky AI <contact@qwerky.ai>
License: Apache-2.0
Project-URL: Homepage, https://github.com/qwerkyai/qwerky-vllm-models
Project-URL: Repository, https://github.com/qwerkyai/qwerky-vllm-models
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.40.0
Requires-Dist: vllm>=0.14.0
Requires-Dist: einops>=0.7.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"

# Qwerky vLLM Models

A vLLM plugin for serving Qwerky AI's MambaInLlama hybrid models without the `--trust-remote-code` flag.

## Installation

```bash
pip install vllm qwerky-vllm-models
```

## Usage

After installing, serve Qwerky models with vLLM:

```bash
vllm serve QwerkyAI/Qwerky-Llama3.2-Mamba-3B-Llama3.3-70B-base-distill --max-model-len 4096
```

The plugin automatically registers the model architecture with vLLM on import.

## Supported Models

- `QwerkyAI/Qwerky-Llama3.2-Mamba-3B-Llama3.3-70B-base-distill`

## How It Works

This package uses vLLM's plugin system (`vllm.general_plugins` entry point) to register the MambaInLlama model architecture. This means:

- No fork of vLLM required
- No `--trust-remote-code` flag needed
- Works with standard vLLM installation
- Uses vLLM's native Triton-accelerated Mamba kernels

## Requirements

- Python >= 3.10
- vLLM >= 0.14.0
- PyTorch >= 2.0.0

## Changelog

### 0.2.6
- Added `embed_input_ids` method required by vLLM's `VllmModelForTextGeneration` interface
- This was the root cause of "This model does not support `--runner generate`" error

### 0.2.5
- Fixed vLLM runner detection: added `MambaInLlamaMambaForCausalLM` alias for HF config compatibility
- Added proper protocol inheritance (`HasInnerState`, `IsHybrid`) from `vllm.model_executor.models.interfaces`
- Fixed class variable type hints (`ClassVar[Literal[True]]`) for vLLM model inspection
- Simplified model registration code

### 0.2.4
- Complete architecture rewrite with explicit state cache management
- Separate prefill and decode paths for Mamba layers
- Grouped-head Mamba support (`num_xb_head`, `num_C_head`, `repeat_group`)
- Pure PyTorch SSM implementation (preparing for vLLM Triton op integration)

### 0.2.3
- Fixed `d_xb` default value computation in configuration
- Removed unsupported `device`/`dtype` kwargs from RMSNorm calls

### 0.2.2
- Fixed vLLM 0.14+ compatibility issues with Mamba ops API

### 0.2.1
- Updated README, removed SFT model reference

### 0.2.0
- Initial public release with vLLM plugin system integration

## License

Apache 2.0
