Metadata-Version: 2.4
Name: qwerky-vllm-models
Version: 0.2.23
Summary: vLLM plugin for Qwerky AI MambaInLlama hybrid models
Author-email: Qwerky AI <contact@qwerky.ai>
License: Apache-2.0
Project-URL: Homepage, https://github.com/qwerkyai/qwerky-vllm-models
Project-URL: Repository, https://github.com/qwerkyai/qwerky-vllm-models
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: torch>=2.0.0
Requires-Dist: transformers>=4.40.0
Requires-Dist: vllm>=0.14.0
Requires-Dist: einops>=0.7.0
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-asyncio>=0.21.0; extra == "dev"

# Qwerky vLLM Models

A vLLM plugin for serving Qwerky AI's MambaInLlama hybrid models without the `--trust-remote-code` flag.

## Installation

```bash
pip install vllm qwerky-vllm-models
```

## Usage

After installing, serve Qwerky models with vLLM:

```bash
vllm serve QwerkyAI/Qwerky-Llama3.2-Mamba-3B-Llama3.3-70B-base-distill --max-model-len 4096
```

The plugin automatically registers the model architecture with vLLM on import.

## Supported Models

- `QwerkyAI/Qwerky-Llama3.2-Mamba-3B-Llama3.3-70B-base-distill`

## How It Works

This package uses vLLM's plugin system (`vllm.general_plugins` entry point) to register the MambaInLlama model architecture. This means:

- No fork of vLLM required
- No `--trust-remote-code` flag needed
- Works with standard vLLM installation
- Uses vLLM's native Triton-accelerated Mamba kernels

## Requirements

- Python >= 3.10
- vLLM >= 0.14.0
- PyTorch >= 2.0.0

## Changelog

### 0.2.23
- **CRITICAL FIX**: Wrong in_proj split order causing gibberish output
- Reference implementation uses: `[z(d_inner), x(d_xb), B(d_xb), C(d_inner), dt(dt_rank)]`
- Our code incorrectly had: `[z(d_inner), x(d_inner), B(d_xb), C(d_xb), dt(dt_rank)]`
- x is d_xb (needs repeat_kv expansion), C is d_inner (already full size)
- Fixed _prefill and _decode_step to handle x/C dimensions correctly

### 0.2.22
- **FIX**: Double bias in Mamba dt computation (partial fix, not root cause)
- Removed redundant bias addition in `_ssm_scan` and `_ssm_state_update`

### 0.2.21
- **FIX**: Dtype mismatch in rotary position embeddings
- Cast cos/sin to match q's dtype before applying rotation
- Fixes `RuntimeError: expected scalar type Float but found BFloat16` in Q×K matmul

### 0.2.20
- **FIX**: Dtype mismatch in attention matmul
- After softmax (computed in float32), convert to `v.dtype` instead of `q.dtype`
- Fixes `RuntimeError: expected scalar type Float but found BFloat16`

### 0.2.19
- **FIX**: Handle vLLM warmup where seq_len exceeds KV cache size
- During warmup/autotune, `max_num_batched_tokens=8192` but cache only holds 2048
- Skip KV caching when tokens don't fit, allowing warmup to complete

### 0.2.18
- Added extensive debug logging to diagnose attention layer shape issue
- Logs: input shape, batch_size, seq_len, Q/K/V shapes, rotary output, KV cache shapes

### 0.2.17
- Added debug logging in MHADecoderLayer to trace tensor shapes

### 0.2.16
- Fixed attention layer to handle vLLM's flattened 2D tensor format
- vLLM passes [total_tokens, hidden] but attention needs [batch, seq, hidden]
- Added automatic batch dimension handling in MHADecoderLayer

### 0.2.15
- Fixed attention layer KV cache shape mismatch
- Removed incorrect tensor transpositions in KV cache assignment

### 0.2.14
- Fixed `mamba_config.json` loading - removed `local_files_only=True` restriction
- Now properly downloads mamba_config.json from HuggingFace Hub if not cached
- Added more detailed logging for config loading

### 0.2.13
- **CRITICAL FIX**: Load `mamba_config.json` for `attn_layers`, `d_inner`, `d_xb`
- MambaInLlama models store Mamba-specific config in separate `mamba_config.json` file
- Main `config.json` has `model_type: "llama"` without Mamba params
- Fixed: Model was treating ALL layers as Mamba (attn_layers=[]) because config wasn't loaded
- Added better logging for weight loading diagnostics
- Attention layers at indices `[3, 8, 13, 18, 23, 27]` now properly recognized

### 0.2.12
- **CRITICAL FIX**: Corrected `d_xb` default to match qwerky-distill PR #81
- `d_xb = num_key_value_heads * head_dim` (GQA-style, e.g., 8×128=1024 for 8B)
- Fixed in_proj split: `[z(d_inner), x(d_inner), B(d_xb), C(d_xb), dt(dt_rank)]`
- Added repeat_kv expansion for C (same as B) in Mamba1 architecture
- Fixed head count: `num_heads = d_inner // d_state` after B/C expansion

### 0.2.11
- **CRITICAL FIX**: Changed `d_inner` default from `intermediate_size` to `hidden_size`
- MambaInLlama Mamba layers use `d_inner = hidden_size`, not `intermediate_size`
- Fixed `d_xb` default: `hidden_size // 16` (was `hidden_size // 4`)
- This fixes the shape mismatch for all Mamba layer weights (A_log, D, conv1d, dt_proj, in_proj, out_proj)

### 0.2.10
- Added debug logging to weight loading to diagnose parameter mapping issues
- Logs first 20 model params, first 20 checkpoint weights, and all skipped weights

### 0.2.9
- Fixed weight loading: split fused `mha.in_proj` into separate q/k/v projections
- Renamed `mha.out_proj` to `o_proj` for checkpoint compatibility
- Should now load all ~395 parameters instead of just 163

### 0.2.8
- Fixed dtype mismatch in SSM scan: `F.softplus`/`torch.exp` compute in float32, now cast back to original dtype
- This caused "expected BFloat16 but found Float" error in einsum

### 0.2.7
- Fixed tensor broadcasting bug in `_ssm_scan`: `A.unsqueeze(0).unsqueeze(-1)` -> `A.unsqueeze(0).unsqueeze(2)`
- This caused shape mismatch (8192 vs 16) during SSM discretization

### 0.2.6
- Added `embed_input_ids` method required by vLLM's `VllmModelForTextGeneration` interface
- This was the root cause of "This model does not support `--runner generate`" error

### 0.2.5
- Fixed vLLM runner detection: added `MambaInLlamaMambaForCausalLM` alias for HF config compatibility
- Added proper protocol inheritance (`HasInnerState`, `IsHybrid`) from `vllm.model_executor.models.interfaces`
- Fixed class variable type hints (`ClassVar[Literal[True]]`) for vLLM model inspection
- Simplified model registration code

### 0.2.4
- Complete architecture rewrite with explicit state cache management
- Separate prefill and decode paths for Mamba layers
- Grouped-head Mamba support (`num_xb_head`, `num_C_head`, `repeat_group`)
- Pure PyTorch SSM implementation (preparing for vLLM Triton op integration)

### 0.2.3
- Fixed `d_xb` default value computation in configuration
- Removed unsupported `device`/`dtype` kwargs from RMSNorm calls

### 0.2.2
- Fixed vLLM 0.14+ compatibility issues with Mamba ops API

### 0.2.1
- Updated README, removed SFT model reference

### 0.2.0
- Initial public release with vLLM plugin system integration

## License

Apache 2.0
