Metadata-Version: 2.4
Name: bllm-inference
Version: 0.1.0
Summary: Simple paged attention with KV cache for RL scenarios
License: MIT
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: torch>=2.5.0
Requires-Dist: transformers>=4.40.0
Provides-Extra: api
Requires-Dist: requests>=2.28.0; extra == "api"
Provides-Extra: cuda
Requires-Dist: flash-attn>=2.5.0; extra == "cuda"
Provides-Extra: dev
Requires-Dist: pytest>=7.0.0; extra == "dev"
Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
Dynamic: license-file

# bllm

Fast batched LLM inference with native PyTorch. Designed for RL training scenarios where you need:
- Direct weight sharing between inference and training in the same process
- Efficient batching without external servers
- Simple setup (no Ray, no vLLM server)

## Installation

```bash
pip install -e .

# With API comparison support (for benchmarking against vLLM/Ollama)
pip install -e ".[api]"
```

## Usage

### Generate text

```bash
# Single prompt
bllm generate Qwen/Qwen2.5-0.5B-Instruct "tell me about yourself"

# Multiple prompts (batched)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '["prompt1", "prompt2", "prompt3"]'

# Shorthand for repeated prompts
bllm generate Qwen/Qwen2.5-0.5B-Instruct '["tell me about yourself"] * 20'

# Variable length prompts for testing
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]'

# Quiet mode (stats only)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '["hello"] * 10' -q
```

### Interactive chat

```bash
bllm chat Qwen/Qwen2.5-0.5B-Instruct
bllm chat Qwen/Qwen2.5-0.5B-Instruct --stream
```

## Benchmarking against vLLM or Ollama

Both vLLM and Ollama expose OpenAI-compatible APIs. Use `--api` to benchmark against either.

### vLLM (CUDA)

```bash
# Install
pip install -e ".[cuda,api]"
pip install vllm

# Start vLLM server (default uses 90% of GPU memory for KV cache)
vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000

# Or limit memory usage (useful when sharing GPU)
vllm serve Qwen/Qwen2.5-0.5B-Instruct --port 8000 \
    --gpu-memory-utilization 0.3 --max-model-len 2048

# Compare
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q \
    --api http://localhost:8000/v1
```

### Ollama (Mac/Linux)

```bash
# Install
pip install -e ".[api]"

# Start Ollama
ollama serve
ollama pull qwen2.5:0.5b-instruct

# Compare (Ollama uses port 11434 by default)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q
bllm generate qwen2.5:0.5b-instruct '[lorem(randint(5,50)) for _ in range(20)]' -q \
    --api http://localhost:11434/v1
```

### Packed vs padded prefill (CUDA only)

```bash
# With Flash Attention (packed, efficient for variable-length prompts)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q

# Without packing (padded to max length)
bllm generate Qwen/Qwen2.5-0.5B-Instruct '[lorem(randint(5,50)) for _ in range(20)]' -q --no-pack
```

## RL Integration

The main advantage over vLLM for RL is in-process weight updates:

```python
from bllm.engine.inference_engine import InferenceEngine

# Create engine
engine = InferenceEngine("Qwen/Qwen2.5-0.5B-Instruct", device="cuda")

# Generate rollouts
outputs = engine.generate(["prompt1", "prompt2", ...], max_new_tokens=128)

# Update weights after training step (no serialization, no IPC)
engine.update_weights(new_state_dict)

# Generate with updated weights
outputs = engine.generate(["prompt1", "prompt2", ...], max_new_tokens=128)
```

## Performance

On Apple Silicon (M-series):
- True GPU batching via MPS
- 155+ tok/s for 20 batched prompts (vs Ollama's ~143 tok/s sequential)

On CUDA:
- Higher throughput with optimized kernels
- `torch.compile` support for additional speedup

Key optimizations:
- Native GQA support via `enable_gqa=True` (PyTorch 2.5+)
- View-based KV cache access (no copies for contiguous batches)
- Batched tensor operations (minimal CPU-GPU sync)
