Metadata-Version: 2.4
Name: vllm-efficient-client
Version: 0.1.0
Summary: A unified interface for efficient LLM inference with vLLM and OpenAI-compatible APIs
Project-URL: Homepage, https://github.com/yourusername/vllm-efficient-client
Project-URL: Repository, https://github.com/yourusername/vllm-efficient-client
Project-URL: Documentation, https://github.com/yourusername/vllm-efficient-client#readme
Project-URL: Bug Tracker, https://github.com/yourusername/vllm-efficient-client/issues
Author-email: Your Name <your.email@example.com>
License: MIT
License-File: LICENSE
Keywords: ai,inference,llm,nlp,openai,vllm
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.12
Requires-Dist: build>=1.3.0
Requires-Dist: hatchling>=1.28.0
Requires-Dist: tqdm>=4.65.0
Provides-Extra: all
Requires-Dist: openai>=2.8.1; extra == 'all'
Requires-Dist: torch>=2.0.0; extra == 'all'
Requires-Dist: transformers>=4.57.3; extra == 'all'
Requires-Dist: vllm>=0.11.2; extra == 'all'
Provides-Extra: dev
Requires-Dist: black>=23.0.0; extra == 'dev'
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.0.0; extra == 'dev'
Requires-Dist: pytest>=7.0.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: openai
Requires-Dist: openai>=2.8.1; extra == 'openai'
Provides-Extra: vllm
Requires-Dist: torch>=2.0.0; extra == 'vllm'
Requires-Dist: transformers>=4.57.3; extra == 'vllm'
Requires-Dist: vllm>=0.11.2; extra == 'vllm'
Description-Content-Type: text/markdown

# vLLM Efficient Client

A unified Python package for efficient Large Language Model (LLM) inference, supporting both:
- **vLLM offline inference** - High-throughput batch inference on local GPUs
- **OpenAI-compatible APIs** - Remote inference with automatic retry and resume

## Features

### vLLM Client (Offline Inference)
- 🚀 High-throughput batch inference using vLLM
- 🎯 Direct token feeding for optimal performance
- 💾 Automatic GPU memory management
- 🔄 Model switching without restarting
- 📝 HuggingFace chat template support

### OpenAI Client (API Inference)
- 🌐 Works with any OpenAI-compatible API (OpenAI, OpenRouter, DeepSeek, Anthropic, etc.)
- 🔁 Automatic retry with exponential backoff
- 💾 Resume from checkpoint (auto-saves progress)
- 🛡️ Graceful quota exhaustion handling
- 🎯 Provider-specific parameter optimization

## Installation

### Basic Installation
```bash
pip install vllm-efficient-client
```

### With vLLM Support (for offline inference)
```bash
pip install vllm-efficient-client[vllm]
```

### With OpenAI Support (for API inference)
```bash
pip install vllm-efficient-client[openai]
```

### With All Features
```bash
pip install vllm-efficient-client[all]
```

### Development Installation
```bash
git clone https://github.com/yourusername/vllm-efficient-client.git
cd vllm-efficient-client
pip install -e .[dev]
```

## Quick Start

### Using vLLM Client (Offline Inference)

```python
from vllm_efficient_client import VLLMClient, VLLMResourceConfig, SamplingConfig

# Configure vLLM resources
config = VLLMResourceConfig(
    gpu_memory_utilization=0.9,
    max_model_len=4096,
    max_num_seqs=128,
    max_num_batched_tokens=65536,
    block_size=16,
    tensor_parallel_size=1,
    dtype="bfloat16",
    trust_remote_code=True,
    disable_log_stats=True,
)

# Optional: Auto-scale for model size
config.scale_for_model_size(3)  # For 3B parameter model

# Initialize client
client = VLLMClient("meta-llama/Llama-3.2-3B-Instruct", config)

# Prepare prompts
prompts = [
    {
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "metadata": {"id": 1, "category": "geography"}
    },
    {
        "messages": [
            {"role": "user", "content": "Explain quantum computing."}
        ],
        "metadata": {"id": 2, "category": "science"}
    }
]

# Run inference
results = client.run_batch(
    prompts,
    SamplingConfig(temperature=0.7, max_tokens=100)
)

# Results include metadata + generated output
for result in results:
    print(f"ID: {result['id']}")
    print(f"Output: {result['output']}")
    print()

# Clean up
client.delete_client()
```

### Using OpenAI Client (API Inference)

```python
from vllm_efficient_client import OpenAIClient, OpenAIConfig, SamplingConfig

# Configure API client
config = OpenAIConfig(
    api_key="your-api-key",
    base_url="https://api.openai.com/v1/",  # or OpenRouter, DeepSeek, etc.
    enable_retry=True,
    max_retries=5,
)

# Initialize client
client = OpenAIClient("gpt-4", config)

# Prepare prompts
prompts = [
    {
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ],
        "metadata": {"qid": 1, "variant": "base", "seed": 42}
    }
]

# Run inference with auto-save (enables resume)
results = client.run_batch(
    prompts,
    SamplingConfig(temperature=0.7, max_tokens=100, seed=42),
    output_path="results.json"  # Auto-saves for resume on failure
)

# If interrupted, just run again - it will resume from checkpoint!
```

## Advanced Usage

### Switching Models (vLLM)

```python
client = VLLMClient("model-1", config)
# ... do some work ...

# Switch to a different model
client.reset_client_to_another_model("model-2")
# ... continue with new model ...
```

### Multiple Completions (OpenAI)

```python
# Generate 5 different responses per prompt
results = client.run_batch(
    prompts,
    SamplingConfig(temperature=0.8, max_tokens=100, n=5),
    output_path="results.json"
)

# Each result contains a list of 5 outputs
for result in results:
    print(f"Generated {len(result['output'])} responses")
```

### Custom Resource Scaling (vLLM)

```python
config = VLLMResourceConfig(
    gpu_memory_utilization=0.9,
    max_model_len=8192,
    max_num_seqs=256,
    max_num_batched_tokens=131072,
    block_size=32,
    tensor_parallel_size=2,  # Use 2 GPUs
    dtype="float16",
    trust_remote_code=False,
    disable_log_stats=True,
    enable_prefix_caching=True,
)

# Automatically adjust for a 70B model
config.scale_for_model_size(70)
```

## Configuration Options

### VLLMResourceConfig

| Parameter | Type | Description |
|-----------|------|-------------|
| `gpu_memory_utilization` | float | Fraction of GPU memory to use (0.0-1.0) |
| `max_model_len` | int | Maximum sequence length |
| `max_num_seqs` | int | Maximum sequences per iteration |
| `max_num_batched_tokens` | int | Maximum tokens in a batch |
| `block_size` | int | Token block size for paged attention |
| `tensor_parallel_size` | int | Number of GPUs for tensor parallelism |
| `dtype` | str | Data type ("float16", "bfloat16", "float32") |
| `trust_remote_code` | bool | Trust remote code from model hub |
| `enable_prefix_caching` | bool | Enable KV cache prefix caching |

### OpenAIConfig

| Parameter | Type | Description |
|-----------|------|-------------|
| `api_key` | str | API key for authentication |
| `base_url` | str | Base URL for API endpoint |
| `enable_retry` | bool | Enable automatic retry on errors |
| `max_retries` | int | Maximum retry attempts |
| `initial_retry_delay` | float | Initial delay for exponential backoff (seconds) |
| `max_retry_delay` | float | Maximum delay for exponential backoff (seconds) |

### SamplingConfig

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `temperature` | float | 0.0 | Sampling temperature (0.0 = deterministic) |
| `top_p` | float | 1.0 | Nucleus sampling threshold |
| `max_tokens` | int | 2 | Maximum tokens to generate |
| `n` | int | 1 | Number of completions (OpenAI only) |
| `seed` | int | None | Random seed for reproducibility |

## Supported Providers (OpenAI Client)

The OpenAI client automatically adapts to different providers:

- **OpenAI** (api.openai.com)
- **OpenRouter** (openrouter.ai)
- **DeepSeek** (deepseek.com)
- **Anthropic** (anthropic.com)
- **Google Gemini** (google AI)
- **Local vLLM servers** with OpenAI API compatibility

## Project Structure

```
vllm_efficient_client/
├── src/
│   └── vllm_efficient_client/
│       ├── __init__.py          # Main exports
│       ├── base.py              # Base classes and interfaces
│       ├── vllm_client.py       # vLLM offline inference client
│       └── openai_client.py     # OpenAI API client
├── pyproject.toml               # Package configuration
├── README.md                    # This file
└── LICENSE                      # MIT License
```

## Best Practices

### For vLLM (Offline Inference)
1. Use `scale_for_model_size()` to automatically adjust parameters
2. Set `dtype="bfloat16"` for better performance on modern GPUs
3. Enable `prefix_caching=True` for repeated prefixes
4. Use `tensor_parallel_size > 1` for large models (>30B params)
5. Always call `delete_client()` to free GPU memory

### For OpenAI (API Inference)
1. Always set `output_path` to enable resume on failure
2. Use `enable_retry=True` for production workloads
3. Include `qid`, `variant`, and `seed` in metadata for proper resume
4. Monitor logs for quota exhaustion warnings
5. Set appropriate `max_retries` based on your rate limits

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

## Citation

If you use this package in your research, please cite:

```bibtex
@software{vllm_efficient_client,
  title = {vLLM Efficient Client: Unified Interface for LLM Inference},
  author = {Your Name},
  year = {2025},
  url = {https://github.com/yourusername/vllm-efficient-client}
}
```

## Acknowledgments

- Built on top of [vLLM](https://github.com/vllm-project/vllm)
- Uses [OpenAI Python SDK](https://github.com/openai/openai-python)
- Inspired by the need for a unified LLM inference interface

## Support

- 📧 Email: your.email@example.com
- 🐛 Issues: [GitHub Issues](https://github.com/yourusername/vllm-efficient-client/issues)
- 💬 Discussions: [GitHub Discussions](https://github.com/yourusername/vllm-efficient-client/discussions)

