Metadata-Version: 2.4
Name: maxllm_gate
Version: 0.8.0
Summary: maxllm_gate - Intelligent LLM client with built-in rate limiting. Maximizes throughput and prevents 429 errors.
Project-URL: Homepage, https://github.com/Cannonbold2412/maxllm_gate
Project-URL: Documentation, https://github.com/Cannonbold2412/maxllm_gate#readme
Project-URL: Repository, https://github.com/Cannonbold2412/maxllm_gate
Author: Cannonbold2412
License-Expression: MIT
License-File: LICENSE
Keywords: api-gateway,groq,litellm,llm,maxllm_gate,openai,rate-limiting,scheduler
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Requires-Dist: httpx>=0.26.0
Requires-Dist: litellm>=1.30.0
Requires-Dist: pydantic>=2.5.0
Requires-Dist: python-dotenv>=1.0.0
Requires-Dist: tiktoken>=0.6.0
Provides-Extra: all
Requires-Dist: fastapi>=0.109.0; extra == 'all'
Requires-Dist: prometheus-client>=0.19.0; extra == 'all'
Requires-Dist: pydantic-settings>=2.1.0; extra == 'all'
Requires-Dist: redis>=5.0.0; extra == 'all'
Requires-Dist: structlog>=24.1.0; extra == 'all'
Requires-Dist: tenacity>=8.2.0; extra == 'all'
Requires-Dist: uvicorn[standard]>=0.27.0; extra == 'all'
Provides-Extra: dev
Requires-Dist: mypy>=1.8.0; extra == 'dev'
Requires-Dist: pytest-asyncio>=0.23.0; extra == 'dev'
Requires-Dist: pytest-cov>=4.1.0; extra == 'dev'
Requires-Dist: pytest>=7.4.0; extra == 'dev'
Requires-Dist: ruff>=0.1.0; extra == 'dev'
Provides-Extra: redis
Requires-Dist: redis>=5.0.0; extra == 'redis'
Provides-Extra: server
Requires-Dist: fastapi>=0.109.0; extra == 'server'
Requires-Dist: prometheus-client>=0.19.0; extra == 'server'
Requires-Dist: pydantic-settings>=2.1.0; extra == 'server'
Requires-Dist: structlog>=24.1.0; extra == 'server'
Requires-Dist: tenacity>=8.2.0; extra == 'server'
Requires-Dist: uvicorn[standard]>=0.27.0; extra == 'server'
Description-Content-Type: text/markdown

# maxllm_gate

> **Production-ready** intelligent LLM client with built-in rate limiting, smart routing, and distributed state support. Published on PyPI as `maxllm_gate`.

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Version](https://img.shields.io/badge/version-0.7.0-green.svg)](https://github.com/Cannonbold2412/maxllm_gate)

## Overview

**maxllm_gate** is a production-ready LLM client that automatically manages rate limits across multiple API keys and providers. It works on top of [LiteLLM](https://github.com/BerriAI/litellm) as an intelligent scheduling and optimization layer.

Install from PyPI with `maxllm_gate`. Import in Python with `from maxllm_gate import ...`.

```python
from maxllm_gate import maxllm_gate
import asyncio

async def main():
    # Load from environment variables
    async with maxllm_gate.from_env() as client:
        # Use like OpenAI client - rate limiting is automatic
        response = await client.chat("gpt-4o-mini", "Explain quantum computing")
        print(response.content)

asyncio.run(main())
```

### What it does automatically:
- ✅ **Multi-key management** - Manages multiple API keys across providers (Groq, OpenAI, OpenRouter, etc.)
- ✅ **Real-time rate limiting** - Tracks TPM/RPM limits per key and per model with token bucket algorithm
- ✅ **Smart routing** - SDK supports `least_utilized`, `round_robin`, `latency_aware`, and `balanced`; the FastAPI gateway supports `least_utilized`, `round_robin`, and `token_aware`
- ✅ **No 429 errors** - Defers requests when capacity exhausted instead of failing
- ✅ **Auto-retry** - Exponential backoff on transient failures
- ✅ **Streaming support** - Async streaming with proper token tracking
- ✅ **Input validation** - Pydantic models validate all inputs
- ✅ **Graceful shutdown** - Context manager support with proper cleanup
- ✅ **Production-ready** - Optional Redis backend for distributed state

## Installation

```bash
# Base installation
pip install maxllm_gate

# With Redis backend (for production/distributed deployments)
pip install maxllm_gate[redis]

# With server mode (optional FastAPI gateway)
pip install maxllm_gate[server]

# Everything (recommended for production)
pip install maxllm_gate[all]
```

```python
from maxllm_gate import maxllm_gate
```

## Quick Start

### 1. Create `.env`

```bash
cp .env.example .env
```

Set `API_KEYS_CONFIG` in `.env`:

```dotenv
API_KEYS_CONFIG='{
  "groq-1": {
    "api_key": "gsk_your_groq_key",
    "provider": "groq",
    "models": {
      "llama-3.1-70b-versatile": {"tpm_limit": 30000, "rpm_limit": 30},
      "mixtral-8x7b-32768": {"tpm_limit": 15000, "rpm_limit": 20}
    }
  },
  "openai-1": {
    "api_key": "sk-your_openai_key",
    "provider": "openai",
    "models": {
      "gpt-4o-mini": {"tpm_limit": 90000, "rpm_limit": 500},
      "gpt-4o": {"tpm_limit": 30000, "rpm_limit": 200}
    }
  }
}'

DEFAULT_STRATEGY=least_utilized
```

### 2. Use it (Async-only)

```python
from maxllm_gate import maxllm_gate
import asyncio

async def main():
    # Context manager ensures graceful shutdown
    async with maxllm_gate.from_env() as client:
        # Simple chat
        response = await client.chat("gpt-4o-mini", "Hello!")
        print(response.content)
        
        # With messages list
        response = await client.chat("mixtral-8x7b-32768", [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Write a haiku about Python."},
        ])
        
        # Streaming
        async for chunk in client.chat_stream("gpt-4o-mini", "Tell me a story"):
            print(chunk, end="", flush=True)
        
        # Check capacity and scores
        print(client.capacity())
        print(client.scores())  # See routing decisions

asyncio.run(main())
```

### Async Usage (Native)

```python
from maxllm_gate import maxllm_gate
import asyncio

async def main():
    # Async context manager
    async with maxllm_gate.from_env() as client:
        # Single request
        response = await client.chat("gpt-4o-mini", "Hello!")
        print(response.content)
        
        # Concurrent requests - automatically load balanced
        tasks = [
            client.chat("gpt-4o-mini", f"Question {i}")
            for i in range(10)
        ]
        responses = await asyncio.gather(*tasks)
        
        # Async streaming
        async for chunk in client.chat_stream("gpt-4o-mini", "Tell a story"):
            print(chunk, end="", flush=True)

asyncio.run(main())
```

## Configuration

### Environment Variables

`maxllm_gate` uses environment variables as the shared configuration source for both the SDK and the FastAPI gateway.

`API_KEYS_CONFIG` is required, and each configured model must declare its own `tpm_limit` and `rpm_limit`. Server settings such as `HOST`, `PORT`, `DEBUG`, and `LOG_LEVEL` are optional. `BASE_URL` is only needed by the HTTP example scripts and other external integrations that call the gateway over HTTP.

```dotenv
HOST=0.0.0.0
PORT=8000
DEBUG=false
LOG_LEVEL=INFO

API_KEYS_CONFIG='{
  "groq-1": {"api_key": "gsk_key_1", "provider": "groq", "models": {"llama-3.1-70b-versatile": {"tpm_limit": 30000, "rpm_limit": 30}}},
  "groq-2": {"api_key": "gsk_key_2", "provider": "groq", "models": {"llama-3.1-70b-versatile": {"tpm_limit": 30000, "rpm_limit": 30}}}
}'
DEFAULT_STRATEGY=least_utilized
DEFAULT_MAX_TOKENS=1024
DEFAULT_TEMPERATURE=0.7
TOKEN_ESTIMATION_BUFFER=1.1
MAX_RETRIES=3
RETRY_BASE_DELAY=1.0
RETRY_MAX_DELAY=60.0
MAX_QUEUE_SIZE=10000
DEFAULT_PRIORITY=medium
REDIS_URL=redis://localhost:6379
```

### Routing Strategies

The SDK and the FastAPI gateway do not expose the exact same routing strategies.

| Strategy | Available In | Best For | How It Works |
|----------|--------------|----------|--------------|
| **`balanced`** | SDK | Production client usage | Combines utilization, latency, recent errors, and freshness into a weighted score. |
| `latency_aware` | SDK | Low latency | Prefers keys with the fastest observed response times. |
| `least_utilized` | SDK and gateway | Safe default | Routes to the key with the most available TPM/RPM headroom. |
| `round_robin` | SDK and gateway | Fair distribution | Cycles through keys evenly. |
| `token_aware` | Gateway | Server-side scheduling | Prefers keys with enough token capacity for the request. |

**Recommended:** Use `least_utilized` in `.env` if you want one strategy setting that works for both the SDK and the gateway. Use `balanced` when working directly with the SDK and you want the richer scoring model.

```python
# See routing decisions in real-time
scores = client.scores()
print(scores)
# {
#   "groq-1": {
#     "total_score": 0.23,      # Lower = better
#     "utilization": 0.15,      # 15% capacity used
#     "latency_normalized": 0.08,
#     "latency_avg_ms": 245.5,
#     "error_penalty": 0.0,     # No recent errors
#     "freshness": 0.85
#   },
#   ...
# }
```

### Environment Variables

```bash
export API_KEYS_CONFIG='{
  "groq-1": {"api_key": "gsk_...", "provider": "groq", "models": {"mixtral-8x7b-32768": {"tpm_limit": 30000, "rpm_limit": 30}}}
}'

# Then in Python
from maxllm_gate import maxllm_gate
client = maxllm_gate.from_env()
```

### FastAPI Gateway

Install the gateway extras and start the HTTP server:

```bash
pip install maxllm_gate[server]
maxllm_gate-server
```

The gateway reads `HOST`, `PORT`, `DEBUG`, and `LOG_LEVEL` from `.env`. Set `BASE_URL` for any example scripts or integrations that call the gateway over HTTP.

## Supported Providers

maxllm_gate works with any provider supported by [LiteLLM](https://docs.litellm.ai/docs/providers):

| Provider | Config Name | Example Models |
|----------|-------------|----------------|
| OpenAI | `openai` | `gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo` |
| Groq | `groq` | `llama-3.1-70b-versatile`, `mixtral-8x7b-32768` |
| OpenRouter | `openrouter` | `anthropic/claude-3-haiku`, `meta-llama/llama-3-70b` |
| Anthropic | `anthropic` | `claude-3-haiku-20240307`, `claude-3-5-sonnet-20241022` |
| Together AI | `together_ai` | `mistralai/Mixtral-8x7B-Instruct-v0.1` |
| Anyscale | `anyscale` | `meta-llama/Llama-3-70b-chat-hf` |
| Fireworks | `fireworks_ai` | `accounts/fireworks/models/llama-v3-70b-instruct` |
| NVIDIA NIM | `nvidia_nim` | Any NVIDIA NIM endpoint |
| Azure OpenAI | `azure` | Your Azure deployments |

### Provider Configuration

```dotenv
API_KEYS_CONFIG='{
  "openai-1": {"api_key": "sk-...", "provider": "openai", "models": {"gpt-4o-mini": {"tpm_limit": 90000, "rpm_limit": 500}}},
  "groq-1": {"api_key": "gsk_...", "provider": "groq", "models": {"llama-3.1-70b-versatile": {"tpm_limit": 30000, "rpm_limit": 30}}},
  "openrouter-1": {"api_key": "sk-or-...", "provider": "openrouter", "models": {"anthropic/claude-3-haiku": {"tpm_limit": 100000, "rpm_limit": 200}}}
}'
```

## How It Works

```
┌─────────────────────────────────────────────────────────────────┐
│                         Your Code                                │
│   response = await client.chat("gpt-4o-mini", "Hello!")         │
└────────────────────────────────┬────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                        maxllm_gate Client                             │
│  1. Validate inputs (Pydantic)                                   │
│  2. Estimate tokens needed (~50 tokens)                          │
│  3. Select best key using routing strategy                       │
│  4. Check capacity - defer if needed                             │
│  5. Execute via LiteLLM                                          │
│  6. Record latency & update rate limits                          │
└────────────────────────────────┬────────────────────────────────┘
                                 │
                                 ▼
┌─────────────────────────────────────────────────────────────────┐
│                          LiteLLM                                 │
│                    (handles provider API)                        │
└────────────────────────────────┬────────────────────────────────┘
                                 │
                                 ▼
                         ┌──────────────┐
                         │   OpenAI     │
                         └──────────────┘
```

### Key Features

#### 1. Deferred Execution (No 429 Errors)

When ALL keys are at capacity, maxllm_gate doesn't fail - it waits:

```python
# If all keys exhausted, request is automatically deferred
# until capacity is available (no 429 errors!)
response = await client.chat("gpt-4o-mini", "Hello")  # May wait, then succeeds
```

#### 2. Input Validation

All inputs are validated with Pydantic before execution:

```python
from maxllm_gate.validation import validate_chat_request

# Manual validation
request = validate_chat_request(
    model="gpt-4",
    messages="Hello!",
    temperature=0.7,
)

# Automatic validation (default)
response = await client.chat("gpt-4", "Hello!", validate=True)  # ✅ Validated

# Skip validation for performance (not recommended)
response = await client.chat("gpt-4", "Hello!", validate=False)
```

Validation checks:
- ✅ Model name is valid (no special characters)
- ✅ Messages are not empty or whitespace-only
- ✅ Temperature is 0-2
- ✅ Max tokens is positive
- ✅ Priority is high/medium/low
- ✅ Roles are valid (system/user/assistant/function/tool)

#### 3. Graceful Shutdown

Use context managers for automatic cleanup:

```python
# Async (only mode supported)
async with maxllm_gate.from_env() as client:
    response = await client.chat(...)
# Waits for in-flight requests, then shuts down
```

Or manual shutdown:

```python
import asyncio

async def main():
    client = maxllm_gate.from_env()
    try:
        response = await client.chat(...)
    finally:
        await client.shutdown(timeout=30)  # Wait max 30s for pending requests

asyncio.run(main())
```

## Production Deployment

### Redis Backend (Recommended for Production)

For distributed deployments or to persist rate limit state across restarts, use Redis:

```bash
pip install maxllm_gate[redis]
```

```dotenv
REDIS_URL=redis://localhost:6379
API_KEYS_CONFIG='{"openai-1": {"api_key": "sk-...", "provider": "openai", "models": {"gpt-4o-mini": {"tpm_limit": 90000, "rpm_limit": 500}}}}'
```

**Redis provides:**
- 🔄 **Persistent state** - Rate limits survive restarts
- 🌐 **Distributed coordination** - Multiple instances share state
- 📊 **Centralized metrics** - Latency tracking across all instances
- 🔒 **Distributed locks** - Atomic operations across workers

**Using HybridRateLimiter (auto-fallback):**

```python
from maxllm_gate.redis_backend import HybridRateLimiter
import asyncio

# Tries Redis, falls back to in-memory if unavailable
limiter = HybridRateLimiter(
    redis_url="redis://localhost:6379",
    fallback_to_memory=True,
)

await limiter.initialize()

if limiter.is_distributed:
    print("✅ Using Redis backend")
else:
    print("⚠️ Fallback to in-memory (Redis unavailable)")
```

### Docker Deployment

```dockerfile
FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install maxllm_gate[all]

COPY .env .
COPY app.py .

CMD ["python", "app.py"]
```

```yaml
# docker-compose.yml
version: '3.8'
services:
  maxllm_gate:
    build: .
    environment:
      - REDIS_URL=redis://redis:6379
    depends_on:
      - redis
  
  redis:
    image: redis:7-alpine
    volumes:
      - redis-data:/data

volumes:
  redis-data:
```

### Monitoring & Observability

```python
from maxllm_gate import maxllm_gate

client = maxllm_gate.from_env()

# Check capacity across all keys
capacity = client.capacity()
print(f"Total capacity: {capacity['total_capacity']}")

# View latency stats per key
latency = client.latency()
for key_id, stats in latency.items():
    print(f"{key_id}: avg={stats['avg_ms']:.1f}ms, p99={stats['p99_ms']:.1f}ms")

# Debug routing decisions
scores = client.scores()
for key_id, score_data in scores.items():
    print(f"{key_id}: score={score_data['total_score']:.2f}, "
          f"util={score_data['utilization']:.2f}, "
          f"latency={score_data['latency_avg_ms']:.1f}ms")
```

### Health Checks

```python
# For Kubernetes/Docker health checks
def health_check():
    try:
        capacity = client.capacity()
        # Check if any key has capacity
        has_capacity = any(
            key['tokens_remaining'] > 1000 
            for key in capacity['keys'].values()
        )
        return has_capacity
    except Exception:
        return False
```

## API Reference

### ChatResponse

```python
response = await client.chat("gpt-4o-mini", "Hello")

response.content       # The generated text
response.model         # Model used
response.usage         # {"prompt_tokens": 10, "completion_tokens": 20, "total_tokens": 30}
response.finish_reason # "stop", "length", etc.
response.latency       # Total request time in seconds
response.llm_latency   # LLM provider time only (NEW)
response.key_used      # Which API key was used
```

### maxllm_gate Methods

| Method | Description | Returns |
|--------|-------------|---------|
| `chat(model, messages, **kwargs)` | Async chat completion | `ChatResponse` |
| `chat_stream(model, messages, **kwargs)` | Async streaming completion | `AsyncGenerator[str]` |
| `add_key(api_key, provider, models, ...)` | Add key at runtime | `None` |
| `status()` | Get scheduler status | `dict` |
| `capacity()` | Get current capacity | `dict` |
| `latency()` | Get latency stats per key (NEW) | `dict` |
| `scores()` | Get routing scores (NEW) | `dict` |
| `shutdown(timeout)` | Graceful shutdown (NEW) | `None` |

All `maxllm_gate` methods are async and should be used with `await`.

### Configuration Classes

```python
from maxllm_gate.config import maxllm_gate_config, KeyConfig

# Programmatic config
config = maxllm_gate_config(
    keys=[
        KeyConfig(
            api_key="sk-...",
            provider="openai",
            models={
                "gpt-4o-mini": {"tpm_limit": 90000, "rpm_limit": 500},
            },
        )
    ],
    strategy="balanced",
    max_retries=3,
)

client = maxllm_gate(config=config)
```

### Validation

```python
from maxllm_gate.validation import validate_chat_request, ChatRequest, ChatMessage

# Validate before sending
request = validate_chat_request(
    model="gpt-4",
    messages=[
        {"role": "user", "content": "Hello!"}
    ],
    temperature=0.7,
    max_tokens=1024,
)

# Access validated data
print(request.model)  # "gpt-4"
print(request.messages[0].role)  # "user"
```

## Testing

```bash
# Install dev dependencies
pip install maxllm_gate[dev]

# Run all tests
pytest

# Run specific test file
pytest tests/test_sdk.py

# With coverage report
pytest --cov=maxllm_gate --cov-report=html

# Run only SDK tests (fast, no server deps needed)
pytest tests/test_sdk.py -v
```

### Test Structure

```
tests/
├── conftest.py             # Shared fixtures
├── test_api.py             # FastAPI route tests
├── test_scheduler.py       # Gateway scheduler tests
├── test_sdk.py             # SDK client tests
├── test_strategies.py      # Gateway routing strategy tests
├── test_token_bucket.py    # Rate limiting bucket tests
└── test_token_estimator.py # Token estimation tests
```

## Examples

See the [`examples/`](examples/) directory for more:

- `basic_usage.py` - Gateway HTTP example
- `simple_async.py` - Minimal async SDK usage
- `concurrent_requests.py` - Concurrent request handling
- `multi_key_config.py` - Multiple keys and providers
- `priority_requests.py` - Gateway request priorities

## Architecture

maxllm_gate is built as a scheduling layer **on top of** [LiteLLM](https://github.com/BerriAI/litellm):

```
┌──────────────────────────────────────────┐
│            Your Application              │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│          maxllm_gate (Scheduler)         │
│  • Rate limiting (token bucket)          │
│  • Smart routing (SDK and gateway)       │
│  • Queue management                      │
│  • Latency tracking                      │
│  • Input validation                      │
└────────────────┬─────────────────────────┘
                 │
                 ▼
┌──────────────────────────────────────────┐
│        LiteLLM (Execution)               │
│  • Provider abstraction                  │
│  • API key management                    │
│  • Retry logic                           │
└────────────────┬─────────────────────────┘
                 │
     ┌───────────┼───────────┐
     ▼           ▼           ▼
  ┌─────┐   ┌─────┐     ┌─────┐
  │ GPT │   │Groq │     │ ... │
  └─────┘   └─────┘     └─────┘
```

## Contributing

Contributions welcome! Please:

1. Fork the repo
2. Create a feature branch
3. Add tests for new features
4. Ensure all tests pass: `pytest`
5. Submit a pull request

## Roadmap

- [ ] Cost tracking and optimization
- [ ] Streaming with backpressure control
- [ ] Web dashboard UI
- [ ] Batch request helpers
- [ ] Provider-specific config presets
- [ ] Custom retry strategies

## FAQ

**Q: Why use maxllm_gate instead of calling LiteLLM directly?**

A: maxllm_gate adds intelligent scheduling, rate limiting, and multi-key management. It prevents 429 errors and maximizes throughput across multiple keys/providers.

**Q: Does this work with OpenAI's official client?**

A: maxllm_gate uses LiteLLM under the hood, which supports OpenAI and 100+ other providers. The API is similar but not identical to OpenAI's client.

**Q: What happens when all keys are rate limited?**

A: maxllm_gate automatically defers the request and waits for capacity to become available. No 429 errors!

**Q: Can I use this in production?**

A: Yes. Version 0.7.0 includes the SDK, the optional FastAPI gateway, Redis support, graceful shutdown, and automated tests for the current code paths.

**Q: Which strategy should I use?**

A: Use `least_utilized` if you want one setting that works everywhere. Use `balanced` for direct SDK usage when you want routing to account for utilization, latency, recent errors, and freshness. Use `token_aware` only in the gateway.

**Q: Do I need Redis?**

A: No, Redis is optional. It's recommended for production/distributed deployments but maxllm_gate works fine with in-memory state for single-instance deployments.

## License

MIT License - see [LICENSE](LICENSE) for details.

## Acknowledgments

- Built on top of [LiteLLM](https://github.com/BerriAI/litellm) for provider abstraction
- Token estimation using [tiktoken](https://github.com/openai/tiktoken)
- Input validation with [Pydantic](https://docs.pydantic.dev/)

---

<div align="center">

**maxllm_gate v0.7.0** - Maximum LLM throughput with zero 429 errors.

[Documentation](https://github.com/Cannonbold2412/maxllm_gate) • [Issues](https://github.com/Cannonbold2412/maxllm_gate/issues) • [PyPI](https://pypi.org/project/maxllm-gate/)

Made for the AI community

</div>
