Metadata-Version: 2.4
Name: vllmoni
Version: 0.1.0
Summary: A CLI tool for monitoring and managing VLLM inference servers in Docker containers
Requires-Python: >=3.13
Description-Content-Type: text/markdown
Requires-Dist: docker>=7.1.0
Requires-Dist: dotenv>=0.9.9
Requires-Dist: fastapi>=0.116.0
Requires-Dist: hydra-core>=1.3.2
Requires-Dist: sqlalchemy>=2.0.41
Requires-Dist: toml>=0.10.2
Requires-Dist: typer>=0.16.0
Requires-Dist: uvicorn>=0.35.0

# VLLMoni

VLLMoni is a command-line tool for monitoring and managing vllm inference servers running in Docker containers. It provides real-time GPU usage tracking, model management, and container lifecycle control.

## Features

- **Model Management**: Easily run and stop VLLM models in Docker containers
- **GPU Monitoring**: Track GPU memory usage and utilization in real-time
- **Container Control**: Start, stop, and monitor Docker containers running VLLM servers
- **Database Integration**: Persistent storage of model and container information using SQLite
- **CLI Interface**: Simple command-line interface with rich table outputs
- **Configuration**: Flexible configuration using Hydra for different models and settings

## Installation

### Prerequisites

- Python 3.13+
- Docker
- NVIDIA GPU (optional, for GPU acceleration)
- NVIDIA Container Toolkit (for GPU support)

### Install from Source

```bash
git clone <repository-url>
cd vllmonitor
pip install -e .
```

## Quick Start

1. **Initialize the database**:
   ```bash
   vllmoni init
   ```

2. **Run a model**:
   ```bash
   vllmoni run model=llama_3_1
   ```

3. **List running models**:
   ```bash
   vllmoni ls
   ```

4. **Monitor with live updates**:
   ```bash
   vllmoni ls --interval 2
   ```

## Usage

### Commands

- `vllmoni init [--override]`: Initialize the database
- `vllmoni ls [--full] [--interval SECONDS]`: List all registered models
- `vllmoni run [OVERRIDES] [--follow]`: Run a VLLM model in a Docker container
- `vllmoni stop <ID>`: Stop a specific model container
- `vllmoni stop-all`: Stop all running model containers
- `vllmoni logs <ID>`: View logs for a specific model

### Configuration

Models are configured using YAML files in the `conf/model/` directory. Each model config includes:

- Model name and HuggingFace path
- GPU memory utilization
- Generation parameters (temperature, max_tokens, etc.)
- Quantization settings

Example model config (`conf/model/llama_3_1.yaml`):

```yaml
model_name: "meta-llama/Meta-Llama-3.1-8B-Instruct"
model_name_short: "llama3.1-8B-Instruct"
gpu_memory_utilization: 0.5
temperature: 0
max_tokens: 1000
max_model_len: 40000
tensor_parallel_size: 1
```

Override settings at runtime:

```bash
vllmoni run model=llama_3_1 port=8006 devices=0,1
```

## Architecture

- **CLI**: Typer-based command-line interface
- **Database**: SQLAlchemy with SQLite backend
- **Configuration**: Hydra for flexible config management
- **Container Management**: Docker Python SDK
- **Monitoring**: Real-time GPU stats via nvidia-smi
- **Logging**: Structured logging with Rich console output

## Development

### Project Structure

```
src/
├── cli/           # Command-line interface
├── app/           # Database models and repository
├── container/     # Docker container management
├── utils/         # Utilities (logging, settings)
└── tests/         # Unit tests

conf/
├── defaults.yaml  # Default configuration
└── model/         # Model-specific configs
```

### Running Tests

```bash
python -m pytest tests/
```

### Building

```bash
python -m build
```

## Contributing

1. Fork the repository
2. Create a feature branch
3. Make your changes
4. Add tests
5. Submit a pull request

## License

[Add your license here]

## Acknowledgments

- [VLLM](https://github.com/vllm-project/vllm) - Efficient LLM inference
- [Hydra](https://github.com/facebookresearch/hydra) - Configuration management
- [Typer](https://github.com/tiangolo/typer) - CLI framework
- [Rich](https://github.com/Textualize/rich) - Terminal formatting
